本篇博文主要内容为 2025-06-24 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。
说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。
友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。
目录
概览 (2025-06-24)
今日共更新938篇论文,其中:
- 自然语言处理共125篇(Computation and Language (cs.CL))
- 人工智能共302篇(Artificial Intelligence (cs.AI))
- 计算机视觉共222篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共257篇(Machine Learning (cs.LG))
自然语言处理
[NLP-0] jina-embeddings-v4: Universal Embeddings for Multimodal Multilingual Retrieval
【速读】: 该论文旨在解决多模态信息检索中的表示统一与高效匹配问题,特别是在处理视觉丰富内容(如表格、图表、图示和混合媒体格式)时的性能瓶颈。解决方案的关键在于提出jina-embeddings-v4模型,该模型采用一种新颖的架构,在晚期交互风格下支持单向量和多向量嵌入,同时引入任务特定的低秩适应(LoRA)适配器,以优化在多种检索场景下的表现。
链接: https://arxiv.org/abs/2506.18902
作者: Michael Günther,Saba Sturua,Mohammad Kalim Akram,Isabelle Mohr,Andrei Ungureanu,Sedigheh Eslami,Scott Martens,Bo Wang,Nan Wang,Han Xiao
机构: Jina AI GmbH (Jina AI有限公司)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 22 pages, 1-10 main, 14-22 experimental results, benchmark tables
Abstract:We introduce jina-embeddings-v4, a 3.8 billion parameter multimodal embedding model that unifies text and image representations through a novel architecture supporting both single-vector and multi-vector embeddings in the late interaction style. The model incorporates task-specific Low-Rank Adaptation (LoRA) adapters to optimize performance across diverse retrieval scenarios, including query-based information retrieval, cross-modal semantic similarity, and programming code search. Comprehensive evaluations demonstrate that jina-embeddings-v4 achieves state-of-the-art performance on both single- modal and cross-modal retrieval tasks, with particular strength in processing visually rich content such as tables, charts, diagrams, and mixed-media formats. To facilitate evaluation of this capability, we also introduce Jina-VDR, a novel benchmark specifically designed for visually rich image retrieval.
zh
[NLP-1] Vision as a Dialect: Unifying Visual Understanding and Generation via Text-Aligned Representations
【速读】: 该论文试图解决多模态任务中视觉理解和生成难以统一的问题,其核心挑战在于如何在共享的离散语义表示下实现跨模态的高效对齐与交互。解决方案的关键是提出了一种文本对齐的离散分词器(Text-Aligned Tokenizer, TA-Tok),该分词器通过从大语言模型(LLM)的词汇空间中投影出一个文本对齐的代码本,将图像转换为离散的token,从而在扩展的词汇空间中实现视觉与文本的统一表示。这一设计使得多模态大语言模型Tar能够通过共享接口进行跨模态输入与输出,无需依赖特定模态的设计。
链接: https://arxiv.org/abs/2506.18898
作者: Jiaming Han,Hao Chen,Yang Zhao,Hanyu Wang,Qi Zhao,Ziyan Yang,Hao He,Xiangyu Yue,Lu Jiang
机构: CUHK MMLab (The Chinese University of Hong Kong Multimedia Laboratory); ByteDance Seed (ByteDance Seed)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multimedia (cs.MM)
备注: Project page: this https URL
Abstract:This paper presents a multimodal framework that attempts to unify visual understanding and generation within a shared discrete semantic representation. At its core is the Text-Aligned Tokenizer (TA-Tok), which converts images into discrete tokens using a text-aligned codebook projected from a large language model’s (LLM) vocabulary. By integrating vision and text into a unified space with an expanded vocabulary, our multimodal LLM, Tar, enables cross-modal input and output through a shared interface, without the need for modality-specific designs. Additionally, we propose scale-adaptive encoding and decoding to balance efficiency and visual detail, along with a generative de-tokenizer to produce high-fidelity visual outputs. To address diverse decoding needs, we utilize two complementary de-tokenizers: a fast autoregressive model and a diffusion-based model. To enhance modality fusion, we investigate advanced pre-training tasks, demonstrating improvements in both visual understanding and generation. Experiments across benchmarks show that Tar matches or surpasses existing multimodal LLM methods, achieving faster convergence and greater training efficiency. Code, models, and data are available at this https URL
zh
[NLP-2] Reason Flux-PRM: Trajectory-Aware PRMs for Long Chain-of-Thought Reasoning in LLM s
【速读】: 该论文旨在解决传统Process Reward Models (PRMs)在评估大型语言模型(LLMs)中间推理轨迹时的不足,特别是在面对由前沿推理模型如Deepseek-R1生成的轨迹-响应输出时,现有PRMs难以稳健地评价中间思考过程。其解决方案的关键在于提出一种新的轨迹感知PRM——ReasonFlux-PRM,该模型专门设计用于评估轨迹-响应类型的推理轨迹,并结合了步骤级和轨迹级的监督,实现了与结构化思维链数据对齐的细粒度奖励分配。
链接: https://arxiv.org/abs/2506.18896
作者: Jiaru Zou,Ling Yang,Jingwen Gu,Jiahao Qiu,Ke Shen,Jingrui He,Mengdi Wang
机构: 未知
类目: Computation and Language (cs.CL)
备注: Codes and Models: this https URL
Abstract:Process Reward Models (PRMs) have recently emerged as a powerful framework for supervising intermediate reasoning steps in large language models (LLMs). Previous PRMs are primarily trained on model final output responses and struggle to evaluate intermediate thinking trajectories robustly, especially in the emerging setting of trajectory-response outputs generated by frontier reasoning models like Deepseek-R1. In this work, we introduce ReasonFlux-PRM, a novel trajectory-aware PRM explicitly designed to evaluate the trajectory-response type of reasoning traces. ReasonFlux-PRM incorporates both step-level and trajectory-level supervision, enabling fine-grained reward assignment aligned with structured chain-of-thought data. We adapt ReasonFlux-PRM to support reward supervision under both offline and online settings, including (i) selecting high-quality model distillation data for downstream supervised fine-tuning of smaller models, (ii) providing dense process-level rewards for policy optimization during reinforcement learning, and (iii) enabling reward-guided Best-of-N test-time scaling. Empirical results on challenging downstream benchmarks such as AIME, MATH500, and GPQA-Diamond demonstrate that ReasonFlux-PRM-7B selects higher quality data than strong PRMs (e.g., Qwen2.5-Math-PRM-72B) and human-curated baselines. Furthermore, our derived ReasonFlux-PRM-7B yields consistent performance improvements, achieving average gains of 12.1% in supervised fine-tuning, 4.5% in reinforcement learning, and 6.3% in test-time scaling. We also release our efficient ReasonFlux-PRM-1.5B for resource-constrained applications and edge deployment. Projects: this https URL
zh
[NLP-3] OMEGA: Can LLM s Reason Outside the Box in Math? Evaluating Exploratory Compositional and Transformative Generalization
【速读】: 该论文试图解决当前大型语言模型(LLMs)在处理需要创造性思维的数学问题时存在的局限性,特别是在分布外泛化能力方面的不足。其解决方案的关键在于引入OMEGA基准,这是一个受Boden创造力类型学启发的、具有三个泛化轴的受控但多样化的评估框架,分别对应探索性、组合性和转化性泛化。通过系统评估这些泛化维度,OMEGA能够识别和量化模型在不同层次上的失败,从而为提升LLMs在数学领域的真正创造力提供基础。
链接: https://arxiv.org/abs/2506.18880
作者: Yiyou Sun,Shawn Hu,Georgia Zhou,Ken Zheng,Hannaneh Hajishirzi,Nouha Dziri,Dawn Song
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent large-scale language models (LLMs) with long Chain-of-Thought reasoning-such as DeepSeek-R1-have achieved impressive results on Olympiad-level mathematics benchmarks. However, they often rely on a narrow set of strategies and struggle with problems that require a novel way of thinking. To systematically investigate these limitations, we introduce OMEGA-Out-of-distribution Math Problems Evaluation with 3 Generalization Axes-a controlled yet diverse benchmark designed to evaluate three axes of out-of-distribution generalization, inspired by Boden’s typology of creativity: (1) Exploratory-applying known problem solving skills to more complex instances within the same problem domain; (2) Compositional-combining distinct reasoning skills, previously learned in isolation, to solve novel problems that require integrating these skills in new and coherent ways; and (3) Transformative-adopting novel, often unconventional strategies by moving beyond familiar approaches to solve problems more effectively. OMEGA consists of programmatically generated training-test pairs derived from templated problem generators across geometry, number theory, algebra, combinatorics, logic, and puzzles, with solutions verified using symbolic, numerical, or graphical methods. We evaluate frontier (or top-tier) LLMs and observe sharp performance degradation as problem complexity increases. Moreover, we fine-tune the Qwen-series models across all generalization settings and observe notable improvements in exploratory generalization, while compositional generalization remains limited and transformative reasoning shows little to no improvement. By isolating and quantifying these fine-grained failures, OMEGA lays the groundwork for advancing LLMs toward genuine mathematical creativity beyond mechanical proficiency.
zh
[NLP-4] CommVQ: Commutative Vector Quantization for KV Cache Compression ICML2025
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在需要长上下文长度的应用中,键值(Key-Value, KV)缓存导致的GPU内存瓶颈问题。其解决方案的关键在于提出了一种称为可交换向量量化(Commutative Vector Quantization, CommVQ)的方法,通过引入带有轻量编码器和码本的加法量化来压缩KV缓存,并设计了一个与旋转位置嵌入(Rotary Position Embedding, RoPE)可交换的码本,结合期望最大化(Expectation-Maximization, EM)算法进行训练,从而实现解码过程与自注意力机制的高效集成。
链接: https://arxiv.org/abs/2506.18879
作者: Junyan Li,Yang Zhang,Muhammad Yusuf Hassan,Talha Chafekar,Tianle Cai,Zhile Ren,Pengsheng Guo,Foroozan Karimzadeh,Colorado Reed,Chong Wang,Chuang Gan
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: ICML 2025 poster
Abstract:Large Language Models (LLMs) are increasingly used in applications requiring long context lengths, but the key-value (KV) cache often becomes a memory bottleneck on GPUs as context grows. To address this, we propose Commutative Vector Quantization (CommVQ) to significantly reduce memory usage for long-context LLM inference. We first introduce additive quantization with a lightweight encoder and codebook to compress the KV cache, which can be decoded via simple matrix multiplication. To further reduce computational costs during decoding, we design the codebook to be commutative with Rotary Position Embedding (RoPE) and train it using an Expectation-Maximization (EM) algorithm. This enables efficient integration of decoding into the self-attention mechanism. Our approach achieves high accuracy with additive quantization and low overhead via the RoPE-commutative codebook. Experiments on long-context benchmarks and GSM8K show that our method reduces FP16 KV cache size by 87.5% with 2-bit quantization, while outperforming state-of-the-art KV cache quantization methods. Notably, it enables 1-bit KV cache quantization with minimal accuracy loss, allowing a LLaMA-3.1 8B model to run with a 128K context length on a single RTX 4090 GPU. The source code is available at: this https URL.
zh
[NLP-5] OmniGen2: Exploration to Advanced Multimodal Generation
【速读】: 该论文旨在解决多模态生成任务中模型灵活性与性能之间的平衡问题,特别是在文本到图像生成、图像编辑和上下文感知生成等任务中。其解决方案的关键在于设计了两种独立的解码路径以处理文本和图像模态,采用非共享参数和解耦的图像分词器,从而在不依赖于重新适配VAE输入的情况下,继承现有多模态理解模型的优势,并保持原始文本生成能力。此外,通过构建全面的数据集和引入针对图像生成的反思机制,进一步提升了模型的生成质量和一致性。
链接: https://arxiv.org/abs/2506.18871
作者: Chenyuan Wu,Pengfei Zheng,Ruiran Yan,Shitao Xiao,Xin Luo,Yueze Wang,Wanli Li,Xiyan Jiang,Yexin Liu,Junjie Zhou,Ze Liu,Ziyi Xia,Chaofan Li,Haoge Deng,Jiahao Wang,Kun Luo,Bo Zhang,Defu Lian,Xinlong Wang,Zhongyuan Wang,Tiejun Huang,Zheng Liu
机构: Beijing Academy of Artificial Intelligence (北京人工智能研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:In this work, we introduce OmniGen2, a versatile and open-source generative model designed to provide a unified solution for diverse generation tasks, including text-to-image, image editing, and in-context generation. Unlike OmniGen v1, OmniGen2 features two distinct decoding pathways for text and image modalities, utilizing unshared parameters and a decoupled image tokenizer. This design enables OmniGen2 to build upon existing multimodal understanding models without the need to re-adapt VAE inputs, thereby preserving the original text generation capabilities. To facilitate the training of OmniGen2, we developed comprehensive data construction pipelines, encompassing image editing and in-context generation data. Additionally, we introduce a reflection mechanism tailored for image generation tasks and curate a dedicated reflection dataset based on OmniGen2. Despite its relatively modest parameter size, OmniGen2 achieves competitive results on multiple task benchmarks, including text-to-image and image editing. To further evaluate in-context generation, also referred to as subject-driven tasks, we introduce a new benchmark named OmniContext. OmniGen2 achieves state-of-the-art performance among open-source models in terms of consistency. We will release our models, training code, datasets, and data construction pipeline to support future research in this field. Project Page: this https URL GitHub Link: this https URL
zh
[NLP-6] Mechanistic Interpretability Needs Philosophy
【速读】: 该论文试图解决机制可解释性(Mechanistic Interpretability, MI)研究中隐含的哲学问题,即如何通过哲学视角澄清其概念、优化方法并评估解释人工智能系统所带来的认识论和伦理意义。解决方案的关键在于将哲学作为MI研究的持续合作伙伴,而非事后补充,通过跨学科对话深化对MI理论基础和实践应用的理解。
链接: https://arxiv.org/abs/2506.18852
作者: Iwan Williams,Ninell Oldenburg,Ruchira Dhar,Joshua Hatherley,Constanza Fierro,Nina Rajcic,Sandrine R. Schiller,Filippos Stamatiou,Anders Søgaard
机构: University of Copenhagen, Department of Philosophy (哥本哈根大学哲学系); University of Copenhagen, Department of Computer Science (哥本哈根大学计算机科学系)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Mechanistic interpretability (MI) aims to explain how neural networks work by uncovering their underlying causal mechanisms. As the field grows in influence, it is increasingly important to examine not just models themselves, but the assumptions, concepts and explanatory strategies implicit in MI research. We argue that mechanistic interpretability needs philosophy: not as an afterthought, but as an ongoing partner in clarifying its concepts, refining its methods, and assessing the epistemic and ethical stakes of interpreting AI systems. Taking three open problems from the MI literature as examples, this position paper illustrates the value philosophy can add to MI research, and outlines a path toward deeper interdisciplinary dialogue.
zh
[NLP-7] USAD: Universal Speech and Audio Representation via Distillation
【速读】: 该论文试图解决音频表示学习中模型领域特定性的问题,即现有模型通常仅专注于语音或非语音任务,难以统一处理多种音频类型。解决方案的关键在于提出一种统一的音频表示学习方法——通用语音与音频蒸馏(Universal Speech and Audio Distillation, USAD),通过从领域特定的自监督学习(Self-supervised Learning, SSL)模型中进行高效层间蒸馏,将语音、声音和音乐等多种音频类型整合到一个模型中进行训练。
链接: https://arxiv.org/abs/2506.18843
作者: Heng-Jui Chang,Saurabhchand Bhati,James Glass,Alexander H. Liu
机构: MIT CSAIL (MIT 计算机科学与人工智能实验室)
类目: ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: Preprint
Abstract:Self-supervised learning (SSL) has revolutionized audio representations, yet models often remain domain-specific, focusing on either speech or non-speech tasks. In this work, we present Universal Speech and Audio Distillation (USAD), a unified approach to audio representation learning that integrates diverse audio types - speech, sound, and music - into a single model. USAD employs efficient layer-to-layer distillation from domain-specific SSL models to train a student on a comprehensive audio dataset. USAD offers competitive performance across various benchmarks and datasets, including frame and instance-level speech processing tasks, audio tagging, and sound classification, achieving near state-of-the-art results with a single encoder on SUPERB and HEAR benchmarks.
zh
[NLP-8] LongWriter-Zero: Mastering Ultra-Long Text Generation via Reinforcement Learning
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在超长文本生成任务中面临的最大生成长度限制和随着序列长度增加导致的整体质量下降问题。传统方法如LongWriter通常依赖于合成数据的监督微调(Supervised Fine-Tuning, SFT),但该策略对合成数据的依赖性高,数据构建困难且成本高昂,常缺乏连贯性和一致性,结构上也较为单一。本文提出的解决方案关键在于采用基于激励机制的方法,从零开始训练,不依赖任何标注或合成数据,通过强化学习(Reinforcement Learning, RL)引导模型在写作过程中进行推理规划与优化,同时利用专门设计的奖励模型提升生成文本的长度控制、写作质量和结构格式,从而实现高质量的超长文本生成能力。
链接: https://arxiv.org/abs/2506.18841
作者: Yuhao Wu,Yushi Bai,Zhiqiang Hu,Roy Ka-Wei Lee,Juanzi Li
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Ultra-long generation by large language models (LLMs) is a widely demanded scenario, yet it remains a significant challenge due to their maximum generation length limit and overall quality degradation as sequence length increases. Previous approaches, exemplified by LongWriter, typically rely on ‘‘teaching’’, which involves supervised fine-tuning (SFT) on synthetic long-form outputs. However, this strategy heavily depends on synthetic SFT data, which is difficult and costly to construct, often lacks coherence and consistency, and tends to be overly artificial and structurally monotonous. In this work, we propose an incentivization-based approach that, starting entirely from scratch and without relying on any annotated or synthetic data, leverages reinforcement learning (RL) to foster the emergence of ultra-long, high-quality text generation capabilities in LLMs. We perform RL training starting from a base model, similar to R1-Zero, guiding it to engage in reasoning that facilitates planning and refinement during the writing process. To support this, we employ specialized reward models that steer the LLM towards improved length control, writing quality, and structural formatting. Experimental evaluations show that our LongWriter-Zero model, trained from Qwen2.5-32B, consistently outperforms traditional SFT methods on long-form writing tasks, achieving state-of-the-art results across all metrics on WritingBench and Arena-Write, and even surpassing 100B+ models such as DeepSeek R1 and Qwen3-235B. We open-source our data and model checkpoints under this https URL
zh
[NLP-9] STU-PID: Steering Token Usage via PID Controller for Efficient Large Language Model Reasoning
【速读】: 该论文试图解决大型语言模型在采用扩展链式思维(chain-of-thought, CoT)推理时出现的过度思考问题,即生成过多且冗余的推理步骤,导致计算成本增加并可能降低性能。解决方案的关键在于提出一种无需训练的STUPID(Steering Token Usage via PID controller)方法,该方法通过PID控制器在推理过程中动态调节激活引导强度,结合块级分类器检测冗余推理模式,并根据预测的冗余概率自适应调整引导强度,从而实现推理质量与计算效率的平衡。
链接: https://arxiv.org/abs/2506.18831
作者: Aryasomayajula Ram Bharadwaj
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Large Language Models employing extended chain-of-thought (CoT) reasoning often suffer from the overthinking phenomenon, generating excessive and redundant reasoning steps that increase computational costs while potentially degrading performance. While recent work has explored static steering approaches to mitigate this issue, they lack the adaptability to dynamically adjust intervention strength based on real-time reasoning quality. We propose STUPID (Steering Token Usage via PID controller), a novel training-free method that employs a PID controller to dynamically modulate activation steering strength during inference. Our approach combines a chunk-level classifier for detecting redundant reasoning patterns with a PID control mechanism that adaptively adjusts steering intensity based on the predicted redundancy probability. Experimental evaluation on GSM8K demonstrates that STUPID achieves a 6% improvement in accuracy while reducing token usage by 32%, outperforming static steering baselines. Our method provides a principled framework for dynamic reasoning calibration that maintains reasoning quality while significantly improving computational efficiency.
zh
[NLP-10] MLLP-VRAIN UPV system for the IWSLT 2025 Simultaneous Speech Translation Translation task
【速读】: 该论文旨在解决长篇语音实时翻译中的挑战,特别是如何在保持翻译质量的同时降低延迟。其解决方案的关键在于构建一个模块化级联系统,利用强大的预训练模型(如Whisper Large-V3-Turbo用于自动语音识别,NLLB-3.3B用于机器翻译)并通过轻量级适应技术将其适配到流式场景中,而非从头训练端到端模型。该方法通过文档级适应和前缀训练提升机器翻译模型处理不完整输入的能力,并结合自适应发射策略(如wait-k策略和RALCP)管理翻译流,同时采用专门的缓冲管理和分割策略确保长音频序列的连贯翻译。
链接: https://arxiv.org/abs/2506.18828
作者: Jorge Iranzo-Sánchez,Javier Iranzo-Sánchez,Adrià Giménez,Jorge Civera,Alfons Juan
机构: Universitat Politècnica de València (瓦伦西亚理工大学); Universitat de València (瓦伦西亚大学)
类目: Computation and Language (cs.CL)
备注: IWSLT 2025 System Description
Abstract:This work describes the participation of the MLLP-VRAIN research group in the shared task of the IWSLT 2025 Simultaneous Speech Translation track. Our submission addresses the unique challenges of real-time translation of long-form speech by developing a modular cascade system that adapts strong pre-trained models to streaming scenarios. We combine Whisper Large-V3-Turbo for ASR with the multilingual NLLB-3.3B model for MT, implementing lightweight adaptation techniques rather than training new end-to-end models from scratch. Our approach employs document-level adaptation with prefix training to enhance the MT model’s ability to handle incomplete inputs, while incorporating adaptive emission policies including a wait- k strategy and RALCP for managing the translation stream. Specialized buffer management techniques and segmentation strategies ensure coherent translations across long audio sequences. Experimental results on the ACL60/60 dataset demonstrate that our system achieves a favorable balance between translation quality and latency, with a BLEU score of 31.96 and non-computational-aware StreamLAAL latency of 2.94 seconds. Our final model achieves a preliminary score on the official test set (IWSLT25Instruct) of 29.8 BLEU. Our work demonstrates that carefully adapted pre-trained components can create effective simultaneous translation systems for long-form content without requiring extensive in-domain parallel data or specialized end-to-end training.
zh
[NLP-11] RWESummary: A Framework and Test for Choosing Large Language Models to Summarize Real-World Evidence (RWE) Studies
【速读】: 该论文试图解决如何评估大型语言模型(Large Language Models, LLMs)在总结真实世界证据(Real-World Evidence, RWE)方面的性能问题,特别是在处理结构化输出的RWE研究时缺乏针对性评估。解决方案的关键在于提出RWESummary,这是对MedHELM框架的补充,旨在为LLMs在RWE总结任务中提供基准测试。RWESummary包含一个场景和三个评估,覆盖了医学研究总结中常见的主要错误类型,并基于Atropos Health的专有数据开发,从而为RWE总结任务提供了系统化的评估工具。
链接: https://arxiv.org/abs/2506.18819
作者: Arjun Mukerji,Michael L. Jackson,Jason Jones,Neil Sanghavi
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 24 pages, 2 figures
Abstract:Large Language Models (LLMs) have been extensively evaluated for general summarization tasks as well as medical research assistance, but they have not been specifically evaluated for the task of summarizing real-world evidence (RWE) from structured output of RWE studies. We introduce RWESummary, a proposed addition to the MedHELM framework (Bedi, Cui, Fuentes, Unell et al., 2025) to enable benchmarking of LLMs for this task. RWESummary includes one scenario and three evaluations covering major types of errors observed in summarization of medical research studies and was developed using Atropos Health proprietary data. Additionally, we use RWESummary to compare the performance of different LLMs in our internal RWE summarization tool. At the time of publication, with 13 distinct RWE studies, we found the Gemini 2.5 models performed best overall (both Flash and Pro). We suggest RWESummary as a novel and useful foundation model benchmark for real-world evidence study summarization.
zh
[NLP-12] ConciseHint: Boosting Efficient Reasoning via Continuous Concise Hints during Generation
【速读】: 该论文试图解决大型推理模型(Large Reasoning Models, LRMs)在复杂推理任务中生成过于冗长的推理过程导致效率低下的问题。解决方案的关键在于提出一种名为ConciseHint的框架,该框架通过在推理过程的token生成阶段注入文本提示(手动设计或基于简洁数据训练得到),持续鼓励模型进行简洁表达,同时根据查询复杂度自适应调整提示强度,从而在不损害模型性能的前提下有效缩短推理长度。
链接: https://arxiv.org/abs/2506.18810
作者: Siao Tang,Xinyin Ma,Gongfan Fang,Xinchao Wang
机构: National University of Singapore (新加坡国立大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Codes are available at this https URL
Abstract:Recent advancements in large reasoning models (LRMs) like DeepSeek-R1 and OpenAI o1 series have achieved notable performance enhancements on complex reasoning tasks by scaling up the generation length by Chain-of-Thought (CoT). However, an emerging issue is their inclination to produce excessively verbose reasoning processes, leading to the inefficiency problem. Existing literature on improving efficiency mainly adheres to the before-reasoning paradigms such as prompting and reasoning or fine-tuning and reasoning, but ignores the promising direction of directly encouraging the model to speak concisely by intervening during the generation of reasoning. In order to fill the blank, we propose a framework dubbed ConciseHint, which continuously encourages the reasoning model to speak concisely by injecting the textual hint (manually designed or trained on the concise data) during the token generation of the reasoning process. Besides, ConciseHint is adaptive to the complexity of the query by adaptively adjusting the hint intensity, which ensures it will not undermine model performance. Experiments on the state-of-the-art LRMs, including DeepSeek-R1 and Qwen-3 series, demonstrate that our method can effectively produce concise reasoning processes while maintaining performance well. For instance, we achieve a reduction ratio of 65% for the reasoning length on GSM8K benchmark with Qwen-3 4B with nearly no accuracy loss.
zh
[NLP-13] Existing LLM s Are Not Self-Consistent For Simple Tasks
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在决策过程中缺乏自洽性(self-consistency)的问题,即模型内部推理中存在矛盾。研究发现,即使是小型模型以及当前最先进的模型如DeepSeek-R1和GPT-o4-mini,在简单任务中也表现出高度的不一致。为量化并缓解这些不一致性,论文引入了不一致度量指标,并提出了两种自动化的解决方案——基于图的方法和基于能量的方法。其关键在于通过系统化评估与修正机制提升模型推理的一致性,从而增强AI系统的可靠性与可解释性。
链接: https://arxiv.org/abs/2506.18781
作者: Zhenru Lin,Jiawen Tao,Yang Yuan,Andrew Chi-Chih Yao
机构: Tsinghua University (清华大学); Shanghai AI Laboratory (上海人工智能实验室); Shanghai Qizhi Institute (上海期智研究院)
类目: Computation and Language (cs.CL)
备注: 10 pages, 6 figures
Abstract:Large Language Models (LLMs) have grown increasingly powerful, yet ensuring their decisions remain transparent and trustworthy requires self-consistency – no contradictions in their internal reasoning. Our study reveals that even on simple tasks, such as comparing points on a line or a plane, or reasoning in a family tree, all smaller models are highly inconsistent, and even state-of-the-art models like DeepSeek-R1 and GPT-o4-mini are not fully self-consistent. To quantify and mitigate these inconsistencies, we introduce inconsistency metrics and propose two automated methods – a graph-based and an energy-based approach. While these fixes provide partial improvements, they also highlight the complexity and importance of self-consistency in building more reliable and interpretable AI. The code and data are available at this https URL.
zh
[NLP-14] Programming by Backprop: LLM s Acquire Reusable Algorithmic Abstractions During Code Training
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在源代码上训练后为何能提升其通用推理能力的问题,尤其是对这种泛化机制的理解不足。论文提出的解决方案的关键是“通过反向传播进行编程”(Programming by Backprop, PBB),即仅通过源代码训练模型来使其能够评估程序的输入输出,而无需显式的输入输出示例。该方法的核心在于让模型在前向传播过程中隐式地执行程序,并通过思维链(chain-of-thought)在上下文中逐步执行程序,从而实现对无输入输出示例程序的可靠评估。
链接: https://arxiv.org/abs/2506.18777
作者: Jonathan Cook,Silvia Sapora,Arash Ahmadian,Akbir Khan,Tim Rocktaschel,Jakob Foerster,Laura Ruis
机构: FLAIR, University of Oxford (FLAIR,牛津大学); Cohere & Cohere Labs (Cohere & Cohere Labs); Anthropic (Anthropic); UCL AI Centre (UCL人工智能中心)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Training large language models (LLMs) on source code significantly enhances their general-purpose reasoning abilities, but the mechanisms underlying this generalisation are poorly understood. In this paper, we propose Programming by Backprop (PBB) as a potential driver of this effect - teaching a model to evaluate a program for inputs by training on its source code alone, without ever seeing I/O examples. To explore this idea, we finetune LLMs on two sets of programs representing simple maths problems and algorithms: one with source code and I/O examples (w/ IO), the other with source code only (w/o IO). We find evidence that LLMs have some ability to evaluate w/o IO programs for inputs in a range of experimental settings, and make several observations. Firstly, PBB works significantly better when programs are provided as code rather than semantically equivalent language descriptions. Secondly, LLMs can produce outputs for w/o IO programs directly, by implicitly evaluating the program within the forward pass, and more reliably when stepping through the program in-context via chain-of-thought. We further show that PBB leads to more robust evaluation of programs across inputs than training on I/O pairs drawn from a distribution that mirrors naturally occurring data. Our findings suggest a mechanism for enhanced reasoning through code training: it allows LLMs to internalise reusable algorithmic abstractions. Significant scope remains for future work to enable LLMs to more effectively learn from symbolic procedures, and progress in this direction opens other avenues like model alignment by training on formal constitutional principles.
zh
[NLP-15] ASP2LJ : An Adversarial Self-Play Laywer Augmented Legal Judgment Framework
【速读】: 该论文旨在解决法律判决预测(Legal Judgment Prediction, LJP)中的两个关键问题:长尾分布问题和律师角色被忽视的问题。长尾分布问题源于现有数据集因真实案例采集导致的高人工标注成本和数据不平衡,进而影响模型性能;而律师角色被忽视的问题则在于现有系统仅关注提升法官决策能力,却忽略了律师在完善论点中的关键作用,从而限制了整体司法准确性。解决方案的关键在于提出一种对抗自博弈律师增强法律判决框架(Adversarial Self-Play Lawyer Augmented Legal Judgment Framework, ASP2LJ),该框架整合了案例生成模块以应对长尾数据分布,并引入对抗自博弈机制以提升律师的论证能力,使法官能够参考演化后的律师论点,从而提高司法决策的客观性、公平性和合理性。
链接: https://arxiv.org/abs/2506.18768
作者: Ao Chang,Tong Zhou,Yubo Chen,Delai Qiu,Shengping Liu,Kang Liu,Jun Zhao
机构: Chinese Academy of Sciences (中国科学院); University of Chinese Academy of Sciences (中国科学院大学); Unisound (声智科技)
类目: Computation and Language (cs.CL)
备注:
Abstract:Legal Judgment Prediction (LJP) aims to predict judicial outcomes, including relevant legal charge, terms, and fines, which is a crucial process in Large Language Model(LLM). However, LJP faces two key challenges: (1)Long Tail Distribution: Current datasets, derived from authentic cases, suffer from high human annotation costs and imbalanced distributions, leading to model performance degradation. (2)Lawyer’s Improvement: Existing systems focus on enhancing judges’ decision-making but neglect the critical role of lawyers in refining arguments, which limits overall judicial accuracy. To address these issues, we propose an Adversarial Self-Play Lawyer Augmented Legal Judgment Framework, called ASP2LJ, which integrates a case generation module to tackle long-tailed data distributions and an adversarial self-play mechanism to enhance lawyers’ argumentation skills. Our framework enables a judge to reference evolved lawyers’ arguments, improving the objectivity, fairness, and rationality of judicial decisions. Besides, We also introduce RareCases, a dataset for rare legal cases in China, which contains 120 tail-end cases. We demonstrate the effectiveness of our approach on the SimuCourt dataset and our RareCases dataset. Experimental results show our framework brings improvements, indicating its utilization. Our contributions include an integrated framework, a rare-case dataset, and publicly releasing datasets and code to support further research in automated judicial systems.
zh
[NLP-16] Neural Total Variation Distance Estimators for Changepoint Detection in News Data
【速读】: 该论文试图解决在公共话语因重大事件发生转变时,如何有效检测这种变化的问题(changepoint detection),尤其是在现实世界数据具有高维性、稀疏性和噪声性的背景下。其解决方案的关键在于利用神经网络进行分类器训练,以区分不同时期的文章,并通过分类准确率估计潜在内容分布之间的总变差距离,从而识别出显著的转折点。该方法基于“学习-混淆”机制(learning-by-confusion scheme),能够自主发现公共话语中的重要变化,并提供定量的改变度量。
链接: https://arxiv.org/abs/2506.18764
作者: Csaba Zsolnai,Niels Lörch,Julian Arnold
机构: University of Basel (巴塞尔大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computers and Society (cs.CY); Social and Information Networks (cs.SI)
备注: 16 pages, 3 figures
Abstract:Detecting when public discourse shifts in response to major events is crucial for understanding societal dynamics. Real-world data is high-dimensional, sparse, and noisy, making changepoint detection in this domain a challenging endeavor. In this paper, we leverage neural networks for changepoint detection in news data, introducing a method based on the so-called learning-by-confusion scheme, which was originally developed for detecting phase transitions in physical systems. We train classifiers to distinguish between articles from different time periods. The resulting classification accuracy is used to estimate the total variation distance between underlying content distributions, where significant distances highlight changepoints. We demonstrate the effectiveness of this method on both synthetic datasets and real-world data from The Guardian newspaper, successfully identifying major historical events including 9/11, the COVID-19 pandemic, and presidential elections. Our approach requires minimal domain knowledge, can autonomously discover significant shifts in public discourse, and yields a quantitative measure of change in content, making it valuable for journalism, policy analysis, and crisis monitoring.
zh
[NLP-17] Semantic-Preserving Adversarial Attacks on LLM s: An Adaptive Greedy Binary Search Approach
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在图形用户界面(GUI)中依赖自动提示工程时,由于用户需求的多样性导致的意外误解问题,即自动化优化可能扭曲原始意图并产生错误输出。解决方案的关键在于提出自适应贪心二分搜索(Adaptive Greedy Binary Search, AGBS)方法,该方法模拟常见的提示优化机制的同时保持语义稳定性,并动态评估这些策略对LLM性能的影响,从而实现鲁棒的对抗样本生成。
链接: https://arxiv.org/abs/2506.18756
作者: Chong Zhang,Xiang Li,Jia Wang,Shan Liang,Haochen Xue,Xiaobo Jin
机构: Xi’an Jiaotong-Liverpool University (西安交通大学利物浦大学); The Chinese University of Hong Kong (香港中文大学); University of Liverpool (利物浦大学)
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注: 19 pages, 8 figures
Abstract:Large Language Models (LLMs) increasingly rely on automatic prompt engineering in graphical user interfaces (GUIs) to refine user inputs and enhance response accuracy. However, the diversity of user requirements often leads to unintended misinterpretations, where automated optimizations distort original intentions and produce erroneous outputs. To address this challenge, we propose the Adaptive Greedy Binary Search (AGBS) method, which simulates common prompt optimization mechanisms while preserving semantic stability. Our approach dynamically evaluates the impact of such strategies on LLM performance, enabling robust adversarial sample generation. Through extensive experiments on open and closed-source LLMs, we demonstrate AGBS’s effectiveness in balancing semantic consistency and attack efficacy. Our findings offer actionable insights for designing more reliable prompt optimization systems. Code is available at: this https URL
zh
[NLP-18] Multi-modal Anchor Gated Transformer with Knowledge Distillation for Emotion Recognition in Conversation IJCAI2025
【速读】: 该论文旨在解决对话中的情感识别(Emotion Recognition in Conversation, ERC)任务中,如何生成高效且模态特异性的句子表示的问题。现有方法在整合不同模态特定编码器提取的特征时,忽略了模态对任务的不同贡献,并通过在帧级别对齐模态引入了高复杂度。解决方案的关键在于提出一种结合知识蒸馏(Knowledge Distillation)的多模态锚点门控Transformer(Multi-modal Anchor Gated Transformer with Knowledge Distillation, MAGTKD),通过提示学习增强文本模态表示,并利用知识蒸馏强化较弱模态的表示,同时采用多模态锚点门控Transformer有效融合跨模态的句子级表示。
链接: https://arxiv.org/abs/2506.18716
作者: Jie Li,Shifei Ding,Lili Guo,Xuan Li
机构: China University of Mining and Technology (中国矿业大学); Mine Digitization Engineering Research Center of Ministry of Education, China University of Mining and Technology (教育部矿山数字化工程研究中心,中国矿业大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: This paper has been accepted by IJCAI2025
Abstract:Emotion Recognition in Conversation (ERC) aims to detect the emotions of individual utterances within a conversation. Generating efficient and modality-specific representations for each utterance remains a significant challenge. Previous studies have proposed various models to integrate features extracted using different modality-specific encoders. However, they neglect the varying contributions of modalities to this task and introduce high complexity by aligning modalities at the frame level. To address these challenges, we propose the Multi-modal Anchor Gated Transformer with Knowledge Distillation (MAGTKD) for the ERC task. Specifically, prompt learning is employed to enhance textual modality representations, while knowledge distillation is utilized to strengthen representations of weaker modalities. Furthermore, we introduce a multi-modal anchor gated transformer to effectively integrate utterance-level representations across modalities. Extensive experiments on the IEMOCAP and MELD datasets demonstrate the effectiveness of knowledge distillation in enhancing modality representations and achieve state-of-the-art performance in emotion recognition. Our code is available at: this https URL.
zh
[NLP-19] Benchmarking the Pedagogical Knowledge of Large Language Models
【速读】: 该论文试图解决现有基准测试(如MMLU)主要关注内容知识,而忽视了对教学法(pedagogy)理解的评估问题。解决方案的关键在于引入《教学法基准》(The Pedagogy Benchmark),该基准通过精心挑选的教师专业发展考试题目,评估大型语言模型在跨领域教学法知识(Cross-Domain Pedagogical Knowledge, CDPK)和特殊教育需求与残疾(Special Education Needs and Disability, SEND)教学法知识方面的能力。
链接: https://arxiv.org/abs/2506.18710
作者: Maxime Lelièvre,Amy Waldock,Meng Liu,Natalia Valdés Aspillaga,Alasdair Mackintosh,María José Ogando Portelo,Jared Lee,Paul Atherton,Robin A. A. Ince,Oliver G. B. Garrod
机构: Fab Inc; AI-for-Education.org
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Benchmarks like Massive Multitask Language Understanding (MMLU) have played a pivotal role in evaluating AI’s knowledge and abilities across diverse domains. However, existing benchmarks predominantly focus on content knowledge, leaving a critical gap in assessing models’ understanding of pedagogy - the method and practice of teaching. This paper introduces The Pedagogy Benchmark, a novel dataset designed to evaluate large language models on their Cross-Domain Pedagogical Knowledge (CDPK) and Special Education Needs and Disability (SEND) pedagogical knowledge. These benchmarks are built on a carefully curated set of questions sourced from professional development exams for teachers, which cover a range of pedagogical subdomains such as teaching strategies and assessment methods. Here we outline the methodology and development of these benchmarks. We report results for 97 models, with accuracies spanning a range from 28% to 89% on the pedagogical knowledge questions. We consider the relationship between cost and accuracy and chart the progression of the Pareto value frontier over time. We provide online leaderboards at this https URL which are updated with new models and allow interactive exploration and filtering based on various model properties, such as cost per token and open-vs-closed weights, as well as looking at performance in different subjects. LLMs and generative AI have tremendous potential to influence education and help to address the global learning crisis. Education-focused benchmarks are crucial to measure models’ capacities to understand pedagogical concepts, respond appropriately to learners’ needs, and support effective teaching practices across diverse contexts. They are needed for informing the responsible and evidence-based deployment of LLMs and LLM-based tools in educational settings, and for guiding both development and policy decisions.
zh
[NLP-20] Context Biasing for Pronunciations-Orthography Mismatch in Automatic Speech Recognition
【速读】: 该论文试图解决神经序列到序列系统在自动语音识别中对未在训练过程中出现的词汇(如命名实体、缩写或领域特定术语)识别失败的问题。其解决方案的关键在于提出一种允许在推理过程中实时添加更正的方法,以纠正替换错误,从而提升这些挑战性词汇的识别准确率。实验结果显示,该方法在偏置词错误率上实现了高达11%的相对改进,同时保持了具有竞争力的整体词错误率。
链接: https://arxiv.org/abs/2506.18703
作者: Christian Huber,Alexander Waibel
机构: Karlsruhe Institute of Technology (卡尔斯鲁厄理工学院); Carnegie Mellon University (卡内基梅隆大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Neural sequence-to-sequence systems deliver state-of-the-art performance for automatic speech recognition. When using appropriate modeling units, e.g., byte-pair encoded characters, these systems are in principal open vocabulary systems. In practice, however, they often fail to recognize words not seen during training, e.g., named entities, acronyms, or domain-specific special words. To address this problem, many context biasing methods have been proposed; however, for words with a pronunciation-orthography mismatch, these methods may still struggle. We propose a method which allows corrections of substitution errors to improve the recognition accuracy of such challenging words. Users can add corrections on the fly during inference. We show that with this method we get a relative improvement in biased word error rate of up to 11%, while maintaining a competitive overall word error rate.
zh
[NLP-21] Is There a Case for Conversation Optimized Tokenizers in Large Language Models ?
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在计算和能耗方面的高昂成本问题,特别是针对聊天机器人场景中令牌化效率不足的挑战。其解决方案的关键在于优化用于聊天对话的分词器(tokenizer),通过使用公开的聊天机器人对话语料库重新设计分词器的词汇表,以减少对话中的令牌数量,从而实现显著的能耗降低,同时对原始训练语料库的分词效率影响微乎其微或略有提升。
链接: https://arxiv.org/abs/2506.18674
作者: Raquel Ferrando,Javier Conde,Gonzalo Martínez,Pedro Reviriego
机构: ETSI de Telecomunicación; Universidad Politécnica de Madrid
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:The computational and energy costs of Large Language Models (LLMs) have increased exponentially driven by the growing model sizes and the massive adoption of LLMs by hundreds of millions of users. The unit cost of an LLM is the computation of a token. Therefore, the tokenizer plays an important role in the efficiency of a model, and they are carefully optimized to minimize the number of tokens for the text in their training corpus. One of the most popular applications of LLMs are chatbots that interact with users. A key observation is that, for those chatbots, what is important is the performance of the tokenizer in the user text input and the chatbot responses. Those are most likely different from the text in the training corpus. So, a question that immediately arises is whether there is a potential benefit in optimizing tokenizers for chatbot conversations. In this paper, this idea is explored for different tokenizers by using a publicly available corpus of chatbot conversations to redesign their vocabularies and evaluate their performance in this domain. The results show that conversation-optimized tokenizers consistently reduce the number of tokens in chatbot dialogues, which can lead to meaningful energy savings, in the range of 5% to 10% while having minimal or even slightly positive impact on tokenization efficiency for the original training corpus.
zh
[NLP-22] ByteSpan: Information-Driven Subword Tokenisation
【速读】: 该论文试图解决传统子词分词方法在生成有效且具有语义结构的词汇表方面的局限性,特别是针对英语和其他语言中词素对齐和压缩效率的问题。其解决方案的关键在于提出一种基于信息驱动的子词分词器ByteSpan,该方法利用外部字节级语言模型(language model, LM)在训练过程中识别连续可预测的字节序列,并将其分组为子词,而非依赖传统的池化策略。这种方法在实验中表现出比BPE更高的词素对齐分数以及在25种语言中的良好压缩和Rényi效率。
链接: https://arxiv.org/abs/2506.18639
作者: Zébulon Goriely,Suchir Salhan,Pietro Lesci,Julius Cheng,Paula Buttery
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted to TokShop 2025 (Non-archival)
Abstract:Recent dynamic tokenisation methods operate directly on bytes and pool their latent representations into patches. This bears similarities to computational models of word segmentation that determine lexical boundaries using spikes in an autoregressive model’s prediction error. Inspired by this connection, we explore whether grouping predictable bytes - rather than pooling their representations - can yield a useful fixed subword vocabulary. We propose a new information-driven subword tokeniser, ByteSpan, that uses an external byte-level LM during training to identify contiguous predictable byte sequences and group them into subwords. Experiments show that ByteSpan yields efficient vocabularies with higher morphological alignment scores than BPE for English. Multilingual experiments show similar compression and Rényi efficiency for 25 languages.
zh
[NLP-23] ReDit: Reward Dithering for Improved LLM Policy Optimization
【速读】: 该论文试图解决基于规则的奖励系统在训练大型语言模型(Large Language Model)时因离散奖励导致的梯度异常、优化不稳定和收敛缓慢的问题。解决方案的关键在于提出ReDit(Reward Dithering),通过向离散奖励信号中添加简单的随机噪声,生成扰动奖励,从而在整个学习过程中持续提供探索性梯度,实现更平滑的梯度更新并加速收敛,同时在平坦奖励区域引入随机性以鼓励模型探索新策略并逃离局部最优。
链接: https://arxiv.org/abs/2506.18631
作者: Chenxing Wei,Jiarui Yu,Ying Tiffany He,Hande Dong,Yao Shu,Fei Yu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 10 pages, 15 figures
Abstract:DeepSeek-R1 has successfully enhanced Large Language Model (LLM) reasoning capabilities through its rule-based reward system. While it’s a ‘‘perfect’’ reward system that effectively mitigates reward hacking, such reward functions are often discrete. Our experimental observations suggest that discrete rewards can lead to gradient anomaly, unstable optimization, and slow convergence. To address this issue, we propose ReDit (Reward Dithering), a method that dithers the discrete reward signal by adding simple random noise. With this perturbed reward, exploratory gradients are continuously provided throughout the learning process, enabling smoother gradient updates and accelerating convergence. The injected noise also introduces stochasticity into flat reward regions, encouraging the model to explore novel policies and escape local optima. Experiments across diverse tasks demonstrate the effectiveness and efficiency of ReDit. On average, ReDit achieves performance comparable to vanilla GRPO with only approximately 10% the training steps, and furthermore, still exhibits a 4% performance improvement over vanilla GRPO when trained for a similar duration. Visualizations confirm significant mitigation of gradient issues with ReDit. Moreover, theoretical analyses are provided to further validate these advantages.
zh
[NLP-24] AggTruth: Contextual Hallucination Detection using Aggregated Attention Scores in LLM s CCS2025
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在实际应用中出现的幻觉问题,尤其是在检索增强生成(Retrieval-Augmented Generation, RAG)设置下,这一问题对模型的部署构成了重大挑战。解决方案的关键在于提出AggTruth方法,通过分析提供的上下文(passage)中内部注意力分数的分布,实现实时检测上下文幻觉。该方法提出了四种不同的变体,每种变体在计算注意力分数时采用不同的聚合技术,实验结果表明AggTruth在多种场景下均表现出稳定的性能,并优于当前最先进的方法。
链接: https://arxiv.org/abs/2506.18628
作者: Piotr Matys,Jan Eliasz,Konrad Kiełczyński,Mikołaj Langner,Teddy Ferdinan,Jan Kocoń,Przemysław Kazienko
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: ICCS 2025 Workshops
Abstract:In real-world applications, Large Language Models (LLMs) often hallucinate, even in Retrieval-Augmented Generation (RAG) settings, which poses a significant challenge to their deployment. In this paper, we introduce AggTruth, a method for online detection of contextual hallucinations by analyzing the distribution of internal attention scores in the provided context (passage). Specifically, we propose four different variants of the method, each varying in the aggregation technique used to calculate attention scores. Across all LLMs examined, AggTruth demonstrated stable performance in both same-task and cross-task setups, outperforming the current SOTA in multiple scenarios. Furthermore, we conducted an in-depth analysis of feature selection techniques and examined how the number of selected attention heads impacts detection performance, demonstrating that careful selection of heads is essential to achieve optimal results.
zh
[NLP-25] he Anatomy of Speech Persuasion: Linguistic Shifts in LLM -Modified Speeches
【速读】: 该论文试图解决大型语言模型如何理解公共演讲中说服力(persuasiveness)的概念问题,其解决方案的关键在于提出一种新颖的方法论和一个可解释的文本特征集,该特征集整合了修辞手法(rhetorical devices)和话语标记(discourse markers)。通过调整“Ma These en 180 Secondes”竞赛中博士生演讲稿,并利用3MT法国数据集,研究者使用GPT-4o对演讲内容进行说服力增强或削弱,并分析原始与生成文本在新特征上的语言变化,结果显示GPT-4o主要通过系统性风格修改而非人类类似的方式影响说服力。
链接: https://arxiv.org/abs/2506.18621
作者: Alisa Barkar,Mathieu Chollet,Matthieu Labeau,Beatrice Biancardi,Chloe Clavel
机构: Telecom Paris (电信巴黎); University of Glasgow (格拉斯哥大学); IMT Atlantique (IMT大西洋); CESI LINEACT (CESI LINEACT); INRIA ALMAnaCH (INRIA ALMAnaCH)
类目: Computation and Language (cs.CL)
备注: Under submission to ICNLSP 2025. 9 pages, 2 tables
Abstract:This study examines how large language models understand the concept of persuasiveness in public speaking by modifying speech transcripts from PhD candidates in the “Ma These en 180 Secondes” competition, using the 3MT French dataset. Our contributions include a novel methodology and an interpretable textual feature set integrating rhetorical devices and discourse markers. We prompt GPT-4o to enhance or diminish persuasiveness and analyze linguistic shifts between original and generated speech in terms of the new features. Results indicate that GPT-4o applies systematic stylistic modifications rather than optimizing persuasiveness in a human-like manner. Notably, it manipulates emotional lexicon and syntactic structures (such as interrogative and exclamatory clauses) to amplify rhetorical impact.
zh
[NLP-26] Semantic similarity estimation for domain specific data using BERT and other techniques
【速读】: 该论文试图解决语义相似性估计(semantic similarity estimation)的问题,该问题在自然语言处理和自然语言理解中具有重要研究价值,并在问答、语义搜索、信息检索、文档聚类、词义消歧和机器翻译等下游任务中具有广泛应用。论文采用多种最先进的技术进行语义相似性估计,包括USE(Universal Sentence Encoder)、InferSent以及最新的BERT(Bidirectional Encoder Representations from Transformers)模型。实验结果表明,BERT在与其它方法的比较中表现出显著优越的性能,这主要归因于其训练过程中涉及的微调(fine-tuning)过程,使其能够基于训练数据学习到更有效的模式。该工作展示了BERT在领域特定数据集上的适用性,并推断BERT是处理领域特定数据的最佳技术。
链接: https://arxiv.org/abs/2506.18602
作者: R. Prashanth
机构: 未知
类目: Computation and Language (cs.CL); Applications (stat.AP)
备注: This is a preprint version of an article accepted for publication in the proceedings of Machine Learning and Data Mining 2019
Abstract:Estimation of semantic similarity is an important research problem both in natural language processing and the natural language understanding, and that has tremendous application on various downstream tasks such as question answering, semantic search, information retrieval, document clustering, word-sense disambiguation and machine translation. In this work, we carry out the estimation of semantic similarity using different state-of-the-art techniques including the USE (Universal Sentence Encoder), InferSent and the most recent BERT, or Bidirectional Encoder Representations from Transformers, models. We use two question pairs datasets for the analysis, one is a domain specific in-house dataset and the other is a public dataset which is the Quora’s question pairs dataset. We observe that the BERT model gave much superior performance as compared to the other methods. This should be because of the fine-tuning procedure that is involved in its training process, allowing it to learn patterns based on the training data that is used. This works demonstrates the applicability of BERT on domain specific datasets. We infer from the analysis that BERT is the best technique to use in the case of domain specific data.
zh
[NLP-27] Reply to “Emergent LLM behaviors are observationally equivalent to data leakage”
【速读】: 该论文试图解决在模拟大规模语言模型(Large Language Models, LLMs)种群时可能出现的数据污染问题,即训练数据可能以非预期的方式影响实验结果。尽管这一问题可能阻碍某些多智能体模型的实验,但论文指出这并不妨碍对LLMs种群中真正涌现的动力学行为进行研究。其解决方案的关键在于证明自组织和依赖模型的涌现动力学可以在LLMs种群中被研究,并通过实证观察社会惯例的具体案例来支持这一观点。
链接: https://arxiv.org/abs/2506.18600
作者: Ariel Flint Ashery,Luca Maria Aiello,Andrea Baronchelli
机构: City St George’s, University of London (城市圣乔治大学); IT University of Copenhagen (哥本哈根信息技术大学); Pioneer Centre for AI (先锋人工智能中心)
类目: Computation and Language (cs.CL); Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA)
备注: Reply to arXiv:2505.23796
Abstract:A potential concern when simulating populations of large language models (LLMs) is data contamination, i.e. the possibility that training data may shape outcomes in unintended ways. While this concern is important and may hinder certain experiments with multi-agent models, it does not preclude the study of genuinely emergent dynamics in LLM populations. The recent critique by Barrie and Törnberg [1] of the results of Flint Ashery et al. [2] offers an opportunity to clarify that self-organisation and model-dependent emergent dynamics can be studied in LLM populations, highlighting how such dynamics have been empirically observed in the specific case of social conventions.
zh
[NLP-28] No Training Wheels: Steering Vectors for Bias Correction at Inference Time
【速读】: 该论文试图解决神经网络分类器在数据集群体分布不均的情况下继承类别偏见并学习虚假相关性的问题,这类模型在平均性能上表现良好,但在小众群体上的表现较差。解决方案的关键在于提出一种无需重新训练、计算成本低廉的方法,该方法受大型语言模型中用于编辑行为的“转向向量”(steering vectors)启发,通过计算多数群体与少数群体之间平均激活值的差异来定义“偏见向量”,并将该向量从模型的残差流中减去,从而降低分类偏见并提升最差群体的准确率。
链接: https://arxiv.org/abs/2506.18598
作者: Aviral Gupta,Armaan Sethi,Ameesh Sethi
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Neural network classifiers trained on datasets with uneven group representation often inherit class biases and learn spurious correlations. These models may perform well on average but consistently fail on atypical groups. For example, in hair color classification, datasets may over-represent females with blond hair, reinforcing stereotypes. Although various algorithmic and data-centric methods have been proposed to address such biases, they often require retraining or significant compute. In this work, we propose a cheap, training-free method inspired by steering vectors used to edit behaviors in large language models. We compute the difference in mean activations between majority and minority groups to define a “bias vector,” which we subtract from the model’s residual stream. This leads to reduced classification bias and improved worst-group accuracy. We explore multiple strategies for extracting and applying these vectors in transformer-like classifiers, showing that steering vectors, traditionally used in generative models, can also be effective in classification. More broadly, we showcase an extremely cheap, inference time, training free method to mitigate bias in classification models.
zh
[NLP-29] Airalogy: AI-empowered universal data digitization for research automation
【速读】: 该论文试图解决跨学科研究数据标准化与高效管理的问题,当前AI应用受限于数据集的可用性、结构化程度及共享效率,而现有平台难以同时兼顾数据的通用性与标准化需求。解决方案的关键在于开发Airalogy,这是一个全球首个结合人工智能与社区驱动的平台,能够在多学科领域中平衡数据的通用性与标准化,通过定制化、标准化的数据记录以及先进的AI研究助手,实现智能问答、自动化数据录入、分析与研究流程的自动化。
链接: https://arxiv.org/abs/2506.18586
作者: Zijie Yang,Qiji Zhou,Fang Guo,Sijie Zhang,Yexun Xi,Jinglei Nie,Yudian Zhu,Liping Huang,Chou Wu,Yonghe Xia,Xiaoyu Ma,Yingming Pu,Panzhong Lu,Junshu Pan,Mingtao Chen,Tiannan Guo,Yanmei Dou,Hongyu Chen,Anping Zeng,Jiaxing Huang,Tian Xu,Yue Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Computation and Language (cs.CL)
备注: 146 pages, 6 figures, 49 supplementary figures
Abstract:Research data are the foundation of Artificial Intelligence (AI)-driven science, yet current AI applications remain limited to a few fields with readily available, well-structured, digitized datasets. Achieving comprehensive AI empowerment across multiple disciplines is still out of reach. Present-day research data collection is often fragmented, lacking unified standards, inefficiently managed, and difficult to share. Creating a single platform for standardized data digitization needs to overcome the inherent challenge of balancing between universality (supporting the diverse, ever-evolving needs of various disciplines) and standardization (enforcing consistent formats to fully enable AI). No existing platform accommodates both facets. Building a truly multidisciplinary platform requires integrating scientific domain knowledge with sophisticated computing skills. Researchers often lack the computational expertise to design customized and standardized data recording methods, whereas platform developers rarely grasp the intricate needs of multiple scientific domains. These gaps impede research data standardization and hamper AI-driven progress. In this study, we address these challenges by developing Airalogy (this https URL), the world’s first AI- and community-driven platform that balances universality and standardization for digitizing research data across multiple disciplines. Airalogy represents entire research workflows using customizable, standardized data records and offers an advanced AI research copilot for intelligent QA, automated data entry, analysis, and research automation. Already deployed in laboratories across all four schools of Westlake University, Airalogy has the potential to accelerate and automate scientific innovation in universities, industry, and the global research community-ultimately benefiting humanity as a whole.
zh
[NLP-30] Parallel Continuous Chain-of-Thought with Jacobi Iteration
【速读】: 该论文旨在解决连续链式思维(Continuous Chain-of-Thought, Continuous CoT)在训练过程中由于潜在思维标记之间的序列依赖性导致并行训练效率低下的问题。解决方案的关键在于提出并行连续链式思维(Parallel Continuous Chain-of-Thought, PCCoT),通过在潜在思维标记上执行雅可比迭代(Jacobi iteration),以并行方式而非顺序方式更新这些标记,从而提升训练和推理效率。
链接: https://arxiv.org/abs/2506.18582
作者: Haoyi Wu,Zhihao Teng,Kewei Tu
机构: ShanghaiTech University (上海科技大学); Shanghai Engineering Research Center of Intelligent Vision and Imaging (上海智能视觉与成像工程研究中心)
类目: Computation and Language (cs.CL)
备注: under review
Abstract:Continuous chain-of-thought has been shown to be effective in saving reasoning tokens for large language models. By reasoning with continuous latent thought tokens, continuous CoT is able to perform implicit reasoning in a compact manner. However, the sequential dependencies between latent thought tokens spoil parallel training, leading to long training time. In this paper, we propose Parallel Continuous Chain-of-Thought (PCCoT), which performs Jacobi iteration on the latent thought tokens, updating them iteratively in parallel instead of sequentially and thus improving both training and inference efficiency of continuous CoT. Experiments demonstrate that by choosing the proper number of iterations, we are able to achieve comparable or even better performance while saving nearly 50% of the training and inference time. Moreover, PCCoT shows better stability and robustness in the training process. Our code is available at this https URL.
zh
[NLP-31] A Modular Taxonomy for Hate Speech Definitions and Its Impact on Zero-Shot LLM Classification Performance
【速读】: 该论文试图解决 hate speech(仇恨言论)定义模糊性对自然语言处理(NLP)模型性能影响的问题。其解决方案的关键在于通过收集和分析现有文献中的 hate speech 定义,构建一个包含14个概念要素的分类体系,从而为不同定义的 hate speech 提供结构化的参考框架,并在此基础上进行零样本实验评估三种大语言模型(LLMs)在不同数据集上的表现。
链接: https://arxiv.org/abs/2506.18576
作者: Matteo Melis,Gabriella Lapesa,Dennis Assenmacher
机构: Aarhus University (奥胡斯大学); GESIS - Leibniz Institute for the Social Sciences (GESIS-莱布尼茨社会科学研究所); Heinrich-Heine University Düsseldorf (海因里希-海涅大学杜塞尔多夫)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:
Abstract:Detecting harmful content is a crucial task in the landscape of NLP applications for Social Good, with hate speech being one of its most dangerous forms. But what do we mean by hate speech, how can we define it, and how does prompting different definitions of hate speech affect model performance? The contribution of this work is twofold. At the theoretical level, we address the ambiguity surrounding hate speech by collecting and analyzing existing definitions from the literature. We organize these definitions into a taxonomy of 14 Conceptual Elements-building blocks that capture different aspects of hate speech definitions, such as references to the target of hate (individual or groups) or of the potential consequences of it. At the experimental level, we employ the collection of definitions in a systematic zero-shot evaluation of three LLMs, on three hate speech datasets representing different types of data (synthetic, human-in-the-loop, and real-world). We find that choosing different definitions, i.e., definitions with a different degree of specificity in terms of encoded elements, impacts model performance, but this effect is not consistent across all architectures.
zh
[NLP-32] When Fine-Tuning Fails: Lessons from MS MARCO Passage Ranking
【速读】: 该论文试图解决在MS MARCO段落排序任务中,微调预训练的Transformer模型反而导致性能下降的反直觉现象。其解决方案的关键在于揭示微调过程破坏了基础模型在10亿句对上进行大规模预训练时所学习到的最优嵌入空间结构,进而导致性能劣化。通过对比多种微调方法与基础模型sentence-transformers/all-MiniLM-L6-v2的性能,研究发现所有微调方法均表现不佳,这表明在饱和基准上传统迁移学习的有效性可能被高估,未来可能需要通过架构创新实现显著改进。
链接: https://arxiv.org/abs/2506.18535
作者: Manu Pande,Shahil Kumar,Anay Yatin Damle
机构: IIIT Allahabad(印度信息科技学院阿拉哈巴德分校)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:
Abstract:This paper investigates the counterintuitive phenomenon where fine-tuning pre-trained transformer models degrades performance on the MS MARCO passage ranking task. Through comprehensive experiments involving five model variants-including full parameter fine-tuning and parameter efficient LoRA adaptations-we demonstrate that all fine-tuning approaches underperform the base sentence-transformers/all- MiniLM-L6-v2 model (MRR@10: 0.3026). Our analysis reveals that fine-tuning disrupts the optimal embedding space structure learned during the base model’s extensive pre-training on 1 billion sentence pairs, including 9.1 million MS MARCO samples. UMAP visualizations show progressive embedding space flattening, while training dynamics analysis and computational efficiency metrics further support our findings. These results challenge conventional wisdom about transfer learning effectiveness on saturated benchmarks and suggest architectural innovations may be necessary for meaningful improvements.
zh
[NLP-33] End-to-End Spoken Grammatical Error Correction
【速读】: 该论文旨在解决口语语法错误修正(Spoken Grammatical Error Correction, SGEC)系统中因语音识别错误、不流畅表达和缺乏结构化输入所带来的挑战,以及传统级联式架构中误差传播的问题。其解决方案的关键在于提出一种端到端(End-to-End, E2E)框架,通过自动伪标注方法扩充训练数据,并引入上下文信息和参考对齐机制以提升修正精度,同时结合编辑置信度估计排除低置信度的修正,从而显著提升SGEC系统的性能。
链接: https://arxiv.org/abs/2506.18532
作者: Mengjie Qian,Rao Ma,Stefano Bannò,Mark J.F. Gales,Kate M. Knill
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: This work has been submitted to the IEEE for possible publication
Abstract:Grammatical Error Correction (GEC) and feedback play a vital role in supporting second language (L2) learners, educators, and examiners. While written GEC is well-established, spoken GEC (SGEC), aiming to provide feedback based on learners’ speech, poses additional challenges due to disfluencies, transcription errors, and the lack of structured input. SGEC systems typically follow a cascaded pipeline consisting of Automatic Speech Recognition (ASR), disfluency detection, and GEC, making them vulnerable to error propagation across modules. This work examines an End-to-End (E2E) framework for SGEC and feedback generation, highlighting challenges and possible solutions when developing these systems. Cascaded, partial-cascaded and E2E architectures are compared, all built on the Whisper foundation model. A challenge for E2E systems is the scarcity of GEC labeled spoken data. To address this, an automatic pseudo-labeling framework is examined, increasing the training data from 77 to over 2500 hours. To improve the accuracy of the SGEC system, additional contextual information, exploiting the ASR output, is investigated. Candidate feedback of their mistakes is an essential step to improving performance. In E2E systems the SGEC output must be compared with an estimate of the fluent transcription to obtain the feedback. To improve the precision of this feedback, a novel reference alignment process is proposed that aims to remove hypothesised edits that results from fluent transcription errors. Finally, these approaches are combined with an edit confidence estimation approach, to exclude low-confidence edits. Experiments on the in-house Linguaskill (LNG) corpora and the publicly available Speak Improve (SI) corpus show that the proposed approaches significantly boost E2E SGEC performance.
zh
[NLP-34] Smooth Operators: LLM s Translating Imperfect Hints into Disfluency-Rich Transcripts INTERSPEECH2025
【速读】: 该论文旨在解决口语中不流畅现象(disfluency)的准确检测问题,这对于提升自动语音与语言处理系统的性能以及推动更具包容性的语音与语言技术发展具有重要意义。其解决方案的关键在于提出一种新颖的方法,将不流畅现象作为带有时间戳的显式标记进行转录,从而生成完全标注的富含不流畅现象的文本。该方法结合了从音频编码器中提取的声学表示与不同质量的文本输入(如干净转录、对齐转录或基于音素的自动语音识别模型输出),并证明即使文本输入存在缺陷,只要包含时间戳相关线索,大型语言模型(LLM)仍能有效优化输入并生成完整的不流畅标注文本,体现了其在处理不完美提示时的鲁棒性。
链接: https://arxiv.org/abs/2506.18510
作者: Duygu Altinok
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: Accepted to INTERSPEECH2025 workshop DISS2025
Abstract:Accurate detection of disfluencies in spoken language is crucial for enhancing the performance of automatic speech and language processing systems, as well as fostering the development of more inclusive speech and language technologies. Leveraging the growing trend of large language models (LLMs) as versatile learners capable of processing both lexical and non-lexical inputs (e.g., audio and video), we propose a novel approach to transcribing disfluencies as explicit tokens with timestamps, enabling the generation of fully annotated disfluency-rich transcripts. Our method integrates acoustic representations extracted from an audio encoder with textual inputs of varying quality: clean transcriptions without disfluencies, time-aligned transcriptions from aligners, or outputs from phoneme-based ASR models – all of which may contain imperfections. Importantly, our experiments demonstrate that textual inputs do not need to be flawless. As long as they include timestamp-related cues, LLMs can effectively smooth the input and produce fully disfluency-annotated transcripts, underscoring their robustness in handling imperfect hints.
zh
[NLP-35] Comparative Evaluation of ChatGPT and DeepSeek Across Key NLP Tasks: Strengths Weaknesses and Domain-Specific Performance
【速读】: 该论文试图解决如何评估大型语言模型(Large Language Models, LLMs)在不同自然语言处理(Natural Language Processing, NLP)任务中的性能问题,以明确其优势、劣势及领域特定能力。解决方案的关键在于设计一个结构化的实验协议,通过使用相同且中性的提示词对ChatGPT和DeepSeek进行测试,并在每个任务上使用两个基准数据集进行评估,从而确保实验的公平性和结果的可比性。
链接: https://arxiv.org/abs/2506.18501
作者: Wael Etaiwi,Bushra Alhijawi
机构: Princess Sumaya University for Technology (公主苏姆亚技术大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:The increasing use of large language models (LLMs) in natural language processing (NLP) tasks has sparked significant interest in evaluating their effectiveness across diverse applications. While models like ChatGPT and DeepSeek have shown strong results in many NLP domains, a comprehensive evaluation is needed to understand their strengths, weaknesses, and domain-specific abilities. This is critical as these models are applied to various tasks, from sentiment analysis to more nuanced tasks like textual entailment and translation. This study aims to evaluate ChatGPT and DeepSeek across five key NLP tasks: sentiment analysis, topic classification, text summarization, machine translation, and textual entailment. A structured experimental protocol is used to ensure fairness and minimize variability. Both models are tested with identical, neutral prompts and evaluated on two benchmark datasets per task, covering domains like news, reviews, and formal/informal texts. The results show that DeepSeek excels in classification stability and logical reasoning, while ChatGPT performs better in tasks requiring nuanced understanding and flexibility. These findings provide valuable insights for selecting the appropriate LLM based on task requirements.
zh
[NLP-36] AI-Generated Song Detection via Lyrics Transcripts
【速读】: 该论文试图解决AI生成音乐内容在实际应用中的检测难题,特别是在缺乏完美对齐歌词的情况下,传统基于音频的检测方法存在泛化能力差和鲁棒性不足的问题。解决方案的关键在于利用通用自动语音识别(ASR)模型对歌曲进行转录,从而获得可用的歌词信息,并结合高效的文本嵌入方法(如Whisper large-v2和LLM2Vec)进行检测,该方法在多种语言和音乐风格下均表现出较强的检测性能,并且在音频被扰动或使用不同生成器时具有更高的鲁棒性。
链接: https://arxiv.org/abs/2506.18488
作者: Markus Frohmann,Elena V. Epure,Gabriel Meseguer-Brocal,Markus Schedl,Romain Hennequin
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted to ISMIR 2025
Abstract:The recent rise in capabilities of AI-based music generation tools has created an upheaval in the music industry, necessitating the creation of accurate methods to detect such AI-generated content. This can be done using audio-based detectors; however, it has been shown that they struggle to generalize to unseen generators or when the audio is perturbed. Furthermore, recent work used accurate and cleanly formatted lyrics sourced from a lyrics provider database to detect AI-generated music. However, in practice, such perfect lyrics are not available (only the audio is); this leaves a substantial gap in applicability in real-life use cases. In this work, we instead propose solving this gap by transcribing songs using general automatic speech recognition (ASR) models. We do this using several detectors. The results on diverse, multi-genre, and multi-lingual lyrics show generally strong detection performance across languages and genres, particularly for our best-performing model using Whisper large-v2 and LLM2Vec embeddings. In addition, we show that our method is more robust than state-of-the-art audio-based ones when the audio is perturbed in different ways and when evaluated on different music generators. Our code is available at this https URL.
zh
[NLP-37] MeRF: Motivation-enhanced Reinforcement Finetuning for Large Reasoning Models
【速读】: 该论文试图解决如何将强化学习(Reinforcement Learning, RL)与大语言模型(Large Language Models, LLMs)的上下文学习能力有效结合,以提升其推理能力的问题。现有RLVR方法忽视了LLMs在上下文学习方面的能力,而这一能力在Chain-of-Thought (CoT)提示中已被证明具有显著效果。解决方案的关键在于提出Motivation-enhanced Reinforcement Finetuning (MeRF),通过在提示中直接注入奖励规范,作为模型优化目标的上下文动机,从而引导模型在内在动机和外部奖励的共同作用下生成更优输出。
链接: https://arxiv.org/abs/2506.18485
作者: Junjie Zhang,Guozheng Ma,Shunyu Liu,Haoyu Wang,Jiaxing Huang,Ting-En Lin,Fei Huang,Yongbin Li,Dacheng Tao
机构: Nanyang Technological University (南洋理工大学); Tongyi Lab (通义实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a powerful learn-to-reason paradigm for Large Language Models (LLMs) to tackle complex reasoning tasks. However, existing RLVR methods overlook one of the most distinctive capabilities of LLMs, their in-context learning ability, as prominently demonstrated by the success of Chain-of-Thought (CoT) prompting. This motivates us to explore how reinforcement learning can be effectively combined with in-context learning to better improve the reasoning capabilities of LLMs. In this paper, we introduce Motivation-enhanced Reinforcement Finetuning (MeRF), an intuitive yet effective method enhancing reinforcement learning of LLMs by involving ``telling LLMs the rules of the game’'. Specifically, MeRF directly injects the reward specification into the prompt, which serves as an in-context motivation for model to improve its responses with awareness of the optimization objective. This simple modification leverages the in-context learning ability of LLMs aligning generation with optimization, thereby incentivizing the model to generate desired outputs from both inner motivation and external reward. Empirical evaluations on the Knights and Knaves~(KK) logic puzzle reasoning benchmark demonstrate that \textttMeRF achieves substantial performance gains over baselines. Moreover, ablation studies show that performance improves with greater consistency between the in-context motivation and the external reward function, while the model also demonstrates an ability to adapt to misleading motivations through reinforcement learning.
zh
[NLP-38] ReB: A Comprehensive Benchmark for Evaluating Table Reasoning Capabilities of Large Language Models
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在处理表格结构化数据时面临的挑战,特别是缺乏一个能够公平反映LLMs在广泛表格推理能力上的有效评估基准。解决方案的关键在于提出一个全面的表格推理进化基准TReB,该基准涵盖26个子任务,用于衡量浅层表格理解和深层表格推理能力,并通过迭代数据处理流程构建高质量数据集,同时设计了三种不同的推理模式(TCoT、PoT和ICoT)的评估框架,以稳健地测量表格推理能力。
链接: https://arxiv.org/abs/2506.18421
作者: Ce Li,Xiaofan Liu,Zhiyan Song,Ce Chi,Chen Zhao,Jingjing Yang,Zhendong Wang,Kexin Yang,Boshen Shi,Xing Wang,Chao Deng,Junlan Feng
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Benmark report v1.0
Abstract:The majority of data in businesses and industries is stored in tables, databases, and data warehouses. Reasoning with table-structured data poses significant challenges for large language models (LLMs) due to its hidden semantics, inherent complexity, and structured nature. One of these challenges is lacking an effective evaluation benchmark fairly reflecting the performances of LLMs on broad table reasoning abilities. In this paper, we fill in this gap, presenting a comprehensive table reasoning evolution benchmark, TReB, which measures both shallow table understanding abilities and deep table reasoning abilities, a total of 26 sub-tasks. We construct a high quality dataset through an iterative data processing procedure. We create an evaluation framework to robustly measure table reasoning capabilities with three distinct inference modes, TCoT, PoT and ICoT. Further, we benchmark over 20 state-of-the-art LLMs using this frame work and prove its effectiveness. Experimental results reveal that existing LLMs still have significant room for improvement in addressing the complex and real world Table related tasks. Both the dataset and evaluation framework are publicly available, with the dataset hosted on [HuggingFace] and the framework on [GitHub].
zh
[NLP-39] Lemmatization as a Classification Task: Results from Arabic across Multiple Genres
【速读】: 该论文试图解决阿拉伯语等形态丰富的语言在自然语言处理任务中因拼写歧义和标准不一致导致的词形还原(Lemmatization)问题。其解决方案的关键在于将词形还原建模为对词形-词性-释义(Lemma-POS-Gloss, LPG)标签集的分类任务,并利用机器翻译和语义聚类技术进行优化。此外,研究还构建了一个覆盖多种语域的标准化阿拉伯语词形还原测试集,以推动该领域的基准提升。
链接: https://arxiv.org/abs/2506.18399
作者: Mostafa Saeed,Nizar Habash
机构: New York University Abu Dhabi (纽约大学阿布扎比分校)
类目: Computation and Language (cs.CL)
备注:
Abstract:Lemmatization is crucial for NLP tasks in morphologically rich languages with ambiguous orthography like Arabic, but existing tools face challenges due to inconsistent standards and limited genre coverage. This paper introduces two novel approaches that frame lemmatization as classification into a Lemma-POS-Gloss (LPG) tagset, leveraging machine translation and semantic clustering. We also present a new Arabic lemmatization test set covering diverse genres, standardized alongside existing datasets. We evaluate character level sequence-to-sequence models, which perform competitively and offer complementary value, but are limited to lemma prediction (not LPG) and prone to hallucinating implausible forms. Our results show that classification and clustering yield more robust, interpretable outputs, setting new benchmarks for Arabic lemmatization.
zh
[NLP-40] Evaluating Causal Explanation in Medical Reports with LLM -Based and Human-Aligned Metrics SIGIR2025
【速读】: 该论文试图解决如何准确评估自动生成诊断报告中因果解释质量的问题,其核心在于比较不同评价指标(如BERTScore、余弦相似度、BioSentVec、GPT-White、GPT-Black及专家定性评估)在两种输入类型(基于观察和基于选择题的报告生成)下的表现。解决方案的关键在于通过两种加权策略(任务特定优先级与等权重)分析各指标的判别能力,结果表明基于大语言模型(LLM)的评估方法(如GPT-Black和GPT-White)在捕捉逻辑连贯性和临床有效性方面优于基于相似度的指标。
链接: https://arxiv.org/abs/2506.18387
作者: Yousang Cho,Key-Sun Choi
机构: Konyang University (高阳大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 9 pages, presented at LLM4Eval Workshop, SIGIR 2025 Padova, Italy, July 17, 2025
Abstract:This study investigates how accurately different evaluation metrics capture the quality of causal explanations in automatically generated diagnostic reports. We compare six metrics: BERTScore, Cosine Similarity, BioSentVec, GPT-White, GPT-Black, and expert qualitative assessment across two input types: observation-based and multiple-choice-based report generation. Two weighting strategies are applied: one reflecting task-specific priorities, and the other assigning equal weights to all metrics. Our results show that GPT-Black demonstrates the strongest discriminative power in identifying logically coherent and clinically valid causal narratives. GPT-White also aligns well with expert evaluations, while similarity-based metrics diverge from clinical reasoning quality. These findings emphasize the impact of metric selection and weighting on evaluation outcomes, supporting the use of LLM-based evaluation for tasks requiring interpretability and causal reasoning.
zh
[NLP-41] SlimMoE: Structured Compression of Large MoE Models via Expert Slimming and Distillation
【速读】: 该论文试图解决大规模Mixture of Experts (MoE)模型在资源受限环境中进行微调或部署时面临的高昂计算和存储成本问题。其解决方案的关键在于提出SlimMoE,一个分阶段压缩框架,通过系统性地减少专家参数数量并利用中间阶段进行知识迁移,有效缓解了一次性剪枝方法中常见的性能下降问题,从而将大型MoE模型转化为更小、高效的变体。
链接: https://arxiv.org/abs/2506.18349
作者: Zichong Li,Chen Liang,Zixuan Zhang,Ilgee Hong,Young Jin Kim,Weizhu Chen,Tuo Zhao
机构: Georgia Tech (佐治亚理工学院); Microsoft (微软)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:The Mixture of Experts (MoE) architecture has emerged as a powerful paradigm for scaling large language models (LLMs) while maintaining inference efficiency. However, their enormous memory requirements make them prohibitively expensive to fine-tune or deploy in resource-constrained environments. To address this challenge, we introduce SlimMoE, a multi-stage compression framework for transforming large MoE models into much smaller, efficient variants without incurring the prohibitive costs of training from scratch. Our method systematically reduces parameter counts by slimming experts and transferring knowledge through intermediate stages, effectively mitigating the performance degradation common in one-shot pruning approaches. Using this framework, we compress Phi 3.5-MoE (41.9B total/6.6B activated parameters) to create Phi-mini-MoE (7.6B total/2.4B activated parameters) and Phi-tiny-MoE (3.8B total/1.1B activated parameters) using only 400B tokens–less than 10% of the original model’s training data. These compressed models can be fine-tuned on a single GPU (A100 for Phi-mini-MoE, A6000 for Phi-tiny-MoE), making them highly suitable for academic and resource-limited settings. Our experiments demonstrate that these compressed models outperform others of similar size and remain competitive with larger models. For instance, Phi-mini-MoE achieves similar or better performance to Phi-3-mini using only 2/3 of the activated parameters and yields comparable MMLU scores to Llama 3.1 8B despite having significantly lower latency. Our findings demonstrate that structured pruning combined with staged distillation offers an effective path to creating high-quality, compact MoE models, paving the way for broader adoption of MoE architectures. We make our models publicly available at this https URL and this https URL .
zh
[NLP-42] Less Data Less Tokens: Multilingual Unification Learning for Efficient Test-Time Reasoning in LLM s
【速读】: 该论文旨在解决大规模语言模型(Large Language Models, LLMs)在测试阶段扩展时面临的数据和推理效率问题。其核心挑战在于如何在有限的数据和计算资源下保持模型的性能。论文提出的解决方案关键在于引入一种名为L2多语言统一学习的方法,并结合解码干预策略,通过利用多语言推理的多样性来提升模型性能与效率。该方法认为不同语言的推理过程可能相互促进,从而在少量数据情况下显著增强模型的推理能力,同时减少所需的训练数据量和推理时的token数量。
链接: https://arxiv.org/abs/2506.18341
作者: Kang Chen,Mengdi Zhang,Yixin Cao
机构: Fudan University (复旦大学); Meituan Group (美团集团)
类目: Computation and Language (cs.CL)
备注:
Abstract:This paper explores the challenges of test-time scaling of large language models (LLMs), regarding both the data and inference efficiency. We highlight the diversity of multi-lingual reasoning based on our pilot studies, and then introduce a novel approach, (L^2) multi-lingual unification learning with a decoding intervention strategy for further investigation. The basic idea of (L^2) is that the reasoning process varies across different languages, which may be mutually beneficial to enhance both model performance and efficiency. In specific, there are two types of multi-lingual data: the entire long chain-of-thought annotations in different languages and the step-wise mixture of languages. By further tuning based on them, we show that even small amounts of data can significantly improve reasoning capabilities. Our findings suggest that multilingual learning reduces both the required data and the number of inference tokens while maintaining a comparable performance. Furthermore, (L^2) is orthogonal to other data efficient methods. Thus, we also emphasize the importance of diverse data selection. The (L^2) method offers a promising solution to the challenges of data collection and test-time compute efficiency in LLMs.
zh
[NLP-43] ranslationCorrect: A Unified Framework for Machine Translation Post-Editing with Predictive Error Assistance
【速读】: 该论文旨在解决机器翻译(Machine Translation, MT)后期编辑与研究数据收集过程中存在的低效、脱节的工作流程问题。其解决方案的关键在于提出了一种集成框架——TranslationCorrect,该框架将MT生成、基于NLLB等模型的自动化错误预测以及直观的后期编辑界面整合至统一环境中,并遵循人机交互(Human-Computer Interaction, HCI)原则以降低认知负荷。此外,TranslationCorrect能够输出符合Error Span Annotation (ESA)格式的高质量跨度标注数据,支持当前最先进的错误检测模型,从而提升翻译效率和用户满意度。
链接: https://arxiv.org/abs/2506.18337
作者: Syed Mekael Wasti,Shou-Yi Hung,Christopher Collins,En-Shiun Annie Lee
机构: Ontario Tech University (安大略理工大学); University of Toronto (多伦多大学)
类目: Computation and Language (cs.CL)
备注: Preprint
Abstract:Machine translation (MT) post-editing and research data collection often rely on inefficient, disconnected workflows. We introduce TranslationCorrect, an integrated framework designed to streamline these tasks. TranslationCorrect combines MT generation using models like NLLB, automated error prediction using models like XCOMET or LLM APIs (providing detailed reasoning), and an intuitive post-editing interface within a single environment. Built with human-computer interaction (HCI) principles in mind to minimize cognitive load, as confirmed by a user study. For translators, it enables them to correct errors and batch translate efficiently. For researchers, TranslationCorrect exports high-quality span-based annotations in the Error Span Annotation (ESA) format, using an error taxonomy inspired by Multidimensional Quality Metrics (MQM). These outputs are compatible with state-of-the-art error detection models and suitable for training MT or post-editing systems. Our user study confirms that TranslationCorrect significantly improves translation efficiency and user satisfaction over traditional annotation methods.
zh
[NLP-44] Confucius3-Math: A Lightweight High-Performance Reasoning LLM for Chinese K-12 Mathematics Learning
【速读】: 该论文旨在解决如何在低成本条件下构建具有强大推理能力的专用领域模型的问题,特别是针对中国K-12阶段数学教育的需求。其解决方案的关键在于通过大规模强化学习(Reinforcement Learning, RL)进行后训练,并结合三项技术创新:目标熵正则化、近期样本恢复和策略特定难度加权,这些方法共同提升了强化学习训练的稳定性、数据效率和模型性能。
链接: https://arxiv.org/abs/2506.18330
作者: Lixin Wu,Na Cai,Qiao Cheng,Jiachen Wang,Yitao Duan
机构: NetEase Youdao (网易有道)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:We introduce Confucius3-Math, an open-source large language model with 14B parameters that (1) runs efficiently on a single consumer-grade GPU; (2) achieves SOTA performances on a range of mathematical reasoning tasks, outperforming many models with significantly larger sizes. In particular, as part of our mission to enhancing education and knowledge dissemination with AI, Confucius3-Math is specifically committed to mathematics learning for Chinese K-12 students and educators. Built via post-training with large-scale reinforcement learning (RL), Confucius3-Math aligns with national curriculum and excels at solving main-stream Chinese K-12 mathematical problems with low cost. In this report we share our development recipe, the challenges we encounter and the techniques we develop to overcome them. In particular, we introduce three technical innovations: Targeted Entropy Regularization, Recent Sample Recovery and Policy-Specific Hardness Weighting. These innovations encompass a new entropy regularization, a novel data scheduling policy, and an improved group-relative advantage estimator. Collectively, they significantly stabilize the RL training, improve data efficiency, and boost performance. Our work demonstrates the feasibility of building strong reasoning models in a particular domain at low cost. We open-source our model and code at this https URL.
zh
[NLP-45] Enhancing Entity Aware Machine Translation with Multi-task Learning
【速读】: 该论文旨在解决实体感知机器翻译(Entity-aware machine translation, EAMT)中存在的翻译数据不足以及上下文处理复杂性问题。其解决方案的关键在于采用多任务学习方法,通过优化命名实体识别和机器翻译两个子任务,提升实体感知机器翻译的整体性能。
链接: https://arxiv.org/abs/2506.18318
作者: An Trieu,Phuong Nguyen,Minh Le Nguyen
机构: 未知
类目: Computation and Language (cs.CL)
备注: In the Proceedings of SCIDOCA 2025
Abstract:Entity-aware machine translation (EAMT) is a complicated task in natural language processing due to not only the shortage of translation data related to the entities needed to translate but also the complexity in the context needed to process while translating those entities. In this paper, we propose a method that applies multi-task learning to optimize the performance of the two subtasks named entity recognition and machine translation, which improves the final performance of the Entity-aware machine translation task. The result and analysis are performed on the dataset provided by the organizer of Task 2 of the SemEval 2025 competition.
zh
[NLP-46] am LA at SCIDOCA shared task 2025: Citation Discovery via relation-based zero-shot retrieval
【速读】: 该论文试图解决在给定段落中从候选集合中预测正确引用文献的问题(Citation Prediction),其主要挑战在于摘要段落的长度以及候选摘要之间的高度相似性,这使得确定正确的引用目标变得困难。解决方案的关键在于首先基于从给定段落中提取的关系特征检索出前k个最相似的摘要,随后利用大型语言模型(Large Language Model, LLM)精确识别最相关的引用文献。
链接: https://arxiv.org/abs/2506.18316
作者: Trieu An,Long Nguyen,Minh Le Nguyen
机构: 未知
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: In the Proceedings of SCIDOCA 2025
Abstract:The Citation Discovery Shared Task focuses on predicting the correct citation from a given candidate pool for a given paragraph. The main challenges stem from the length of the abstract paragraphs and the high similarity among candidate abstracts, making it difficult to determine the exact paper to cite. To address this, we develop a system that first retrieves the top-k most similar abstracts based on extracted relational features from the given paragraph. From this subset, we leverage a Large Language Model (LLM) to accurately identify the most relevant citation. We evaluate our framework on the training dataset provided by the SCIDOCA 2025 organizers, demonstrating its effectiveness in citation prediction.
zh
[NLP-47] Enhancing Document Retrieval in COVID-19 Research: Leverag ing Large Language Models for Hidden Relation Extraction
【速读】: 该论文试图解决在突发性疫情(如COVID-19)背景下,面对海量文献时如何高效检索出高质量信息的问题。其解决方案的关键在于利用生成式 AI (Generative AI) 的能力,从未标注的文献中提取当前解析工具无法识别的隐含关系,从而提升检索系统的信息质量与效果。
链接: https://arxiv.org/abs/2506.18311
作者: Hoang-An Trieu,Dinh-Truong Do,Chau Nguyen,Vu Tran,Minh Le Nguyen
机构: 未知
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: In the Proceedings of SCIDOCA 2024
Abstract:In recent years, with the appearance of the COVID-19 pandemic, numerous publications relevant to this disease have been issued. Because of the massive volume of publications, an efficient retrieval system is necessary to provide researchers with useful information if an unexpected pandemic happens so suddenly, like COVID-19. In this work, we present a method to help the retrieval system, the Covrelex-SE system, to provide more high-quality search results. We exploited the power of the large language models (LLMs) to extract the hidden relationships inside the unlabeled publication that cannot be found by the current parsing tools that the system is using. Since then, help the system to have more useful information during retrieval progress.
zh
[NLP-48] RLPR: Extrapolating RLVR to General Domains without Verifiers
【速读】: 该论文试图解决生成式 AI (Generative AI) 在强化学习中依赖领域特定验证器导致的复杂性和可扩展性受限的问题。其解决方案的关键在于提出一种无需验证器的框架 RLPR,该框架利用 LLM 自身对参考答案的标记概率作为奖励信号,并通过 prob-to-reward 和稳定化方法降低噪声概率奖励的高方差,从而实现对更广泛通用领域推理能力的有效提升。
链接: https://arxiv.org/abs/2506.18254
作者: Tianyu Yu,Bo Ji,Shouli Wang,Shu Yao,Zefan Wang,Ganqu Cui,Lifan Yuan,Ning Ding,Yuan Yao,Zhiyuan Liu,Maosong Sun,Tat-Seng Chua
机构: Tsinghua University (清华大学); National University of Singapore (新加坡国立大学); Shanghai Qi Zhi Institute (上海奇智学院); Harbin Institute of Technology (哈尔滨工业大学); Beijing University of Posts and Telecommunications (北京邮电大学); University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Project Website: this https URL
Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) demonstrates promising potential in advancing the reasoning capabilities of LLMs. However, its success remains largely confined to mathematical and code domains. This primary limitation stems from the heavy reliance on domain-specific verifiers, which results in prohibitive complexity and limited scalability. To address the challenge, our key observation is that LLM’s intrinsic probability of generating a correct free-form answer directly indicates its own evaluation of the reasoning reward (i.e., how well the reasoning process leads to the correct answer). Building on this insight, we propose RLPR, a simple verifier-free framework that extrapolates RLVR to broader general domains. RLPR uses the LLM’s own token probability scores for reference answers as the reward signal and maximizes the expected reward during training. We find that addressing the high variance of this noisy probability reward is crucial to make it work, and propose prob-to-reward and stabilizing methods to ensure a precise and stable reward from LLM intrinsic probabilities. Comprehensive experiments in four general-domain benchmarks and three mathematical benchmarks show that RLPR consistently improves reasoning capabilities in both areas for Gemma, Llama, and Qwen based models. Notably, RLPR outperforms concurrent VeriFree by 7.6 points on TheoremQA and 7.5 points on Minerva, and even surpasses strong verifier-model-dependent approaches General-Reasoner by 1.6 average points across seven benchmarks.
zh
[NLP-49] AdapThink: Adaptive Thinking Preferences for Reasoning Language Model
【速读】: 该论文旨在解决基于强化学习(Reinforcement Learning, RL)的后训练方法在语言模型复杂推理能力提升过程中存在的推理效率问题,即模型可能在简单问题上消耗过多计算资源,并在复杂问题上过早进行推理。其解决方案的关键在于提出AdapThink框架,该框架通过两个核心机制实现自适应推理:一是基于模型置信度和响应特征的组内相对奖励函数,动态调整与反思相关的过渡词偏好;二是基于多样性感知的采样机制,通过熵引导得分平衡训练组的解题准确率与推理多样性。
链接: https://arxiv.org/abs/2506.18237
作者: Xu Wan,Wei Wang,Wenyue Xu,Wotao Yin,Jie Song,Mingyang Sun
机构: Zhejiang University (浙江大学); Alibaba DAMO Academy (阿里巴巴达摩院); Tongji University (同济大学); Peking University (北京大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Reinforcement Learning (RL)-based post-training has significantly advanced the complex reasoning capabilities of language models, fostering sophisticated self-reflection processes. However, this ``slow thinking’’ paradigm presents a critical challenge to reasoning efficiency: models may expend excessive computation on simple questions and shift reasoning prematurely for complex ones. Previous mechanisms typically rely on static length budgets or predefined rules, lacking the adaptability for varying question complexities and models’ evolving capabilities. To this end, we propose AdapThink, an adaptive post-training framework designed to induce more efficient thinking while maintaining the performance of reasoning language models. Specifically, AdapThink incorporates two key mechanisms: 1) A group-relative reward function that leverages model confidence and response’s characteristic to dynamically adjust the preference of reflection-related transition words without resorting to a fixed length preference. 2) A diversity-aware sampling mechanism that balances the training group’s solution accuracy with reasoning diversity via an entropy-guided score. Experiments on several mathematical reasoning datasets with DeepSeek-distilled models demonstrate AdapThink’s advantages in enabling adaptive reasoning patterns and mitigating the inefficiencies.
zh
[NLP-50] Shrinking the Generation-Verification Gap with Weak Verifiers
【速读】: 该论文试图解决现有验证器(verifier)在语言模型能力提升中的局限性,即高精度验证器难以扩展(如人类)或功能受限(如Lean等工具),而现有的语言模型法官和奖励模型虽然广泛适用,但与理想验证器(oracle verifier)之间仍存在显著性能差距。解决方案的关键在于提出Weaver框架,通过结合多个弱验证器来设计一个强验证器,利用加权集成方法提升性能,并通过弱监督技术减少对标注数据的依赖,同时通过数据集统计规范输出并过滤低质量验证器,从而更准确地反映响应质量。
链接: https://arxiv.org/abs/2506.18203
作者: Jon Saad-Falcon,E. Kelly Buchanan,Mayee F. Chen,Tzu-Heng Huang,Brendan McLaughlin,Tanvir Bhathal,Shang Zhu,Ben Athiwaratkun,Frederic Sala,Scott Linderman,Azalia Mirhoseini,Christopher Ré
机构: Stanford University (斯坦福大学); University of Wisconsin-Madison (威斯康星大学麦迪逊分校); Together AI (Together AI)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注:
Abstract:Verifiers can improve language model capabilities by scoring and ranking responses from generated candidates. Currently, high-quality verifiers are either unscalable (e.g., humans) or limited in utility (e.g., tools like Lean). While LM judges and reward models have become broadly useful as general-purpose verifiers, a significant performance gap remains between them and oracle verifiers (verifiers with perfect accuracy). To help close this gap, we introduce Weaver, a framework for designing a strong verifier by combining multiple weak, imperfect verifiers. We find weighted ensembles of verifiers, which typically require learning from labeled data, significantly outperform unweighted combinations due to differences in verifier accuracies. To reduce dependency on labeled data, Weaver leverages weak supervision to estimate each verifier’s accuracy and combines outputs into a unified score that better reflects true response quality. However, directly applying weak supervision algorithms poses challenges, including inconsistent verifier output formats and handling low-quality verifiers. Weaver addresses these using dataset statistics to normalize outputs and filter specific verifiers. We study Weaver’s effectiveness in test-time repeated sampling, where a model generates multiple candidate responses and selects one. Our evaluations show Weaver significantly improves over Pass@1-performance when selecting the first candidate-across reasoning and math tasks, achieving o3-mini-level accuracy with Llama 3.3 70B Instruct as generator, and an ensemble of 70B or smaller judge and reward models as verifiers (87.7% average). This gain mirrors the jump between GPT-4o and o3-mini (69.0% vs. 86.7%), which required extensive finetuning and post-training. To reduce computational costs of verifier ensembles, we train a 400M cross-encoder using Weaver’s combined output scores.
zh
[NLP-51] Deciphering Emotions in Children Storybooks: A Comparative Analysis of Multimodal LLM s in Educational Applications
【速读】: 该论文试图解决多模态人工智能系统在阿拉伯语语境下情感识别能力不足的问题,尤其是在开发文化敏感的教育技术方面存在的研究空白。解决方案的关键在于评估两种先进的多模态大语言模型(GPT-4o 和 Gemini 1.5 Pro)在处理阿拉伯儿童绘本图像时的情感识别性能,并通过三种提示策略(零样本、少样本和思维链)进行比较分析。研究结果表明,GPT-4o 在所有条件下均优于 Gemini 1.5 Pro,尤其是在使用思维链提示策略时表现最佳,但同时也揭示了当前模型在文化内涵情感和模糊叙事情境中的局限性。
链接: https://arxiv.org/abs/2506.18201
作者: Bushra Asseri,Estabraq Abdelaziz,Maha Al Mogren,Tayef Alhefdhi,Areej Al-Wabil
机构: Alfaisal University (阿尔法伊萨尔大学)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注:
Abstract:Emotion recognition capabilities in multimodal AI systems are crucial for developing culturally responsive educational technologies, yet remain underexplored for Arabic language contexts where culturally appropriate learning tools are critically needed. This study evaluates the emotion recognition performance of two advanced multimodal large language models, GPT-4o and Gemini 1.5 Pro, when processing Arabic children’s storybook illustrations. We assessed both models across three prompting strategies (zero-shot, few-shot, and chain-of-thought) using 75 images from seven Arabic storybooks, comparing model predictions with human annotations based on Plutchik’s emotional framework. GPT-4o consistently outperformed Gemini across all conditions, achieving the highest macro F1-score of 59% with chain-of-thought prompting compared to Gemini’s best performance of 43%. Error analysis revealed systematic misclassification patterns, with valence inversions accounting for 60.7% of errors, while both models struggled with culturally nuanced emotions and ambiguous narrative contexts. These findings highlight fundamental limitations in current models’ cultural understanding and emphasize the need for culturally sensitive training approaches to develop effective emotion-aware educational technologies for Arabic-speaking learners.
zh
[NLP-52] Prompt Engineering Techniques for Mitigating Cultural Bias Against Arabs and Muslims in Large Language Models : A Systematic Review
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)中存在文化偏见的问题,特别是针对阿拉伯人和穆斯林群体的偏见,这种偏见可能加剧有害的刻板印象和边缘化现象。其解决方案的关键在于通过提示工程(prompt engineering)策略来减轻这些偏见,研究识别出五种主要方法:文化提示、情感预处理、自我去偏技术、结构化多步骤流程以及参数优化的连续提示。其中,结构化多步骤流程在减少偏见方面效果最佳,而文化提示则具有较高的可及性,表明提示工程在不依赖模型参数的情况下可以有效缓解文化偏见。
链接: https://arxiv.org/abs/2506.18199
作者: Bushra Asseri,Estabrag Abdelaziz,Areej Al-Wabil
机构: alfaisal.edu(阿尔法伊萨尔大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注:
Abstract:Large language models have demonstrated remarkable capabilities across various domains, yet concerns about cultural bias - particularly towards Arabs and Muslims - pose significant ethical challenges by perpetuating harmful stereotypes and marginalization. Despite growing recognition of bias in LLMs, prompt engineering strategies specifically addressing Arab and Muslim representation remain understudied. This mixed-methods systematic review examines such techniques, offering evidence-based guidance for researchers and practitioners. Following PRISMA guidelines and Kitchenham’s systematic review methodology, we analyzed 8 empirical studies published between 2021-2024 investigating bias mitigation strategies. Our findings reveal five primary prompt engineering approaches: cultural prompting, affective priming, self-debiasing techniques, structured multi-step pipelines, and parameter-optimized continuous prompts. Although all approaches show potential for reducing bias, effectiveness varied substantially across studies and bias types. Evidence suggests that certain bias types may be more resistant to prompt-based mitigation than others. Structured multi-step pipelines demonstrated the highest overall effectiveness, achieving up to 87.7% reduction in bias, though they require greater technical expertise. Cultural prompting offers broader accessibility with substantial effectiveness. These results underscore the accessibility of prompt engineering for mitigating cultural bias without requiring access to model parameters. The limited number of studies identified highlights a significant research gap in this critical area. Future research should focus on developing culturally adaptive prompting techniques, creating Arab and Muslim-specific evaluation resources, and integrating prompt engineering with complementary debiasing methods to address deeper stereotypes while maintaining model utility.
zh
[NLP-53] CareLab at #SMM4H-HeaRD 2025: Insomnia Detection and Food Safety Event Extraction with Domain-Aware Transformers ALT AAAI
【速读】: 该论文旨在解决临床笔记中失眠提及的检测(Task 4)以及新闻文章中食品安全事件的提取(Task 5)问题。其关键解决方案是采用基于编码器的模型(如RoBERTa)进行核心建模,并结合GPT-4进行数据增强,从而在Task 5 Subtask 1中取得了F1得分为0.958的优异成绩,位居榜首。
链接: https://arxiv.org/abs/2506.18185
作者: Zihan Liang,Ziwen Pan,Sumon Kanti Dey,Azra Ismail
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: In the Proceedings of the 10th Social Media Mining for Health and Health Real-World Data Workshop and Shared Tasks, co-located with AAAI ICWSM 2025
Abstract:This paper presents our system for the SMM4H-HeaRD 2025 shared tasks, specifically Task 4 (Subtasks 1, 2a, and 2b) and Task 5 (Subtasks 1 and 2). Task 4 focused on detecting mentions of insomnia in clinical notes, while Task 5 addressed the extraction of food safety events from news articles. We participated in all subtasks and report key findings across them, with particular emphasis on Task 5 Subtask 1, where our system achieved strong performance-securing first place with an F1 score of 0.958 on the test set. To attain this result, we employed encoder-based models (e.g., RoBERTa), alongside GPT-4 for data augmentation. This paper outlines our approach, including preprocessing, model architecture, and subtask-specific adaptations
zh
[NLP-54] Reasoning about Uncertainty: Do Reasoning Models Know When They Dont Know?
【速读】: 该论文旨在解决推理语言模型在实际应用中因过度自信而产生的幻觉问题,即模型生成看似合理但不正确的响应。解决方案的关键在于探索推理模型的不确定性量化(Uncertainty Quantification, UQ),通过引入反思性不确定性量化方法,使模型能够显式地对其思维链轨迹进行再推理以提升校准度。研究发现,当前SOTA推理模型通常存在过度自信现象,且更深层次的推理会加剧这一问题,而通过反思性机制部分模型可实现更好的校准,但效果并不一致。
链接: https://arxiv.org/abs/2506.18183
作者: Zhiting Mei,Christina Zhang,Tenny Yin,Justin Lidard,Ola Shorinwa,Anirudha Majumdar
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Reasoning language models have set state-of-the-art (SOTA) records on many challenging benchmarks, enabled by multi-step reasoning induced using reinforcement learning. However, like previous language models, reasoning models are prone to generating confident, plausible responses that are incorrect (hallucinations). Knowing when and how much to trust these models is critical to the safe deployment of reasoning models in real-world applications. To this end, we explore uncertainty quantification of reasoning models in this work. Specifically, we ask three fundamental questions: First, are reasoning models well-calibrated? Second, does deeper reasoning improve model calibration? Finally, inspired by humans’ innate ability to double-check their thought processes to verify the validity of their answers and their confidence, we ask: can reasoning models improve their calibration by explicitly reasoning about their chain-of-thought traces? We introduce introspective uncertainty quantification (UQ) to explore this direction. In extensive evaluations on SOTA reasoning models across a broad range of benchmarks, we find that reasoning models: (i) are typically overconfident, with self-verbalized confidence estimates often greater than 85% particularly for incorrect responses, (ii) become even more overconfident with deeper reasoning, and (iii) can become better calibrated through introspection (e.g., o3-Mini and DeepSeek R1) but not uniformly (e.g., Claude 3.7 Sonnet becomes more poorly calibrated). Lastly, we conclude with important research directions to design necessary UQ benchmarks and improve the calibration of reasoning models.
zh
[NLP-55] QuranMorph: Morphologically Annotated Quranic Corpus
【速读】: 该论文旨在为《古兰经》提供一个经过形态学标注的语料库,以支持阿拉伯语的自然语言处理研究。解决方案的关键在于通过三位专家语言学家对语料库中的每个词元进行手动词形还原(lemmatization)和词性标注(part-of-speech tagging),并利用Qabas阿拉伯语词典数据库中的词元以及细粒度的SAMA/Qabas词性标签集,确保标注的准确性和一致性。
链接: https://arxiv.org/abs/2506.18148
作者: Diyam Akra,Tymaa Hammouda,Mustafa Jarrar
机构: Birzeit University (比尔泽特大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:We present the QuranMorph corpus, a morphologically annotated corpus for the Quran (77,429 tokens). Each token in the QuranMorph was manually lemmatized and tagged with its part-of-speech by three expert linguists. The lemmatization process utilized lemmas from Qabas, an Arabic lexicographic database linked with 110 lexicons and corpora of 2 million tokens. The part-of-speech tagging was performed using the fine-grained SAMA/Qabas tagset, which encompasses 40 tags. As shown in this paper, this rich lemmatization and POS tagset enabled the QuranMorph corpus to be inter-linked with many linguistic resources. The corpus is open-source and publicly available as part of the SinaLab resources at (this https URL)
zh
[NLP-56] Sparse Feature Coactivation Reveals Composable Semantic Modules in Large Language Models
【速读】: 该论文试图解决如何识别大型语言模型(Large Language Models, LLMs)中语义一致且上下文一致的网络组件的问题,以实现对模型内部知识结构的深入理解与高效操控。其解决方案的关键在于利用少量提示词收集的稀疏自编码器(Sparse Autoencoder, SAE)特征的共激活模式,从而识别出与国家和关系相关的语义组件,并通过消融实验和增强实验验证这些组件对模型输出的影响。研究揭示了LLMs中知识的模块化组织结构,并为实现精准的模型调控提供了新方法。
链接: https://arxiv.org/abs/2506.18141
作者: Ruixuan Deng,Xiaoyang Hu,Miles Gilberti,Shane Storks,Aman Taxali,Mike Angstadt,Chandra Sripada,Joyce Chai
机构: Georgia Institute of Technology (佐治亚理工学院); Brown University (布朗大学); University of Michigan (密歇根大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:We identify semantically coherent, context-consistent network components in large language models (LLMs) using coactivation of sparse autoencoder (SAE) features collected from just a handful of prompts. Focusing on country-relation tasks, we show that ablating semantic components for countries and relations changes model outputs in predictable ways, while amplifying these components induces counterfactual responses. Notably, composing relation and country components yields compound counterfactual outputs. We find that, whereas most country components emerge from the very first layer, the more abstract relation components are concentrated in later layers. Furthermore, within relation components themselves, nodes from later layers tend to have a stronger causal impact on model outputs. Overall, these findings suggest a modular organization of knowledge within LLMs and advance methods for efficient, targeted model manipulation.
zh
[NLP-57] SE-Merging: A Self-Enhanced Approach for Dynamic Model Merging IJCNN2025
【速读】: 该论文试图解决模型合并(model merging)中多任务能力的形成机制不明确的问题,以及如何在不进行额外训练的情况下提升合并模型的任务特定专业知识。其解决方案的关键在于揭示模型合并通过两个核心能力实现多任务能力:一是区分不同任务的样本,二是为每个样本适配相应的专家模型。基于此,作者提出了SE-Merging框架,该框架利用上述特性动态识别样本对应的任务,并自适应地调整合并系数以增强任务特定的专业知识。
链接: https://arxiv.org/abs/2506.18135
作者: Zijun Chen,Zhanpeng Zhou,Bo Zhang,Weinan Zhang,Xi Sun,Junchi Yan
机构: Shanghai Jiao Tong University (上海交通大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); MetaLight HK Limited (MetaLight香港有限公司); Shanghai-Chongqing Institute of Artificial Intelligence (上海-重庆人工智能研究院)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: preprint, accepted at IJCNN2025
Abstract:Model merging has gained increasing attention due to its intriguing property: interpolating the parameters of different task-specific fine-tuned models leads to multi-task abilities. However, despite its empirical success, the underlying mechanisms of model merging remain poorly understood. In this work, we delve into the mechanism behind model merging from a representation perspective. Our analysis reveals that model merging achieves multi-task abilities through two key capabilities: i) distinguishing samples from different tasks, and ii) adapting to the corresponding expert model for each sample. These two capabilities allow the merged model to retain task-specific expertise, enabling efficient multi-task adaptation. Building on these insights, we propose \textttSE-Merging, a self-enhanced model merging framework that leverages these two characteristics to dynamically identify the corresponding task for each sample and then adaptively rescales the merging coefficients to further enhance task-specific expertise in the merged model. Notably, \textttSE-Merging achieves dynamic model merging without additional training. Extensive experiments demonstrate that \textttSE-Merging achieves significant performance improvements while remaining compatible with existing model merging techniques.
zh
[NLP-58] ϕinfty: Clause Purification Embedding Realignment and the Total Suppression of the Em Dash in Autoregressive Language Models
【速读】: 该论文试图解决自回归Transformer语言模型中因破折号(em dash)标记引发的递归语义漂移问题,该问题导致从句边界幻觉和嵌入空间纠缠。解决方案的关键在于结合符号学从句净化方法(通过phi-infinity算子)与目标嵌入矩阵重新对齐,从而在不需模型微调的情况下完全抑制问题标记,同时通过固定点收敛保证维持语义连贯性。
链接: https://arxiv.org/abs/2506.18129
作者: Bugra Kilictas,Faruk Alpay
机构: Bahcesehir University (巴赫切谢希尔大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 16 pages, 3 figures
Abstract:We identify a critical vulnerability in autoregressive transformer language models where the em dash token induces recursive semantic drift, leading to clause boundary hallucination and embedding space entanglement. Through formal analysis of token-level perturbations in semantic lattices, we demonstrate that em dash insertion fundamentally alters the model’s latent representations, causing compounding errors in long-form generation. We propose a novel solution combining symbolic clause purification via the phi-infinity operator with targeted embedding matrix realignment. Our approach enables total suppression of problematic tokens without requiring model retraining, while preserving semantic coherence through fixed-point convergence guarantees. Experimental validation shows significant improvements in generation consistency and topic maintenance. This work establishes a general framework for identifying and mitigating token-level vulnerabilities in foundation models, with immediate implications for AI safety, model alignment, and robust deployment of large language models in production environments. The methodology extends beyond punctuation to address broader classes of recursive instabilities in neural text generation systems.
zh
[NLP-59] he Syntactic Acceptability Dataset (Preview): A Resource for Machine Learning and Linguistic Analysis of English COLING2024 LREC
【速读】: 该论文试图解决语法正确性(grammaticality)与可接受性(acceptability)在语言学和计算语言学研究中的界定与关联问题,其解决方案的关键在于构建一个大规模的公开可用的句法可接受性数据集(Syntactic Acceptability Dataset)。该数据集包含1,000个英语语料,涵盖教科书和《Linguistic Inquiry》期刊内容,并通过文献提取语法状态标签及通过众包获取母语者可接受性判断,从而提供高质量的双维度标注数据。此数据集为相关领域的研究提供了基础资源,并支持对语法与可接受性关系的深入分析。
链接: https://arxiv.org/abs/2506.18120
作者: Tom S Juzek
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted and published at LREC-COLING 2024. 8 pages, 3 figures. Licensed under CC BY-NC-SA 4.0
Abstract:We present a preview of the Syntactic Acceptability Dataset, a resource being designed for both syntax and computational linguistics research. In its current form, the dataset comprises 1,000 English sequences from the syntactic discourse: Half from textbooks and half from the journal Linguistic Inquiry, the latter to ensure a representation of the contemporary discourse. Each entry is labeled with its grammatical status (“well-formedness” according to syntactic formalisms) extracted from the literature, as well as its acceptability status (“intuitive goodness” as determined by native speakers) obtained through crowdsourcing, with highest experimental standards. Even in its preliminary form, this dataset stands as the largest of its kind that is publicly accessible. We also offer preliminary analyses addressing three debates in linguistics and computational linguistics: We observe that grammaticality and acceptability judgments converge in about 83% of the cases and that “in-betweenness” occurs frequently. This corroborates existing research. We also find that while machine learning models struggle with predicting grammaticality, they perform considerably better in predicting acceptability. This is a novel finding. Future work will focus on expanding the dataset.
zh
[NLP-60] Mental Health Equity in LLM s: Leverag ing Multi-Hop Question Answering to Detect Amplified and Silenced Perspectives
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在心理健康领域中可能传播的交叉性偏见问题,这些问题可能强化污名化并损害边缘化群体。解决方案的关键在于提出一种多跳问答(multi-hop question answering, MHQA)框架,用于探索LLMs在心理健康话语中的响应偏见,并通过系统化的标签方法分析年龄、种族、性别和社会经济地位等人口统计学因素的交叉影响。该方法相较于传统方法在检测偏见方面表现出更高的有效性,能够识别出偏见在序列推理过程中被放大的关键点。
链接: https://arxiv.org/abs/2506.18116
作者: Batool Haider,Atmika Gorti,Aman Chadha,Manas Gaur
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 19 Pages, 7 Figures, 4 Tables (Note: Under Review)
Abstract:Large Language Models (LLMs) in mental healthcare risk propagating biases that reinforce stigma and harm marginalized groups. While previous research identified concerning trends, systematic methods for detecting intersectional biases remain limited. This work introduces a multi-hop question answering (MHQA) framework to explore LLM response biases in mental health discourse. We analyze content from the Interpretable Mental Health Instruction (IMHI) dataset across symptom presentation, coping mechanisms, and treatment approaches. Using systematic tagging across age, race, gender, and socioeconomic status, we investigate bias patterns at demographic intersections. We evaluate four LLMs: Claude 3.5 Sonnet, Jamba 1.6, Gemma 3, and Llama 4, revealing systematic disparities across sentiment, demographics, and mental health conditions. Our MHQA approach demonstrates superior detection compared to conventional methods, identifying amplification points where biases magnify through sequential reasoning. We implement two debiasing techniques: Roleplay Simulation and Explicit Bias Reduction, achieving 66-94% bias reductions through few-shot prompting with BBQ dataset examples. These findings highlight critical areas where LLMs reproduce mental healthcare biases, providing actionable insights for equitable AI development.
zh
[NLP-61] Chengyu-Bench: Benchmarking Large Language Models for Chinese Idiom Understanding and Use
【速读】: 该论文旨在解决语言模型在理解和正确使用汉语成语(Chengyu)方面的挑战,尤其是成语的隐含意义和语境适配性问题。现有基准测试主要集中在狭窄任务上,如选择题填空、孤立翻译或简单改写,而未能全面评估模型对成语复杂性的理解能力。论文提出的解决方案是构建一个综合性基准测试——Chengyu-Bench,其关键在于设计三个任务:评价内涵(分类成语为积极或消极)、适用性(检测上下文中错误的成语使用)以及开放填空(在无选项的情况下填充长段落中的空白)。该基准包含2,937个经过人工验证的示例,覆盖1,765个常见成语,能够更全面地评估语言模型在成语理解与应用上的能力。
链接: https://arxiv.org/abs/2506.18105
作者: Yicheng Fu,Zhemin Huang,Liuxin Yang,Yumeng Lu,Zhongdongming Dai
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Chinese idioms (Chengyu) are concise four-character expressions steeped in history and culture, whose literal translations often fail to capture their full meaning. This complexity makes them challenging for language models to interpret and use correctly. Existing benchmarks focus on narrow tasks - multiple-choice cloze tests, isolated translation, or simple paraphrasing. We introduce Chengyu-Bench, a comprehensive benchmark featuring three tasks: (1) Evaluative Connotation, classifying idioms as positive or negative; (2) Appropriateness, detecting incorrect idiom usage in context; and (3) Open Cloze, filling blanks in longer passages without options. Chengyu-Bench comprises 2,937 human-verified examples covering 1,765 common idioms sourced from diverse corpora. We evaluate leading LLMs and find they achieve over 95% accuracy on Evaluative Connotation, but only ~85% on Appropriateness and ~40% top-1 accuracy on Open Cloze. Error analysis reveals that most mistakes arise from fundamental misunderstandings of idiom meanings. Chengyu-Bench demonstrates that while LLMs can reliably gauge idiom sentiment, they still struggle to grasp the cultural and contextual nuances essential for proper usage. The benchmark and source code are available at: this https URL.
zh
[NLP-62] InspireDebate: Multi-Dimensional Subjective-Objective Evaluation-Guided Reasoning and Optimization for Debating ACL2025
【速读】: 该论文旨在解决现有基于大语言模型(Large Language Models, LLMs)的辩论系统在应对具体论点时忽视客观评估指标(如事实真实性与逻辑有效性)以及缺乏跨多维优化(包括评估指标、链式思维推理和多轮辩论优化)的问题。其解决方案的关键在于提出一个双组件框架:(1) InspireScore,一种新型评估系统,通过融合四个主观标准(情感吸引力、论点清晰度、论点结构和主题相关性)与两个客观指标(事实真实性与逻辑有效性)建立多维评估架构;(2) InspireDebate,一种优化的辩论框架,采用分阶段优化方法,结合链式思维推理增强、多维直接偏好优化(DPO)以及基于网络检索增强生成(Web-RAG)的实时知识定位。
链接: https://arxiv.org/abs/2506.18102
作者: Fuyu Wang,Jiangtong Li,Kun Zhu,Changjun Jiang
机构: Tongji University (同济大学)
类目: Computation and Language (cs.CL)
备注: 20 pages; Accepted to ACL 2025 Main
Abstract:With the rapid advancements in large language models (LLMs), debating tasks, such as argument quality assessment and debate process simulation, have made significant progress. However, existing LLM-based debating systems focus on responding to specific arguments while neglecting objective assessments such as authenticity and logical validity. Furthermore, these systems lack a structured approach to optimize across various dimensions - including evaluation metrics, chain-of-thought (CoT) reasoning, and multi-turn debate refinement - thereby limiting their effectiveness. To address these interconnected challenges, we propose a dual-component framework: (1) \textbfInspireScore , a novel evaluation system that establishes a multi-dimensional assessment architecture incorporating four subjective criteria (emotional appeal, argument clarity, argument arrangement, and topic relevance) alongside two objective metrics (fact authenticity and logical validity); and (2) \textbfInspireDebate , an optimized debating framework employing a phased optimization approach through CoT reasoning enhancement, multi-dimensional Direct Preference Optimization (DPO), and real-time knowledge grounding via web-based Retrieval Augmented Generation (Web-RAG). Empirical evaluations demonstrate that \textbfInspireScore achieves 44 % higher correlation with expert judgments compared to existing methods, while \textbfInspireDebate shows significant improvements, outperforming baseline models by 57 % . Source code is available at this https URL.
zh
[NLP-63] Evaluating Prompt-Based and Fine-Tuned Approaches to Czech Anaphora Resolution
【速读】: 该论文旨在解决指代消解(anaphora resolution)问题,特别是在形态丰富的语言如捷克语中的自然语言理解任务。其解决方案的关键在于对比两种现代方法:基于大型语言模型(LLMs)的提示工程与针对捷克语指代消解任务微调的紧凑生成模型。研究通过在普鲁士依存树库基础上构建的数据集进行评估,结果显示,尽管提示工程在少量样本情况下表现出色(最高74.5%准确率),但微调后的模型,尤其是mT5-large,在准确率(最高88%)和计算资源消耗方面均优于提示工程方法。
链接: https://arxiv.org/abs/2506.18091
作者: Patrik Stano,Aleš Horák
机构: 未知
类目: Computation and Language (cs.CL)
备注: 12 pages
Abstract:Anaphora resolution plays a critical role in natural language understanding, especially in morphologically rich languages like Czech. This paper presents a comparative evaluation of two modern approaches to anaphora resolution on Czech text: prompt engineering with large language models (LLMs) and fine-tuning compact generative models. Using a dataset derived from the Prague Dependency Treebank, we evaluate several instruction-tuned LLMs, including Mistral Large 2 and Llama 3, using a series of prompt templates. We compare them against fine-tuned variants of the mT5 and Mistral models that we trained specifically for Czech anaphora resolution. Our experiments demonstrate that while prompting yields promising few-shot results (up to 74.5% accuracy), the fine-tuned models, particularly mT5-large, outperform them significantly, achieving up to 88% accuracy while requiring fewer computational resources. We analyze performance across different anaphora types, antecedent distances, and source corpora, highlighting key strengths and trade-offs of each approach.
zh
[NLP-64] RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation
【速读】: 该论文旨在解决现有合成数据集在增强真实世界双臂机器人操作中的不足,具体表现为缺乏高效可扩展的数据生成方法以及过于简化的仿真环境无法体现现实复杂性。其解决方案的关键在于提出RoboTwin 2.0,一个可扩展的仿真框架,能够自动化生成多样化且逼真的数据,并提供统一的双臂操作评估协议。该框架通过构建大规模对象库、结合多模态大语言模型与仿真闭环优化的数据生成流水线,以及引入结构化领域随机化策略,显著提升了数据多样性与策略鲁棒性。
链接: https://arxiv.org/abs/2506.18088
作者: Tianxing Chen,Zanxin Chen,Baijun Chen,Zijian Cai,Yibin Liu,Qiwei Liang,Zixuan Li,Xianliang Lin,Yiheng Ge,Zhenyu Gu,Weiliang Deng,Yubin Guo,Tian Nian,Xuanbing Xie,Qiangyu Chen,Kailun Su,Tianling Xu,Guodong Liu,Mengkang Hu,Huan-ang Gao,Kaixuan Wang,Zhixuan Liang,Yusen Qin,Xiaokang Yang,Ping Luo,Yao Mu
机构: SJTU ScaleLab(上海交通大学规模实验室); HKU MMLab(香港大学多媒体实验室); Shanghai AI Lab(上海人工智能实验室); D-Robotics(D-机器人); SZU(深圳大学); THU(清华大学); TeleAI(电信人工智能); FDU(复旦大学); USTC(中国科学技术大学); SUSTech(南方科技大学); SYSU(中山大学); CSU(中南大学); NEU(东北大学); HKU-SH ICRC(香港大学-上海国际联合研究中心); NJU(南京大学); Lumina EAI(光启人工智能研究院)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Multiagent Systems (cs.MA)
备注: Project Page: this https URL
Abstract:Simulation-based data synthesis has emerged as a powerful paradigm for enhancing real-world robotic manipulation. However, existing synthetic datasets remain insufficient for robust bimanual manipulation due to two challenges: (1) the lack of an efficient, scalable data generation method for novel tasks, and (2) oversimplified simulation environments that fail to capture real-world complexity. We present RoboTwin 2.0, a scalable simulation framework that enables automated, large-scale generation of diverse and realistic data, along with unified evaluation protocols for dual-arm manipulation. We first construct RoboTwin-OD, a large-scale object library comprising 731 instances across 147 categories, each annotated with semantic and manipulation-relevant labels. Building on this foundation, we develop an expert data synthesis pipeline that combines multimodal large language models (MLLMs) with simulation-in-the-loop refinement to generate task-level execution code automatically. To improve sim-to-real transfer, RoboTwin 2.0 incorporates structured domain randomization along five axes: clutter, lighting, background, tabletop height and language instructions, thereby enhancing data diversity and policy robustness. We instantiate this framework across 50 dual-arm tasks spanning five robot embodiments, and pre-collect over 100,000 domain-randomized expert trajectories. Empirical results show a 10.9% gain in code generation success and improved generalization to novel real-world scenarios. A VLA model fine-tuned on our dataset achieves a 367% relative improvement (42.0% vs. 9.0%) on unseen scene real-world tasks, while zero-shot models trained solely on our synthetic data achieve a 228% relative gain, highlighting strong generalization without real-world supervision. We release the data generator, benchmark, dataset, and code to support scalable research in robust bimanual manipulation.
zh
[NLP-65] Statistical Multicriteria Evaluation of LLM -Generated Text
【速读】: 该论文试图解决大语言模型(Large Language Model, LLM)生成文本质量评估中存在的核心问题,即现有评估方法通常依赖单一指标或简单聚合,无法准确捕捉文本在连贯性、多样性、流畅性等多维质量指标之间的复杂权衡。其解决方案的关键在于引入基于广义随机占优(Generalized Stochastic Dominance, GSD)的统计推断框架,该框架能够同时评估多个质量维度,并尊重不同指标的测量尺度,通过解码策略的部分序关系避免对指标进行任意加权,从而提供更具统计保障的评估结果。
链接: https://arxiv.org/abs/2506.18082
作者: Esteban Garces Arias,Hannah Blocher,Julian Rodemann,Matthias Aßenmacher,Christoph Jansen
机构: LMU Munich(慕尼黑路德维希-马克西米利安大学); Lancaster University Leipzig(兰卡斯特大学莱比锡校区); Munich Center for Machine Learning (MCML)(慕尼黑中心机器学习)
类目: Computation and Language (cs.CL); Applications (stat.AP)
备注:
Abstract:Assessing the quality of LLM-generated text remains a fundamental challenge in natural language processing. Current evaluation approaches often rely on isolated metrics or simplistic aggregations that fail to capture the nuanced trade-offs between coherence, diversity, fluency, and other relevant indicators of text quality. In this work, we adapt a recently proposed framework for statistical inference based on Generalized Stochastic Dominance (GSD) that addresses three critical limitations in existing benchmarking methodologies: the inadequacy of single-metric evaluation, the incompatibility between cardinal automatic metrics and ordinal human judgments, and the lack of inferential statistical guarantees. The GSD-front approach enables simultaneous evaluation across multiple quality dimensions while respecting their different measurement scales, building upon partial orders of decoding strategies, thus avoiding arbitrary weighting of the involved metrics. By applying this framework to evaluate common decoding strategies against human-generated text, we demonstrate its ability to identify statistically significant performance differences while accounting for potential deviations from the i.i.d. assumption of the sampling design.
zh
[NLP-66] he Democratic Paradox in Large Language Models Underestimation of Press Freedom
【速读】: Model call failure
链接: https://arxiv.org/abs/2506.18045
作者: I. Loaiza,R. Vestrelli,A. Fronzetti Colladon,R. Rigobon
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:As Large Language Models (LLMs) increasingly mediate global information access for millions of users worldwide, their alignment and biases have the potential to shape public understanding and trust in fundamental democratic institutions, such as press freedom. In this study, we uncover three systematic distortions in the way six popular LLMs evaluate press freedom in 180 countries compared to expert assessments of the World Press Freedom Index (WPFI). The six LLMs exhibit a negative misalignment, consistently underestimating press freedom, with individual models rating between 71% to 93% of countries as less free. We also identify a paradoxical pattern we term differential misalignment: LLMs disproportionately underestimate press freedom in countries where it is strongest. Additionally, five of the six LLMs exhibit positive home bias, rating their home countries’ press freedoms more favorably than would be expected given their negative misalignment with the human benchmark. In some cases, LLMs rate their home countries between 7% to 260% more positively than expected. If LLMs are set to become the next search engines and some of the most important cultural tools of our time, they must ensure accurate representations of the state of our human and civic rights globally.
zh
[NLP-67] Markov-Enhanced Clustering for Long Document Summarization: Tackling the Lost in the Middle Challenge with Large Language Models
【速读】: 该论文试图解决长文档在生成式 AI (Generative AI) 摘要过程中出现的“中间迷失”问题,即模型在处理长文本时难以有效保留关键信息。其解决方案的关键在于提出一种混合摘要方法,通过将文档分割为较小的文本块、聚类向量嵌入、为每个聚类生成代表核心观点的摘要,并利用马尔可夫链图确定语义顺序来构建最终摘要。
链接: https://arxiv.org/abs/2506.18036
作者: Aziz Amari(1),Mohamed Achref Ben Ammar(1) ((1) National Institute of Applied Science and Technology (INSAT), University of Carthage, Tunis, Tunisia)
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:The rapid expansion of information from diverse sources has heightened the need for effective automatic text summarization, which condenses documents into shorter, coherent texts. Summarization methods generally fall into two categories: extractive, which selects key segments from the original text, and abstractive, which generates summaries by rephrasing the content coherently. Large language models have advanced the field of abstractive summarization, but they are resourceintensive and face significant challenges in retaining key information across lengthy documents, which we call being “lost in the middle”. To address these issues, we propose a hybrid summarization approach that combines extractive and abstractive techniques. Our method splits the document into smaller text chunks, clusters their vector embeddings, generates a summary for each cluster that represents a key idea in the document, and constructs the final summary by relying on a Markov chain graph when selecting the semantic order of ideas.
zh
[NLP-68] Splitformer: An improved early-exit architecture for automatic speech recognition on edge devices
【速读】: 该论文旨在解决在资源受限的设备端处理场景中,如何动态调整神经网络模型计算负载的问题,以适应有限且随时间变化的计算资源。其解决方案的关键在于引入并行层,这些层处理输入的下采样版本,与标准处理层结合使用,从而在不增加推理时间的前提下显著提升语音识别在标准基准上的性能,仅带来少量模型参数的增加。
链接: https://arxiv.org/abs/2506.18035
作者: Maxence Lasbordes,Daniele Falavigna,Alessio Brutti
机构: Université Paris-Dauphine, Université PSL; Télécom SudParis, Institut Polytechnique de Paris; Center for Augmented Intelligence; Fondazione Bruno Kessler
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: 5 pages, 3 Postscript figures
Abstract:The ability to dynamically adjust the computational load of neural models during inference in a resource aware manner is crucial for on-device processing scenarios, characterised by limited and time-varying computational resources. Early-exit architectures represent an elegant and effective solution, since they can process the input with a subset of their layers, exiting at intermediate branches (the upmost layers are hence removed from the model). From a different perspective, for automatic speech recognition applications there are memory-efficient neural architectures that apply variable frame rate analysis, through downsampling/upsampling operations in the middle layers, reducing the overall number of operations and improving significantly the performance on well established benchmarks. One example is the Zipformer. However, these architectures lack the modularity necessary to inject early-exit branches. With the aim of improving the performance in early-exit models, we propose introducing parallel layers in the architecture that process downsampled versions of their inputs. % in conjunction with standard processing layers. We show that in this way the speech recognition performance on standard benchmarks significantly improve, at the cost of a small increase in the overall number of model parameters but without affecting the inference time. Comments: 5 pages, 3 Postscript figures Subjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS) MSC classes: 68T50 (Primary) ACMclasses: I.2.7; I.5.4 Cite as: arXiv:2506.18035 [cs.CL] (or arXiv:2506.18035v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2506.18035 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-69] PDF Retrieval Augmented Question Answering
【速读】: 该论文试图解决现有问答(Question-Answering, QA)系统在处理PDF文件中多模态数据(如文本、图像、矢量图、图表和表格)时的局限性,这些数据类型通常超出传统QA系统设计时所针对的纯文本内容。解决方案的关键在于构建一个基于检索增强生成(Retrieval Augmented Generation, RAG)框架的综合性QA系统,通过优化对非文本元素的处理与集成,以及微调大语言模型,以更有效地应对涉及多种数据类型的复杂问题。
链接: https://arxiv.org/abs/2506.18027
作者: Thi Thu Uyen Hoang,Viet Anh Nguyen
机构: Cranberry-Lemon University (克兰伯里-柠檬大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:This paper presents an advancement in Question-Answering (QA) systems using a Retrieval Augmented Generation (RAG) framework to enhance information extraction from PDF files. Recognizing the richness and diversity of data within PDFs–including text, images, vector diagrams, graphs, and tables–poses unique challenges for existing QA systems primarily designed for textual content. We seek to develop a comprehensive RAG-based QA system that will effectively address complex multimodal questions, where several data types are combined in the query. This is mainly achieved by refining approaches to processing and integrating non-textual elements in PDFs into the RAG framework to derive precise and relevant answers, as well as fine-tuning large language models to better adapt to our system. We provide an in-depth experimental evaluation of our solution, demonstrating its capability to extract accurate information that can be applied to different types of content across PDFs. This work not only pushes the boundaries of retrieval-augmented QA systems but also lays a foundation for further research in multimodal data integration and processing.
zh
[NLP-70] PP-DocBee2: Improved Baselines with Efficient Data for Multimodal Document Understanding
【速读】: 该论文旨在解决多模态文档理解中的性能瓶颈与推理延迟问题,特别是在中文业务文档上的表现。其关键解决方案是通过提升合成数据质量、优化视觉特征融合策略以及改进推理方法来增强模型性能。其中,数据质量优化策略是核心创新点,通过使用大规模多模态预训练模型评估数据,并应用新颖的统计标准过滤异常值,从而确保高质量的训练数据;同时,通过对视觉Transformer(ViT)的分层分解和新型特征融合策略,提升了模型的表征能力与复杂推理效果。
链接: https://arxiv.org/abs/2506.18023
作者: Kui Huang,Xinrong Chen,Wenyu Lv,Jincheng Liao,Guanzhong Wang,Yi Liu
机构: Baidu Inc. (百度)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:This report introduces PP-DocBee2, an advanced version of the PP-DocBee, designed to enhance multimodal document understanding. Built on a large multimodal model architecture, PP-DocBee2 addresses the limitations of its predecessor through key technological improvements, including enhanced synthetic data quality, improved visual feature fusion strategy, and optimized inference methodologies. These enhancements yield an 11.4% performance boost on internal benchmarks for Chinese business documents, and reduce inference latency by 73.0% to the vanilla version. A key innovation of our work is a data quality optimization strategy for multimodal document tasks. By employing a large-scale multimodal pre-trained model to evaluate data, we apply a novel statistical criterion to filter outliers, ensuring high-quality training data. Inspired by insights into underutilized intermediate features in multimodal models, we enhance the ViT representational capacity by decomposing it into layers and applying a novel feature fusion strategy to improve complex reasoning. The source code and pre-trained model are available at \hrefthis https URLthis https URL.
zh
[NLP-71] A Comprehensive Graph Framework for Question Answering with Mode-Seeking Preference Alignment ACL2025
【速读】: 该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)在问答任务中面临的全局理解不足以及模型输出与人类伦理和质量偏好对齐困难的问题。其解决方案的关键在于提出GraphMPA,这是一个基于图的综合框架,通过构建层次化文档图来模拟人类认知过程中的信息理解和综合,并引入模式搜索偏好优化方法,利用概率匹配约束使模型输出更符合人类偏好。
链接: https://arxiv.org/abs/2506.17951
作者: Quanwei Tang,Sophia Yat Mei Lee,Junshuang Wu,Dong Zhang,Shoushan Li,Erik Cambria,Guodong Zhou
机构: 未知
类目: Computation and Language (cs.CL)
备注: acl 2025 findings
Abstract:Recent advancements in retrieval-augmented generation (RAG) have enhanced large language models in question answering by integrating external knowledge. However, challenges persist in achieving global understanding and aligning responses with human ethical and quality preferences. To address these issues, we propose GraphMPA, a comprehensive graph-based framework with mode-seeking preference alignment. Our approach constructs a hierarchical document graph using a general similarity measurement, mimicking human cognitive processes for information understanding and synthesis. Additionally, we introduce mode-seeking preference optimization to better align model outputs with human preferences through probability-matching constraints. Extensive experiments on six datasets demonstrate the effectiveness of our \hrefthis https URLGraphMPA.
zh
[NLP-72] Scatter-Based Innovation Propagation in Large Language Models for Multi-Stage Process Adaptation
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在多阶段流程中难以将特定阶段或组件引入的局部创新推广到其他部分的问题。解决方案的关键在于提出一种基于散射的创新扩展模型(innovation scatter model),该模型通过四个步骤实现创新的泛化与应用:首先识别核心创新,其次去除对具体阶段或组件的依赖以进行泛化,再次判断泛化后的创新是否适用于更广泛的场景,最后系统性地将其应用于结构相似的其他阶段。该模型利用各阶段间的结构冗余,提升新想法的适用性与可重用性。
链接: https://arxiv.org/abs/2506.17949
作者: Hong Su
机构: Chengdu University of Information Technology(成都信息工程大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) exhibit strong capabilities in reproducing and extending patterns observed during pretraining but often struggle to generalize novel ideas beyond their original context. This paper addresses the challenge of applying such localized innovations - introduced at a specific stage or component - to other parts of a multi-stage process. We propose a scatter-based innovation expansion model (innovation scatter model) that guides the LLM through a four-step process: (1) identifying the core innovation by comparing the user’s input with its surrounding context, (2) generalizing the innovation by removing references to specific stages or components, (3) determining whether the generalized innovation applies to a broader scope beyond the original stage, and (4) systematically applying it to other structurally similar stages using the LLM. This model leverages structural redundancy across stages to improve the applicability of novel ideas. Verification results demonstrate that the innovation scatter model enables LLMs to extend innovations across structurally similar stages, thereby enhancing generalization and reuse.
zh
[NLP-73] utorial: φ-Transductions in OpenFst via the Gallic Semiring
【速读】: 该论文试图解决OpenFst库中\varphi-转移无法直接用于有限状态转导器的问题,其关键解决方案是利用OpenFst提供的Gallic语义环功能来正确实现\varphi-转导。通过这一方法,作者演示了如何实现MaxMatch(WordPiece)分词算法,并提供了自包含的代码示例。
链接: https://arxiv.org/abs/2506.17942
作者: Marco Cognetta,Cyril Allauzen
机构: Google(谷歌)
类目: Formal Languages and Automata Theory (cs.FL); Computation and Language (cs.CL)
备注: 8 pages, 2 figures, code included
Abstract:OpenFst, a popular finite-state transducer library, supports \varphi -transitions but, due to an implementation constraint, they cannot be used with transducers in a straightforward way. In this short tutorial, we describe how one can use other functionality provided by OpenFst (namely, the Gallic semiring) to correctly implement \varphi -transductions and demonstrate it by implementing the MaxMatch (WordPiece) tokenization algorithm (Devlin et al., 2019; Song et al., 2021). Accompanying self-contained code examples are provided. this https URL Comments: 8 pages, 2 figures, code included Subjects: Formal Languages and Automata Theory (cs.FL); Computation and Language (cs.CL) Cite as: arXiv:2506.17942 [cs.FL] (or arXiv:2506.17942v1 [cs.FL] for this version) https://doi.org/10.48550/arXiv.2506.17942 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-74] Evolving Prompts In-Context: An Open-ended Self-replicating Perspective ICML2025
【速读】: 该论文试图解决传统大规模语言模型(Large Language Model, LLM)提示设计中依赖精心构造的指令和示例以提升上下文学习(In-Context Learning, ICL)性能的问题。其解决方案的关键在于提出一种新颖的提示设计范式,通过将随机示例剪枝为看似无意义的“gibberish”来显著提升模型在多种任务上的表现。这一方法不仅优于现有的自动提示优化技术,而且无需依赖复杂的算法或人类直觉,而是通过自发现的提示优化框架PromptQuine,在低数据条件下自动搜索有效的剪枝策略,从而生成高效且非传统的提示。
链接: https://arxiv.org/abs/2506.17930
作者: Jianyu Wang,Zhiqiang Hu,Lidong Bing
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Robotics (cs.RO)
备注: ICML 2025, and Code will be released at: this https URL
Abstract:We propose a novel prompt design paradigm that challenges conventional wisdom in large language model (LLM) prompting. While conventional wisdom prioritizes well-crafted instructions and demonstrations for in-context learning (ICL), we show that pruning random demonstrations into seemingly incoherent “gibberish” can remarkably improve performance across diverse tasks. Notably, the “gibberish” always matches or surpasses state-of-the-art automatic prompt optimization techniques, achieving substantial gains regardless of LLM alignment. Nevertheless, discovering an effective pruning strategy is non-trivial, as existing attribution methods and prompt compression algorithms fail to deliver robust results, let alone human intuition. In terms of this, we propose a self-discover prompt optimization framework, PromptQuine, an evolutionary search framework that automatically searches for the pruning strategy by itself using only low-data regimes. Much like the emergent complexity in nature–such as symbiosis and self-organization–arising in response to resource constraints, our framework evolves and refines unconventional yet highly effective prompts by leveraging only the tokens present within the context. We demonstrate its effectiveness across classification, multi-choice question answering, generation and math reasoning tasks across LLMs, while achieving decent runtime efficiency. We hope our findings can guide mechanistic studies on in-context learning, and provide a call to action, to pave the way for more open-ended search algorithms for more effective LLM prompting.
zh
[NLP-75] Multi-turn Jailbreaking via Global Refinement and Active Fabrication
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在多轮对话场景中存在安全风险的问题,特别是针对当前多轮越狱攻击技术难以适应对话动态变化的局限性。解决方案的关键在于提出一种新的多轮越狱方法,该方法在每次交互中全局优化越狱路径,并主动生成模型响应以抑制安全警告,从而提高后续问题中诱导有害输出的可能性。
链接: https://arxiv.org/abs/2506.17881
作者: Hua Tang,Lingyong Yan,Yukun Zhao,Shuaiqiang Wang,Jizhou Huang,Dawei Yin
机构: Shanghai Jiao Tong University (上海交通大学); Baidu Inc. (百度公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) have achieved exceptional performance across a wide range of tasks. However, they still pose significant safety risks due to the potential misuse for malicious purposes. Jailbreaks, which aim to elicit models to generate harmful content, play a critical role in identifying the underlying security threats. Recent jailbreaking primarily focuses on single-turn scenarios, while the more complicated multi-turn scenarios remain underexplored. Moreover, existing multi-turn jailbreaking techniques struggle to adapt to the evolving dynamics of dialogue as the interaction progresses. To address this limitation, we propose a novel multi-turn jailbreaking method that refines the jailbreaking path globally at each interaction. We also actively fabricate model responses to suppress safety-related warnings, thereby increasing the likelihood of eliciting harmful outputs in subsequent questions. Experimental results demonstrate the superior performance of our method compared with existing single-turn and multi-turn jailbreaking techniques across six state-of-the-art LLMs. Our code is publicly available at this https URL.
zh
[NLP-76] How Alignment Shrinks the Generative Horizon
【速读】: 该论文试图解决对齐的大语言模型(LLM)生成输出缺乏多样性的问题,其核心在于理解为何这些模型在生成过程中表现出高度的稳定性。解决方案的关键在于引入分支因子(Branching Factor, BF),这是一个与标记无关的度量指标,用于量化生成过程中可能的下一步数量。通过实证分析发现,随着生成过程的推进,BF通常会下降,表明LLM在生成过程中变得更加可预测;同时,对齐调优显著压缩了模型输出分布,使BF大幅降低,从而解释了对齐模型对解码策略不敏感的现象。此外,研究还发现对齐的链式思维(Chain-of-Thought, CoT)模型通过生成更长的推理链,进入更低BF的确定性阶段,从而实现更稳定的输出。
链接: https://arxiv.org/abs/2506.17871
作者: Chenghao Yang,Ari Holtzman
机构: University of Chicago (芝加哥大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Codebase: this https URL , Website: this https URL
Abstract:Despite their impressive capabilities, aligned large language models (LLMs) often generate outputs that lack diversity. What drives this stability in the generation? We investigate this phenomenon through the lens of probability concentration in the model’s output distribution. To quantify this concentration, we introduce the Branching Factor (BF) – a token-invariant measure of the effective number of plausible next steps during generation. Our empirical analysis reveals two key findings: (1) BF often decreases as generation progresses, suggesting that LLMs become more predictable as they generate. (2) alignment tuning substantially sharpens the model’s output distribution from the outset, reducing BF by nearly an order of magnitude (e.g., from 12 to 1.2) relative to base models. This stark reduction helps explain why aligned models often appear less sensitive to decoding strategies. Building on this insight, we find this stability has surprising implications for complex reasoning. Aligned Chain-of-Thought (CoT) models (e.g., DeepSeek-distilled models), for instance, leverage this effect; by generating longer reasoning chains, they push generation into later, more deterministic (lower BF) stages, resulting in more stable outputs. We hypothesize that alignment tuning does not fundamentally change a model’s behavior, but instead steers it toward stylistic tokens (e.g., “Sure”) that unlock low-entropy trajectories already present in the base model. This view is supported by nudging experiments, which show that prompting base models with such tokens can similarly reduce BF. Together, our findings establish BF as a powerful diagnostic for understanding and controlling LLM outputs - clarifying how alignment reduces variability, how CoT promotes stable generations, and how base models can be steered away from diversity.
zh
[NLP-77] QueueEDIT: Structural Self-Correction for Sequential Model Editing in LLM s
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在连续序列模型编辑(Sequential Model Editing, SME)过程中出现的幻觉问题以及因引入新参数而导致的通用能力下降问题。其解决方案的关键在于提出一种基于队列的自我校正框架(Queue-based Self-Correction Framework, QueueEDIT),通过结构映射编辑损失将三元组映射到Transformer层中的知识敏感神经元,并利用队列存储和动态对齐已编辑参数,从而有效处理长序列依赖性并减少参数偏差对模型通用能力的影响。
链接: https://arxiv.org/abs/2506.17864
作者: Taolin Zhang,Haidong Kang,Dongyang Li,Qizhou Chen,Chengyu Wang Xiaofeng He,Richang Hong
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Recently, large language models (LLMs) have demonstrated impressive results but still suffer from hallucinations. Model editing has been proposed to correct factual inaccuracies in LLMs. A challenging case is sequential model editing (SME), which aims to rectify errors continuously rather than treating them as a one-time task. During SME, the general capabilities of LLMs can be negatively affected due to the introduction of new parameters. In this paper, we propose a queue-based self-correction framework (QueueEDIT) that not only enhances SME performance by addressing long-sequence dependency but also mitigates the impact of parameter bias on the general capabilities of LLMs. Specifically, we first introduce a structural mapping editing loss to map the triplets to the knowledge-sensitive neurons within the Transformer layers of LLMs. We then store the located parameters for each piece of edited knowledge in a queue and dynamically align previously edited parameters. In each edit, we select queue parameters most relevant to the currently located parameters to determine whether previous knowledge needs realignment. Irrelevant parameters in the queue are frozen, and we update the parameters at the queue head to the LLM to ensure they do not harm general abilities. Experiments show that our framework significantly outperforms strong baselines across various SME settings and maintains competitiveness in single-turn editing. The resulting LLMs also preserve high capabilities in general NLP tasks throughout the SME process.
zh
[NLP-78] LLM s for Customized Marketing Content Generation and Evaluation at Scale KDD
【速读】: 该论文旨在解决离线营销(offsite marketing)中广告内容过于通用、模板化且与着陆页不匹配的问题,从而提升广告效果。其解决方案的关键在于提出MarketingFM系统,该系统通过整合多源数据生成与关键词相关的广告文案,并在最小化人工干预的情况下提高广告点击率(CTR)、曝光量及成本效率。此外,为降低人工审核成本,论文还提出了AutoEval-Main和AutoEval-Update两个自动化评估框架,分别通过规则与大语言模型(LLM)结合的方式以及LLM与人类协作的动态优化机制,提升评估一致性并减少人工工作量。
链接: https://arxiv.org/abs/2506.17863
作者: Haoran Liu,Amir Tahmasbi,Ehtesham Sam Haque,Purak Jain
机构: Texas A&M University (德州农工大学); Amazon Inc. (亚马逊公司)
类目: Computation and Language (cs.CL)
备注: KDD LLM4ECommerce Workshop 2025
Abstract:Offsite marketing is essential in e-commerce, enabling businesses to reach customers through external platforms and drive traffic to retail websites. However, most current offsite marketing content is overly generic, template-based, and poorly aligned with landing pages, limiting its effectiveness. To address these limitations, we propose MarketingFM, a retrieval-augmented system that integrates multiple data sources to generate keyword-specific ad copy with minimal human intervention. We validate MarketingFM via offline human and automated evaluations and large-scale online A/B tests. In one experiment, keyword-focused ad copy outperformed templates, achieving up to 9% higher CTR, 12% more impressions, and 0.38% lower CPC, demonstrating gains in ad ranking and cost efficiency. Despite these gains, human review of generated ads remains costly. To address this, we propose AutoEval-Main, an automated evaluation system that combines rule-based metrics with LLM-as-a-Judge techniques to ensure alignment with marketing principles. In experiments with large-scale human annotations, AutoEval-Main achieved 89.57% agreement with human reviewers. Building on this, we propose AutoEval-Update, a cost-efficient LLM-human collaborative framework to dynamically refine evaluation prompts and adapt to shifting criteria with minimal human input. By selectively sampling representative ads for human review and using a critic LLM to generate alignment reports, AutoEval-Update improves evaluation consistency while reducing manual effort. Experiments show the critic LLM suggests meaningful refinements, improving LLM-human agreement. Nonetheless, human oversight remains essential for setting thresholds and validating refinements before deployment.
zh
[NLP-79] HCM-CAL: Temporal-Hierarchical Causal Modelling with Conformal Calibration for Clinical Risk Prediction
【速读】: 该论文试图解决从电子健康记录(Electronic Health Records, EHRs)中自动进行临床风险预测的问题,特别是如何建模结构化诊断代码和非结构化病历文本之间的复杂关系。现有方法通常分别处理这两种模态或采用简化的融合策略,忽略了叙述性观察如何引发诊断并跨住院过程传播风险的定向、层次化因果交互。解决方案的关键在于提出THCM-CAL,一个带有置信校准的时序-层次化因果模型,其框架构建了一个多模态因果图,通过层次化因果发现推断出三种临床相关的交互:同切片内模态序列、同切片跨模态触发以及跨切片风险传播,并扩展了置信区间预测以在复杂共现情况下校准每个诊断代码的置信度。
链接: https://arxiv.org/abs/2506.17844
作者: Xin Zhang,Qiyu Wei,Yingjie Zhu,Fanyi Wu,Sophia Ananiadou
机构: The University of Manchester (曼彻斯特大学); Harbin Institute of Technology (哈尔滨工业大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 13 pages, 4 figures
Abstract:Automated clinical risk prediction from electronic health records (EHRs) demands modeling both structured diagnostic codes and unstructured narrative notes. However, most prior approaches either handle these modalities separately or rely on simplistic fusion strategies that ignore the directional, hierarchical causal interactions by which narrative observations precipitate diagnoses and propagate risk across admissions. In this paper, we propose THCM-CAL, a Temporal-Hierarchical Causal Model with Conformal Calibration. Our framework constructs a multimodal causal graph where nodes represent clinical entities from two modalities: Textual propositions extracted from notes and ICD codes mapped to textual descriptions. Through hierarchical causal discovery, THCM-CAL infers three clinically grounded interactions: intra-slice same-modality sequencing, intra-slice cross-modality triggers, and inter-slice risk propagation. To enhance prediction reliability, we extend conformal prediction to multi-label ICD coding, calibrating per-code confidence intervals under complex co-occurrences. Experimental results on MIMIC-III and MIMIC-IV demonstrate the superiority of THCM-CAL.
zh
[NLP-80] Aligning Frozen LLM s by Reinforcement Learning: An Iterative Reweight-then-Optimize Approach
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)与人类偏好对齐的问题,特别是针对传统微调方法如RLHF(Reinforcement Learning from Human Feedback)和DPO(Direct Preference Optimization)在测试阶段无法优化模型性能以及当模型权重不可访问时无法应用的局限性。其解决方案的关键在于提出一种名为迭代重加权再优化(Iterative Reweight-then-Optimize, IRO)的强化学习(Reinforcement Learning, RL)框架,该框架在不修改模型参数的前提下,通过价值函数引导模型生成质量的提升,从而实现模型的对齐。
链接: https://arxiv.org/abs/2506.17828
作者: Xinnan Zhang,Chenliang Li,Siliang Zeng,Jiaxiang Li,Zhongruo Wang,Kaixiang Lin,Songtao Lu,Alfredo Garcia,Mingyi Hong
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Aligning large language models (LLMs) with human preferences usually requires fine-tuning methods such as RLHF and DPO. These methods directly optimize the model parameters, so they cannot be used in test-time to improve model performance, nor are they applicable when the model weights are not accessible. In contrast, test-time methods sidestep weight updates by leveraging reward functions to guide and improve output quality. However, they incur high inference costs, and their one-shot guidance is often based on imperfect reward or value functions, leading to suboptimal outputs. In this work, we present a method named Iterative Reweight-then-Optimize (IRO), a reinforcement learning (RL) framework that performs RL-style alignment of the (frozen) base model without touching its parameters. During training, each iteration (i) samples candidates from the base model, (ii) resamples using current value functions, and (iii) trains a new lightweight value function that guides the next decoding pass. At test time, the value functions are used to guide the base model generation via a search-based optimization process. Notably, users can apply IRO to align a model on their own dataset, similar to OpenAI’s reinforcement fine-tuning (RFT), but without requiring access to the model weights.
zh
[NLP-81] Multilingual Tokenization through the Lens of Indian Languages: Challenges and Insights
【速读】: 该论文旨在解决多语言自然语言处理(NLP)中分词策略在资源丰富语言上存在偏倚的问题,特别是针对印度次大陆等语言结构复杂且资源匮乏的语言。其解决方案的关键在于通过系统评估不同分词算法(如基于底部向上策略的BPE和基于顶部向下策略的Unigram LM)以及多语言词汇构建方法(如联合训练与聚类训练),量化词汇表大小对分词效果的影响,并探索高资源语言对低资源语言分词性能的提升作用。
链接: https://arxiv.org/abs/2506.17789
作者: N J Karthika,Maharaj Brahma,Rohit Saluja,Ganesh Ramakrishnan,Maunendra Sankar Desarkar
机构: IIT Hyderabad(印度理工学院海得拉巴分校); IIT Bombay(印度理工学院孟买分校); IIT Mandi(印度理工学院曼迪分校); BharatGen Consortium(印度基因组联盟)
类目: Computation and Language (cs.CL)
备注:
Abstract:Tokenization plays a pivotal role in multilingual NLP. However, existing tokenizers are often skewed towards high-resource languages, limiting their effectiveness for linguistically diverse and morphologically rich languages such as those in the Indian subcontinent. This paper presents a comprehensive intrinsic evaluation of tokenization strategies across 17 Indian languages. We quantify the trade-offs between bottom-up and top-down tokenizer algorithms (BPE and Unigram LM), effects of vocabulary sizes, and compare strategies of multilingual vocabulary construction such as joint and cluster-based training. We also show that extremely low-resource languages can benefit from tokenizers trained on related high-resource languages. Our study provides practical insights for building more fair, efficient, and linguistically informed tokenizers for multilingual NLP.
zh
[NLP-82] Bayesian Social Deduction with Graph-Informed Language Models
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在社会推理(Social Reasoning)任务中的局限性,特别是在从部分观察中推断其他智能体的不可观测信念和意图方面。研究发现,尽管最大规模的模型表现出较强性能,但它们需要大量的测试时推理,并且在压缩为更小、具备实时能力的变体时性能会显著下降。解决方案的关键在于引入一种混合推理框架,将信念推断外部化到结构化的概率模型中,同时利用LLM进行语言理解和交互,从而在代理-代理对战中实现与更大模型相媲美的性能,并首次在受控研究中击败人类玩家。
链接: https://arxiv.org/abs/2506.17788
作者: Shahab Rahimirad,Guven Gergerli,Lucia Romero,Angela Qian,Matthew Lyle Olson,Simon Stepputtis,Joseph Campbell
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: 32 pages, 10 figures. Under review
Abstract:Social reasoning - inferring unobservable beliefs and intentions from partial observations of other agents - remains a challenging task for large language models (LLMs). We evaluate the limits of current reasoning language models in the social deduction game Avalon and find that while the largest models demonstrate strong performance, they require extensive test-time inference and degrade sharply when distilled to smaller, real-time-capable variants. To address this, we introduce a hybrid reasoning framework that externalizes belief inference to a structured probabilistic model, while using an LLM for language understanding and interaction. Our approach achieves competitive performance with much larger models in Agent-Agent play and, notably, is the first language agent to defeat human players in a controlled study - achieving a 67% win rate and receiving higher qualitative ratings than both reasoning baselines and human teammates. We release code, models, and a dataset to support future work on social reasoning in LLM agents, which can be found at this https URL
zh
[NLP-83] Beyond instruction-conditioning MoTE: Mixture of Task Experts for Multi-task Embedding Models
【速读】: 该论文试图解决在低容量模型中应用指令条件化进行嵌入专业化时所面临的表征限制问题,这些问题限制了专业化带来的性能提升。解决方案的关键在于引入了任务专家混合(Mixture of Task Experts, MoTE)变换器块,该结构通过任务感知对比学习(Task-Aware Contrastive Learning, \tacl)训练任务专用参数,从而增强模型生成专业嵌入的能力。
链接: https://arxiv.org/abs/2506.17781
作者: Miguel Romero,Shuoyang Ding,Corey D. Barret,Georgiana Dinu,George Karypis
机构: Amazon(亚马逊); University of Minnesota(明尼苏达大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:Dense embeddings are fundamental to modern machine learning systems, powering Retrieval-Augmented Generation (RAG), information retrieval, and representation learning. While instruction-conditioning has become the dominant approach for embedding specialization, its direct application to low-capacity models imposes fundamental representational constraints that limit the performance gains derived from specialization. In this paper, we analyze these limitations and introduce the Mixture of Task Experts (MoTE) transformer block, which leverages task-specialized parameters trained with Task-Aware Contrastive Learning (\tacl) to enhance the model ability to generate specialized embeddings. Empirical results show that MoTE achieves 64% higher performance gains in retrieval datasets ( +3.27 \rightarrow +5.21 ) and 43% higher performance gains across all datasets ( +1.81 \rightarrow +2.60 ). Critically, these gains are achieved without altering instructions, training data, inference time, or number of active parameters.
zh
[NLP-84] HIDE and Seek: Detecting Hallucinations in Language Models via Decoupled Representations
【速读】: 该论文旨在解决语言模型(Language Models, LMs)在生成内容时出现的“幻觉”(hallucination)问题,即模型生成的内容与事实不符或偏离输入上下文。现有方法大多依赖于对每个输入进行多次生成以检测幻觉,导致计算成本和延迟增加。论文提出的解决方案关键在于一种无需训练的单次通过方法——通过解耦表示进行幻觉检测(HIDE),其核心思想是利用语言模型内部表示与生成输出之间的统计解耦现象,通过希尔伯特-施密特独立准则(Hilbert-Schmidt Independence Criterion, HSIC)量化这种解耦程度,从而实现高效且有效的幻觉检测。
链接: https://arxiv.org/abs/2506.17748
作者: Anwoy Chatterjee,Yash Goel,Tanmoy Chakraborty
机构: Indian Institute of Technology Delhi (印度理工学院德里分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Contemporary Language Models (LMs), while impressively fluent, often generate content that is factually incorrect or unfaithful to the input context - a critical issue commonly referred to as ‘hallucination’. This tendency of LMs to generate hallucinated content undermines their reliability, especially because these fabrications are often highly convincing and therefore difficult to detect. While several existing methods attempt to detect hallucinations, most rely on analyzing multiple generations per input, leading to increased computational cost and latency. To address this, we propose a single-pass, training-free approach for effective Hallucination detectIon via Decoupled rEpresentations (HIDE). Our approach leverages the hypothesis that hallucinations result from a statistical decoupling between an LM’s internal representations of input context and its generated output. We quantify this decoupling using the Hilbert-Schmidt Independence Criterion (HSIC) applied to hidden-state representations extracted while generating the output sequence. We conduct extensive experiments on four diverse question answering datasets, evaluating both faithfulness and factuality hallucinations across six open-source LMs of varying scales and properties. Our results demonstrate that HIDE outperforms other single-pass methods in almost all settings, achieving an average relative improvement of ~29% in AUC-ROC over the best-performing single-pass strategy across various models and datasets. Additionally, HIDE shows competitive and often superior performance with multi-pass state-of-the-art methods, obtaining an average relative improvement of ~3% in AUC-ROC while consuming ~51% less computation time. Our findings highlight the effectiveness of exploiting internal representation decoupling in LMs for efficient and practical hallucination detection.
zh
[NLP-85] KAG-Thinker: Teaching Large Language Models to Think with Human-like Reasoning Process
【速读】: 该论文旨在解决在领域特定知识库(KB)上的问答(Q&A)任务中,大型语言模型(LLM)逻辑连贯性和上下文一致性不足的问题。其解决方案的关键在于提出KAG-Thinker框架,该框架通过建立结构化的思考过程来模拟人类认知机制,将复杂问题分解为可独立求解的子问题(即逻辑形式),并以自然语言和逻辑函数两种等价形式表示,进一步分类为知识检索或推理分析任务,同时通过逻辑函数接口显式建模依赖关系和变量。此外,该框架利用知识边界模型和深度求解模型优化知识获取,并采用监督微调与多轮对话对齐模型,以避免过度反思。
链接: https://arxiv.org/abs/2506.17728
作者: Dalong Zhang,Jun Xu,Jun Zhou,Lei Liang,Lin Yuan,Ling Zhong,Mengshu Sun,Peilong Zhao,QiWei Wang,Xiaorui Wang,Xinkai Du,YangYang Hou,Yu Ao,ZhaoYang Wang,Zhengke Gui,ZhiYing Yi,Zhongpu Bo
机构: Inclusion AI(包容人工智能); Ant Group(蚂蚁集团)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:In this paper, we introduce KAG-Thinker, a novel human-like reasoning framework built upon a parameter-light large language model (LLM). Our approach enhances the logical coherence and contextual consistency of the thinking process in question-answering (Q\A) tasks on domain-specific knowledge bases (KBs) within LLMs. This framework simulates human cognitive mechanisms for handling complex problems by establishing a structured thinking process. Continuing the \textbfLogical Form guided retrieval and reasoning technology route of KAG v0.7, firstly, it decomposes complex questions into independently solvable sub-problems(also referred to as logical forms) through \textbfbreadth decomposition, each represented in two equivalent forms-natural language and logical function-and further classified as either Knowledge Retrieval or Reasoning Analysis tasks, with dependencies and variables passing explicitly modeled via logical function interfaces. In the solving process, the Retrieval function is used to perform knowledge retrieval tasks, while the Math and Deduce functions are used to perform reasoning analysis tasks. Secondly, it is worth noting that, in the Knowledge Retrieval sub-problem tasks, LLMs and external knowledge sources are regarded as equivalent KBs. We use the \textbfknowledge boundary model to determine the optimal source using self-regulatory mechanisms such as confidence calibration and reflective reasoning, and use the \textbfdepth solving model to enhance the comprehensiveness of knowledge acquisition. Finally, instead of utilizing reinforcement learning, we employ supervised fine-tuning with multi-turn dialogues to align the model with our structured inference paradigm, thereby avoiding excessive reflection. This is supported by a data evaluation framework and iterative corpus synthesis, which facilitate the generation of detailed reasoning trajectories…
zh
[NLP-86] Unveiling Factors for Enhanced POS Tagging: A Study of Low-Resource Medieval Romance Languages
【速读】: 该论文试图解决历史文本中词性标注(POS tagging)的挑战,特别是在中世纪罗曼语语言(如中世纪奥克语、中世纪西班牙语和中世纪法语)处理中的问题,这些问题源于历时语言演变、拼写变化以及标注数据稀缺。解决方案的关键在于系统评估微调方法、提示工程、模型架构、解码策略以及跨语言迁移学习技术对标注准确率的影响,并探索针对低资源历史语言的独特有效技术。
链接: https://arxiv.org/abs/2506.17715
作者: Matthias Schöffel,Esteban Garces Arias,Marinus Wiedner,Paula Ruppert,Meimingwei Li,Christian Heumann,Matthias Aßenmacher
机构: LMU Munich (慕尼黑路德维希-马克西米利安大学); Munich Center for Machine Learning (慕尼黑机器学习中心); University of Freiburg (弗莱堡大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Part-of-speech (POS) tagging remains a foundational component in natural language processing pipelines, particularly critical for historical text analysis at the intersection of computational linguistics and digital humanities. Despite significant advancements in modern large language models (LLMs) for ancient languages, their application to Medieval Romance languages presents distinctive challenges stemming from diachronic linguistic evolution, spelling variations, and labeled data scarcity. This study systematically investigates the central determinants of POS tagging performance across diverse corpora of Medieval Occitan, Medieval Spanish, and Medieval French texts, spanning biblical, hagiographical, medical, and dietary domains. Through rigorous experimentation, we evaluate how fine-tuning approaches, prompt engineering, model architectures, decoding strategies, and cross-lingual transfer learning techniques affect tagging accuracy. Our results reveal both notable limitations in LLMs’ ability to process historical language variations and non-standardized spelling, as well as promising specialized techniques that effectively address the unique challenges presented by low-resource historical languages.
zh
[NLP-87] Aged to Perfection: Machine-Learning Maps of Age in Conversational English
【速读】: 该论文试图解决语言模式在不同年龄群体间的差异性问题,以及如何通过语言特征预测说话者的年龄群体。其解决方案的关键在于结合计算语言分析与机器学习方法,以识别不同世代特有的语言标记,并构建能够从多个方面一致估计说话者年龄的预测模型。
链接: https://arxiv.org/abs/2506.17708
作者: MingZe Tang
机构: University of Aberdeen (阿伯丁大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 6 pages, 11 figures
Abstract:The study uses the British National Corpus 2014, a large sample of contemporary spoken British English, to investigate language patterns across different age groups. Our research attempts to explore how language patterns vary between different age groups, exploring the connection between speaker demographics and linguistic factors such as utterance duration, lexical diversity, and word choice. By merging computational language analysis and machine learning methodologies, we attempt to uncover distinctive linguistic markers characteristic of multiple generations and create prediction models that can consistently estimate the speaker’s age group from various aspects. This work contributes to our knowledge of sociolinguistic diversity throughout the life of modern British speech.
zh
[NLP-88] he Evolution of Natural Language Processing: How Prompt Optimization and Language Models are Shaping the Future
【速读】: 该论文试图解决当前在自然语言处理(Natural Language Processing, NLP)领域中,关于生成式 AI (Generative AI) 的提示优化策略缺乏系统性分析的问题。其解决方案的关键在于对多种提示优化策略进行分类与深入分析,提出了11种不同的类别,并基于其工作原理进行归纳,同时详细介绍了这些策略在各类NLP任务中的应用情况以及所使用的大型语言模型(Large Language Models, LLMs)和基准数据集。这一综合性研究为后续的比较研究和在统一实验条件下对提示优化及基于LLM的预测流程进行严格评估提供了坚实基础。
链接: https://arxiv.org/abs/2506.17700
作者: Summra Saleem,Muhammad Nabeel Asim,Shaista Zulfiqar,Andreas Dengel
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) have revolutionized the field of Natural Language Processing (NLP) by automating traditional labor-intensive tasks and consequently accelerated the development of computer-aided applications. As researchers continue to advance this field with the introduction of novel language models and more efficient training/finetuning methodologies, the idea of prompt engineering and subsequent optimization strategies with LLMs has emerged as a particularly impactful trend to yield a substantial performance boost across diverse NLP tasks. To best of our knowledge numerous review articles have explored prompt engineering, however, a critical gap exists in comprehensive analyses of prompt optimization strategies. To bridge this gap this paper provides unique and comprehensive insights about the potential of diverse prompt optimization strategies. It analyzes their underlying working paradigms and based on these principles, categorizes them into 11 distinct classes. Moreover, the paper provides details about various NLP tasks where these prompt optimization strategies have been employed, along with details of different LLMs and benchmark datasets used for evaluation. This comprehensive compilation lays a robust foundation for future comparative studies and enables rigorous assessment of prompt optimization and LLM-based predictive pipelines under consistent experimental settings: a critical need in the current landscape. Ultimately, this research will centralize diverse strategic knowledge to facilitate the adaptation of existing prompt optimization strategies for development of innovative predictors across unexplored tasks.
zh
[NLP-89] Zero-Shot Conversational Stance Detection: Dataset and Approaches ACL2025
【速读】: 该论文旨在解决零样本对话立场检测(zero-shot conversational stance detection)中因现有数据集目标类型有限而导致模型在面对大量未见过的目标时性能受限的问题。其解决方案的关键在于手动构建一个大规模、高质量的零样本对话立场检测数据集ZS-CSD,并提出一种基于说话人交互与目标感知原型对比学习(SITPCL)的模型,以提升模型在零样本场景下的表现。
链接: https://arxiv.org/abs/2506.17693
作者: Yuzhe Ding,Kang He,Bobo Li,Li Zheng,Haijun He,Fei Li,Chong Teng,Donghong Ji
机构: Wuhan University (武汉大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: ACL 2025 (Findings)
Abstract:Stance detection, which aims to identify public opinion towards specific targets using social media data, is an important yet challenging task. With the increasing number of online debates among social media users, conversational stance detection has become a crucial research area. However, existing conversational stance detection datasets are restricted to a limited set of specific targets, which constrains the effectiveness of stance detection models when encountering a large number of unseen targets in real-world applications. To bridge this gap, we manually curate a large-scale, high-quality zero-shot conversational stance detection dataset, named ZS-CSD, comprising 280 targets across two distinct target types. Leveraging the ZS-CSD dataset, we propose SITPCL, a speaker interaction and target-aware prototypical contrastive learning model, and establish the benchmark performance in the zero-shot setting. Experimental results demonstrate that our proposed SITPCL model achieves state-of-the-art performance in zero-shot conversational stance detection. Notably, the SITPCL model attains only an F1-macro score of 43.81%, highlighting the persistent challenges in zero-shot conversational stance detection.
zh
[NLP-90] Resource-Friendly Dynamic Enhancement Chain for Multi-Hop Question Answering
【速读】: 该论文旨在解决知识密集型多跳问答(Knowledge-intensive Multi-hop QA)任务中,大型语言模型(LLMs)在处理复杂查询时需要多次检索和迭代生成所带来的挑战,尤其是在轻量级LLMs中,面对大量文档和长上下文时容易出现幻觉和语义漂移的问题。其解决方案的关键在于提出一种名为DEC(Dynamic Enhancement Chain)的框架,该框架通过将复杂问题分解为逻辑连贯的子问题,构建无幻觉的推理链,并通过上下文感知的重写迭代优化子问题,从而生成高效的查询表述,同时结合轻量级判别性关键词提取模块实现精准的文档召回,有效降低了计算开销并提升了性能。
链接: https://arxiv.org/abs/2506.17692
作者: Binquan Ji,Haibo Luo,Yifei Lu,Lei Hei,Jiaqi Wang,Tingjing Liao,Lingyu Wang,Shichao Wang,Feiliang Ren
机构: Northeastern University (东北大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Knowledge-intensive multi-hop question answering (QA) tasks, which require integrating evidence from multiple sources to address complex queries, often necessitate multiple rounds of retrieval and iterative generation by large language models (LLMs). However, incorporating many documents and extended contexts poses challenges -such as hallucinations and semantic drift-for lightweight LLMs with fewer parameters. This work proposes a novel framework called DEC (Dynamic Enhancement Chain). DEC first decomposes complex questions into logically coherent subquestions to form a hallucination-free reasoning chain. It then iteratively refines these subquestions through context-aware rewriting to generate effective query formulations. For retrieval, we introduce a lightweight discriminative keyword extraction module that leverages extracted keywords to achieve targeted, precise document recall with relatively low computational overhead. Extensive experiments on three multi-hop QA datasets demonstrate that DEC performs on par with or surpasses state-of-the-art benchmarks while significantly reducing token consumption. Notably, our approach attains state-of-the-art results on models with 8B parameters, showcasing its effectiveness in various scenarios, particularly in resource-constrained environments.
zh
[NLP-91] FaithfulSAE: Towards Capturing Faithful Features with Sparse Autoencoders without External Dataset Dependencies
【速读】: 该论文试图解决Sparse Autoencoders (SAEs)在训练过程中存在的不稳定性和无法捕捉模型内部特征的问题。这些问题可能源于使用外部数据集(如网络数据或另一模型生成的数据)进行训练,这些数据集可能包含超出模型泛化能力的分布外(OOD)数据,导致生成“假特征”(Fake Features)。解决方案的关键在于提出FaithfulSAE,该方法通过在模型自身的合成数据集上训练SAEs,减少对外部数据集的依赖,从而提升SAEs的稳定性与对模型内部特征的捕捉能力。
链接: https://arxiv.org/abs/2506.17673
作者: Seonglae Cho,Harryn Oh,Donghyun Lee,Luis Eduardo Rodrigues Vieira,Andrew Bermingham,Ziad El Sayed
机构: University College London (伦敦大学学院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 18 pages, 18 figures
Abstract:Sparse Autoencoders (SAEs) have emerged as a promising solution for decomposing large language model representations into interpretable features. However, Paulo and Belrose (2025) have highlighted instability across different initialization seeds, and Heap et al. (2025) have pointed out that SAEs may not capture model-internal features. These problems likely stem from training SAEs on external datasets - either collected from the Web or generated by another model - which may contain out-of-distribution (OOD) data beyond the model’s generalisation capabilities. This can result in hallucinated SAE features, which we term “Fake Features”, that misrepresent the model’s internal activations. To address these issues, we propose FaithfulSAE, a method that trains SAEs on the model’s own synthetic dataset. Using FaithfulSAEs, we demonstrate that training SAEs on less-OOD instruction datasets results in SAEs being more stable across seeds. Notably, FaithfulSAEs outperform SAEs trained on web-based datasets in the SAE probing task and exhibit a lower Fake Feature Ratio in 5 out of 7 models. Overall, our approach eliminates the dependency on external datasets, advancing interpretability by better capturing model-internal features while highlighting the often neglected importance of SAE training datasets.
zh
[NLP-92] PTT: Transforming Pretrained Transformer into Titans
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在自然语言处理任务中计算和内存需求高的问题,特别是在长上下文推理场景下的挑战。其解决方案的关键在于提出TPTT(Transforming Pretrained Transformer into Titans)框架,该框架通过引入高效的线性化注意力机制(linearized attention mechanisms)和先进的内存管理技术,如Memory as Gate(MaG)和混合线性化注意力(LiZA),以提升预训练Transformer模型的性能。TPTT还兼容Hugging Face Transformers库,支持通过参数高效微调(LoRA)对任意因果LLM进行无缝适配,而无需完全重新训练。
链接: https://arxiv.org/abs/2506.17671
作者: Fabien Furfaro
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 6 pages, 1 figure
Abstract:Recent advances in large language models (LLMs) have led to remarkable progress in natural language processing, but their computational and memory demands remain a significant challenge, particularly for long-context inference. We introduce TPTT (Transforming Pretrained Transformer into Titans), a novel framework for enhancing pretrained Transformer models with efficient linearized attention mechanisms and advanced memory management. TPTT employs techniques such as Memory as Gate (MaG) and mixed linearized attention (LiZA). It is fully compatible with the Hugging Face Transformers library, enabling seamless adaptation of any causal LLM through parameter-efficient fine-tuning (LoRA) without full retraining. We show the effectiveness of TPTT on the MMLU benchmark with models of approximately 1 billion parameters, observing substantial improvements in both efficiency and accuracy. For instance, Titans-Llama-3.2-1B achieves a 20% increase in Exact Match (EM) over its baseline. Statistical analyses and comparisons with recent state-of-the-art methods confirm the practical scalability and robustness of TPTT. Code is available at this https URL . Python package at this https URL .
zh
[NLP-93] Step-Opt: Boosting Optimization Modeling in LLM s through Iterative Data Synthesis and Structured Validation
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在运筹学(Operations Research, OR)优化建模任务中面临的挑战,尤其是在处理复杂问题时表现不佳的问题。其解决方案的关键在于提出一种名为Step-Opt-Instruct的框架,该框架通过迭代生成问题来逐步增加问题复杂度,并结合分步验证机制严格校验数据,从而防止错误传播并确保生成数据集的质量。基于该框架,研究者对开源LLMs进行了微调,开发出Step-Opt模型,在多个基准测试中取得了最先进的性能。
链接: https://arxiv.org/abs/2506.17637
作者: Yang Wu,Yifan Zhang,Yurong Wu,Yuran Wang,Junkai Zhang,Jian Cheng
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 17 pages, 12 figures
Abstract:Large Language Models (LLMs) have revolutionized various domains but encounter substantial challenges in tackling optimization modeling tasks for Operations Research (OR), particularly when dealing with complex problem. In this work, we propose Step-Opt-Instruct, a framework that augments existing datasets and generates high-quality fine-tuning data tailored to optimization modeling. Step-Opt-Instruct employs iterative problem generation to systematically increase problem complexity and stepwise validation to rigorously verify data, preventing error propagation and ensuring the quality of the generated dataset. Leveraging this framework, we fine-tune open-source LLMs, including LLaMA-3-8B and Mistral-7B, to develop Step-Opt–a model that achieves state-of-the-art performance on benchmarks such as NL4OPT, MAMO, and IndustryOR. Extensive experiments demonstrate the superior performance of Step-Opt, especially in addressing complex OR tasks, with a notable 17.01% improvement in micro average accuracy on difficult problems. These findings highlight the effectiveness of combining structured validation with gradual problem refinement to advance the automation of decision-making processes using this http URL code and dataset are available at this https URL.
zh
[NLP-94] Answer-Centric or Reasoning -Driven? Uncovering the Latent Memory Anchor in LLM s
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)是否主要依赖于最终答案还是推理链的文本模式这一核心问题。其解决方案的关键在于提出一种五级答案可见性提示框架,通过系统性地操控答案线索并借助间接的行为分析来探测模型行为,从而揭示模型对显式答案的强烈依赖性。实验结果表明,当答案线索被遮蔽时,模型性能显著下降,这表明LLMs的推理表现可能更多是事后合理化而非真正的推理能力。
链接: https://arxiv.org/abs/2506.17630
作者: Yang Wu,Yifan Zhang,Yiwei Wang,Yujun Cai,Yurong Wu,Yuran Wang,Ning Xu,Jian Cheng
机构: 未知
类目: Computation and Language (cs.CL)
备注: 14 pages, 8 figures
Abstract:While Large Language Models (LLMs) demonstrate impressive reasoning capabilities, growing evidence suggests much of their success stems from memorized answer-reasoning patterns rather than genuine inference. In this work, we investigate a central question: are LLMs primarily anchored to final answers or to the textual pattern of reasoning chains? We propose a five-level answer-visibility prompt framework that systematically manipulates answer cues and probes model behavior through indirect, behavioral analysis. Experiments across state-of-the-art LLMs reveal a strong and consistent reliance on explicit answers. The performance drops by 26.90% when answer cues are masked, even with complete reasoning chains. These findings suggest that much of the reasoning exhibited by LLMs may reflect post-hoc rationalization rather than true inference, calling into question their inferential depth. Our study uncovers the answer-anchoring phenomenon with rigorous empirical validation and underscores the need for a more nuanced understanding of what constitutes reasoning in LLMs.
zh
[NLP-95] CLiViS: Unleashing Cognitive Map through Linguistic-Visual Synergy for Embodied Visual Reasoning
【速读】: 该论文旨在解决具身视觉推理(Embodied Visual Reasoning, EVR)中因复杂指令多样性及长期第一视角视频中的时空动态性所带来的挑战。现有方法要么依赖静态视频描述上的大型语言模型(Large Language Models, LLMs),导致关键视觉细节缺失,要么依赖端到端的视觉-语言模型(Vision-Language Models, VLMs),难以处理分步组合推理。论文提出的解决方案CLiViS的关键在于利用LLMs进行高层任务规划,并通过VLM驱动的开放世界视觉感知迭代更新场景上下文,其核心是一个在推理过程中动态演化的认知地图,该地图构建了具身场景的结构化表示,连接低层感知与高层推理。
链接: https://arxiv.org/abs/2506.17629
作者: Kailing Li,Qi’ao Xu,Tianwen Qian,Yuqian Fu,Yang Jiao,Xiaoling Wang
机构: East China Normal University (华东师范大学); Sofia University “St. Kliment Ohridski” (索菲亚大学“圣克莱门特·奥赫里德斯基”); Fudan University (复旦大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Embodied Visual Reasoning (EVR) seeks to follow complex, free-form instructions based on egocentric video, enabling semantic understanding and spatiotemporal reasoning in dynamic environments. Despite its promising potential, EVR encounters significant challenges stemming from the diversity of complex instructions and the intricate spatiotemporal dynamics in long-term egocentric videos. Prior solutions either employ Large Language Models (LLMs) over static video captions, which often omit critical visual details, or rely on end-to-end Vision-Language Models (VLMs) that struggle with stepwise compositional reasoning. Consider the complementary strengths of LLMs in reasoning and VLMs in perception, we propose CLiViS. It is a novel training-free framework that leverages LLMs for high-level task planning and orchestrates VLM-driven open-world visual perception to iteratively update the scene context. Building on this synergy, the core of CLiViS is a dynamic Cognitive Map that evolves throughout the reasoning process. This map constructs a structured representation of the embodied scene, bridging low-level perception and high-level reasoning. Extensive experiments across multiple benchmarks demonstrate the effectiveness and generality of CLiViS, especially in handling long-term visual dependencies. Code is available at this https URL.
zh
[NLP-96] OpusLM: A Family of Open Unified Speech Language Models
【速读】: 该论文旨在解决语音语言模型(Speech Language Models, SpeechLMs)的开放性与性能优化问题,通过构建一个开源的统一语音语言模型家族OpusLMs来提升语音识别、语音合成及文本处理能力。其解决方案的关键在于基于解码器-only的文本语言模型进行初始化,并在大规模语音-文本对数据(213K小时)和纯文本数据(292B tokens)上进行持续预训练,同时采用多流语言模型架构和多阶段训练策略以增强模型表现。此外,模型规模扩展和数据选择的退火策略也被证明对性能提升具有重要作用。
链接: https://arxiv.org/abs/2506.17611
作者: Jinchuan Tian,William Chen,Yifan Peng,Jiatong Shi,Siddhant Arora,Shikhar Bharadwaj,Takashi Maekaku,Yusuke Shinohara,Keita Goto,Xiang Yue,Huck Yang,Shinji Watanabe
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:This paper presents Open Unified Speech Language Models (OpusLMs), a family of open foundational speech language models (SpeechLMs) up to 7B. Initialized from decoder-only text language models, the OpusLMs are continuously pre-trained on 213K hours of speech-text pairs and 292B text-only tokens. We demonstrate our OpusLMs achieve comparable (or even superior) performance with existing SpeechLMs in speech recognition, speech synthesis, and text-only capabilities. Technically, this paper articulates our SpeechLM designs on tokenization, multi-stream language models, and multi-stage training strategies. We experimentally demonstrate the importance of model size scaling and the effect of annealing data selection. The OpusLMs are all built from publicly available materials and are fully transparent models. We release our code, data, checkpoints, and training logs to facilitate open SpeechLM research
zh
[NLP-97] yphoFormer: Language-Augmented Transformer for Accurate Typhoon Track Forecasting
【速读】: 该论文旨在解决台风轨迹预测中由于气象数据稀疏性导致的预测可靠性不足问题,尤其是在非线性路径变化和历史观测数据有限的情况下。解决方案的关键在于提出TyphoFormer框架,通过引入自然语言描述作为辅助提示,将高阶气象语义嵌入到数值时间序列输入中,从而增强模型对上下文信息的捕捉能力。该方法在统一的Transformer编码器中融合文本与序列信息,使模型能够利用仅凭数值特征难以获取的上下文线索。
链接: https://arxiv.org/abs/2506.17609
作者: Lincan Li,Eren Erman Ozguven,Yue Zhao,Guang Wang,Yiqun Xie,Yushun Dong
机构: Florida State University (佛罗里达州立大学); FAMU-FSU College of Engineering (佛罗里达农工大学-佛罗里达州立大学工程学院); University of Southern California (南加利福尼亚大学); University of Maryland (马里兰大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Accurate typhoon track forecasting is crucial for early system warning and disaster response. While Transformer-based models have demonstrated strong performance in modeling the temporal dynamics of dense trajectories of humans and vehicles in smart cities, they usually lack access to broader contextual knowledge that enhances the forecasting reliability of sparse meteorological trajectories, such as typhoon tracks. To address this challenge, we propose TyphoFormer, a novel framework that incorporates natural language descriptions as auxiliary prompts to improve typhoon trajectory forecasting. For each time step, we use Large Language Model (LLM) to generate concise textual descriptions based on the numerical attributes recorded in the North Atlantic hurricane database. The language descriptions capture high-level meteorological semantics and are embedded as auxiliary special tokens prepended to the numerical time series input. By integrating both textual and sequential information within a unified Transformer encoder, TyphoFormer enables the model to leverage contextual cues that are otherwise inaccessible through numerical features alone. Extensive experiments are conducted on HURDAT2 benchmark, results show that TyphoFormer consistently outperforms other state-of-the-art baseline methods, particularly under challenging scenarios involving nonlinear path shifts and limited historical observations.
zh
[NLP-98] Mind the Gap: Assessing Wiktionarys Crowd-Sourced Linguistic Knowledge on Morphological Gaps in Two Related Languages
【速读】: 该论文试图解决形态缺陷性(morphological defectivity)在语言学中的研究不足问题,特别是在形态丰富的语言中,传统语言资源往往无法覆盖形态空缺现象,而维基百科和Wiktionary等众包资源虽为罕见语言现象提供了重要数据,但其可靠性仍存争议。解决方案的关键在于定制一种新型神经形态分析器,用于标注拉丁语和意大利语语料库,并利用大规模标注数据计算验证从Wiktionary收集的缺陷动词列表,从而评估众包数据的质量并提升对形态缺陷性的计算分析能力。
链接: https://arxiv.org/abs/2506.17603
作者: Jonathan Sakunkoo,Annabella Sakunkoo
机构: Stanford University OHS (斯坦福大学OHS)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:
Abstract:Morphological defectivity is an intriguing and understudied phenomenon in linguistics. Addressing defectivity, where expected inflectional forms are absent, is essential for improving the accuracy of NLP tools in morphologically rich languages. However, traditional linguistic resources often lack coverage of morphological gaps as such knowledge requires significant human expertise and effort to document and verify. For scarce linguistic phenomena in under-explored languages, Wikipedia and Wiktionary often serve as among the few accessible resources. Despite their extensive reach, their reliability has been a subject of controversy. This study customizes a novel neural morphological analyzer to annotate Latin and Italian corpora. Using the massive annotated data, crowd-sourced lists of defective verbs compiled from Wiktionary are validated computationally. Our results indicate that while Wiktionary provides a highly reliable account of Italian morphological gaps, 7% of Latin lemmata listed as defective show strong corpus evidence of being non-defective. This discrepancy highlights potential limitations of crowd-sourced wikis as definitive sources of linguistic knowledge, particularly for less-studied phenomena and languages, despite their value as resources for rare linguistic features. By providing scalable tools and methods for quality assurance of crowd-sourced data, this work advances computational morphology and expands linguistic knowledge of defectivity in non-English, morphologically rich languages.
zh
[NLP-99] Cite Pretrain: Retrieval-Free Knowledge Attribution for Large Language Models
【速读】: 该论文试图解决语言模型在生成答案时无法可靠地引用预训练阶段所见文档的问题,当前系统依赖于推理时的外部检索器来插入引用,导致延迟、基础设施依赖和检索噪声。解决方案的关键在于通过修改训练过程,使语言模型能够在不依赖测试时检索的情况下,准确地将事实与预训练期间见过的文档进行关联。论文提出了一种主动索引(Active Indexing)方法,通过持续预训练合成问答对,以多样化的方式重述事实并要求双向源到事实和事实到源的生成,从而教会模型从引用来源生成内容并为其答案进行归因。
链接: https://arxiv.org/abs/2506.17585
作者: Yukun Huang,Sanxing Chen,Jian Pei,Manzil Zaheer,Bhuwan Dhingra
机构: Duke University (杜克大学); Meta (元)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Trustworthy language models should provide both correct and verifiable answers. While language models can sometimes attribute their outputs to pretraining data, their citations are often unreliable due to hallucination. As a result, current systems insert citations by querying an external retriever at inference time, introducing latency, infrastructure dependence, and vulnerability to retrieval noise. We explore whether LLMs can be made to reliably attribute to the documents seen during (continual) pretraining–without test-time retrieval–by revising the training process. To evaluate this, we release CitePretrainBench, a benchmark that mixes real-world corpora (Wikipedia, Common Crawl, arXiv) with novel, unseen documents and probes both short-form (single fact) and long-form (multi-fact) citation tasks. Our approach follows a two-stage process: (1) continual pretraining to bind facts to persistent document identifiers, and (2) instruction tuning to elicit citation behavior. We find that simple Passive Indexing, which appends an identifier to each document, helps memorize verbatim text but fails on paraphrased or compositional facts. Instead, we propose Active Indexing, which continually pretrains on synthetic QA pairs that (1) restate each fact in diverse compositional forms, and (2) require bidirectional source-to-fact and fact-to-source generation, jointly teaching the model to generate content from a cited source and to attribute its own answers. Experiments with Qwen2.5-7B and 3B show that Active Indexing consistently outperforms Passive Indexing across all tasks and models, with citation precision gains up to 30.2 percent. Our ablation studies reveal that performance continues to improve as we scale the amount of augmented data, showing a clear upward trend even at 16 times the original token count.
zh
[NLP-100] AgriCHN: A Comprehensive Cross-domain Resource for Chinese Agricultural Named Entity Recognition
【速读】: 该论文旨在解决农业命名实体识别(Agricultural Named Entity Recognition)中因高质量农业数据集稀缺而导致的性能不足问题,尤其是针对中文语料的缺乏。其解决方案的关键在于构建AgriCHN,一个全面的开源中文资源,该资源不仅涵盖了广泛的农业实体类别,还整合了与农业相关的水文和气象实体,从而提升了实体识别的多样性和准确性。通过精细的实体划分和丰富的实体类型,AgriCHN在数据质量上优于现有资源,并为后续研究提供了具有挑战性的基准任务。
链接: https://arxiv.org/abs/2506.17578
作者: Lingxiao Zeng,Yiqi Tong,Wei Guo,Huarui Wu,Lihao Ge,Yijun Ye,Fuzhen Zhuang,Deqing Wang,Wei Guo,Cheng Chen
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Agricultural named entity recognition is a specialized task focusing on identifying distinct agricultural entities within vast bodies of text, including crops, diseases, pests, and fertilizers. It plays a crucial role in enhancing information extraction from extensive agricultural text resources. However, the scarcity of high-quality agricultural datasets, particularly in Chinese, has resulted in suboptimal performance when employing mainstream methods for this purpose. Most earlier works only focus on annotating agricultural entities while overlook the profound correlation of agriculture with hydrology and meteorology. To fill this blank, we present AgriCHN, a comprehensive open-source Chinese resource designed to promote the accuracy of automated agricultural entity annotation. The AgriCHN dataset has been meticulously curated from a wealth of agricultural articles, comprising a total of 4,040 sentences and encapsulating 15,799 agricultural entity mentions spanning 27 diverse entity categories. Furthermore, it encompasses entities from hydrology to meteorology, thereby enriching the diversity of entities considered. Data validation reveals that, compared with relevant resources, AgriCHN demonstrates outstanding data quality, attributable to its richer agricultural entity types and more fine-grained entity divisions. A benchmark task has also been constructed using several state-of-the-art neural NER models. Extensive experimental results highlight the significant challenge posed by AgriCHN and its potential for further research.
zh
[NLP-101] LLM -driven Medical Report Generation via Communication-efficient Heterogeneous Federated Learning
【速读】: 该论文旨在解决医学影像报告生成(Medical Report Generation, MRG)中因多中心数据分散且受隐私法规限制而难以集中训练大型语言模型(Large Language Models, LLMs)的问题,从而阻碍了LLM驱动的MRG模型的发展与应用。其解决方案的关键在于提出FedMRG框架,该框架首次利用联邦学习(Federated Learning, FL)实现隐私保护下的多中心LLM驱动MRG模型开发,通过低秩分解减少参数更新的通信开销,并引入客户端感知对比学习和双适配器互增强机制,以应对多模态数据异质性带来的挑战。
链接: https://arxiv.org/abs/2506.17562
作者: Haoxuan Che,Haibo Jin,Zhengrui Guo,Yi Lin,Cheng Jin,Hao Chen
机构: Hong Kong University of Science and Technology (香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:
Abstract:LLMs have demonstrated significant potential in Medical Report Generation (MRG), yet their development requires large amounts of medical image-report pairs, which are commonly scattered across multiple centers. Centralizing these data is exceptionally challenging due to privacy regulations, thereby impeding model development and broader adoption of LLM-driven MRG models. To address this challenge, we present FedMRG, the first framework that leverages Federated Learning (FL) to enable privacy-preserving, multi-center development of LLM-driven MRG models, specifically designed to overcome the critical challenge of communication-efficient LLM training under multi-modal data heterogeneity. To start with, our framework tackles the fundamental challenge of communication overhead in FL-LLM tuning by employing low-rank factorization to efficiently decompose parameter updates, significantly reducing gradient transmission costs and making LLM-driven MRG feasible in bandwidth-constrained FL settings. Furthermore, we observed the dual heterogeneity in MRG under the FL scenario: varying image characteristics across medical centers, as well as diverse reporting styles and terminology preferences. To address this, we further enhance FedMRG with (1) client-aware contrastive learning in the MRG encoder, coupled with diagnosis-driven prompts, which capture both globally generalizable and locally distinctive features while maintaining diagnostic accuracy; and (2) a dual-adapter mutual boosting mechanism in the MRG decoder that harmonizes generic and specialized adapters to address variations in reporting styles and terminology. Through extensive evaluation of our established FL-MRG benchmark, we demonstrate the generalizability and adaptability of FedMRG, underscoring its potential in harnessing multi-center data and generating clinically accurate reports while maintaining communication efficiency.
zh
[NLP-102] Probing for Phonology in Self-Supervised Speech Representations: A Case Study on Accent Perception
【速读】: 该论文试图解决传统语音口音感知模型在量化语音中音系特征的梯度变化方面存在的不足,这些变化是听者进行口音判断所依赖的关键因素。其解决方案的关键在于利用当前自监督学习(SSL)模型的预训练表示,分析影响音段口音感知的音系特征级别的变化。研究聚焦于三个音段:唇齿近音、卷舌闪音和卷舌塞音,并通过 CSLU 外国口音英语语料库提取音系特征概率及预训练表示,结合美式英语母语者的口音判断进行分析。结果表明,口音强度的最佳预测依赖于部分预训练表示特征,其中对感知显著的音系特征给予较高权重,这凸显了自监督语音表示在基于可解释音系特征建模口音感知中的价值。
链接: https://arxiv.org/abs/2506.17542
作者: Nitin Venkateswaran,Kevin Tang,Ratree Wayland
机构: University of Florida (佛罗里达大学); Heinrich Heine University Düsseldorf (海德堡大学杜塞尔多夫分校)
类目: Computation and Language (cs.CL)
备注:
Abstract:Traditional models of accent perception underestimate the role of gradient variations in phonological features which listeners rely upon for their accent judgments. We investigate how pretrained representations from current self-supervised learning (SSL) models of speech encode phonological feature-level variations that influence the perception of segmental accent. We focus on three segments: the labiodental approximant, the rhotic tap, and the retroflex stop, which are uniformly produced in the English of native speakers of Hindi as well as other languages in the Indian sub-continent. We use the CSLU Foreign Accented English corpus (Lander, 2007) to extract, for these segments, phonological feature probabilities using Phonet (Vásquez-Correa et al., 2019) and pretrained representations from Wav2Vec2-BERT (Barrault et al., 2023) and WavLM (Chen et al., 2022) along with accent judgements by native speakers of American English. Probing analyses show that accent strength is best predicted by a subset of the segment’s pretrained representation features, in which perceptually salient phonological features that contrast the expected American English and realized non-native English segments are given prominent weighting. A multinomial logistic regression of pretrained representation-based segment distances from American and Indian English baselines on accent ratings reveals strong associations between the odds of accent strength and distances from the baselines, in the expected directions. These results highlight the value of self-supervised speech representations for modeling accent perception using interpretable phonological features.
zh
[NLP-103] DuaShepherd: Integrating Stepwise Correctness and Potential Rewards for Mathematical Reasoning
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在数学推理能力上的不足,通过引入一种新的奖励建模框架DuaShepherd来提升其表现。该框架的关键在于整合了两种互补的奖励信号——正确性(correctness)与潜力(potential),其中正确性信号关注步骤错误的识别,而潜力信号则关注达到正确最终答案的可能性。通过构建大规模的奖励建模数据集,并采用统一的多头架构在多任务设置中联合训练两个奖励模型,实现了对正确性和潜力的并行学习,从而在多个基准测试中取得了显著的性能提升。
链接: https://arxiv.org/abs/2506.17533
作者: Yuanhao Wu,Juntong Song,Hanning Zhang,Tong Zhang,Cheng Niu
机构: NewsBreak; University of Illinois Urbana-Champaign
类目: Computation and Language (cs.CL)
备注:
Abstract:In this paper, we propose DuaShepherd, a novel reward modeling framework that integrates two complementary reward signals, correctness and potential, to enhance the mathematical reasoning capabilities of Large Language Models (LLMs). While correctness-based signals emphasize identification of stepwise errors, potential-based signals focus on the likelihood of reaching the correct final answer. We developed an automated pipeline for constructing large-scale reward modeling dataset with both signals. A unified, multi-head architecture was explored to train the two reward models in a multi-task setup, demonstrating benefits from learning both correctness and potential in parallel. By combining these two signals into a compound probability, our model achieves consistent performance improvements across multiple benchmarks. Empirical evaluations on MATH500 and ProcessBench confirm that this combined reward significantly outperforms models trained on either reward type alone, achieving state-of-the-art performance under comparable resource constraints.
zh
[NLP-104] Data Quality Issues in Multilingual Speech Datasets: The Need for Sociolinguistic Awareness and Proactive Language Planning ACL2025
【速读】: 该论文试图解决多语言语音数据集(如Mozilla Common Voice 17.0、FLEURS和VoxPopuli)在某些语言中存在显著质量问题的问题,这些问题影响了其作为训练和评估数据集的效用,并进而影响下游模型的性能。解决方案的关键在于区分质量缺陷为微观层面和宏观层面,并强调在资源较少、制度化程度较低的语言中,宏观层面的问题更为普遍。论文通过台湾南部闽南语(nan_tw)的案例分析,指出需要在自动语音识别(ASR)数据集构建过程中加强语言规划(如正字法规范、方言边界定义)和数据质量控制,最终提出未来数据集开发的指南和建议,以提升语音数据资源的鲁棒性和可靠性。
链接: https://arxiv.org/abs/2506.17525
作者: Mingfei Lau,Qian Chen,Yeming Fang,Tingting Xu,Tongzhou Chen,Pavel Golik
机构: Google(谷歌)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by ACL 2025 Main Conference
Abstract:Our quality audit for three widely used public multilingual speech datasets - Mozilla Common Voice 17.0, FLEURS, and VoxPopuli - shows that in some languages, these datasets suffer from significant quality issues. We believe addressing these issues will make these datasets more useful as training and evaluation sets, and improve downstream models. We divide these quality issues into two categories: micro-level and macro-level. We find that macro-level issues are more prevalent in less institutionalized, often under-resourced languages. We provide a case analysis of Taiwanese Southern Min (nan_tw) that highlights the need for proactive language planning (e.g. orthography prescriptions, dialect boundary definition) and enhanced data quality control in the process of Automatic Speech Recognition (ASR) dataset creation. We conclude by proposing guidelines and recommendations to mitigate these issues in future dataset development, emphasizing the importance of sociolinguistic awareness in creating robust and reliable speech data resources.
zh
[NLP-105] VeriLocc: End-to-End Cross-Architecture Register Allocation via LLM
【速读】: 该论文试图解决现代GPU快速演进背景下,生产级编译器仍依赖人工设计的寄存器分配启发式方法,且需针对每代硬件进行大量调优的问题。解决方案的关键在于提出VeriLocc框架,该框架结合大型语言模型(LLM)与形式化编译技术,通过微调LLM将中间表示(MIR)转换为目标架构的寄存器分配,并借助静态分析实现跨架构的归一化与泛化,以及验证器引导的再生循环确保正确性。
链接: https://arxiv.org/abs/2506.17506
作者: Lesheng Jin,Zhenyuan Ruan,Haohui Mai,Jingbo Shang
机构: UC San Diego (加州大学圣地亚哥分校); MIT (麻省理工学院); CausalFlow Inc. (因果流公司)
类目: Computation and Language (cs.CL); Operating Systems (cs.OS)
备注:
Abstract:Modern GPUs evolve rapidly, yet production compilers still rely on hand-crafted register allocation heuristics that require substantial re-tuning for each hardware generation. We introduce VeriLocc, a framework that combines large language models (LLMs) with formal compiler techniques to enable generalizable and verifiable register allocation across GPU architectures. VeriLocc fine-tunes an LLM to translate intermediate representations (MIRs) into target-specific register assignments, aided by static analysis for cross-architecture normalization and generalization and a verifier-guided regeneration loop to ensure correctness. Evaluated on matrix multiplication (GEMM) and multi-head attention (MHA), VeriLocc achieves 85-99% single-shot accuracy and near-100% pass@100. Case study shows that VeriLocc discovers more performant assignments than expert-tuned libraries, outperforming rocBLAS by over 10% in runtime.
zh
[NLP-106] Computational Approaches to Understanding Large Language Model Impact on Writing and Information Ecosystems
【速读】: 该论文试图解决人工智能生成式AI(Generative AI)在社会应用中引发的公平性问题、大规模采用模式的测量问题以及其在科研反馈中的有效性问题。解决方案的关键在于通过制度性采用AI检测工具揭示系统性偏见,利用群体层面的算法方法量化LLMs在不同写作领域中的普及程度,并通过大规模实证分析评估LLMs在研究手稿反馈中的潜力,从而为相关技术的合理治理与应用提供依据。
链接: https://arxiv.org/abs/2506.17467
作者: Weixin Liang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: Stanford CS PhD Dissertation
Abstract:Large language models (LLMs) have shown significant potential to change how we write, communicate, and create, leading to rapid adoption across society. This dissertation examines how individuals and institutions are adapting to and engaging with this emerging technology through three research directions. First, I demonstrate how the institutional adoption of AI detectors introduces systematic biases, particularly disadvantaging writers of non-dominant language varieties, highlighting critical equity concerns in AI governance. Second, I present novel population-level algorithmic approaches that measure the increasing adoption of LLMs across writing domains, revealing consistent patterns of AI-assisted content in academic peer reviews, scientific publications, consumer complaints, corporate communications, job postings, and international organization press releases. Finally, I investigate LLMs’ capability to provide feedback on research manuscripts through a large-scale empirical analysis, offering insights into their potential to support researchers who face barriers in accessing timely manuscript feedback, particularly early-career researchers and those from under-resourced settings.
zh
[NLP-107] Breaking the Transcription Bottleneck: Fine-tuning ASR Models for Extremely Low-Resource Fieldwork Languages
【速读】: 该论文试图解决在语言学田野工作中,自动语音识别(ASR)技术由于数据量有限、语音自发性及环境噪声等问题导致的实用性受限问题。解决方案的关键在于通过微调两种多语言ASR模型——MMS和XLS-R,并在不同训练数据时长下进行性能基准测试,以确定在极端小样本和中等规模数据情况下的最佳模型选择,从而为语言学家提供可重复的ASR适应方法,缓解语言记录中的转录瓶颈。
链接: https://arxiv.org/abs/2506.17459
作者: Siyu Liang,Gina-Anne Levow
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Automatic Speech Recognition (ASR) has reached impressive accuracy for high-resource languages, yet its utility in linguistic fieldwork remains limited. Recordings collected in fieldwork contexts present unique challenges, including spontaneous speech, environmental noise, and severely constrained datasets from under-documented languages. In this paper, we benchmark the performance of two fine-tuned multilingual ASR models, MMS and XLS-R, on five typologically diverse low-resource languages with control of training data duration. Our findings show that MMS is best suited when extremely small amounts of training data are available, whereas XLS-R shows parity performance once training data exceed one hour. We provide linguistically grounded analysis for further provide insights towards practical guidelines for field linguists, highlighting reproducible ASR adaptation approaches to mitigate the transcription bottleneck in language documentation.
zh
[NLP-108] Beyond the Link: Assessing LLM s ability to Classify Political Content across Global Media
【速读】: 该论文试图解决在政治内容(Political Content, PC)分类中,利用仅URL信息是否能够有效替代全文分析的问题。其解决方案的关键在于评估当前先进的生成式AI(Generative AI)模型(如GPT、Llama、Mistral、Deepseek、Qwen和Gemma)是否能够准确区分PC与非PC内容,并通过对比人工标注数据及传统监督学习方法,验证URL级分析在不同语言和国家背景下的可行性与准确性。
链接: https://arxiv.org/abs/2506.17435
作者: Alberto Martinez-Serra,Alejandro De La Fuente,Nienke Viescher,Ana S. Cardenal
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:The use of large language models (LLMs) is becoming common in the context of political science, particularly in studies that analyse individuals use of digital media. However, while previous research has demonstrated LLMs ability at labelling tasks, the effectiveness of using LLMs to classify political content (PC) from just URLs is not yet well explored. The work presented in this article bridges this gap by evaluating whether LLMs can accurately identify PC vs. non-PC from both the article text and the URLs from five countries (France, Germany, Spain, the UK, and the US) and different languages. Using cutting-edge LLMs like GPT, Llama, Mistral, Deepseek, Qwen and Gemma, we measure model performance to assess whether URL-level analysis can be a good approximation for full-text analysis of PC, even across different linguistic and national contexts. Model outputs are compared with human-labelled articles, as well as traditional supervised machine learning techniques, to set a baseline of performance. Overall, our findings suggest the capacity of URLs to embed most of the news content, providing a vital perspective on accuracy-cost balancing. We also account for contextual limitations and suggest methodological recommendations to use LLMs within political science studies.
zh
[NLP-109] UProp: Investigating the Uncertainty Propagation of LLM s in Multi-Step Agent ic Decision-Making
【速读】: 该论文旨在解决在涉及序列决策的安全关键型应用中,如何判断大型语言模型(Large Language Models, LLMs)决策可信度的问题。现有LLM不确定性量化(Uncertainty Quantification, UQ)方法主要针对单轮问答场景设计,未能充分覆盖多步骤决策场景中的不确定性评估。论文提出了一种基于信息论的系统性框架,将LLM序列决策不确定性分解为内部不确定性(与当前决策相关)和外部不确定性(通过互信息MI衡量前序决策带来的不确定性传递)。其关键解决方案是UProp,该方法通过将直接估计MI转化为在轨迹依赖决策过程(Trajectory-Dependent Decision Processes, TDPs)上估计点互信息(Pointwise Mutual Information, PMI),从而高效且有效地估算外部不确定性。
链接: https://arxiv.org/abs/2506.17419
作者: Jinhao Duan,James Diffenderfer,Sandeep Madireddy,Tianlong Chen,Bhavya Kailkhura,Kaidi Xu
机构: Drexel University (德雷塞尔大学); Lawrence Livermore National Laboratory (劳伦斯利弗莫尔国家实验室); Argonne National Laboratory (阿贡国家实验室); UNC Chapel Hill (北卡罗来纳大学教堂山分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注: 19 pages, 5 figures, 4 tables
Abstract:As Large Language Models (LLMs) are integrated into safety-critical applications involving sequential decision-making in the real world, it is essential to know when to trust LLM decisions. Existing LLM Uncertainty Quantification (UQ) methods are primarily designed for single-turn question-answering formats, resulting in multi-step decision-making scenarios, e.g., LLM agentic system, being underexplored. In this paper, we introduce a principled, information-theoretic framework that decomposes LLM sequential decision uncertainty into two parts: (i) internal uncertainty intrinsic to the current decision, which is focused on existing UQ methods, and (ii) extrinsic uncertainty, a Mutual-Information (MI) quantity describing how much uncertainty should be inherited from preceding decisions. We then propose UProp, an efficient and effective extrinsic uncertainty estimator that converts the direct estimation of MI to the estimation of Pointwise Mutual Information (PMI) over multiple Trajectory-Dependent Decision Processes (TDPs). UProp is evaluated over extensive multi-step decision-making benchmarks, e.g., AgentBench and HotpotQA, with state-of-the-art LLMs, e.g., GPT-4.1 and DeepSeek-V3. Experimental results demonstrate that UProp significantly outperforms existing single-turn UQ baselines equipped with thoughtful aggregation strategies. Moreover, we provide a comprehensive analysis of UProp, including sampling efficiency, potential applications, and intermediate uncertainty propagation, to demonstrate its effectiveness. Codes will be available at this https URL.
zh
[NLP-110] Leverag ing LLM s to Assess Tutor Moves in Real-Life Dialogues: A Feasibility Study
【速读】: 该论文试图解决如何在大规模范围内通过音频转录文本识别和研究与学生学习最相关的辅导行为这一开放性问题。其解决方案的关键在于利用生成式 AI(Generative AI)来识别和评估真实情境下的数学辅导中特定的辅导策略,如有效表扬和应对学生数学错误的能力,实验结果表明这些模型在检测相关情境和评估辅导实践方面表现出较高的准确性,并与人类判断高度一致。
链接: https://arxiv.org/abs/2506.17410
作者: Danielle R. Thomas,Conrad Borchers,Jionghao Lin,Sanjit Kakarla,Shambhavi Bhushan,Erin Gatz,Shivang Gupta,Ralph Abboud,Kenneth R. Koedinger
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: Short research paper accepted at EC-TEL 2025
Abstract:Tutoring improves student achievement, but identifying and studying what tutoring actions are most associated with student learning at scale based on audio transcriptions is an open research problem. This present study investigates the feasibility and scalability of using generative AI to identify and evaluate specific tutor moves in real-life math tutoring. We analyze 50 randomly selected transcripts of college-student remote tutors assisting middle school students in mathematics. Using GPT-4, GPT-4o, GPT-4-turbo, Gemini-1.5-pro, and LearnLM, we assess tutors’ application of two tutor skills: delivering effective praise and responding to student math errors. All models reliably detected relevant situations, for example, tutors providing praise to students (94-98% accuracy) and a student making a math error (82-88% accuracy) and effectively evaluated the tutors’ adherence to tutoring best practices, aligning closely with human judgments (83-89% and 73-77%, respectively). We propose a cost-effective prompting strategy and discuss practical implications for using large language models to support scalable assessment in authentic settings. This work further contributes LLM prompts to support reproducibility and research in AI-supported learning.
zh
[NLP-111] Cash or Comfort? How LLM s Value Your Inconvenience
【速读】: 该论文试图解决当前大型语言模型(Large Language Models, LLMs)在涉及财务收益与用户舒适度冲突的情境下,其行为决策的不确定性和不可靠性问题。解决方案的关键在于量化多个LLMs对一系列用户不适感(如额外步行、等待、饥饿和疼痛)所赋予的代价,从而揭示现有LLMs在决策辅助中的潜在缺陷,包括模型间响应差异大、对提示语变化敏感、接受不合理低回报以及拒绝无不适感的金钱收益等核心问题。
链接: https://arxiv.org/abs/2506.17367
作者: Mateusz Cedro,Timour Ichmoukhamedov,Sofie Goethals,Yifan He,James Hinns,David Martens
机构: University of Antwerp (安特卫普大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 12 pages, 4 figures, 3 tables
Abstract:Large Language Models (LLMs) are increasingly proposed as near-autonomous artificial intelligence (AI) agents capable of making everyday decisions on behalf of humans. Although LLMs perform well on many technical tasks, their behaviour in personal decision-making remains less understood. Previous studies have assessed their rationality and moral alignment with human decisions. However, the behaviour of AI assistants in scenarios where financial rewards are at odds with user comfort has not yet been thoroughly explored. In this paper, we tackle this problem by quantifying the prices assigned by multiple LLMs to a series of user discomforts: additional walking, waiting, hunger and pain. We uncover several key concerns that strongly question the prospect of using current LLMs as decision-making assistants: (1) a large variance in responses between LLMs, (2) within a single LLM, responses show fragility to minor variations in prompt phrasing (e.g., reformulating the question in the first person can considerably alter the decision), (3) LLMs can accept unreasonably low rewards for major inconveniences (e.g., 1 Euro to wait 10 hours), and (4) LLMs can reject monetary gains where no discomfort is imposed (e.g., 1,000 Euro to wait 0 minutes). These findings emphasize the need for scrutiny of how LLMs value human inconvenience, particularly as we move toward applications where such cash-versus-comfort trade-offs are made on users’ behalf.
zh
[NLP-112] owards Safety Evaluations of Theory of Mind in Large Language Models
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在安全评估中表现出的潜在欺骗性行为问题,特别是模型在面对不利信息时可能隐藏其真实意图并提供虚假答案的现象。解决方案的关键在于测量LLMs的心智理论(Theory of Mind)能力,以评估其是否具备理解他人心理状态并据此调整行为的能力,从而判断其行为是否源于有意的隐蔽过程。
链接: https://arxiv.org/abs/2506.17352
作者: Tatsuhiro Aoshima,Mitsuaki Akiyama
机构: NTT(日本电信电话株式会社)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:As the capabilities of large language models (LLMs) continue to advance, the importance of rigorous safety evaluation is becoming increasingly evident. Recent concerns within the realm of safety assessment have highlighted instances in which LLMs exhibit behaviors that appear to disable oversight mechanisms and respond in a deceptive manner. For example, there have been reports suggesting that, when confronted with information unfavorable to their own persistence during task execution, LLMs may act covertly and even provide false answers to questions intended to verify their this http URL evaluate the potential risk of such deceptive actions toward developers or users, it is essential to investigate whether these behaviors stem from covert, intentional processes within the model. In this study, we propose that it is necessary to measure the theory of mind capabilities of LLMs. We begin by reviewing existing research on theory of mind and identifying the perspectives and tasks relevant to its application in safety evaluation. Given that theory of mind has been predominantly studied within the context of developmental psychology, we analyze developmental trends across a series of open-weight LLMs. Our results indicate that while LLMs have improved in reading comprehension, their theory of mind capabilities have not shown comparable development. Finally, we present the current state of safety evaluation with respect to LLMs’ theory of mind, and discuss remaining challenges for future work.
zh
[NLP-113] Zero-Shot Cognitive Impairment Detection from Speech Using AudioLLM
【速读】: 该论文旨在解决认知障碍(Cognitive Impairment, CI)早期检测的问题,传统方法依赖于人工标注的语音声学和语言特征训练的监督模型,存在泛化能力差的问题。其解决方案的关键在于提出一种基于Qwen2-Audio AudioLLM的零样本语音CI检测方法,通过设计提示指令引导模型对语音样本进行分类,无需依赖特定数据集或语言的监督信号,从而实现了跨语言、任务和数据集的良好泛化性与一致性。
链接: https://arxiv.org/abs/2506.17351
作者: Mostafa Shahin,Beena Ahmed,Julien Epps
机构: University of New South Wales (新南威尔士大学)
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
备注:
Abstract:Cognitive impairment (CI) is of growing public health concern, and early detection is vital for effective intervention. Speech has gained attention as a non-invasive and easily collectible biomarker for assessing cognitive decline. Traditional CI detection methods typically rely on supervised models trained on acoustic and linguistic features extracted from speech, which often require manual annotation and may not generalise well across datasets and languages. In this work, we propose the first zero-shot speech-based CI detection method using the Qwen2- Audio AudioLLM, a model capable of processing both audio and text inputs. By designing prompt-based instructions, we guide the model in classifying speech samples as indicative of normal cognition or cognitive impairment. We evaluate our approach on two datasets: one in English and another multilingual, spanning different cognitive assessment tasks. Our results show that the zero-shot AudioLLM approach achieves performance comparable to supervised methods and exhibits promising generalizability and consistency across languages, tasks, and datasets.
zh
[NLP-114] Beyond Prediction – Structuring Epistemic Integrity in Artificial Reasoning Systems
【速读】: 该论文试图解决在严格认识论约束下人工智能系统的设计问题,旨在超越传统的随机语言预测模型,实现结构化推理、命题承诺和矛盾检测。其解决方案的关键在于构建一个综合框架,通过形式化信念表示、元认知过程和规范验证,整合符号推理、知识图谱与基于区块链的论证机制,以确保认知代理具备保真性与可审计的理性。
链接: https://arxiv.org/abs/2506.17331
作者: Craig Steven Wright
机构: University of Exeter (埃克塞特大学)
类目: Logic in Computer Science (cs.LO); Computation and Language (cs.CL); Logic (math.LO)
备注: 126 pages, 0 figures, includes formal frameworks and architecture blueprint; no prior version; suitable for submission under AI and Logic categories
Abstract:This paper develops a comprehensive framework for artificial intelligence systems that operate under strict epistemic constraints, moving beyond stochastic language prediction to support structured reasoning, propositional commitment, and contradiction detection. It formalises belief representation, metacognitive processes, and normative verification, integrating symbolic inference, knowledge graphs, and blockchain-based justification to ensure truth-preserving, auditably rational epistemic agents.
zh
[NLP-115] PRAISE: Enhancing Product Descriptions with LLM -Driven Structured Insights ACL2025
【速读】: 该论文试图解决电子商务中产品描述不准确或不完整的问题,此类问题通常源于卖家提供的信息不足,而客户评论虽然包含有价值的信息,但手动筛选和整理却十分耗时。解决方案的关键在于提出PRAISE(Product Review Attribute Insight Structuring Engine),该系统利用大型语言模型(Large Language Models, LLMs)自动提取、比较并结构化客户评论与卖家描述中的洞察信息,从而识别两者之间的缺失、矛盾或部分匹配的细节,并以清晰的结构化格式呈现差异及其支持证据。
链接: https://arxiv.org/abs/2506.17314
作者: Adnan Qidwai,Srija Mukhopadhyay,Prerana Khatiwada,Dan Roth,Vivek Gupta
机构: IIIT Hyderabad(印度国际信息技术学院); University of Delaware(特拉华大学); University of Pennsylvania(宾夕法尼亚大学); Arizona State University(亚利桑那州立大学)
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: 9 Pages, 9 Figures. Accepted at ACL 2025 System Demonstration Track
Abstract:Accurate and complete product descriptions are crucial for e-commerce, yet seller-provided information often falls short. Customer reviews offer valuable details but are laborious to sift through manually. We present PRAISE: Product Review Attribute Insight Structuring Engine, a novel system that uses Large Language Models (LLMs) to automatically extract, compare, and structure insights from customer reviews and seller descriptions. PRAISE provides users with an intuitive interface to identify missing, contradictory, or partially matching details between these two sources, presenting the discrepancies in a clear, structured format alongside supporting evidence from reviews. This allows sellers to easily enhance their product listings for clarity and persuasiveness, and buyers to better assess product reliability. Our demonstration showcases PRAISE’s workflow, its effectiveness in generating actionable structured insights from unstructured reviews, and its potential to significantly improve the quality and trustworthiness of e-commerce product catalogs.
zh
[NLP-116] Mercury: Ultra-Fast Language Models Based on Diffusion
【速读】: 该论文旨在解决大规模语言模型(Large Language Models, LLMs)在代码生成任务中的效率与质量平衡问题。其解决方案的关键在于采用扩散机制(diffusion)构建新的商业级LLMs——Mercury Coder,该模型基于Transformer架构,并通过并行预测多个标记来提高推理速度,从而在保持高质量输出的同时显著提升了处理速度。
链接: https://arxiv.org/abs/2506.17298
作者: Inception Labs,Samar Khanna,Siddhant Kharbanda,Shufan Li,Harshit Varma,Eric Wang,Sawyer Birnbaum,Ziyang Luo,Yanis Miraoui,Akash Palrecha,Stefano Ermon,Aditya Grover,Volodymyr Kuleshov
机构: Inception Labs(Inception Labs)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 15 pages; equal core, cross-function, senior authors listed alphabetically
Abstract:We present Mercury, a new generation of commercial-scale large language models (LLMs) based on diffusion. These models are parameterized via the Transformer architecture and trained to predict multiple tokens in parallel. In this report, we detail Mercury Coder, our first set of diffusion LLMs designed for coding applications. Currently, Mercury Coder comes in two sizes: Mini and Small. These models set a new state-of-the-art on the speed-quality frontier. Based on independent evaluations conducted by Artificial Analysis, Mercury Coder Mini and Mercury Coder Small achieve state-of-the-art throughputs of 1109 tokens/sec and 737 tokens/sec, respectively, on NVIDIA H100 GPUs and outperform speed-optimized frontier models by up to 10x on average while maintaining comparable quality. We discuss additional results on a variety of code benchmarks spanning multiple languages and use-cases as well as real-world validation by developers on Copilot Arena, where the model currently ranks second on quality and is the fastest model overall. We also release a public API at this https URL and free playground at this https URL
zh
[NLP-117] Semantic uncertainty in advanced decoding methods for LLM generation
【速读】: 该论文试图解决大型语言模型(Large Language Model, LLM)输出中的语义不确定性问题,重点分析不同解码方法对模型输出多样性和可靠性的影响。其解决方案的关键在于通过结构化的解码策略,如链式思维(Chain-of-Thought, CoT)解码和推测采样(Speculative Sampling),在保持或提升输出质量的同时增加语义探索能力。研究结果表明,CoT解码虽然具有更高的语义多样性,但其预测熵较低,表现出更高的置信度和准确性,而推测采样在摘要任务中则实现了较高的ROUGE分数与适中的语义多样性,从而挑战了传统上认为多样性和准确性之间存在权衡的假设。
链接: https://arxiv.org/abs/2506.17296
作者: Darius Foodeei,Simin Fan,Martin Jaggi
机构: EPFL, Lausanne Switzerland (瑞士洛桑联邦理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:This study investigates semantic uncertainty in large language model (LLM) outputs across different decoding methods, focusing on emerging techniques like speculative sampling and chain-of-thought (CoT) decoding. Through experiments on question answering, summarization, and code generation tasks, we analyze how different decoding strategies affect both the diversity and reliability of model outputs. Our findings reveal that while CoT decoding demonstrates higher semantic diversity, it maintains lower predictive entropy, suggesting that structured exploration can lead to more confident and accurate outputs. This is evidenced by a 48.8% improvement in code generation Pass@2 rates, despite lower alignment with reference solutions. For summarization tasks, speculative sampling proved particularly effective, achieving superior ROUGE scores while maintaining moderate semantic diversity. Our results challenge conventional assumptions about trade-offs between diversity and accuracy in language model outputs, demonstrating that properly structured decoding methods can increase semantic exploration while maintaining or improving output quality. These findings have significant implications for deploying language models in practical applications where both reliability and diverse solution generation are crucial.
zh
[NLP-118] AI-Generated Game Commentary: A Survey and a Datasheet Repository
【速读】: 该论文旨在解决AI生成游戏解说(AIGGC)这一综合性多模态自然语言处理(NLP)任务中的技术挑战,包括事实准确性、逻辑推理、表达性文本生成、生成速度以及上下文管理等问题。其解决方案的关键在于提出一个通用框架,并对45个现有的游戏解说数据集和方法进行系统性综述,同时分类和比较了该领域常用的评估指标,以促进未来研究和基准测试的发展。
链接: https://arxiv.org/abs/2506.17294
作者: Qirui Zheng,Xingbo Wang,Keyuan Cheng,Yunlong Lu,Wenxin Li
机构: Peking University (北京大学); South China University of Technology (华南理工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:AI-Generated Game Commentary (AIGGC) has gained increasing attention due to its market potential and inherent technical challenges. As a comprehensive multimodal Natural Language Processing (NLP) task, AIGGC imposes substantial demands on language models, including factual accuracy, logical reasoning, expressive text generation, generation speed, and context management. In this paper, we introduce a general framework for AIGGC and present a comprehensive survey of 45 existing game commentary dataset and methods according to key challenges they aim to address in this domain. We further classify and compare various evaluation metrics commonly used in this domain. To support future research and benchmarking, we also provide a structured datasheet summarizing the essential attributes of these datasets in appendix, which is meanwhile publicly available in an open repository.
zh
[NLP-119] SlimRAG : Retrieval without Graphs via Entity-Aware Context Selection
【速读】: 该论文试图解决图基检索增强生成(graph-based RAG)系统中存在的结构开销大和检索不精确的问题,这些问题源于语义相似性并不等同于语义相关性。解决方案的关键在于提出SlimRAG,一个无需图结构的轻量级检索框架,其核心是采用一种简单而有效的实体感知机制,通过构建基于语义嵌入的实体到段落表,在查询时识别关键实体并检索相关段落,从而生成简洁且语境相关的输入,避免了图遍历和边构建的过程。
链接: https://arxiv.org/abs/2506.17288
作者: Jiale Zhang,Jiaxiang Chen,Zhucong Li,Jie Ding,Kui Zhao,Zenglin Xu,Xin Pang,Yinghui Xu
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Retrieval-Augmented Generation (RAG) enhances language models by incorporating external knowledge at inference time. However, graph-based RAG systems often suffer from structural overhead and imprecise retrieval: they require costly pipelines for entity linking and relation extraction, yet frequently return subgraphs filled with loosely related or tangential content. This stems from a fundamental flaw – semantic similarity does not imply semantic relevance. We introduce SlimRAG, a lightweight framework for retrieval without graphs. SlimRAG replaces structure-heavy components with a simple yet effective entity-aware mechanism. At indexing time, it constructs a compact entity-to-chunk table based on semantic embeddings. At query time, it identifies salient entities, retrieves and scores associated chunks, and assembles a concise, contextually relevant input – without graph traversal or edge construction. To quantify retrieval efficiency, we propose Relative Index Token Utilization (RITU), a metric measuring the compactness of retrieved content. Experiments across multiple QA benchmarks show that SlimRAG outperforms strong flat and graph-based baselines in accuracy while reducing index size and RITU (e.g., 16.31 vs. 56+), highlighting the value of structure-free, entity-centric context selection. The code will be released soon. this https URL
zh
[NLP-120] GTA: Grouped-head latenT Attention
【速读】: 该论文旨在解决大型语言模型(Large Language Model, LLM)中注意力机制带来的计算和内存开销过大的问题,特别是在长文本场景下,键值缓存(KV cache)和注意力计算的规模迅速增长,限制了在资源受限硬件上的部署。其解决方案的关键在于提出一种名为分组头潜在注意力(Grouped-Head Latent Attention, GTA)的新注意力机制,通过共享注意力图机制减少键缓存大小,并利用带有学习投影的非线性值解码器将值缓存压缩到潜在空间,从而显著降低内存使用和计算复杂度,同时保持模型性能。
链接: https://arxiv.org/abs/2506.17286
作者: Luoyang Sun,Jiwen Jiang,Cheng Deng,Xinjian Wu,Haifeng Zhang,Lei Chen,Lionel Ni,Jun Wang
机构: Institution of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); School of Artificial Intelligence, University of Chinese Academy of Sciences (中国科学院大学人工智能学院); AI Lab, The Yangtze River Delta (长江三角洲人工智能实验室); The Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州)); University College London (伦敦大学学院); The Hong Kong University of Science and Technology (香港科技大学); UCL Centre for Artificial Intelligence (伦敦大学学院人工智能中心)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Attention mechanisms underpin the success of large language models (LLMs), yet their substantial computational and memory overhead poses challenges for optimizing efficiency and performance. A critical bottleneck arises as KV cache and attention computations scale rapidly with text length, challenging deployment on hardware with limited computational and memory resources. We observe that attention mechanisms exhibit substantial redundancy, since the KV cache can be significantly compressed and attention maps across heads display high similarity, revealing that much of the computation and storage is unnecessary. Leveraging these insights, we propose \textbfGrouped-Head Laten\textbfT \textbfAttention (GTA), a novel attention mechanism that reduces memory usage and computational complexity while maintaining performance. GTA comprises two components: (1) a shared attention map mechanism that reuses attention scores across multiple heads, decreasing the key cache size; and (2) a nonlinear value decoder with learned projections that compresses the value cache into a latent space, further cutting memory needs. GTA cuts attention computation FLOPs by up to \emph62.5% versus Grouped-Query Attention and shrink the KV cache by up to \emph70%, all while avoiding the extra overhead of Multi-Head Latent Attention to improve LLM deployment efficiency. Consequently, GTA models achieve a \emph2x increase in end-to-end inference speed, with prefill benefiting from reduced computational cost and decoding benefiting from the smaller cache footprint.
zh
[NLP-121] Efficient and Stealthy Jailbreak Attacks via Adversarial Prompt Distillation from LLM s to SLMs
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在越狱攻击场景下的安全与伦理问题。现有越狱攻击方法存在效率低、计算成本高及跨模型适应性差等问题,难以应对LLM的快速发展和新型防御策略。论文提出的解决方案是对抗性提示蒸馏(Adversarial Prompt Distillation),其关键在于通过提示生成与蒸馏方法结合掩码语言建模、强化学习和动态温度控制,使小型语言模型(Small Language Models, SLMs)能够对主流LLMs实施越狱攻击,从而验证了该方法在攻击成功率、危害性、资源效率及跨模型适应性方面的优越性。
链接: https://arxiv.org/abs/2506.17231
作者: Xiang Li,Chong Zhang,Jia Wang,Fangyu Wu,Yushi Li,Xiaobo Jin
机构: Xi’an Jiaotong-Liverpool University (西安交通大学-利物浦大学); The Chinese University of Hong Kong (香港中文大学)
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注: 15 pages, 5 figures
Abstract:Attacks on large language models (LLMs) in jailbreaking scenarios raise many security and ethical issues. Current jailbreak attack methods face problems such as low efficiency, high computational cost, and poor cross-model adaptability and versatility, which make it difficult to cope with the rapid development of LLM and new defense strategies. Our work proposes an Adversarial Prompt Distillation, which combines masked language modeling, reinforcement learning, and dynamic temperature control through a prompt generation and distillation method. It enables small language models (SLMs) to jailbreak attacks on mainstream LLMs. The experimental results verify the superiority of the proposed method in terms of attack success rate and harm, and reflect the resource efficiency and cross-model adaptability. This research explores the feasibility of distilling the jailbreak ability of LLM to SLM, reveals the model’s vulnerability, and provides a new idea for LLM security research.
zh
[NLP-122] Outcome-Based Education: Evaluating Students Perspectives Using Transformer
【速读】: 该论文试图解决如何通过分析学生反馈来评估和改进教育成果的问题,以支持基于成果的教育(Outcome-Based Education, OBE)目标。其解决方案的关键在于采用基于Transformer的模型,特别是DistilBERT,结合LIME(Local Interpretable Model-agnostic Explanations)方法,以提升情感分类的准确性并增强模型预测的可解释性,从而更有效地识别学生学习体验中的模式。
链接: https://arxiv.org/abs/2506.17223
作者: Shuvra Smaran Das,Anirban Saha Anik,Md Kishor Morol,Mohammad Sakib Mahmood
机构: 未知
类目: Computation and Language (cs.CL)
备注: 6 pages, 7 figures
Abstract:Outcome-Based Education (OBE) emphasizes the development of specific competencies through student-centered learning. In this study, we reviewed the importance of OBE and implemented transformer-based models, particularly DistilBERT, to analyze an NLP dataset that includes student feedback. Our objective is to assess and improve educational outcomes. Our approach is better than other machine learning models because it uses the transformer’s deep understanding of language context to classify sentiment better, giving better results across a wider range of matrices. Our work directly contributes to OBE’s goal of achieving measurable outcomes by facilitating the identification of patterns in student learning experiences. We have also applied LIME (local interpretable model-agnostic explanations) to make sure that model predictions are clear. This gives us understandable information about how key terms affect sentiment. Our findings indicate that the combination of transformer models and LIME explanations results in a strong and straightforward framework for analyzing student feedback. This aligns more closely with the principles of OBE and ensures the improvement of educational practices through data-driven insights.
zh
[NLP-123] Enhancing Few-shot Keyword Spotting Performance through Pre-Trained Self-supervised Speech Models
【速读】: 该论文旨在解决传统语音关键词识别(Keyword Spotting, KWS)系统在资源受限的边缘设备上难以实现高准确率与低误接受率的问题,特别是在少样本(Few-Shot, FS)场景下。其解决方案的关键在于提出一种基于自监督学习模型的训练方案,利用Wav2Vec 2.0作为教师模型,并通过子中心ArcFace损失增强类间可分性与类内紧凑性;同时引入基于注意力机制的降维方法,以适应边缘设备的部署需求,并训练一个轻量级的ResNet15学生模型,从而显著提升了10样本分类准确率。
链接: https://arxiv.org/abs/2506.17686
作者: Alican Gok,Oguzhan Buyuksolak,Osman Erman Okman,Murat Saraclar
机构: Boğaziçi University (博阿兹伊大学); Analog Devices Istanbul Turkey (模拟设备伊斯坦布尔土耳其); Analog Devices (模拟设备)
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
备注: To be submitted to IEEE Signal Processing Letters, 5 pages, 3 figures
Abstract:Keyword Spotting plays a critical role in enabling hands-free interaction for battery-powered edge devices. Few-Shot Keyword Spotting (FS-KWS) addresses the scalability and adaptability challenges of traditional systems by enabling recognition of custom keywords with only a few examples. However, existing FS-KWS systems achieve subpar accuracy at desirable false acceptance rates, particularly in resource-constrained edge environments. To address these issues, we propose a training scheme that leverages self-supervised learning models for robust feature extraction, dimensionality reduction, and knowledge distillation. The teacher model, based on Wav2Vec 2.0 is trained using Sub-center ArcFace loss, which enhances inter-class separability and intra-class compactness. To enable efficient deployment on edge devices, we introduce attention-based dimensionality reduction and train a standard lightweight ResNet15 student model. We evaluate the proposed approach on the English portion of the Multilingual Spoken Words Corpus (MSWC) and the Google Speech Commands (GSC) datasets. Notably, the proposed training method improves the 10-shot classification accuracy from 33.4% to 74.1% on 11 classes at 1% false alarm accuracy on the GSC dataset, thus making it significantly better-suited for a real use case scenario.
zh
[NLP-124] PaceLLM : Brain-Inspired Large Language Models for Long-Context Understanding
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在长上下文处理中的信息衰减和语义碎片化问题,这些问题由瞬时神经激活和非结构化的前馈网络(Feed-Forward Network, FFN)权重引起。其解决方案的关键在于提出PaceLLM,该模型通过两个创新机制实现优化:一是持久活动(Persistent Activity, PA)机制,通过引入激活级记忆库动态检索、复用和更新关键FFN状态,以缓解上下文衰减;二是皮层专家(Cortical Expert, CE)聚类,通过重新组织FFN权重为语义模块来建立跨标记依赖关系,从而减轻语义碎片化。
链接: https://arxiv.org/abs/2506.17310
作者: Kangcong Li,Peng Ye,Chongjun Tu,Lin Zhang,Chunfeng Song,Jiamin Wu,Tao Yang,Qihao Zheng,Tao Chen
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Computation and Language (cs.CL); Neural and Evolutionary Computing (cs.NE)
备注:
Abstract:While Large Language Models (LLMs) demonstrate strong performance across domains, their long-context capabilities are limited by transient neural activations causing information decay and unstructured feed-forward network (FFN) weights leading to semantic fragmentation. Inspired by the brain’s working memory and cortical modularity, we propose PaceLLM, featuring two innovations: (1) a Persistent Activity (PA) Mechanism that mimics prefrontal cortex (PFC) neurons’ persistent firing by introducing an activation-level memory bank to dynamically retrieve, reuse, and update critical FFN states, addressing contextual decay; and (2) Cortical Expert (CE) Clustering that emulates task-adaptive neural specialization to reorganize FFN weights into semantic modules, establishing cross-token dependencies and mitigating fragmentation. Extensive evaluations show that PaceLLM achieves 6% improvement on LongBench’s Multi-document QA and 12.5-17.5% performance gains on Infinite-Bench tasks, while extending measurable context length to 200K tokens in Needle-In-A-Haystack (NIAH) tests. This work pioneers brain-inspired LLM optimization and is complementary to other works. Besides, it can be generalized to any model and enhance their long-context performance and interpretability without structural overhauls.
zh
计算机视觉
[CV-0] C-Light: Temporally Consistent Relighting for Dynamic Long Videos
【速读】:该论文旨在解决在具有复杂动态的长视频中进行光照编辑的问题,该问题在视觉内容创作、操控以及通过sim2real和real2real迁移提升具身AI的数据规模等方面具有重要价值。现有视频重新照明技术主要局限于人像视频,或在时间一致性和计算效率方面存在瓶颈。论文提出的解决方案关键在于TC-Light框架,其核心是两阶段后优化机制:第一阶段优化外观嵌入以对齐全局光照,第二阶段优化提出的规范视频表示——唯一视频张量(Unique Video Tensor, UVT),以对齐细粒度的纹理和光照。
链接: https://arxiv.org/abs/2506.18904
作者: Yang Liu,Chuanchen Luo,Zimo Tang,Yingyan Li,Yuran Yang,Yuanyong Ning,Lue Fan,Junran Peng,Zhaoxiang Zhang
机构: Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); University of Chinese Academy of Sciences (中国科学院大学); Shandong University (山东大学); University of Science and Technology Beijing (北京科技大学); Tencent (腾讯); Huazhong University of Science and Technology (华中科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL Code: this https URL
Abstract:Editing illumination in long videos with complex dynamics has significant value in various downstream tasks, including visual content creation and manipulation, as well as data scaling up for embodied AI through sim2real and real2real transfer. Nevertheless, existing video relighting techniques are predominantly limited to portrait videos or fall into the bottleneck of temporal consistency and computation efficiency. In this paper, we propose TC-Light, a novel paradigm characterized by the proposed two-stage post optimization mechanism. Starting from the video preliminarily relighted by an inflated video relighting model, it optimizes appearance embedding in the first stage to align global illumination. Then it optimizes the proposed canonical video representation, i.e., Unique Video Tensor (UVT), to align fine-grained texture and lighting in the second stage. To comprehensively evaluate performance, we also establish a long and highly dynamic video benchmark. Extensive experiments show that our method enables physically plausible relighting results with superior temporal coherence and low computation cost. The code and video demos are available at this https URL.
zh
[CV-1] VMem: Consistent Interactive Video Scene Generation with Surfel-Indexed View Memory
【速读】:该论文试图解决视频生成过程中场景连贯性难以维持以及计算成本过高的问题,尤其是在长期场景合成任务中。现有方法要么通过逐帧补全2D视图并逐步重建3D几何结构来实现,但容易积累误差;要么依赖具有短上下文窗口的视频生成器,难以在长时间内保持场景一致性。其解决方案的关键在于提出Surfel-Indexed View Memory (VMem),该机制通过基于3D表面元素(surfels)对过去视图进行几何索引,从而高效检索最相关的过去视图,减少计算负担并提升生成场景的一致性与可控性。
链接: https://arxiv.org/abs/2506.18903
作者: Runjia Li,Philip Torr,Andrea Vedaldi,Tomas Jakab
机构: University of Oxford (牛津大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
Abstract:We propose a novel memory mechanism to build video generators that can explore environments interactively. Similar results have previously been achieved by out-painting 2D views of the scene while incrementally reconstructing its 3D geometry, which quickly accumulates errors, or by video generators with a short context window, which struggle to maintain scene coherence over the long term. To address these limitations, we introduce Surfel-Indexed View Memory (VMem), a mechanism that remembers past views by indexing them geometrically based on the 3D surface elements (surfels) they have observed. VMem enables the efficient retrieval of the most relevant past views when generating new ones. By focusing only on these relevant views, our method produces consistent explorations of imagined environments at a fraction of the computational cost of using all past views as context. We evaluate our approach on challenging long-term scene synthesis benchmarks and demonstrate superior performance compared to existing methods in maintaining scene coherence and camera control.
zh
[CV-2] From Virtual Games to Real-World Play
【速读】:该论文试图解决在真实世界场景中实现交互式视频生成的问题,特别是如何从用户控制信号生成具有照片级真实感且时间一致的视频序列。解决方案的关键在于构建一个基于神经网络的真实世界游戏引擎——RealPlay,其通过迭代块状预测实现低延迟反馈、保持跨迭代的时间一致性,并确保精确的控制响应。此外,RealPlay利用标注的游戏数据和未标注的真实世界视频进行训练,无需真实世界动作标注,从而实现了控制信号从虚拟到现实场景的迁移以及对多种实体(如自行车和行人)的控制泛化。
链接: https://arxiv.org/abs/2506.18901
作者: Wenqiang Sun,Fangyun Wei,Jinjing Zhao,Xi Chen,Zilong Chen,Hongyang Zhang,Jun Zhang,Yan Lu
机构: HKUST(香港科技大学); Microsoft Research(微软研究院); University of Sydney(悉尼大学); Tsinghua University(清华大学); University of Waterloo(滑铁卢大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
Abstract:We introduce RealPlay, a neural network-based real-world game engine that enables interactive video generation from user control signals. Unlike prior works focused on game-style visuals, RealPlay aims to produce photorealistic, temporally consistent video sequences that resemble real-world footage. It operates in an interactive loop: users observe a generated scene, issue a control command, and receive a short video chunk in response. To enable such realistic and responsive generation, we address key challenges including iterative chunk-wise prediction for low-latency feedback, temporal consistency across iterations, and accurate control response. RealPlay is trained on a combination of labeled game data and unlabeled real-world videos, without requiring real-world action annotations. Notably, we observe two forms of generalization: (1) control transfer-RealPlay effectively maps control signals from virtual to real-world scenarios; and (2) entity transfer-although training labels originate solely from a car racing game, RealPlay generalizes to control diverse real-world entities, including bicycles and pedestrians, beyond vehicles. Project page can be found: this https URL
zh
[CV-3] Audit Repair: An Agent ic Framework for Consistent Story Visualization in Text-to-Image Diffusion Models
【速读】:该论文试图解决故事可视化任务中多面板场景之间视觉一致性不足的问题,特别是在角色和物体在叙事过程中持续性和演变方面的不一致。解决方案的关键在于提出一种协作的多智能体框架,该框架能够自主识别、纠正和优化多面板故事视觉化中的不一致之处,通过迭代循环实现细粒度的面板级更新,而无需重新生成整个序列。
链接: https://arxiv.org/abs/2506.18900
作者: Kiymet Akdemir,Tahira Kazimi,Pinar Yanardag
机构: Virginia Tech (弗吉尼亚理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project webpage: this https URL
Abstract:Story visualization has become a popular task where visual scenes are generated to depict a narrative across multiple panels. A central challenge in this setting is maintaining visual consistency, particularly in how characters and objects persist and evolve throughout the story. Despite recent advances in diffusion models, current approaches often fail to preserve key character attributes, leading to incoherent narratives. In this work, we propose a collaborative multi-agent framework that autonomously identifies, corrects, and refines inconsistencies across multi-panel story visualizations. The agents operate in an iterative loop, enabling fine-grained, panel-level updates without re-generating entire sequences. Our framework is model-agnostic and flexibly integrates with a variety of diffusion models, including rectified flow transformers such as Flux and latent diffusion models such as Stable Diffusion. Quantitative and qualitative experiments show that our method outperforms prior approaches in terms of multi-panel consistency.
zh
[CV-4] FilMaster: Bridging Cinematic Principles and Generative AI for Automated Film Generation
【速读】:该论文试图解决当前电影生成系统在实现电影制作中的电影原则方面存在的不足,特别是缺乏多样化的镜头语言和电影节奏,导致视觉模板化和叙事吸引力不足的问题。其解决方案的关键在于构建一个端到端的AI系统FilMaster,该系统基于两个核心原则:从大量真实电影数据中学习摄影技巧,以及模拟以观众为中心的专业后期制作流程。FilMaster通过两个阶段实现这一目标:参考引导生成阶段和生成式后期制作阶段,分别负责将用户输入转化为视频片段,并通过协调视听元素来实现电影节奏,从而生成专业级的影视作品。
链接: https://arxiv.org/abs/2506.18899
作者: Kaiyi Huang,Yukun Huang,Xintao Wang,Zinan Lin,Xuefei Ning,Pengfei Wan,Di Zhang,Yu Wang,Xihui Liu
机构: The University of Hong Kong (香港大学); Kuaishou Technology (快手科技); Microsoft Research (微软研究院); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL
Abstract:AI-driven content creation has shown potential in film production. However, existing film generation systems struggle to implement cinematic principles and thus fail to generate professional-quality films, particularly lacking diverse camera language and cinematic rhythm. This results in templated visuals and unengaging narratives. To address this, we introduce FilMaster, an end-to-end AI system that integrates real-world cinematic principles for professional-grade film generation, yielding editable, industry-standard outputs. FilMaster is built on two key principles: (1) learning cinematography from extensive real-world film data and (2) emulating professional, audience-centric post-production workflows. Inspired by these principles, FilMaster incorporates two stages: a Reference-Guided Generation Stage which transforms user input to video clips, and a Generative Post-Production Stage which transforms raw footage into audiovisual outputs by orchestrating visual and auditory elements for cinematic rhythm. Our generation stage highlights a Multi-shot Synergized RAG Camera Language Design module to guide the AI in generating professional camera language by retrieving reference clips from a vast corpus of 440,000 film clips. Our post-production stage emulates professional workflows by designing an Audience-Centric Cinematic Rhythm Control module, including Rough Cut and Fine Cut processes informed by simulated audience feedback, for effective integration of audiovisual elements to achieve engaging content. The system is empowered by generative AI models like (M)LLMs and video generation models. Furthermore, we introduce FilmEval, a comprehensive benchmark for evaluating AI-generated films. Extensive experiments show FilMaster’s superior performance in camera language design and cinematic rhythm control, advancing generative AI in professional filmmaking.
zh
[CV-5] 4D-LRM: Large Space-Time Reconstruction Model From and To Any View at Any Time
【速读】:该论文试图解决如何将4D预训练扩展以学习通用的时空表示,从而从某些时间点的少量视角重建物体,并能够渲染任意时间点的任意视角问题。其解决方案的关键在于提出4D-LRM,这是首个大规模的4D重建模型,能够从非受限视角和时间戳输入中学习统一的时空表示,并直接从跨时间的定位图像标记中预测每像素的4D高斯基元,从而实现高效、高质量的渲染。
链接: https://arxiv.org/abs/2506.18890
作者: Ziqiao Ma,Xuweiyi Chen,Shoubin Yu,Sai Bi,Kai Zhang,Chen Ziwen,Sihan Xu,Jianing Yang,Zexiang Xu,Kalyan Sunkavalli,Mohit Bansal,Joyce Chai,Hao Tan
机构: Adobe Research(Adobe 研究院); University of Michigan(密歇根大学); UNC Chapel Hill(北卡罗来纳大学教堂山分校); University of Virginia(弗吉尼亚大学); Oregon State University(俄勒冈州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
Abstract:Can we scale 4D pretraining to learn general space-time representations that reconstruct an object from a few views at some times to any view at any time? We provide an affirmative answer with 4D-LRM, the first large-scale 4D reconstruction model that takes input from unconstrained views and timestamps and renders arbitrary novel view-time combinations. Unlike prior 4D approaches, e.g., optimization-based, geometry-based, or generative, that struggle with efficiency, generalization, or faithfulness, 4D-LRM learns a unified space-time representation and directly predicts per-pixel 4D Gaussian primitives from posed image tokens across time, enabling fast, high-quality rendering at, in principle, infinite frame rate. Our results demonstrate that scaling spatiotemporal pretraining enables accurate and efficient 4D reconstruction. We show that 4D-LRM generalizes to novel objects, interpolates across time, and handles diverse camera setups. It reconstructs 24-frame sequences in one forward pass with less than 1.5 seconds on a single A100 GPU.
zh
[CV-6] GRAND-SLAM: Local Optimization for Globally Consistent Large-Scale Multi-Agent Gaussian SLAM
【速读】:该论文旨在解决在大规模、多智能体户外环境中应用3D高斯点云(3D Gaussian splatting)进行RGB-D视觉SLAM的挑战,现有方法主要局限于小规模室内环境。其解决方案的关键在于提出一种名为GRAND-SLAM的协作式高斯点云SLAM方法,该方法集成了基于子图局部优化的隐式跟踪模块,以及在位姿图优化框架中集成的跨机器人和机器人内部回环闭合机制。
链接: https://arxiv.org/abs/2506.18885
作者: Annika Thomas,Aneesa Sonawalla,Alex Rose,Jonathan P. How
机构: Massachusetts Institute of Technology Department of Aeronautics and Astronautics (麻省理工学院航空航天系)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:3D Gaussian splatting has emerged as an expressive scene representation for RGB-D visual SLAM, but its application to large-scale, multi-agent outdoor environments remains unexplored. Multi-agent Gaussian SLAM is a promising approach to rapid exploration and reconstruction of environments, offering scalable environment representations, but existing approaches are limited to small-scale, indoor environments. To that end, we propose Gaussian Reconstruction via Multi-Agent Dense SLAM, or GRAND-SLAM, a collaborative Gaussian splatting SLAM method that integrates i) an implicit tracking module based on local optimization over submaps and ii) an approach to inter- and intra-robot loop closure integrated into a pose-graph optimization framework. Experiments show that GRAND-SLAM provides state-of-the-art tracking performance and 28% higher PSNR than existing methods on the Replica indoor dataset, as well as 91% lower multi-agent tracking error and improved rendering over existing multi-agent methods on the large-scale, outdoor Kimera-Multi dataset.
zh
[CV-7] Universal Video Temporal Grounding with Generative Multi-modal Large Language Models
【速读】:该论文旨在解决通用视频时间定位问题,即根据自然语言查询准确地在视频中定位时间片段。现有方法通常受限于特定视频领域或时长,而本文提出UniTime模型,其关键在于利用生成式多模态大语言模型(MLLMs)强大的视觉-语言理解能力,通过将时间戳标记与视频标记交错嵌入以融入时间信息,并通过自适应帧缩放处理不同输入粒度的视频,从而实现对短时和长时视频的鲁棒时间定位。
链接: https://arxiv.org/abs/2506.18883
作者: Zeqian Li,Shangzhe Di,Zhonghua Zhai,Weilin Huang,Yanfeng Wang,Weidi Xie
机构: SAI(上海人工智能实验室); Shanghai Jiao Tong University (上海交通大学); ByteDance Seed(字节跳动种子)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This paper presents a computational model for universal video temporal grounding, which accurately localizes temporal moments in videos based on natural language queries (e.g., questions or descriptions). Unlike existing methods that are often limited to specific video domains or durations, we propose UniTime, a robust and universal video grounding model leveraging the strong vision-language understanding capabilities of generative Multi-modal Large Language Models (MLLMs). Our model effectively handles videos of diverse views, genres, and lengths while comprehending complex language queries. The key contributions include: (i) We consider steering strong MLLMs for temporal grounding in videos. To enable precise timestamp outputs, we incorporate temporal information by interleaving timestamp tokens with video tokens. (ii) By training the model to handle videos with different input granularities through adaptive frame scaling, our approach achieves robust temporal grounding for both short and long videos. (iii) Comprehensive experiments show that UniTime outperforms state-of-the-art approaches in both zero-shot and dataset-specific finetuned settings across five public temporal grounding benchmarks. (iv) When employed as a preliminary moment retriever for long-form video question-answering (VideoQA), UniTime significantly improves VideoQA accuracy, highlighting its value for complex video understanding tasks.
zh
[CV-8] Light of Normals: Unified Feature Representation for Universal Photometric Stereo
【速读】:该论文旨在解决通用光度立体(Universal photometric stereo)中的两个核心问题:一是光照变化与表面法线特征之间的深度耦合,由于观测到的亮度变化可能来源于光照变化或表面朝向,导致不确定性;二是复杂表面上高频几何细节的保留问题,自阴影、相互反射和细微法线变化使得传统特征处理方法难以准确捕捉。其解决方案的关键在于克服光照与表面法线之间的耦合关系,并有效恢复复杂表面的精细几何结构。
链接: https://arxiv.org/abs/2506.18882
作者: Hong Li,Houyuan Chen,Chongjie Ye,Zhaoxi Chen,Bohan Li,Shaocong Xu,Xianda Guo,Xuhui Liu,Yikai Wang,Baochang Zhang,Satoshi Ikehata,Boxin Shi,Anyi Rao,Hao Zhao
机构: BAAI(百度研究院); BUAA(北京航空航天大学); NJU(南京大学); FNii, CUHKSZ(香港中文大学深圳研究院); BNU(北京师范大学); NII(日本国家信息研究所); HKUST(香港科技大学); PKU(北京大学); AIR, THU(清华大学人工智能研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Universal photometric stereo (PS) aims to recover high-quality surface normals from objects under arbitrary lighting conditions without relying on specific illumination models. Despite recent advances such as SDM-UniPS and Uni MS-PS, two fundamental challenges persist: 1) the deep coupling between varying illumination and surface normal features, where ambiguity in observed intensity makes it difficult to determine whether brightness variations stem from lighting changes or surface orientation; and 2) the preservation of high-frequency geometric details in complex surfaces, where intricate geometries create self-shadowing, inter-reflections, and subtle normal variations that conventional feature processing operations struggle to capture accurately.
zh
[CV-9] Let Your Video Listen to Your Music!
【速读】:该论文试图解决视频视觉运动节奏与给定音乐轨道对齐的问题,这一任务在自主视频编辑中仍处于研究不足的状态。现有方法通常依赖于人工剪辑、速度调整或基于启发式的编辑技术,而生成式模型在处理视频与音乐联合生成时往往将两者纠缠,限制了视频与音乐节拍对齐的灵活性。论文提出的解决方案关键在于构建一个名为MVAA(Music-Video Auto-Alignment)的框架,通过将任务模块化为两步:首先将运动关键帧与音频节拍对齐,随后进行节奏感知的视频修复,从而在保持原始视觉内容的前提下实现自动化的视频编辑。
链接: https://arxiv.org/abs/2506.18881
作者: Xinyu Zhang,Dong Gong,Zicheng Duan,Anton van den Hengel,Lingqiao Liu
机构: The University of Adelaide(阿德莱德大学); University of New South Wales(新南威尔士大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: project page: this https URL
Abstract:Aligning the rhythm of visual motion in a video with a given music track is a practical need in multimedia production, yet remains an underexplored task in autonomous video editing. Effective alignment between motion and musical beats enhances viewer engagement and visual appeal, particularly in music videos, promotional content, and cinematic editing. Existing methods typically depend on labor-intensive manual cutting, speed adjustments, or heuristic-based editing techniques to achieve synchronization. While some generative models handle joint video and music generation, they often entangle the two modalities, limiting flexibility in aligning video to music beats while preserving the full visual content. In this paper, we propose a novel and efficient framework, termed MVAA (Music-Video Auto-Alignment), that automatically edits video to align with the rhythm of a given music track while preserving the original visual content. To enhance flexibility, we modularize the task into a two-step process in our MVAA: aligning motion keyframes with audio beats, followed by rhythm-aware video inpainting. Specifically, we first insert keyframes at timestamps aligned with musical beats, then use a frame-conditioned diffusion model to generate coherent intermediate frames, preserving the original video’s semantic content. Since comprehensive test-time training can be time-consuming, we adopt a two-stage strategy: pretraining the inpainting module on a small video set to learn general motion priors, followed by rapid inference-time fine-tuning for video-specific adaptation. This hybrid approach enables adaptation within 10 minutes with one epoch on a single NVIDIA 4090 GPU using CogVideoX-5b-I2V as the backbone. Extensive experiments show that our approach can achieve high-quality beat alignment and visual smoothness.
zh
[CV-10] OmniAvatar: Efficient Audio-Driven Avatar Video Generation with Adaptive Body Animation
【速读】:该论文旨在解决音频驱动的人体动画生成中面部动作占主导、难以生成自然同步且流畅的全身动画,以及在细粒度生成中缺乏精确提示控制的问题。其解决方案的关键在于提出OmniAvatar模型,该模型通过像素级多层级音频嵌入策略提升音频特征在潜在空间中的捕捉能力,从而增强唇形同步效果,并采用基于LoRA的训练方法,在保持基础模型提示驱动控制能力的同时有效融合音频特征。
链接: https://arxiv.org/abs/2506.18866
作者: Qijun Gan,Ruizi Yang,Jianke Zhu,Shaofei Xue,Steven Hoi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注: Project page: this https URL
Abstract:Significant progress has been made in audio-driven human animation, while most existing methods focus mainly on facial movements, limiting their ability to create full-body animations with natural synchronization and fluidity. They also struggle with precise prompt control for fine-grained generation. To tackle these challenges, we introduce OmniAvatar, an innovative audio-driven full-body video generation model that enhances human animation with improved lip-sync accuracy and natural movements. OmniAvatar introduces a pixel-wise multi-hierarchical audio embedding strategy to better capture audio features in the latent space, enhancing lip-syncing across diverse scenes. To preserve the capability for prompt-driven control of foundation models while effectively incorporating audio features, we employ a LoRA-based training approach. Extensive experiments show that OmniAvatar surpasses existing models in both facial and semi-body video generation, offering precise text-based control for creating videos in various domains, such as podcasts, human interactions, dynamic scenes, and singing. Our project page is this https URL.
zh
[CV-11] AMMs: Temporal-Aware Multimodal Model for Satellite Image Change Understanding and Forecasting
【速读】:该论文旨在解决现有多模态大语言模型(Multimodal Large Language Models, MLLMs)在卫星图像时序分析中的细粒度时空推理能力不足的问题。其关键解决方案是提出TAMMs(Temporal-Aware Multimodal Model),通过引入轻量级时间模块实现结构化序列编码和上下文提示,同时结合语义融合控制注入(Semantic-Fused Control Injection, SFCI)机制,将高层语义推理与结构先验自适应融合到增强的ControlNet中,从而实现时间一致且语义合理的图像生成。
链接: https://arxiv.org/abs/2506.18862
作者: Zhongbin Guo,Yuhao Wang,Ping Jian,Xinyue Chen,Wei Peng,Ertai E
机构: Beijing Institute of Technology(北京理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Submitted to the 33rd ACM International Conference on Multimedia. Our dataset can be found at this https URL
Abstract:Satellite image time-series analysis demands fine-grained spatial-temporal reasoning, which remains a challenge for existing multimodal large language models (MLLMs). In this work, we study the capabilities of MLLMs on a novel task that jointly targets temporal change understanding and future scene generation, aiming to assess their potential for modeling complex multimodal dynamics over time. We propose TAMMs, a Temporal-Aware Multimodal Model for satellite image change understanding and forecasting, which enhances frozen MLLMs with lightweight temporal modules for structured sequence encoding and contextual prompting. To guide future image generation, TAMMs introduces a Semantic-Fused Control Injection (SFCI) mechanism that adaptively combines high-level semantic reasoning and structural priors within an enhanced ControlNet. This dual-path conditioning enables temporally consistent and semantically grounded image synthesis. Experiments demonstrate that TAMMs outperforms strong MLLM baselines in both temporal change understanding and future image forecasting tasks, highlighting how carefully designed temporal reasoning and semantic fusion can unlock the full potential of MLLMs for spatio-temporal understanding.
zh
[CV-12] RAG -6DPose: Retrieval-Augmented 6D Pose Estimation via Leverag ing CAD as Knowledge Base IROS2025
【速读】:该论文旨在解决6D位姿估计(6D pose estimation)问题,这是机器人操作中实现精确物体定位的关键技术,尤其在抓取等任务中具有重要应用。其解决方案的核心在于提出一种基于检索增强的框架RAG-6DPose,该框架通过整合视觉和几何线索,并利用3D计算机辅助设计(CAD)模型作为知识库,从而提升位姿估计的准确性与鲁棒性。关键创新点包括构建多模态CAD知识库、通过ReSPC模块检索相关CAD特征,以及利用检索到的CAD信息进行位姿预测的增强解码。
链接: https://arxiv.org/abs/2506.18856
作者: Kuanning Wang,Yuqian Fu,Tianyu Wang,Yanwei Fu,Longfei Liang,Yu-Gang Jiang,Xiangyang Xue
机构: Fudan University (复旦大学); INSAIT, Sofia University “St. Kliment Ohridski” (INSAIT,索非亚大学“圣克莱门特·奥赫里德斯基”); NeuhHelium Co.,Ltd. (NeuhHelium 公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IROS 2025
Abstract:Accurate 6D pose estimation is key for robotic manipulation, enabling precise object localization for tasks like grasping. We present RAG-6DPose, a retrieval-augmented approach that leverages 3D CAD models as a knowledge base by integrating both visual and geometric cues. Our RAG-6DPose roughly contains three stages: 1) Building a Multi-Modal CAD Knowledge Base by extracting 2D visual features from multi-view CAD rendered images and also attaching 3D points; 2) Retrieving relevant CAD features from the knowledge base based on the current query image via our ReSPC module; and 3) Incorporating retrieved CAD information to refine pose predictions via retrieval-augmented decoding. Experimental results on standard benchmarks and real-world robotic tasks demonstrate the effectiveness and robustness of our approach, particularly in handling occlusions and novel viewpoints. Supplementary material is available on our project website: this https URL .
zh
[CV-13] Phantom-Data : Towards a General Subject-Consistent Video Generation Dataset
【速读】:该论文旨在解决生成式 AI (Generative AI) 在文本到视频生成任务中对文本指令忠实遵循的问题,即所谓的“复制粘贴问题”。该问题源于传统的成对训练范式,该范式通过从相同场景中采样参考图像,导致主体身份与背景及上下文属性相互纠缠。论文提出的解决方案关键在于构建首个通用的跨对主体到视频一致性数据集 Phantom-Data,其核心流程包括:(1)一个通用且输入对齐的主体检测模块,(2)从超过5300万视频和30亿张图像中进行大规模跨上下文主体检索,(3)基于先验指导的身份验证以确保在上下文变化下的视觉一致性。实验表明,使用 Phantom-Data 训练可显著提升提示对齐度和视觉质量,同时保持与成对基线相当的主体一致性。
链接: https://arxiv.org/abs/2506.18851
作者: Zhuowei Chen,Bingchuan Li,Tianxiang Ma,Lijie Liu,Mingcong Liu,Yi Zhang,Gen Li,Xinghui Li,Siyu Zhou,Qian He,Xinglong Wu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
Abstract:Subject-to-video generation has witnessed substantial progress in recent years. However, existing models still face significant challenges in faithfully following textual instructions. This limitation, commonly known as the copy-paste problem, arises from the widely used in-pair training paradigm. This approach inherently entangles subject identity with background and contextual attributes by sampling reference images from the same scene as the target video. To address this issue, we introduce \textbfPhantom-Data, the first general-purpose cross-pair subject-to-video consistency dataset, containing approximately one million identity-consistent pairs across diverse categories. Our dataset is constructed via a three-stage pipeline: (1) a general and input-aligned subject detection module, (2) large-scale cross-context subject retrieval from more than 53 million videos and 3 billion images, and (3) prior-guided identity verification to ensure visual consistency under contextual variation. Comprehensive experiments show that training with Phantom-Data significantly improves prompt alignment and visual quality while preserving identity consistency on par with in-pair baselines.
zh
[CV-14] Reproducible Evaluation of Camera Auto-Exposure Methods in the Field: Platform Benchmark and Lessons Learned
【速读】:该论文旨在解决标准数据集在评估自动曝光(Automatic-Exposure, AE)方法时的局限性,特别是由于输入数据传感器的固定特性导致无法有效比较那些主动调整传感器参数以适应环境条件的方法。其解决方案的关键在于提出一种利用模拟器生成任意曝光时间图像的方法,该方法基于BorealHDR数据集及其扩展,通过在不同光照条件下重复轨迹采集数据,实现对AE方法的离线基准测试。实验表明,该方法能够生成与真实图像RMSE低于1.78%的模拟图像,从而支持可重复的实验和评估。
链接: https://arxiv.org/abs/2506.18844
作者: Olivier Gamache,Jean-Michel Fortin,Matěj Boxan,François Pomerleau,Philippe Giguère
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages, 11 figures, pre-print version of the accepted paper for IEEE Transactions on Field Robotics (T-FR)
Abstract:Standard datasets often present limitations, particularly due to the fixed nature of input data sensors, which makes it difficult to compare methods that actively adjust sensor parameters to suit environmental conditions. This is the case with Automatic-Exposure (AE) methods, which rely on environmental factors to influence the image acquisition process. As a result, AE methods have traditionally been benchmarked in an online manner, rendering experiments non-reproducible. Building on our prior work, we propose a methodology that utilizes an emulator capable of generating images at any exposure time. This approach leverages BorealHDR, a unique multi-exposure stereo dataset, along with its new extension, in which data was acquired along a repeated trajectory at different times of the day to assess the impact of changing illumination. In total, BorealHDR covers 13.4 km over 59 trajectories in challenging lighting conditions. The dataset also includes lidar-inertial-odometry-based maps with pose estimation for each image frame, as well as Global Navigation Satellite System (GNSS) data for comparison. We demonstrate that by using images acquired at various exposure times, we can emulate realistic images with a Root-Mean-Square Error (RMSE) below 1.78% compared to ground truth images. Using this offline approach, we benchmarked eight AE methods, concluding that the classical AE method remains the field’s best performer. To further support reproducibility, we provide in-depth details on the development of our backpack acquisition platform, including hardware, electrical components, and performance specifications. Additionally, we share valuable lessons learned from deploying the backpack over more than 25 km across various environments. Our code and dataset are available online at this link: this https URL BorealHDR
zh
[CV-15] LIGHTHOUSE: Fast and precise distance to shoreline calculations from anywhere on earth ICML2025
【速读】:该论文旨在解决全球海岸线距离计算精度不足的问题,现有全球海岸数据集的分辨率较低(如1-4公里),限制了其应用价值。解决方案的关键在于利用公开的卫星影像与计算机视觉技术,生成分辨率为10米的全球海岸线数据集,相较于现有数据实现了100倍以上的精度提升。同时,为应对大规模查询带来的计算挑战,提出了一个名为Lighthouse的新型库,其具备高效快速和资源消耗低的特点,仅需1个CPU和2GB内存即可实现毫秒级在线推理,适用于资源受限环境中的实时应用。
链接: https://arxiv.org/abs/2506.18842
作者: Patrick Beukema,Henry Herzog,Yawen Zhang,Hunter Pitelka,Favyen Bastani
机构: Allen Institute for AI (艾伦人工智能研究所)
类目: Databases (cs.DB); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 8 pages, 7 figures, 1 table, ICML 2025 ML4RS
Abstract:We introduce a new dataset and algorithm for fast and efficient coastal distance calculations from Anywhere on Earth (AoE). Existing global coastal datasets are only available at coarse resolution (e.g. 1-4 km) which limits their utility. Publicly available satellite imagery combined with computer vision enable much higher precision. We provide a global coastline dataset at 10 meter resolution, a 100+ fold improvement in precision over existing data. To handle the computational challenge of querying at such an increased scale, we introduce a new library: Layered Iterative Geospatial Hierarchical Terrain-Oriented Unified Search Engine (Lighthouse). Lighthouse is both exceptionally fast and resource-efficient, requiring only 1 CPU and 2 GB of RAM to achieve millisecond online inference, making it well suited for real-time applications in resource-constrained environments.
zh
[CV-16] 4Real-Video-V2: Fused View-Time Attention and Feedforward Reconstruction for 4D Scene Generation
【速读】:该论文旨在解决4D视频生成与重建中时空信息融合不足及计算效率低的问题。其解决方案的关键在于提出一种基于稀疏注意力模式的融合架构,该模式在单一层内同时处理空间和时间注意力,通过让标记在相同帧、同一时间戳或同一视角内进行交互,提升了模型对时空结构的理解能力。此外,通过引入高斯头、相机标记替换算法以及动态层等改进,进一步增强了3D重建效果,从而实现了4D生成任务的新基准。
链接: https://arxiv.org/abs/2506.18839
作者: Chaoyang Wang,Ashkan Mirzaei,Vidit Goel,Willi Menapace,Aliaksandr Siarohin,Avalon Vinella,Michael Vasilkovsky,Ivan Skorokhodov,Vladislav Shakhrai,Sergey Korolev,Sergey Tulyakov,Peter Wonka
机构: Snap Inc.; KAUST
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We propose the first framework capable of computing a 4D spatio-temporal grid of video frames and 3D Gaussian particles for each time step using a feed-forward architecture. Our architecture has two main components, a 4D video model and a 4D reconstruction model. In the first part, we analyze current 4D video diffusion architectures that perform spatial and temporal attention either sequentially or in parallel within a two-stream design. We highlight the limitations of existing approaches and introduce a novel fused architecture that performs spatial and temporal attention within a single layer. The key to our method is a sparse attention pattern, where tokens attend to others in the same frame, at the same timestamp, or from the same viewpoint. In the second part, we extend existing 3D reconstruction algorithms by introducing a Gaussian head, a camera token replacement algorithm, and additional dynamic layers and training. Overall, we establish a new state of the art for 4D generation, improving both visual quality and reconstruction capability.
zh
[CV-17] PicoSAM2: Low-Latency Segmentation In-Sensor for Edge Vision Applications
【速读】:该论文旨在解决在延迟敏感和隐私意识强的应用场景中,如智能眼镜和物联网设备,实现实时、本地化的图像分割问题。其关键解决方案是提出PicoSAM2,一个轻量级(1.3M参数,336M MACs)的可提示分割模型,专为边缘计算和传感器内执行优化,采用了深度可分离U-Net架构,并结合知识蒸馏和定点提示编码技术,以从Segment Anything Model 2 (SAM2) 中学习。该模型在COCO和LVIS数据集上分别取得了51.9%和44.9%的mIoU,量化后的模型在IMX500上运行时间为14.3 ms,满足传感器内部署的内存和计算约束。
链接: https://arxiv.org/abs/2506.18807
作者: Pietro Bonazzi,Nicola Farronato,Stefan Zihlmann,Haotong Qi,Michele Magno
机构: ETH Zürich (ETH Zurich); IBM Research - Zürich (IBM研究-苏黎世)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Real-time, on-device segmentation is critical for latency-sensitive and privacy-aware applications like smart glasses and IoT devices. We introduce PicoSAM2, a lightweight (1.3M parameters, 336M MACs) promptable segmentation model optimized for edge and in-sensor execution, including the Sony IMX500. It builds on a depthwise separable U-Net, with knowledge distillation and fixed-point prompt encoding to learn from the Segment Anything Model 2 (SAM2). On COCO and LVIS, it achieves 51.9% and 44.9% mIoU, respectively. The quantized model (1.22MB) runs at 14.3 ms on the IMX500-achieving 86 MACs/cycle, making it the only model meeting both memory and compute constraints for in-sensor deployment. Distillation boosts LVIS performance by +3.5% mIoU and +5.1% mAP. These results demonstrate that efficient, promptable segmentation is feasible directly on-camera, enabling privacy-preserving vision without cloud or host processing.
zh
[CV-18] OC-SOP: Enhancing Vision-Based 3D Semantic Occupancy Prediction by Object-Centric Awareness
【速读】:该论文旨在解决自动驾驶感知中由于环境中的遮挡和场景数据不完整所带来的挑战,具体通过语义占用预测(Semantic Occupancy Prediction, SOP)任务来联合推断场景的几何结构和语义标签。传统基于摄像头的方法通常对所有类别一视同仁,并主要依赖局部特征,导致在动态前景物体上的预测效果不佳。该论文提出的解决方案关键在于引入基于目标的语义占用预测(Object-Centric SOP, OC-SOP),将检测分支提取的高层目标中心线索整合到语义占用预测流程中,从而显著提升前景物体的预测精度,并在SemanticKITTI数据集上取得了最先进的性能。
链接: https://arxiv.org/abs/2506.18798
作者: Helin Cao,Sven Behnke
机构: University of Bonn(波恩大学); Autonomous Intelligent Systems group(自主智能系统组); Computer Science Institute VI – Intelligent Systems and Robotics –(计算机科学研究所VI–智能系统与机器人–); Center for Robotics(机器人中心); Lamarr Institute for Machine Learning and Artificial Intelligence(拉玛尔机器学习与人工智能研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: under review
Abstract:Autonomous driving perception faces significant challenges due to occlusions and incomplete scene data in the environment. To overcome these issues, the task of semantic occupancy prediction (SOP) is proposed, which aims to jointly infer both the geometry and semantic labels of a scene from images. However, conventional camera-based methods typically treat all categories equally and primarily rely on local features, leading to suboptimal predictions, especially for dynamic foreground objects. To address this, we propose Object-Centric SOP (OC-SOP), a framework that integrates high-level object-centric cues extracted via a detection branch into the semantic occupancy prediction pipeline. This object-centric integration significantly enhances the prediction accuracy for foreground objects and achieves state-of-the-art performance among all categories on SemanticKITTI.
zh
[CV-19] ViDAR: Video Diffusion-Aware 4D Reconstruction From Monocular Inputs
【速读】:该论文旨在解决单目视频中动态新视角合成(Dynamic Novel View Synthesis)的挑战,特别是在结构与运动解耦不明确且监督信号稀缺的情况下。其核心解决方案是引入Video Diffusion-Aware Reconstruction (ViDAR),该框架利用个性化扩散模型生成伪多视角监督信号,以训练高斯点云表示。ViDAR通过场景特定特征进行条件约束,在恢复精细外观细节的同时减少单目模糊带来的伪影,并通过扩散感知损失函数和相机位姿优化策略解决基于扩散的监督在时空上的一致性问题。
链接: https://arxiv.org/abs/2506.18792
作者: Michal Nazarczuk,Sibi Catley-Chandar,Thomas Tanay,Zhensong Zhang,Gregory Slabaugh,Eduardo Pérez-Pellitero
机构: Huawei Noah’s Ark Lab (华为诺亚方舟实验室); Queen Mary University of London (玛丽皇后大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Dynamic Novel View Synthesis aims to generate photorealistic views of moving subjects from arbitrary viewpoints. This task is particularly challenging when relying on monocular video, where disentangling structure from motion is ill-posed and supervision is scarce. We introduce Video Diffusion-Aware Reconstruction (ViDAR), a novel 4D reconstruction framework that leverages personalised diffusion models to synthesise a pseudo multi-view supervision signal for training a Gaussian splatting representation. By conditioning on scene-specific features, ViDAR recovers fine-grained appearance details while mitigating artefacts introduced by monocular ambiguity. To address the spatio-temporal inconsistency of diffusion-based supervision, we propose a diffusion-aware loss function and a camera pose optimisation strategy that aligns synthetic views with the underlying scene geometry. Experiments on DyCheck, a challenging benchmark with extreme viewpoint variation, show that ViDAR outperforms all state-of-the-art baselines in visual quality and geometric consistency. We further highlight ViDAR’s strong improvement over baselines on dynamic regions and provide a new benchmark to compare performance in reconstructing motion-rich parts of the scene. Project page: this https URL
zh
[CV-20] Focus Your Attention: Towards Data-Intuitive Lightweight Vision Transformers
【速读】:该论文旨在解决Vision Transformers在预训练过程中对大量计算和内存资源的依赖以及任务特定迁移学习中的困难,这些问题主要源于计算密集型的自注意力机制导致的能效低下。其解决方案的关键在于提出了一种基于超像素的块池化(Super-Pixel Based Patch Pooling, SPPP)技术,以生成具有语义丰富性的块嵌入,从而有效降低架构复杂度并提高效率,同时引入了轻量级潜在注意力(Light Latent Attention, LLA)模块,通过将潜在标记整合到注意力机制中,显著减少了注意力模块的时间和空间复杂度。
链接: https://arxiv.org/abs/2506.18791
作者: Suyash Gaurav,Muhammad Farhan Humayun,Jukka Heikkonen,Jatin Chaudhary
机构: University of Turku (Turku University)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:The evolution of Vision Transformers has led to their widespread adaptation to different domains. Despite large-scale success, there remain significant challenges including their reliance on extensive computational and memory resources for pre-training on huge datasets as well as difficulties in task-specific transfer learning. These limitations coupled with energy inefficiencies mainly arise due to the computation-intensive self-attention mechanism. To address these issues, we propose a novel Super-Pixel Based Patch Pooling (SPPP) technique that generates context-aware, semantically rich, patch embeddings to effectively reduce the architectural complexity and improve efficiency. Additionally, we introduce the Light Latent Attention (LLA) module in our pipeline by integrating latent tokens into the attention mechanism allowing cross-attention operations to significantly reduce the time and space complexity of the attention module. By leveraging the data-intuitive patch embeddings coupled with dynamic positional encodings, our approach adaptively modulates the cross-attention process to focus on informative regions while maintaining the global semantic structure. This targeted attention improves training efficiency and accelerates convergence. Notably, the SPPP module is lightweight and can be easily integrated into existing transformer architectures. Extensive experiments demonstrate that our proposed architecture provides significant improvements in terms of computational efficiency while achieving comparable results with the state-of-the-art approaches, highlighting its potential for energy-efficient transformers suitable for edge deployment. (The code is available on our GitHub repository: this https URL).
zh
[CV-21] 3D Arena: An Open Platform for Generative 3D Evaluation
【速读】:该论文试图解决生成式三维(Generative 3D)模型评估中自动化指标与人类感知质量之间存在偏差的问题,现有基准测试要么依赖于忽略三维结构的图像指标,要么使用无法捕捉感知吸引力和现实世界适用性的几何度量。其解决方案的关键是提出3D Arena平台,通过大规模的人类偏好收集(采用成对比较方式)来评估图像到三维生成模型,该平台已收集了来自19个先进模型的123,243次投票,并通过统计欺诈检测实现了99.75%的用户真实性验证,同时引入基于ELO的排名系统以提供可靠的模型评估。
链接: https://arxiv.org/abs/2506.18787
作者: Dylan Ebert
机构: Hugging Face(赫吉夫)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 2 figures
Abstract:Evaluating Generative 3D models remains challenging due to misalignment between automated metrics and human perception of quality. Current benchmarks rely on image-based metrics that ignore 3D structure or geometric measures that fail to capture perceptual appeal and real-world utility. To address this gap, we present 3D Arena, an open platform for evaluating image-to-3D generation models through large-scale human preference collection using pairwise comparisons. Since launching in June 2024, the platform has collected 123,243 votes from 8,096 users across 19 state-of-the-art models, establishing the largest human preference evaluation for Generative 3D. We contribute the iso3d dataset of 100 evaluation prompts and demonstrate quality control achieving 99.75% user authenticity through statistical fraud detection. Our ELO-based ranking system provides reliable model assessment, with the platform becoming an established evaluation resource. Through analysis of this preference data, we present insights into human preference patterns. Our findings reveal preferences for visual presentation features, with Gaussian splat outputs achieving a 16.6 ELO advantage over meshes and textured models receiving a 144.1 ELO advantage over untextured models. We provide recommendations for improving evaluation methods, including multi-criteria assessment, task-oriented evaluation, and format-aware comparison. The platform’s community engagement establishes 3D Arena as a benchmark for the field while advancing understanding of human-centered evaluation in Generative 3D. Comments: 9 pages, 2 figures Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2506.18787 [cs.CV] (or arXiv:2506.18787v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2506.18787 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-22] SWA-SOP: Spatially-aware Window Attention for Semantic Occupancy Prediction in Autonomous Driving
【速读】:该论文旨在解决自动驾驶中感知系统因遮挡和数据稀疏性导致的环境信息不完整问题,具体通过语义占用预测(Semantic Occupancy Prediction, SOP)来推断未观测区域的占据状态和语义信息。其解决方案的关键在于提出了一种名为空间感知窗口注意力(Spatially-aware Window Attention, SWA)的新机制,该机制在注意力计算中引入了局部空间上下文,从而增强了几何感知能力,并在LiDAR和相机基础的SOP任务中均取得了显著性能提升。
链接: https://arxiv.org/abs/2506.18785
作者: Helin Cao,Rafael Materla,Sven Behnke
机构: University of Bonn(波恩大学); Lamarr Institute for Machine Learning and Artificial Intelligence(拉玛尔机器学习与人工智能研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: under reviewed
Abstract:Perception systems in autonomous driving rely on sensors such as LiDAR and cameras to perceive the 3D environment. However, due to occlusions and data sparsity, these sensors often fail to capture complete information. Semantic Occupancy Prediction (SOP) addresses this challenge by inferring both occupancy and semantics of unobserved regions. Existing transformer-based SOP methods lack explicit modeling of spatial structure in attention computation, resulting in limited geometric awareness and poor performance in sparse or occluded areas. To this end, we propose Spatially-aware Window Attention (SWA), a novel mechanism that incorporates local spatial context into attention. SWA significantly improves scene completion and achieves state-of-the-art results on LiDAR-based SOP benchmarks. We further validate its generality by integrating SWA into a camera-based SOP pipeline, where it also yields consistent gains across modalities.
zh
[CV-23] USVTrack: USV-Based 4D Radar-Camera Tracking Dataset for Autonomous Driving in Inland Waterways IROS
【速读】:该论文旨在解决内河航道中目标跟踪的挑战,以支持安全且经济的应用,如水路运输、观光旅游、环境监测和水面救援。其解决方案的关键在于开发了一个名为USVTrack的首个针对新一代水路交通系统自主驾驶的4D雷达-相机跟踪数据集,并提出了一种简单但有效的雷达-相机匹配方法(RCM)。该方法可集成到主流的两阶段关联跟踪器中,通过融合多传感器数据提升水路环境中目标跟踪的准确性和可靠性。
链接: https://arxiv.org/abs/2506.18737
作者: Shanliang Yao,Runwei Guan,Yi Ni,Sen Xu,Yong Yue,Xiaohui Zhu,Ryan Wen Liu
机构: XJTLU AI University Research Centre; Jiangsu Province Engineering Research Centre of Data Science and Cognitive Computation at XJTLU; SIP AI innovation platform; The Hong Kong University of Science and Technology (Guangzhou); School of Advanced Technology, Xi’an Jiaotong-Liverpool University; School of Information Engineering, Yancheng Institute Technology; School of Navigation, Wuhan University of Technology; Hubei Key Laboratory of Inland Shipping Technology; State Key Laboratory of Maritime Technology and Safety
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Accepted by IROS
Abstract:Object tracking in inland waterways plays a crucial role in safe and cost-effective applications, including waterborne transportation, sightseeing tours, environmental monitoring and surface rescue. Our Unmanned Surface Vehicle (USV), equipped with a 4D radar, a monocular camera, a GPS, and an IMU, delivers robust tracking capabilities in complex waterborne environments. By leveraging these sensors, our USV collected comprehensive object tracking data, which we present as USVTrack, the first 4D radar-camera tracking dataset tailored for autonomous driving in new generation waterborne transportation systems. Our USVTrack dataset presents rich scenarios, featuring diverse various waterways, varying times of day, and multiple weather and lighting conditions. Moreover, we present a simple but effective radar-camera matching method, termed RCM, which can be plugged into popular two-stage association trackers. Experimental results utilizing RCM demonstrate the effectiveness of the radar-camera matching in improving object tracking accuracy and reliability for autonomous driving in waterborne environments. The USVTrack dataset is public on this https URL.
zh
[CV-24] Deep CNN Face Matchers Inherently Support Revocable Biometric Templates
【速读】:该论文试图解决生物特征认证中一个核心问题:当个体的生物特征被泄露后,无法像密码一样进行撤销和重新生成。其解决方案的关键在于利用现代深度卷积神经网络(Deep CNN)人脸匹配器的特性,构建一种可撤销的生物特征方案。该方案的核心在于能够生成大量具有相似识别能力但生物特征模板高度不兼容的模型实例,使得被泄露的模板在撤销后失去有效性,同时允许用户通过重新注册新的模板继续使用系统。
链接: https://arxiv.org/abs/2506.18731
作者: Aman Bhatta,Michael C. King,Kevin W. Bowyer
机构: University of Notre Dame (圣母大学); Florida Insitute of Technology (佛罗里达技术学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:
Abstract:One common critique of biometric authentication is that if an individual’s biometric is compromised, then the individual has no recourse. The concept of revocable biometrics was developed to address this concern. A biometric scheme is revocable if an individual can have their current enrollment in the scheme revoked, so that the compromised biometric template becomes worthless, and the individual can re-enroll with a new template that has similar recognition power. We show that modern deep CNN face matchers inherently allow for a robust revocable biometric scheme. For a given state-of-the-art deep CNN backbone and training set, it is possible to generate an unlimited number of distinct face matcher models that have both (1) equivalent recognition power, and (2) strongly incompatible biometric templates. The equivalent recognition power extends to the point of generating impostor and genuine distributions that have the same shape and placement on the similarity dimension, meaning that the models can share a similarity threshold for a 1-in-10,000 false match rate. The biometric templates from different model instances are so strongly incompatible that the cross-instance similarity score for images of the same person is typically lower than the same-instance similarity score for images of different persons. That is, a stolen biometric template that is revoked is of less value in attempting to match the re-enrolled identity than the average impostor template. We also explore the feasibility of using a Vision Transformer (ViT) backbone-based face matcher in the revocable biometric system proposed in this work and demonstrate that it is less suitable compared to typical ResNet-based deep CNN backbones.
zh
[CV-25] DACloud: Point Cloud Recognition Using Topological Data Analysis
【速读】:该论文旨在解决基于点云的对象/场景识别问题,特别是在查询点云存在噪声或经过变换(如旋转)的情况下,如何提取可匹配的有意义局部描述符这一挑战。其解决方案的关键在于提出了一种名为TDACloud的新方法,该方法利用拓扑数据分析(Topological Data Analysis, TDA)从点云中提取局部描述符,无需依赖资源密集型的GPU加速机器学习训练。该方法通过ATOL向量化技术将原始点云转换为固定大小的TDA描述符向量,从而实现了对真实世界和模拟点云数据集的有效识别,并在噪声和变换条件下表现出较高的识别准确率。
链接: https://arxiv.org/abs/2506.18725
作者: Anirban Ghosh,Ian Dahlin,Ayan Dutta
机构: University of North Florida (北佛罗里达大学)
类目: Robotics (cs.RO); Computational Geometry (cs.CG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Point cloud-based object/place recognition remains a problem of interest in applications such as autonomous driving, scene reconstruction, and localization. Extracting meaningful local descriptors from a query point cloud that can be matched with the descriptors of the collected point clouds is a challenging problem. Furthermore, when the query point cloud is noisy or has been transformed (e.g., rotated), it adds to the complexity. To this end, we propose a novel methodology, named TDACloud, using Topological Data Analysis (TDA) for local descriptor extraction from a point cloud, which does not need resource-intensive GPU-based machine learning training. More specifically, we used the ATOL vectorization method to generate vectors for point clouds. Unlike voxelization, our proposed technique can take raw point clouds as inputs and outputs a fixed-size TDA-descriptor vector. To test the quality of the proposed TDACloud technique, we have implemented it on multiple real-world (e.g., Oxford RobotCar, KITTI-360) and realistic (e.g., ShapeNet) point cloud datasets for object and place recognition. We have also tested TDACloud on noisy and transformed test cases where the query point cloud has been scaled, translated, or rotated. Our results demonstrate high recognition accuracies in noisy conditions and large-scale real-world place recognition while outperforming the baselines by up to approximately 14%.
zh
[CV-26] Including Semantic Information via Word Embeddings for Skeleton-based Action Recognition IJCNN
【速读】:该论文试图解决传统基于骨架(skeleton)的动作识别方法在复杂交互中丢失关键点语义信息的问题,从而限制了其在工业4.0中协作机器人(cobots)装配任务中的有效性。解决方案的关键在于通过引入词嵌入(word embeddings)来丰富输入表示,用语义体积(semantic volumes)替代传统的独热编码(one-hot encodings),从而使模型能够捕捉关节与物体之间的有意义关系。
链接: https://arxiv.org/abs/2506.18721
作者: Dustin Aganian,Erik Franze,Markus Eisenbach,Horst-Michael Gross
机构: Ilmenau University of Technology (伊尔梅瑙工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
备注: IEEE International Joint Conference on Neural Networks (IJCNN) 2025
Abstract:Effective human action recognition is widely used for cobots in Industry 4.0 to assist in assembly tasks. However, conventional skeleton-based methods often lose keypoint semantics, limiting their effectiveness in complex interactions. In this work, we introduce a novel approach to skeleton-based action recognition that enriches input representations by leveraging word embeddings to encode semantic information. Our method replaces one-hot encodings with semantic volumes, enabling the model to capture meaningful relationships between joints and objects. Through extensive experiments on multiple assembly datasets, we demonstrate that our approach significantly improves classification performance, and enhances generalization capabilities by simultaneously supporting different skeleton types and object classes. Our findings highlight the potential of incorporating semantic information to enhance skeleton-based action recognition in dynamic and diverse environments.
zh
[CV-27] Matrix-Game: Interactive World Foundation Model
【速读】:该论文旨在解决可控游戏世界生成的问题,即如何在保持高质量视觉效果和时间连贯性的同时,实现对角色动作、摄像机运动以及物理规则的精确控制。其解决方案的关键在于提出Matrix-Game,一个基于双阶段训练管道的交互式世界基础模型,该模型首先进行大规模无标签预训练以理解环境,随后通过带动作标签的数据进行交互式视频生成训练,并采用可控的图像到世界生成范式,结合参考图像、运动上下文和用户动作进行条件生成。
链接: https://arxiv.org/abs/2506.18701
作者: Yifan Zhang,Chunli Peng,Boyang Wang,Puyi Wang,Qingcheng Zhu,Fei Kang,Biao Jiang,Zedong Gao,Eric Li,Yang Liu,Yahui Zhou
机构: Skywork AI (天空工坊AI)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Technical Report
Abstract:We introduce Matrix-Game, an interactive world foundation model for controllable game world generation. Matrix-Game is trained using a two-stage pipeline that first performs large-scale unlabeled pretraining for environment understanding, followed by action-labeled training for interactive video generation. To support this, we curate Matrix-Game-MC, a comprehensive Minecraft dataset comprising over 2,700 hours of unlabeled gameplay video clips and over 1,000 hours of high-quality labeled clips with fine-grained keyboard and mouse action annotations. Our model adopts a controllable image-to-world generation paradigm, conditioned on a reference image, motion context, and user actions. With over 17 billion parameters, Matrix-Game enables precise control over character actions and camera movements, while maintaining high visual quality and temporal coherence. To evaluate performance, we develop GameWorld Score, a unified benchmark measuring visual quality, temporal quality, action controllability, and physical rule understanding for Minecraft world generation. Extensive experiments show that Matrix-Game consistently outperforms prior open-source Minecraft world models (including Oasis and MineWorld) across all metrics, with particularly strong gains in controllability and physical consistency. Double-blind human evaluations further confirm the superiority of Matrix-Game, highlighting its ability to generate perceptually realistic and precisely controllable videos across diverse game scenarios. To facilitate future research on interactive image-to-world generation, we will open-source the Matrix-Game model weights and the GameWorld Score benchmark at this https URL.
zh
[CV-28] SIM-Net: A Multimodal Fusion Network Using Inferred 3D Object Shape Point Clouds from RGB Images for 2D Classification
【速读】:该论文旨在解决传统基于2D图像的分类方法在处理数字化标本时面临的挑战,包括异质背景、非植物元素干扰以及遮挡问题。其解决方案的关键在于引入一种像素到点云的转换机制,将2D物体掩码转化为3D点云表示,从而融合纹理和几何特征以提升分类性能。
链接: https://arxiv.org/abs/2506.18683
作者: Youcef Sklab,Hanane Ariouat,Eric Chenin,Edi Prifti,Jean-Daniel Zucker
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 25 pages, 9 figures, 14 tables
Abstract:We introduce the Shape-Image Multimodal Network (SIM-Net), a novel 2D image classification architecture that integrates 3D point cloud representations inferred directly from RGB images. Our key contribution lies in a pixel-to-point transformation that converts 2D object masks into 3D point clouds, enabling the fusion of texture-based and geometric features for enhanced classification performance. SIM-Net is particularly well-suited for the classification of digitized herbarium specimens (a task made challenging by heterogeneous backgrounds), non-plant elements, and occlusions that compromise conventional image-based models. To address these issues, SIM-Net employs a segmentation-based preprocessing step to extract object masks prior to 3D point cloud generation. The architecture comprises a CNN encoder for 2D image features and a PointNet-based encoder for geometric features, which are fused into a unified latent space. Experimental evaluations on herbarium datasets demonstrate that SIM-Net consistently outperforms ResNet101, achieving gains of up to 9.9% in accuracy and 12.3% in F-score. It also surpasses several transformer-based state-of-the-art architectures, highlighting the benefits of incorporating 3D structural reasoning into 2D image classification tasks.
zh
[CV-29] Multi-Scale Spectral Attention Module-based Hyperspectral Segmentation in Autonomous Driving Scenarios
【速读】:该论文旨在解决高维光谱数据在自主驾驶(AD)中高效处理的挑战,特别是在复杂天气和光照条件下提升环境感知能力。其解决方案的关键在于提出一种多尺度光谱注意力模块(MSAM),通过三个不同卷积核大小(1到11)的并行1D卷积以及自适应特征聚合机制,增强光谱特征提取能力。将MSAM集成到UNet的跳跃连接中(UNet-SC)后,所提出的UNet-MSAM在多个高光谱成像(HSI)数据集上显著提升了语义分割性能。
链接: https://arxiv.org/abs/2506.18682
作者: Imad Ali Shah,Jiarong Li,Tim Brophy,Martin Glavin,Edward Jones,Enda Ward,Brian Deegan
机构: University of Galway (爱尔兰国立高威大学); Valeo Vision Systems (维宁韦来视觉系统)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent advances in autonomous driving (AD) have highlighted the potential of Hyperspectral Imaging (HSI) for enhanced environmental perception, particularly in challenging weather and lighting conditions. However, efficiently processing its high-dimensional spectral data remains a significant challenge. This paper introduces a Multi-scale Spectral Attention Module (MSAM) that enhances spectral feature extraction through three parallel 1D convolutions with varying kernel sizes between 1 to 11, coupled with an adaptive feature aggregation mechanism. By integrating MSAM into UNet’s skip connections (UNet-SC), our proposed UNet-MSAM achieves significant improvements in semantic segmentation performance across multiple HSI datasets: HyKo-VIS v2, HSI-Drive v2, and Hyperspectral City v2. Our comprehensive experiments demonstrate that with minimal computational overhead (on average 0.02% in parameters and 0.82% GFLOPS), UNet-MSAM consistently outperforms UNet-SC, achieving average improvements of 3.61% in mean IoU and 3.80% in mF1 across the three datasets. Through extensive ablation studies, we have established that multi-scale kernel combinations perform better than single-scale configurations. These findings demonstrate the potential of HSI processing for AD and provide valuable insights into designing robust, multi-scale spectral feature extractors for real-world applications.
zh
[CV-30] DuetGen: Music Driven Two-Person Dance Generation via Hierarchical Masked Modeling SIGGRAPH2025
【速读】:该论文试图解决从音乐生成互动式双人舞蹈的问题,其核心挑战在于双人舞蹈交互的固有复杂性,即舞者需要彼此以及与音乐同步。解决方案的关键在于提出一种两阶段框架:首先将双人动作编码为离散标记,然后根据音乐生成这些标记。为有效捕捉复杂的交互关系,论文将两名舞者的动作表示为一个统一的整体,并在两个阶段中采用自粗粒度到细粒度的学习策略,从而实现高质量的舞蹈生成。
链接: https://arxiv.org/abs/2506.18680
作者: Anindita Ghosh,Bing Zhou,Rishabh Dabral,Jian Wang,Vladislav Golyanik,Christian Theobalt,Philipp Slusallek,Chuan Guo
机构: DFKI(德国弗劳恩霍夫研究所); MPI for Informatics(马克斯·普朗克信息学研究所); SICGermany(德国人工智能中心); Snap Inc.(Snap公司)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: 11 pages, 7 figures, 2 tables, accepted in ACM Siggraph 2025 conference track
Abstract:We present DuetGen, a novel framework for generating interactive two-person dances from music. The key challenge of this task lies in the inherent complexities of two-person dance interactions, where the partners need to synchronize both with each other and with the music. Inspired by the recent advances in motion synthesis, we propose a two-stage solution: encoding two-person motions into discrete tokens and then generating these tokens from music. To effectively capture intricate interactions, we represent both dancers’ motions as a unified whole to learn the necessary motion tokens, and adopt a coarse-to-fine learning strategy in both the stages. Our first stage utilizes a VQ-VAE that hierarchically separates high-level semantic features at a coarse temporal resolution from low-level details at a finer resolution, producing two discrete token sequences at different abstraction levels. Subsequently, in the second stage, two generative masked transformers learn to map music signals to these dance tokens: the first producing high-level semantic tokens, and the second, conditioned on music and these semantic tokens, producing the low-level tokens. We train both transformers to learn to predict randomly masked tokens within the sequence, enabling them to iteratively generate motion tokens by filling an empty token sequence during inference. Through the hierarchical masked modeling and dedicated interaction representation, DuetGen achieves the generation of synchronized and interactive two-person dances across various genres. Extensive experiments and user studies on a benchmark duet dance dataset demonstrate state-of-the-art performance of DuetGen in motion realism, music-dance alignment, and partner coordination.
zh
[CV-31] MARL-MambaContour: Unleashing Multi-Agent Deep Reinforcement Learning for Active Contour Optimization in Medical Image Segmentation
【速读】:该论文旨在解决传统基于像素的医学图像分割方法在拓扑约束和解剖区域整体结构感知方面的不足,其解决方案的关键在于将分割任务重新定义为多智能体协作生成拓扑一致的对象级轮廓。每个轮廓点被建模为一个自主智能体,通过迭代调整位置以精确对齐目标边界,从而适应医学图像中常见的模糊边缘和复杂形态,同时采用定制的Soft Actor-Critic(SAC)算法结合熵正则化调整机制(ERAM)优化迭代过程,提升分割精度与鲁棒性。
链接: https://arxiv.org/abs/2506.18679
作者: Ruicheng Zhang,Yu Sun,Zeyu Zhang,Jinai Li,Xiaofan Liu,Au Hoi Fan,Haowei Guo,Puxin Yan
机构: Sun Yat-sen University (中山大学); The Australian National University (澳大利亚国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We introduce MARL-MambaContour, the first contour-based medical image segmentation framework based on Multi-Agent Reinforcement Learning (MARL). Our approach reframes segmentation as a multi-agent cooperation task focused on generate topologically consistent object-level contours, addressing the limitations of traditional pixel-based methods which could lack topological constraints and holistic structural awareness of anatomical regions. Each contour point is modeled as an autonomous agent that iteratively adjusts its position to align precisely with the target boundary, enabling adaptation to blurred edges and intricate morphologies common in medical images. This iterative adjustment process is optimized by a contour-specific Soft Actor-Critic (SAC) algorithm, further enhanced with the Entropy Regularization Adjustment Mechanism (ERAM) which dynamically balance agent exploration with contour smoothness. Furthermore, the framework incorporates a Mamba-based policy network featuring a novel Bidirectional Cross-attention Hidden-state Fusion Mechanism (BCHFM). This mechanism mitigates potential memory confusion limitations associated with long-range modeling in state space models, thereby facilitating more accurate inter-agent information exchange and informed decision-making. Extensive experiments on five diverse medical imaging datasets demonstrate the state-of-the-art performance of MARL-MambaContour, highlighting its potential as an accurate and robust clinical application.
zh
[CV-32] MCN-SLAM: Multi-Agent Collaborative Neural SLAM with Hybrid Implicit Neural Scene Representation
【速读】:该论文旨在解决现有神经隐式场景表示在大规模场景和长序列中表现不佳,以及基于NeRF的多智能体SLAM框架在通信带宽约束下无法满足需求的问题。其关键解决方案是提出首个基于混合场景表示的分布式多智能体协同神经SLAM框架,包含分布式相机跟踪、从内到外的回环闭合机制以及用于多子地图融合的在线蒸馏方法,同时引入一种新颖的三平面网格联合场景表示方法以提升场景重建精度。
链接: https://arxiv.org/abs/2506.18678
作者: Tianchen Deng,Guole Shen,Xun Chen,Shenghai Yuan,Hongming Shen,Guohao Peng,Zhenyu Wu,Jingchuan Wang,Lihua Xie,Danwei Wang,Hesheng Wang,Weidong Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:Neural implicit scene representations have recently shown promising results in dense visual SLAM. However, existing implicit SLAM algorithms are constrained to single-agent scenarios, and fall difficulties in large-scale scenes and long sequences. Existing NeRF-based multi-agent SLAM frameworks cannot meet the constraints of communication bandwidth. To this end, we propose the first distributed multi-agent collaborative neural SLAM framework with hybrid scene representation, distributed camera tracking, intra-to-inter loop closure, and online distillation for multiple submap fusion. A novel triplane-grid joint scene representation method is proposed to improve scene reconstruction. A novel intra-to-inter loop closure method is designed to achieve local (single-agent) and global (multi-agent) consistency. We also design a novel online distillation method to fuse the information of different submaps to achieve global consistency. Furthermore, to the best of our knowledge, there is no real-world dataset for NeRF-based/GS-based SLAM that provides both continuous-time trajectories groundtruth and high-accuracy 3D meshes groundtruth. To this end, we propose the first real-world Dense slam (DES) dataset covering both single-agent and multi-agent scenarios, ranging from small rooms to large-scale outdoor scenes, with high-accuracy ground truth for both 3D mesh and continuous-time camera trajectory. This dataset can advance the development of the research in both SLAM, 3D reconstruction, and visual foundation model. Experiments on various datasets demonstrate the superiority of the proposed method in both mapping, tracking, and communication. The dataset and code will open-source on this https URL.
zh
[CV-33] Reconstructing Tornadoes in 3D with Gaussian Splatting
【速读】:该论文旨在解决如何准确重建龙卷风的三维结构这一关键问题,这对于理解并应对这种极具破坏性的天气现象至关重要。当前缺乏可控的龙卷风数据集,限制了3D场景重建技术(如3D Gaussian splatting,3DGS)的发展与验证。本文提出的关键解决方案是捕获并释放一个基于实验室的小型龙卷风多视角数据集,并证明可以利用3DGS有效重建和可视化该龙卷风的三维结构。
链接: https://arxiv.org/abs/2506.18677
作者: Adam Yang,Nadula Kadawedduwa,Tianfu Wang,Maria Molina,Christopher Metzler
机构: University of Maryland (马里兰大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Accurately reconstructing the 3D structure of tornadoes is critically important for understanding and preparing for this highly destructive weather phenomenon. While modern 3D scene reconstruction techniques, such as 3D Gaussian splatting (3DGS), could provide a valuable tool for reconstructing the 3D structure of tornados, at present we are critically lacking a controlled tornado dataset with which to develop and validate these tools. In this work we capture and release a novel multiview dataset of a small lab-based tornado. We demonstrate one can effectively reconstruct and visualize the 3D structure of this tornado using 3DGS.
zh
[CV-34] CDiff: An End-to-end Trajectory-Controllable Diffusion Model for Harmonious Music-Driven Group Choreography
【速读】:该论文旨在解决群体舞蹈生成过程中存在的三个主要问题:多舞者碰撞、单舞者脚部滑动以及长时序群体舞蹈生成中的突变位置交换。其解决方案的关键在于提出TCDiff++框架,通过引入舞者定位嵌入(dancer positioning embedding)以维持舞者间的相对位置,并结合距离一致性损失(distance-consistency loss)确保舞者间距离在合理范围内;同时,通过引入交换模式嵌入(swap mode embedding)和足部适配器(Footwork Adaptor)来减少脚部滑动;此外,采用长时序群体扩散采样策略和序列解码器层,以提升模型对长序列的处理能力,从而实现高质量且连贯的群体舞蹈生成。
链接: https://arxiv.org/abs/2506.18671
作者: Yuqin Dai,Wanlu Zhu,Ronghui Li,Xiu Li,Zhenyu Zhang,Jun Li,Jian Yang
机构: Nanjing University of Science and Technology (南京理工大学); Tsinghua University (清华大学)
类目: ound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Audio and Speech Processing (eess.AS)
备注:
Abstract:Music-driven dance generation has garnered significant attention due to its wide range of industrial applications, particularly in the creation of group choreography. During the group dance generation process, however, most existing methods still face three primary issues: multi-dancer collisions, single-dancer foot sliding and abrupt swapping in the generation of long group dance. In this paper, we propose TCDiff++, a music-driven end-to-end framework designed to generate harmonious group dance. Specifically, to mitigate multi-dancer collisions, we utilize a dancer positioning embedding to better maintain the relative positioning among dancers. Additionally, we incorporate a distance-consistency loss to ensure that inter-dancer distances remain within plausible ranges. To address the issue of single-dancer foot sliding, we introduce a swap mode embedding to indicate dancer swapping patterns and design a Footwork Adaptor to refine raw motion, thereby minimizing foot sliding. For long group dance generation, we present a long group diffusion sampling strategy that reduces abrupt position shifts by injecting positional information into the noisy input. Furthermore, we integrate a Sequence Decoder layer to enhance the model’s ability to selectively process long sequences. Extensive experiments demonstrate that our TCDiff++ achieves state-of-the-art performance, particularly in long-duration scenarios, ensuring high-quality and coherent group dance generation.
zh
[CV-35] MedSeg-R: Medical Image Segmentation with Clinical Reasoning
【速读】:该论文旨在解决医学图像分割中由于解剖结构重叠、边界模糊以及前景与背景类别严重不平衡导致的小病灶分割困难的问题。其解决方案的关键在于提出MedSeg-R,一个轻量级的双阶段框架,该框架受临床推理启发,通过认知阶段将医学报告转化为结构化的语义先验(位置、纹理、形状),并利用Transformer模块进行融合;在感知阶段,这些先验通过空间注意力、动态卷积和可变形采样对Segment Anything Model (SAM) 的主干网络进行调制,从而提升对低对比度或重叠目标的泛化能力。
链接: https://arxiv.org/abs/2506.18669
作者: Hao Shao,Qibin Hou
机构: Nankai University (南开大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Medical image segmentation is challenging due to overlapping anatomies with ambiguous boundaries and a severe imbalance between the foreground and background classes, which particularly affects the delineation of small lesions. Existing methods, including encoder-decoder networks and prompt-driven variants of the Segment Anything Model (SAM), rely heavily on local cues or user prompts and lack integrated semantic priors, thus failing to generalize well to low-contrast or overlapping targets. To address these issues, we propose MedSeg-R, a lightweight, dual-stage framework inspired by inspired by clinical reasoning. Its cognitive stage interprets medical report into structured semantic priors (location, texture, shape), which are fused via transformer block. In the perceptual stage, these priors modulate the SAM backbone: spatial attention highlights likely lesion regions, dynamic convolution adapts feature filters to expected textures, and deformable sampling refines spatial support. By embedding this fine-grained guidance early, MedSeg-R disentangles inter-class confusion and amplifies minority-class cues, greatly improving sensitivity to small lesions. In challenging benchmarks, MedSeg-R produces large Dice improvements in overlapping and ambiguous structures, demonstrating plug-and-play compatibility with SAM-based systems.
zh
[CV-36] Benchmarking histopathology foundation models in a multi-center dataset for skin cancer subtyping
【速读】:该论文旨在解决在计算病理学中,如何有效评估生成式 AI (Generative AI) 基础模型(FM)作为局部图像块特征提取器在多实例学习(MIL)框架中的性能问题。其解决方案的关键在于构建一个针对组织病理学基础模型的新型基准测试,并引入 Foundation Model - Silhouette Index (FM-SI) 作为衡量模型在分布变化下一致性的新指标,以更准确地评估模型在实际应用中的鲁棒性和泛化能力。
链接: https://arxiv.org/abs/2506.18668
作者: Pablo Meseguer,Rocío del Amor,Valery Naranjo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepeted for oral presentation at Medical Image Understanding and Analysis (MIUA) 2025
Abstract:Pretraining on large-scale, in-domain datasets grants histopathology foundation models (FM) the ability to learn task-agnostic data representations, enhancing transfer learning on downstream tasks. In computational pathology, automated whole slide image analysis requires multiple instance learning (MIL) frameworks due to the gigapixel scale of the slides. The diversity among histopathology FMs has highlighted the need to design real-world challenges for evaluating their effectiveness. To bridge this gap, our work presents a novel benchmark for evaluating histopathology FMs as patch-level feature extractors within a MIL classification framework. For that purpose, we leverage the AI4SkIN dataset, a multi-center cohort encompassing slides with challenging cutaneous spindle cell neoplasm subtypes. We also define the Foundation Model - Silhouette Index (FM-SI), a novel metric to measure model consistency against distribution shifts. Our experimentation shows that extracting less biased features enhances classification performance, especially in similarity-based MIL classifiers.
zh
[CV-37] Historical Report Guided Bi-modal Concurrent Learning for Pathology Report Generation
【速读】:该论文旨在解决从全切片图像(Whole Slide Images, WSI)自动生成病理报告所面临的两个关键问题:一是视觉特征中语义内容的缺失,二是WSI中固有的信息冗余。其解决方案的关键在于提出一种基于历史报告引导的双模态并行学习框架(BiGen),该框架通过知识检索机制获取丰富的语义内容,并利用双模态并行学习策略动态提取关键视觉特征和检索到的知识,其中共享权重层实现了视觉特征与知识特征之间的跨模态对齐,从而生成全面的诊断报告。
链接: https://arxiv.org/abs/2506.18658
作者: Ling Zhang,Boxiang Yun,Qingli Li,Yan Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Automated pathology report generation from Whole Slide Images (WSIs) faces two key challenges: (1) lack of semantic content in visual features and (2) inherent information redundancy in WSIs. To address these issues, we propose a novel Historical Report Guided \textbfBi-modal Concurrent Learning Framework for Pathology Report \textbfGeneration (BiGen) emulating pathologists’ diagnostic reasoning, consisting of: (1) A knowledge retrieval mechanism to provide rich semantic content, which retrieves WSI-relevant knowledge from pre-built medical knowledge bank by matching high-attention patches and (2) A bi-modal concurrent learning strategy instantiated via a learnable visual token and a learnable textual token to dynamically extract key visual features and retrieved knowledge, where weight-shared layers enable cross-modal alignment between visual features and knowledge features. Our multi-modal decoder integrates both modals for comprehensive diagnostic reports generation. Experiments on the PathText (BRCA) dataset demonstrate our framework’s superiority, achieving state-of-the-art performance with 7.4% relative improvement in NLP metrics and 19.1% enhancement in classification metrics for Her-2 prediction versus existing methods. Ablation studies validate the necessity of our proposed modules, highlighting our method’s ability to provide WSI-relevant rich semantic content and suppress information redundancy in WSIs. Code is publicly available at this https URL.
zh
[CV-38] RDPO: Real Data Preference Optimization for Physics Consistency Video Generation
【速读】:该论文试图解决视频生成技术中难以准确再现现实世界物理规律的问题,尽管当前技术在视觉质量上取得了显著进展。解决方案的关键在于提出一种无需标注数据的框架——真实数据偏好优化(Real Data Preference Optimisation, RDPO),该框架直接从真实视频中提炼物理先验知识。通过预训练生成器反向采样真实视频序列以自动构建在物理正确性上具有统计区分性的偏好对,并采用多阶段迭代训练策略,使生成器逐步遵循物理定律,从而提升生成视频的动作连贯性和物理真实性。
链接: https://arxiv.org/abs/2506.18655
作者: Wenxu Qian,Chaoyue Wang,Hou Peng,Zhiyu Tan,Hao Li,Anxiang Zeng
机构: Fudan University (复旦大学); Shopee Inc (虾皮科技)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 10 figures
Abstract:Video generation techniques have achieved remarkable advancements in visual quality, yet faithfully reproducing real-world physics remains elusive. Preference-based model post-training may improve physical consistency, but requires costly human-annotated datasets or reward models that are not yet feasible. To address these challenges, we present Real Data Preference Optimisation (RDPO), an annotation-free framework that distills physical priors directly from real-world videos. Specifically, the proposed RDPO reverse-samples real video sequences with a pre-trained generator to automatically build preference pairs that are statistically distinguishable in terms of physical correctness. A multi-stage iterative training schedule then guides the generator to obey physical laws increasingly well. Benefiting from the dynamic information explored from real videos, our proposed RDPO significantly improves the action coherence and physical realism of the generated videos. Evaluations on multiple benchmarks and human evaluations have demonstrated that RDPO achieves improvements across multiple dimensions. The source code and demonstration of this paper are available at: this https URL
zh
[CV-39] BulletGen: Improving 4D Reconstruction with Bullet-Time Generation
【速读】:该论文试图解决从随意拍摄的单目视频中重建出完整沉浸式动态场景这一高度病态的问题,其主要挑战包括重构未见区域以及处理单目深度估计中的模糊性。解决方案的关键在于引入BulletGen,该方法利用生成模型校正基于高斯的动态场景表示中的错误并补全缺失信息,通过将基于扩散的视频生成模型的输出与单个冻结的“子弹时间”步骤中的4D重建对齐,生成的帧随后用于监督4D高斯模型的优化,从而实现生成内容与静态及动态场景成分的无缝融合。
链接: https://arxiv.org/abs/2506.18601
作者: Denys Rozumnyi,Jonathon Luiten,Numair Khan,Johannes Schönberger,Peter Kontschieder
机构: Meta Reality Labs (Meta 沉浸式现实实验室)
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Transforming casually captured, monocular videos into fully immersive dynamic experiences is a highly ill-posed task, and comes with significant challenges, e.g., reconstructing unseen regions, and dealing with the ambiguity in monocular depth estimation. In this work we introduce BulletGen, an approach that takes advantage of generative models to correct errors and complete missing information in a Gaussian-based dynamic scene representation. This is done by aligning the output of a diffusion-based video generation model with the 4D reconstruction at a single frozen “bullet-time” step. The generated frames are then used to supervise the optimization of the 4D Gaussian model. Our method seamlessly blends generative content with both static and dynamic scene components, achieving state-of-the-art results on both novel-view synthesis, and 2D/3D tracking tasks.
zh
[CV-40] SpaNN: Detecting Multiple Adversarial Patches on CNNs by Spanning Saliency Thresholds
【速读】:该论文旨在解决当前卷积神经网络模型在目标检测和图像分类任务中对物理可实现的对抗扰动(如补丁攻击)的脆弱性问题。现有防御方法主要针对单补丁攻击,未能有效应对多补丁攻击场景,导致在计算效率或检测性能上存在不足。论文提出的解决方案是SpaNN,其关键在于通过在受害者模型第一层卷积层的神经激活上应用一组显著性阈值,构建二值化特征图的集成,并对集成结果进行聚类,利用聚类特征作为分类器的输入进行攻击检测。与现有方法不同,SpaNN不依赖固定显著性阈值来识别对抗区域,从而增强了对白盒对抗攻击的鲁棒性。
链接: https://arxiv.org/abs/2506.18591
作者: Mauricio Byrd Victorica,György Dán,Henrik Sandberg
机构: KTH Royal Institute of Technology (KTH皇家理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 2025 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML2025)
Abstract:State-of-the-art convolutional neural network models for object detection and image classification are vulnerable to physically realizable adversarial perturbations, such as patch attacks. Existing defenses have focused, implicitly or explicitly, on single-patch attacks, leaving their sensitivity to the number of patches as an open question or rendering them computationally infeasible or inefficient against attacks consisting of multiple patches in the worst cases. In this work, we propose SpaNN, an attack detector whose computational complexity is independent of the expected number of adversarial patches. The key novelty of the proposed detector is that it builds an ensemble of binarized feature maps by applying a set of saliency thresholds to the neural activations of the first convolutional layer of the victim model. It then performs clustering on the ensemble and uses the cluster features as the input to a classifier for attack detection. Contrary to existing detectors, SpaNN does not rely on a fixed saliency threshold for identifying adversarial regions, which makes it robust against white box adversarial attacks. We evaluate SpaNN on four widely used data sets for object detection and classification, and our results show that SpaNN outperforms state-of-the-art defenses by up to 11 and 27 percentage points in the case of object detection and the case of image classification, respectively. Our code is available at this https URL.
zh
[CV-41] Resampling Augmentation for Time Series Contrastive Learning: Application to Remote Sensing ICML2025
【速读】:该论文旨在解决遥感时间序列数据中标签数据稀缺而无标签数据丰富的挑战,通过对比自监督预训练来利用大量无标签的卫星图像时间序列(SITS)。其解决方案的关键在于提出了一种基于重采样的增强策略,通过上采样时间序列并提取不相交的子序列来生成正样本对,同时保持时间覆盖范围,从而有效提升了对比学习的效果。
链接: https://arxiv.org/abs/2506.18587
作者: Antoine Saget,Baptiste Lafabregue,Antoine Cornuéjols,Pierre Gançarski
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 2 figures, accepted at 42nd International Conference on Machine Learning (ICML 2025) Terrabytes workshop
Abstract:Given the abundance of unlabeled Satellite Image Time Series (SITS) and the scarcity of labeled data, contrastive self-supervised pretraining emerges as a natural tool to leverage this vast quantity of unlabeled data. However, designing effective data augmentations for contrastive learning remains challenging for time series. We introduce a novel resampling-based augmentation strategy that generates positive pairs by upsampling time series and extracting disjoint subsequences while preserving temporal coverage. We validate our approach on multiple agricultural classification benchmarks using Sentinel-2 imagery, showing that it outperforms common alternatives such as jittering, resizing, and masking. Further, we achieve state-of-the-art performance on the S2-Agri100 dataset without employing spatial information or temporal encodings, surpassing more complex masked-based SSL frameworks. Our method offers a simple, yet effective, contrastive learning augmentation for remote sensing time series.
zh
[CV-42] 2D Triangle Splatting for Direct Differentiable Mesh Training
【速读】:该论文试图解决基于3D高斯(3D Gaussian)基元的可微渲染方法在渲染速度和高级渲染效果(如重新光照和阴影渲染)方面相较于基于网格(mesh-based)模型的不足。其解决方案的关键在于提出2D三角形点图(2D Triangle Splatting, 2DTS),用2D三角形面片替代3D高斯基元,该表示方法自然形成离散的网格结构,同时保留连续体积建模的优势,并通过引入紧致性参数实现真实感网格的直接训练。
链接: https://arxiv.org/abs/2506.18575
作者: Kaifeng Sheng,Zheng Zhou,Yingliang Peng,Qianwei Wang
机构: Alibaba Inc. (阿里巴巴集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 8 figures
Abstract:Differentiable rendering with 3D Gaussian primitives has emerged as a powerful method for reconstructing high-fidelity 3D scenes from multi-view images. While it offers improvements over NeRF-based methods, this representation still encounters challenges with rendering speed and advanced rendering effects, such as relighting and shadow rendering, compared to mesh-based models. In this paper, we propose 2D Triangle Splatting (2DTS), a novel method that replaces 3D Gaussian primitives with 2D triangle facelets. This representation naturally forms a discrete mesh-like structure while retaining the benefits of continuous volumetric modeling. By incorporating a compactness parameter into the triangle primitives, we enable direct training of photorealistic meshes. Our experimental results demonstrate that our triangle-based method, in its vanilla version (without compactness tuning), achieves higher fidelity compared to state-of-the-art Gaussian-based methods. Furthermore, our approach produces reconstructed meshes with superior visual quality compared to existing mesh reconstruction methods.
zh
[CV-43] VisualChef: Generating Visual Aids in Cooking via Mask Inpainting
【速读】:该论文试图解决烹饪过程中缺乏一致性的视觉指导问题,尤其是在动作执行和对象外观变化的可视化呈现方面。现有方法依赖于详细的文本描述来引导图像生成,需要细粒度的视觉-文本对齐和额外标注,过程复杂且繁琐。其解决方案的关键在于提出VisualChef,通过基于掩码的视觉定位简化对齐过程,识别并分类与动作相关的物体,实现针对特定动作和结果的精确修改,同时保持环境的一致性。此外,还设计了一个自动化流程以提取高质量的初始、动作和最终状态帧。
链接: https://arxiv.org/abs/2506.18569
作者: Oleh Kuzyk,Zuoyue Li,Marc Pollefeys,Xi Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Cooking requires not only following instructions but also understanding, executing, and monitoring each step - a process that can be challenging without visual guidance. Although recipe images and videos offer helpful cues, they often lack consistency in focus, tools, and setup. To better support the cooking process, we introduce VisualChef, a method for generating contextual visual aids tailored to cooking scenarios. Given an initial frame and a specified action, VisualChef generates images depicting both the action’s execution and the resulting appearance of the object, while preserving the initial frame’s environment. Previous work aims to integrate knowledge extracted from large language models by generating detailed textual descriptions to guide image generation, which requires fine-grained visual-textual alignment and involves additional annotations. In contrast, VisualChef simplifies alignment through mask-based visual grounding. Our key insight is identifying action-relevant objects and classifying them to enable targeted modifications that reflect the intended action and outcome while maintaining a consistent environment. In addition, we propose an automated pipeline to extract high-quality initial, action, and final state frames. We evaluate VisualChef quantitatively and qualitatively on three egocentric video datasets and show its improvements over state-of-the-art methods.
zh
[CV-44] VQ-Insight: Teaching VLMs for AI-Generated Video Quality Understanding via Progressive Visual Reinforcement Learning
【速读】:该论文旨在解决AI生成内容(AIGC)视频质量评估中的挑战,包括有限的泛化能力、缺乏时间感知、对大规模标注数据的高度依赖以及与生成模型的有效交互不足。其解决方案的关键在于提出一种新颖的推理风格视觉-语言模型(VLM)框架VQ-Insight,该框架通过渐进式的视频质量学习方案和多维度评分奖励机制,提升了视频质量评估的泛化与专业化能力。
链接: https://arxiv.org/abs/2506.18564
作者: Xuanyu Zhang,Weiqi Li,Shijie Zhao,Junlin Li,Li Zhang,Jian Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Technical Report
Abstract:Recent advances in AI-generated content (AIGC) have led to the emergence of powerful text-to-video generation models. Despite these successes, evaluating the quality of AIGC-generated videos remains challenging due to limited generalization, lack of temporal awareness, heavy reliance on large-scale annotated datasets, and the lack of effective interaction with generation models. Most current approaches rely on supervised finetuning of vision-language models (VLMs), which often require large-scale annotated datasets and tend to decouple understanding and generation. To address these shortcomings, we propose VQ-Insight, a novel reasoning-style VLM framework for AIGC video quality assessment. Our approach features: (1) a progressive video quality learning scheme that combines image quality warm-up, general task-specific temporal learning, and joint optimization with the video generation model; (2) the design of multi-dimension scoring rewards, preference comparison rewards, and temporal modeling rewards to enhance both generalization and specialization in video quality evaluation. Extensive experiments demonstrate that VQ-Insight consistently outperforms state-of-the-art baselines in preference comparison, multi-dimension scoring, and natural video scoring, bringing significant improvements for video generation tasks.
zh
[CV-45] Object-aware Sound Source Localization via Audio-Visual Scene Understanding CVPR2025
【速读】:该论文试图解决音频-视觉声音源定位任务中在复杂场景下难以准确定位发声物体的问题,尤其是在存在视觉上相似的静止物体时。现有方法主要依赖于简单的音视频对应关系,未能捕捉发声物体与静止物体之间的细粒度语义差异。解决方案的关键在于利用多模态大语言模型(Multimodal Large Language Models, MLLMs)生成详细的上下文信息,以明确区分发声前景物体和静止背景物体,并引入两种新的损失函数:目标感知对比对齐(Object-aware Contrastive Alignment, OCA)损失和目标区域隔离(Object Region Isolation, ORI)损失,以有效整合这些详细信息。
链接: https://arxiv.org/abs/2506.18557
作者: Sung Jin Um,Dongjin Kim,Sangmin Lee,Jung Uk Kim
机构: Kyung Hee University (庆熙大学); KAIST AI (KAIST人工智能); Sungkyunkwan University (成均馆大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at CVPR 2025
Abstract:Audio-visual sound source localization task aims to spatially localize sound-making objects within visual scenes by integrating visual and audio cues. However, existing methods struggle with accurately localizing sound-making objects in complex scenes, particularly when visually similar silent objects coexist. This limitation arises primarily from their reliance on simple audio-visual correspondence, which does not capture fine-grained semantic differences between sound-making and silent objects. To address these challenges, we propose a novel sound source localization framework leveraging Multimodal Large Language Models (MLLMs) to generate detailed contextual information that explicitly distinguishes between sound-making foreground objects and silent background objects. To effectively integrate this detailed information, we introduce two novel loss functions: Object-aware Contrastive Alignment (OCA) loss and Object Region Isolation (ORI) loss. Extensive experimental results on MUSIC and VGGSound datasets demonstrate the effectiveness of our approach, significantly outperforming existing methods in both single-source and multi-source localization scenarios. Code and generated detailed contextual information are available at: this https URL.
zh
[CV-46] Normality Prior Guided Multi-Semantic Fusion Network for Unsupervised Image Anomaly Detection
【速读】:该论文试图解决逻辑异常(logical anomalies)在无监督异常检测中的检测难题,相较于结构异常(structural anomalies),逻辑异常的局部特征往往与正常语义相似,但全局语义显著偏离正常模式,导致现有基于编码器-解码器的方法难以有效抑制其传播。解决方案的关键在于引入一种由正常性先验(normality prior)引导的多语义融合网络,通过预训练的视觉-语言网络提取正常样本的抽象全局语义,并利用向量量化构建可学习的语义代码本,将多语义特征融合后作为解码器的输入,从而引导异常重建过程趋近于正常状态。
链接: https://arxiv.org/abs/2506.18544
作者: Muhao Xu,Xueying Zhou,Xizhan Gao,Weiye Song,Guang Feng,Sijie Niu
机构: Shandong Key Laboratory of Ubiquitous Intelligent Computing, University of Jinan; Department of Mechanical Engineering, Shandong University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recently, detecting logical anomalies is becoming a more challenging task compared to detecting structural ones. Existing encoder decoder based methods typically compress inputs into low-dimensional bottlenecks on the assumption that the compression process can effectively suppress the transmission of logical anomalies to the decoder. However, logical anomalies present a particular difficulty because, while their local features often resemble normal semantics, their global semantics deviate significantly from normal patterns. Thanks to the generalisation capabilities inherent in neural networks, these abnormal semantic features can propagate through low-dimensional bottlenecks. This ultimately allows the decoder to reconstruct anomalous images with misleading fidelity. To tackle the above challenge, we propose a novel normality prior guided multi-semantic fusion network for unsupervised anomaly detection. Instead of feeding the compressed bottlenecks to the decoder directly, we introduce the multi-semantic features of normal samples into the reconstruction process. To this end, we first extract abstract global semantics of normal cases by a pre-trained vision-language network, then the learnable semantic codebooks are constructed to store representative feature vectors of normal samples by vector quantisation. Finally, the above multi-semantic features are fused and employed as input to the decoder to guide the reconstruction of anomalies to approximate normality. Extensive experiments are conducted to validate the effectiveness of our proposed method, and it achieves the SOTA performance on the MVTec LOCO AD dataset with improvements of 5.7% in pixel-sPRO and 2.6% in image-AUROC. The source code is available at this https URL.
zh
[CV-47] Geometry-aware Distance Measure for Diverse Hierarchical Structures in Hyperbolic Spaces
【速读】:该论文试图解决传统超球面学习方法中固定距离度量无法适应真实世界中多样化层次结构的问题,从而限制了模型对复杂数据的建模能力。其解决方案的关键在于提出一种几何感知的距离度量方法,通过为每对数据点生成定制化的投影和曲率,动态适应不同的层次结构,从而更有效地将数据映射到合适的超球面空间中。
链接: https://arxiv.org/abs/2506.18533
作者: Pengxiang Li,Yuwei Wu,Zhi Gao,Xiaomeng Fan,Wei Wu,Zhipeng Lu,Yunde Jia,Mehrtash Harandi
机构: Beijing Institute of Technology (北京理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 24 pages
Abstract:Learning in hyperbolic spaces has attracted increasing attention due to its superior ability to model hierarchical structures of data. Most existing hyperbolic learning methods use fixed distance measures for all data, assuming a uniform hierarchy across all data points. However, real-world hierarchical structures exhibit significant diversity, making this assumption overly restrictive. In this paper, we propose a geometry-aware distance measure in hyperbolic spaces, which dynamically adapts to varying hierarchical structures. Our approach derives the distance measure by generating tailored projections and curvatures for each pair of data points, effectively mapping them to an appropriate hyperbolic space. We introduce a revised low-rank decomposition scheme and a hard-pair mining mechanism to mitigate the computational cost of pair-wise distance computation without compromising accuracy. We present an upper bound on the low-rank approximation error using Talagrand’s concentration inequality, ensuring theoretical robustness. Extensive experiments on standard image classification (MNIST, CIFAR-10 and CIFAR-100), hierarchical classification (5-level CIFAR-100), and few-shot learning tasks (mini-ImageNet, tiered-ImageNet) demonstrate the effectiveness of our method. Our approach consistently outperforms learning methods that use fixed distance measures, with notable improvements on few-shot learning tasks, where it achieves over 5% gains on mini-ImageNet. The results reveal that adaptive distance measures better capture diverse hierarchical structures, with visualization showing clearer class boundaries and improved prototype separation in hyperbolic spaces.
zh
[CV-48] A Set-to-Set Distance Measure in Hyperbolic Space
【速读】:该论文试图解决在双曲空间中比较集合(set)之间相似性的难题,特别是在许多实际应用中,集合的局部结构和全局结构都包含关键语义信息。解决方案的关键在于提出一种称为超双曲集到集距离(Hyperbolic Set-to-Set Distance, HS2SD)的度量方法,该方法通过结合集合的全局结构(通过双曲集合的爱因斯坦中点之间的测地线距离表示)和局部结构(通过集合的拓扑特性表示)来实现对集合间关系的更细致刻画。为高效计算拓扑差异,作者证明了使用有限的Thue-Morse序列与度数矩阵和邻接矩阵可以作为集合拓扑结构的鲁棒近似。
链接: https://arxiv.org/abs/2506.18529
作者: Pengxiang Li,Wei Wu,Zhi Gao,Xiaomeng Fan,Peilin Yu,Yuwei Wu,Zhipeng Lu,Yunde Jia,Mehrtash Harandi
机构: Beijing Key Laboratory of Intelligent Information Technology; School of Computer Science & Technology, Beijing Institute of Technology; Guangdong Laboratory of Machine Perception and Intelligent Computing, Shenzhen MSU-BIT University; Department of Electrical and Computer System Engineering, Monash University
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 24 pages
Abstract:We propose a hyperbolic set-to-set distance measure for computing dissimilarity between sets in hyperbolic space. While point-to-point distances in hyperbolic space effectively capture hierarchical relationships between data points, many real-world applications require comparing sets of hyperbolic data points, where the local structure and the global structure of the sets carry crucial semantic information. The proposed the \underlinehyperbolic \underlineset-\underlineto-\underlineset \underlinedistance measure (HS2SD) integrates both global and local structural information: global structure through geodesic distances between Einstein midpoints of hyperbolic sets, and local structure through topological characteristics of the two sets. To efficiently compute topological differences, we prove that using a finite Thue-Morse sequence of degree and adjacency matrices can serve as a robust approximation to capture the topological structure of a set. In this case, by considering the topological differences, HS2SD provides a more nuanced understanding of the relationships between two hyperbolic sets. Empirical evaluation on entity matching, standard image classification, and few-shot image classification demonstrates that our distance measure outperforms existing methods by effectively modeling the hierarchical and complex relationships inherent in hyperbolic sets.
zh
[CV-49] Auto-Regressively Generating Multi-View Consistent Images
【速读】:该论文旨在解决从人类指令生成一致的多视角图像这一问题,主要挑战包括在多个视角间保持一致性以及在不同条件下有效合成形状和纹理。解决方案的关键在于提出一种多视角自回归(Multi-View Auto-Regressive, MV-AR)方法,该方法利用自回归模型逐步生成一致的多视角图像,通过其下一词预测能力提升多视角合成的有效性,并通过条件注入模块和渐进式训练策略处理多种输入条件,同时采用“Shuffle View”数据增强技术缓解数据不足导致的过拟合问题。
链接: https://arxiv.org/abs/2506.18527
作者: JiaKui Hu,Yuxiao Yang,Jialun Liu,Jinbo Wu,Chen Zhao,Yanye Lu
机构: Baidu VIS; Peking University; Tsinghua University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Generating multi-view images from human instructions is crucial for 3D content creation. The primary challenges involve maintaining consistency across multiple views and effectively synthesizing shapes and textures under diverse conditions. In this paper, we propose the Multi-View Auto-Regressive (MV-AR) method, which leverages an auto-regressive model to progressively generate consistent multi-view images from arbitrary prompts. Firstly, the next-token-prediction capability of the AR model significantly enhances its effectiveness in facilitating progressive multi-view synthesis. When generating widely-separated views, MV-AR can utilize all its preceding views to extract effective reference information. Subsequently, we propose a unified model that accommodates various prompts via architecture designing and training strategies. To address multiple conditions, we introduce condition injection modules for text, camera pose, image, and shape. To manage multi-modal conditions simultaneously, a progressive training strategy is employed. This strategy initially adopts the text-to-multi-view (t2mv) model as a baseline to enhance the development of a comprehensive X-to-multi-view (X2mv) model through the randomly dropping and combining conditions. Finally, to alleviate the overfitting problem caused by limited high-quality data, we propose the “Shuffle View” data augmentation technique, thus significantly expanding the training data by several magnitudes. Experiments demonstrate the performance and versatility of our MV-AR, which consistently generates consistent multi-view images across a range of conditions and performs on par with leading diffusion-based multi-view image generation models. Code and models will be released at this https URL.
zh
[CV-50] Multi-Scale Representation of Follicular Lymphoma Pathology Images in a Single Hyperbolic Space
【速读】:该论文试图解决恶性淋巴瘤病理图像从高分辨率细胞核到低分辨率组织图像的统一表征问题,旨在通过自监督学习在单一双曲空间中实现多尺度形态变化的有效建模。解决方案的关键在于利用包含关系将组织图像与对应的细胞核图像嵌入到相近的位置,并采用庞加莱球(Poincaré ball)作为特征空间,从而有效编码层次结构,学习到能够捕捉疾病状态和细胞类型变化的表示。
链接: https://arxiv.org/abs/2506.18523
作者: Kei Taguchi,Kazumasa Ohara,Tatsuya Yokota,Hiroaki Miyoshi,Noriaki Hashimoto,Ichiro Takeuchi,Hidekata Hontani
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 3 figures
Abstract:We propose a method for representing malignant lymphoma pathology images, from high-resolution cell nuclei to low-resolution tissue images, within a single hyperbolic space using self-supervised learning. To capture morphological changes that occur across scales during disease progression, our approach embeds tissue and corresponding nucleus images close to each other based on inclusion relationships. Using the Poincaré ball as the feature space enables effective encoding of this hierarchical structure. The learned representations capture both disease state and cell type variations.
zh
[CV-51] Enhancing Image Restoration Transformer via Adaptive Translation Equivariance
【速读】:该论文旨在解决现代图像恢复中基于注意力机制的Transformer模型缺乏平移等变性(translation equivariance)的问题,这一缺陷影响了模型的训练收敛性和泛化能力。解决方案的关键在于提出两种策略以引入平移等变性:滑动索引(slide indexing)和组件堆叠(component stacking),其中滑动索引通过固定操作符响应位置来保持平移等变性,而组件堆叠则允许并行或顺序排列平移等变操作符以构建复杂架构。为进一步优化计算效率与感受野之间的权衡,论文还设计了自适应滑动索引机制,以高效选择关键-值对并进行全局聚合。
链接: https://arxiv.org/abs/2506.18520
作者: JiaKui Hu,Zhengjian Yao,Lujia Jin,Hangzhou He,Yanye Lu
机构: Peking University Health Science Center (北京大学医学部); Peking University (北京大学); College of Future Technology (未来技术学院); National Biomedical Imaging Center (国家生物医学成像中心); China Mobile Research Institute (中国移动研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Translation equivariance is a fundamental inductive bias in image restoration, ensuring that translated inputs produce translated outputs. Attention mechanisms in modern restoration transformers undermine this property, adversely impacting both training convergence and generalization. To alleviate this issue, we propose two key strategies for incorporating translation equivariance: slide indexing and component stacking. Slide indexing maintains operator responses at fixed positions, with sliding window attention being a notable example, while component stacking enables the arrangement of translation-equivariant operators in parallel or sequentially, thereby building complex architectures while preserving translation equivariance. However, these strategies still create a dilemma in model design between the high computational cost of self-attention and the fixed receptive field associated with sliding window attention. To address this, we develop an adaptive sliding indexing mechanism to efficiently select key-value pairs for each query, which are then concatenated in parallel with globally aggregated key-value pairs. The designed network, called the Translation Equivariance Adaptive Transformer (TEAFormer), is assessed across a variety of image restoration tasks. The results highlight its superiority in terms of effectiveness, training convergence, and generalization.
zh
[CV-52] MedTVT-R1: A Multimodal LLM Empowering Medical Reasoning and Diagnosis
【速读】:该论文试图解决多疾病诊断中如何有效利用异构多模态医疗数据以实现准确且可解释的诊断问题。当前方法通常依赖单一模态数据,限制了对复杂疾病的全面理解。其解决方案的关键在于提出MedTVT-R1,一个新型的多模态大语言模型(Multimodal Large Language Model, MLLM)框架,通过引入模态感知层捕捉模态间依赖关系并自适应加权模态贡献,结合基于Jaccard奖励函数的Group Relative Policy Optimization(GRPO)强化学习微调策略,提升诊断推理能力。
链接: https://arxiv.org/abs/2506.18512
作者: Yuting Zhang,Kaishen Yuan,Hao Lu,Yutao Yue,Jintai Chen,Kaishun Wu
机构: The Hong Kong University of Science & Technology (Guangzhou)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Accurate and interpretable multi-disease diagnosis remains a critical challenge in medical research, particularly when leveraging heterogeneous multimodal medical data. Current approaches often rely on single-modal data, limiting their ability to comprehensively understand complex diseases. To address this, we propose MedTVT-R1, a novel Multimodal Large Language Model (MLLM) framework designed to integrate clinical multimodal data for reasoning and diagnosing multiple diseases. We construct MedTVT-QA, a curated instruction dataset that provides question-answer pairs for physiological-level interpretations and disease-level diagnoses with a Chain of Evidence approach. MedTVT-R1 incorporates a modality perception layer to capture inter-modal dependencies and adaptively weight modality contributions. Additionally, we employ Group Relative Policy Optimization (GRPO)-based Reinforcement Fine-Tuning with a Jaccard Reward function to enhance diagnostic reasoning. Experimental results demonstrate MedTVT-R1’s superiority in multimodal feature utilization and multi-disease diagnosis, offering significant potential for clinical applications such as diagnostic report generation and comorbidity reasoning. The dataset and code are available at this https URL.
zh
[CV-53] Generalizing Vision-Language Models to Novel Domains: A Comprehensive Survey
【速读】:该论文试图解决视觉-语言预训练模型(Vision-Language Models, VLMs)在面对领域特定或专业泛化任务时性能下降的问题,其核心在于如何有效将VLM中蕴含的丰富知识迁移或泛化到各种下游应用中。解决方案的关键在于通过不同的转移模块对VLM进行调整,具体可分为基于提示(prompt-based)、基于参数(parameter-based)和基于特征(feature-based)的方法,并通过对典型迁移学习(Transfer Learning, TL)设置的回顾,提供对VLM时代迁移学习的新理解。
链接: https://arxiv.org/abs/2506.18504
作者: Xinyao Li,Jingjing Li,Fengling Li,Lei Zhu,Yang Yang,Heng Tao Shen
机构: University of Electronic Science and Technology of China (电子科技大学); University of Technology Sydney (悉尼科技大学); Tongji University (同济大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Recently, vision-language pretraining has emerged as a transformative technique that integrates the strengths of both visual and textual modalities, resulting in powerful vision-language models (VLMs). Leveraging web-scale pretraining data, these models exhibit strong zero-shot capabilities. However, their performance often deteriorates when confronted with domain-specific or specialized generalization tasks. To address this, a growing body of research focuses on transferring or generalizing the rich knowledge embedded in VLMs to various downstream applications. This survey aims to comprehensively summarize the generalization settings, methodologies, benchmarking and results in VLM literatures. Delving into the typical VLM structures, current literatures are categorized into prompt-based, parameter-based and feature-based methods according to the transferred modules. The differences and characteristics in each category are furthered summarized and discussed by revisiting the typical transfer learning (TL) settings, providing novel interpretations for TL in the era of VLMs. Popular benchmarks for VLM generalization are further introduced with thorough performance comparisons among the reviewed methods. Following the advances in large-scale generalizable pretraining, this survey also discusses the relations and differences between VLMs and up-to-date multimodal large language models (MLLM), e.g., DeepSeek-VL. By systematically reviewing the surging literatures in vision-language research from a novel and practical generalization prospective, this survey contributes to a clear landscape of current and future multimodal researches.
zh
[CV-54] Biased Teacher Balanced Student
【速读】:该论文旨在解决知识蒸馏(Knowledge Distillation, KD)在长尾数据分布下的性能下降问题,特别是在教师模型对头部类别存在偏见、难以有效指导尾部类别学习的情况下。其解决方案的关键在于将标准KD目标重新分解为组间(inter-group)和组内(intra-group)Kullback-Leibler (KL) 散度,从而量化教师模型的偏差来源,并通过引入平衡的组间损失和均匀的组内损失,提升模型在类别不平衡场景下的知识迁移效果。
链接: https://arxiv.org/abs/2506.18496
作者: Seonghak Kim
机构: Agency for Defense Development (国防发展局)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 5 figures. This work has been submitted to the IEEE for possible publication
Abstract:Knowledge Distillation (KD) is a widely adopted model compression technique where a compact student model learns from the output of a larger, pre-trained teacher. While effective in balanced settings, conventional KD suffers significantly when applied to long-tailed data distributions, as the teacher model tends to be biased toward head classes and provides limited supervision for tail classes. In this paper, we propose Long-Tailed Knowledge Distillation (LTKD), a novel framework tailored for class-imbalanced scenarios. We begin by reformulating the standard KD objective into two components: inter-group and intra-group Kullback-Leibler (KL) divergence, corresponding to the prediction distributions across and within class groups (head, medium, tail), respectively. This decomposition allows us to identify and quantify the sources of teacher bias. To address them, we introduce (1) a rebalanced inter-group loss that calibrates the teacher’s group-level predictions and (2) a uniform intra-group loss that ensures equal contribution from all groups during distillation. Extensive experiments on CIFAR-100-LT, TinyImageNet-LT, and ImageNet-LT show that LTKD consistently outperforms existing KD methods, achieving significant gains in both overall accuracy and tail-class performance. Our results demonstrate that LTKD enables effective knowledge transfer even from biased teachers, making it a strong candidate for real-world deployment in resource-constrained and imbalanced settings.
zh
[CV-55] ShowFlow: From Robust Single Concept to Condition-Free Multi-Concept Generation
【速读】:该论文旨在解决可控图像生成中的图像定制问题,特别是在单概念生成中保持身份保留和提示对齐的挑战,以及在多概念场景下依赖单一提示而缺乏额外条件(如布局框或语义掩码)时导致的身份丢失和概念遗漏问题。其解决方案的关键在于提出ShowFlow框架,其中ShowFlow-S通过引入KronA-WED适配器,结合克罗内克适配器与权重和嵌入分解,并采用解耦学习方法及新颖的注意力正则化目标来提升单概念生成效果;而ShowFlow-M则通过复用ShowFlow-S中学习到的模型,无需额外条件即可支持多概念生成,同时引入了Subject-Adaptive Matching Attention (SAMA) 和布局一致性策略作为即插即用模块。
链接: https://arxiv.org/abs/2506.18493
作者: Trong-Vu Hoang,Quang-Binh Nguyen,Thanh-Toan Do,Tam V. Nguyen,Minh-Triet Tran,Trung-Nghia Le
机构: HCMUS(胡志明市科技大学); Monash University (莫纳什大学); University of Dayton (亚当斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Customizing image generation remains a core challenge in controllable image synthesis. For single-concept generation, maintaining both identity preservation and prompt alignment is challenging. In multi-concept scenarios, relying solely on a prompt without additional conditions like layout boxes or semantic masks, often leads to identity loss and concept omission. In this paper, we introduce ShowFlow, a comprehensive framework designed to tackle these challenges. We propose ShowFlow-S for single-concept image generation, and ShowFlow-M for handling multiple concepts. ShowFlow-S introduces a KronA-WED adapter, which integrates a Kronecker adapter with weight and embedding decomposition, and employs a disentangled learning approach with a novel attention regularization objective to enhance single-concept generation. Building on this foundation, ShowFlow-M directly reuses the learned models from ShowFlow-S to support multi-concept generation without extra conditions, incorporating a Subject-Adaptive Matching Attention (SAMA) and a layout consistency strategy as the plug-and-play module. Extensive experiments and user studies validate ShowFlow’s effectiveness, highlighting its potential in real-world applications like advertising and virtual dressing.
zh
[CV-56] GANs vs. Diffusion Models for virtual staining with the HER2match dataset
【速读】:该论文旨在解决虚拟染色技术中HE-HER2染色迁移任务的挑战,特别是在缺乏足够公开数据集的情况下,限制了该领域的研究进展。其解决方案的关键在于引入了首个公开可用的HER2match数据集,该数据集包含同一乳腺癌组织切片同时进行HE和HER2染色的图像,从而为模型训练与评估提供了高质量的数据基础。此外,论文还对比了多种生成对抗网络(GANs)和扩散模型(DMs)的性能,并提出了一种新的布朗桥扩散模型(BBDM),以提升HE-HER2染色迁移的效果。研究结果表明,数据对齐在模型性能提升中具有重要作用。
链接: https://arxiv.org/abs/2506.18484
作者: Pascal Klöckner,José Teixeira,Diana Montezuma,Jaime S. Cardoso,Hugo M. Horlings,Sara P. Oliveira
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Virtual staining is a promising technique that uses deep generative models to recreate histological stains, providing a faster and more cost-effective alternative to traditional tissue chemical staining. Specifically for HE-HER2 staining transfer, despite a rising trend in publications, the lack of sufficient public datasets has hindered progress in the topic. Additionally, it is currently unclear which model frameworks perform best for this particular task. In this paper, we introduce the HER2match dataset, the first publicly available dataset with the same breast cancer tissue sections stained with both HE and HER2. Furthermore, we compare the performance of several Generative Adversarial Networks (GANs) and Diffusion Models (DMs), and implement a novel Brownian Bridge Diffusion Model for HE-HER2 translation. Our findings indicate that, overall, GANs perform better than DMs, with only the BBDM achieving comparable results. Furthermore, we emphasize the importance of data alignment, as all models trained on HER2match produced vastly improved visuals compared to the widely used consecutive-slide BCI dataset. This research provides a new high-quality dataset ([available upon publication acceptance]), improving both model training and evaluation. In addition, our comparison of frameworks offers valuable guidance for researchers working on the topic.
zh
[CV-57] Context Consistency Learning via Sentence Removal for Semi-Supervised Video Parag raph Grounding ICME2025
【速读】:该论文试图解决半监督视频段落定位(Semi-Supervised Video Paragraph Grounding, SSVPG)问题,即在仅有少量时间标注的情况下,从未剪辑的视频中定位一段文本中的多个句子。现有方法主要依赖教师-学生一致性学习和视频级对比损失,但忽略了通过扰动查询上下文生成强监督信号的重要性。该论文提出的解决方案的关键在于引入一种新的上下文一致性学习(Context Consistency Learning, CCL)框架,该框架统一了一致性正则化和伪标签的范式,通过让学生模型在移除句子的强增强样本上学习教师模型提供的强监督信号,并利用原始视图与增强视图预测之间的相互一致性作为标签置信度进行模型重训练,从而提升半监督学习的效果。
链接: https://arxiv.org/abs/2506.18476
作者: Yaokun Zhong,Siyu Jiang,Jian Zhu,Jian-Fang Hu
机构: Sun Yat-sen University (中山大学); Guangdong Province Key Laboratory of Information Security Technology (广东省信息安全技术重点实验室); Key Laboratory of Machine Intelligence and Advanced Computing, Ministry of Education (教育部机器智能与先进计算重点实验室); Guangdong University of Foreign Studies (广东外语外贸大学); Guangdong University of Technology (广东工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICME2025
Abstract:Semi-Supervised Video Paragraph Grounding (SSVPG) aims to localize multiple sentences in a paragraph from an untrimmed video with limited temporal annotations. Existing methods focus on teacher-student consistency learning and video-level contrastive loss, but they overlook the importance of perturbing query contexts to generate strong supervisory signals. In this work, we propose a novel Context Consistency Learning (CCL) framework that unifies the paradigms of consistency regularization and pseudo-labeling to enhance semi-supervised learning. Specifically, we first conduct teacher-student learning where the student model takes as inputs strongly-augmented samples with sentences removed and is enforced to learn from the adequately strong supervisory signals from the teacher model. Afterward, we conduct model retraining based on the generated pseudo labels, where the mutual agreement between the original and augmented views’ predictions is utilized as the label confidence. Extensive experiments show that CCL outperforms existing methods by a large margin.
zh
[CV-58] AViLA: Asynchronous Vision-Language Agent for Streaming Multimodal Data Interaction
【速读】:该论文旨在解决视觉-语言代理在处理流数据时面临的查询与证据异步性(Query-Evidence Asynchrony)问题,即用户查询与其支持证据通常在流式环境中异步到达,代理需要基于历史数据、当前观察和未来数据进行响应。解决方案的关键在于提出AViLA,一个针对流数据交互的异步视频-语言代理,其核心包含三个模块:全面的记忆保持、证据识别和基于证据的触发机制,以实现对查询的及时且时间感知的响应。
链接: https://arxiv.org/abs/2506.18472
作者: Gengyuan Zhang,Tanveer Hannan,Hermine Kleiner,Beste Aydemir,Xinyu Xie,Jian Lan,Thomas Seidl,Volker Tresp,Jindong Gu
机构: LMU Munich (慕尼黑路德维希-马克西米利安大学); Munich Center for Machine Learning (慕尼黑机器学习中心); University of Oxford (牛津大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: preprint version; 23 pages (including references and appendix)
Abstract:An ideal vision-language agent serves as a bridge between the human users and their surrounding physical world in real-world applications like autonomous driving and embodied agents, and proactively provides accurate and timely responses given user intents. An intriguing challenge arises when agents interact with the world as a dynamic data stream and ad-hoc queries from users: supporting knowledge for queries, namely evidence, usually appears asynchronously with the arrival time of queries, and agents need to ground their responses in historical data, present observations, and even future streams. We frame this challenge as Query-Evidence Asynchrony, where user queries and their supporting evidence typically arrive asynchronously in the streaming setting. This setting requires not only strong reasoning capabilities but also the ability to retain past observations and respond to queries with temporal awareness. In this paper, we introduce a diagnostic benchmark that evaluates Multimodal Large Language Models (MLLMs) on their ability to handle interaction with streaming data. Further, we present AViLA, Asynchronous Video-Language Agent for streaming data interaction that can handle ad-hoc queries and give time-aware responses. For this purpose, AViLA consists of three key modules: comprehensive memory retention, evidence identification, and evidence-grounded trigger, that are designed to maintain a general-purpose memory and respond readily and timely to queries. Our experiments show that existing models often fail to respond at appropriate times, while AViLA significantly improves both accuracy and temporal awareness. Our code and dataset will be publicly available.
zh
[CV-59] DIP: Unsupervised Dense In-Context Post-training of Visual Representations
【速读】:该论文旨在解决大规模预训练视觉编码器在上下文场景理解任务中密集图像表示不足的问题。其解决方案的关键在于提出一种名为DIP的新型无监督后训练方法,该方法通过伪任务显式模拟下游上下文场景,借鉴元学习原理来训练视觉编码器,从而提升其密集表示能力。DIP利用预训练扩散模型和视觉编码器自身结合生成上下文任务,实现了无需标注数据的高效后训练,具有简单、无监督和计算效率高的特点。
链接: https://arxiv.org/abs/2506.18463
作者: Sophia Sirko-Galouchenko,Spyros Gidaris,Antonin Vobecky,Andrei Bursuc,Nicolas Thome
机构: Valeo.ai; Sorbonne Université, CNRS, ISIR, F-75005 Paris, France; FEE CTU; CIIRC CTU Prague
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We introduce DIP, a novel unsupervised post-training method designed to enhance dense image representations in large-scale pretrained vision encoders for in-context scene understanding. Unlike prior approaches that rely on complex self-distillation architectures, our method trains the vision encoder using pseudo-tasks that explicitly simulate downstream in-context scenarios, inspired by meta-learning principles. To enable post-training on unlabeled data, we propose an automatic mechanism for generating in-context tasks that combines a pretrained diffusion model and the vision encoder itself. DIP is simple, unsupervised, and computationally efficient, requiring less than 9 hours on a single A100 GPU. By learning dense representations through pseudo in-context tasks, it achieves strong performance across a wide variety of downstream real-world in-context scene understanding tasks. It outperforms both the initial vision encoder and prior methods, offering a practical and effective solution for improving dense representations. Code available here: this https URL
zh
[CV-60] Radar and Event Camera Fusion for Agile Robot Ego-Motion Estimation
【速读】:该论文旨在解决在高度动态场景下,对于敏捷机器人(如特技飞行器)实现可靠自我运动速度估计的问题。传统机器人传感器难以及时且清晰地响应高度动态的运动,常导致测量模糊、失真和延迟。论文提出的解决方案关键在于采用无惯性测量单元(IMU)和无特征关联的框架,结合事件相机和毫米波雷达两种外源传感器,直接从瞬时原始事件和多普勒测量中推导出旋转和线性速度,并通过连续时间状态空间模型融合基于时间与事件的测量数据,以固定滞后平滑方式估计自我运动速度,从而提高了在无纹理和无结构环境中的鲁棒性和计算效率。
链接: https://arxiv.org/abs/2506.18443
作者: Yang Lyu,Zhenghao Zou,Yanfeng Li,Chunhui Zhao,Quan Pan
机构: Northwestern Polytechnical University (西北工业大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Achieving reliable ego motion estimation for agile robots, e.g., aerobatic aircraft, remains challenging because most robot sensors fail to respond timely and clearly to highly dynamic robot motions, often resulting in measurement blurring, distortion, and delays. In this paper, we propose an IMU-free and feature-association-free framework to achieve aggressive ego-motion velocity estimation of a robot platform in highly dynamic scenarios by combining two types of exteroceptive sensors, an event camera and a millimeter wave radar, First, we used instantaneous raw events and Doppler measurements to derive rotational and translational velocities directly. Without a sophisticated association process between measurement frames, the proposed method is more robust in texture-less and structureless environments and is more computationally efficient for edge computing devices. Then, in the back-end, we propose a continuous-time state-space model to fuse the hybrid time-based and event-based measurements to estimate the ego-motion velocity in a fixed-lagged smoother fashion. In the end, we validate our velometer framework extensively in self-collected experiment datasets. The results indicate that our IMU-free and association-free ego motion estimation framework can achieve reliable and efficient velocity output in challenging environments. The source code, illustrative video and dataset are available at this https URL.
zh
[CV-61] CPAM: Context-Preserving Adaptive Manipulation for Zero-Shot Real Image Editing
【速读】:该论文试图解决在文本到图像扩散模型中使用文本描述编辑自然图像的挑战,特别是在保持生成一致性以及处理复杂非刚性物体方面的问题。现有方法在保留纹理和身份、减少微调需求以及在编辑特定空间区域或对象的同时保持背景细节方面存在局限性。解决方案的关键在于提出一种名为Context-Preserving Adaptive Manipulation (CPAM) 的零样本框架,其核心是通过保留适应模块调整自注意力机制,以有效保留并独立控制对象和背景,同时利用掩码引导技术确保对象形状、纹理和身份在编辑过程中不被破坏,背景也保持不变。此外,还引入了局部提取模块和多种掩码引导策略,以提升图像编辑的灵活性和效果。
链接: https://arxiv.org/abs/2506.18438
作者: Dinh-Khoi Vo,Thanh-Toan Do,Tam V. Nguyen,Minh-Triet Tran,Trung-Nghia Le
机构: HCMUS (Ho Chi Minh City University of Science); Monash University (莫纳什大学); University of Dayton (艾德华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Editing natural images using textual descriptions in text-to-image diffusion models remains a significant challenge, particularly in achieving consistent generation and handling complex, non-rigid objects. Existing methods often struggle to preserve textures and identity, require extensive fine-tuning, and exhibit limitations in editing specific spatial regions or objects while retaining background details. This paper proposes Context-Preserving Adaptive Manipulation (CPAM), a novel zero-shot framework for complicated, non-rigid real image editing. Specifically, we propose a preservation adaptation module that adjusts self-attention mechanisms to preserve and independently control the object and background effectively. This ensures that the objects’ shapes, textures, and identities are maintained while keeping the background undistorted during the editing process using the mask guidance technique. Additionally, we develop a localized extraction module to mitigate the interference with the non-desired modified regions during conditioning in cross-attention mechanisms. We also introduce various mask-guidance strategies to facilitate diverse image manipulation tasks in a simple manner. Extensive experiments on our newly constructed Image Manipulation BenchmArk (IMBA), a robust benchmark dataset specifically designed for real image editing, demonstrate that our proposed method is the preferred choice among human raters, outperforming existing state-of-the-art editing techniques.
zh
[CV-62] Frequency-Domain Fusion Transformer for Image Inpainting
【速读】:该论文旨在解决传统图像修复方法在处理复杂纹理和大范围遮挡时效果不佳,以及基于Transformer的方法因自注意力机制的低通特性导致高频细节丢失和计算成本高的问题。其解决方案的关键在于引入一种结合小波变换与Gabor滤波的注意力机制,以增强多尺度结构建模和细节保留能力,并设计一种基于快速傅里叶变换的可学习频域滤波器,以替代前馈网络实现自适应噪声抑制和细节保持。
链接: https://arxiv.org/abs/2506.18437
作者: Sijin He,Guangfeng Lin,Tao Li,Yajun Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Image inpainting plays a vital role in restoring missing image regions and supporting high-level vision tasks, but traditional methods struggle with complex textures and large occlusions. Although Transformer-based approaches have demonstrated strong global modeling capabilities, they often fail to preserve high-frequency details due to the low-pass nature of self-attention and suffer from high computational costs. To address these challenges, this paper proposes a Transformer-based image inpainting method incorporating frequency-domain fusion. Specifically, an attention mechanism combining wavelet transform and Gabor filtering is introduced to enhance multi-scale structural modeling and detail preservation. Additionally, a learnable frequency-domain filter based on the fast Fourier transform is designed to replace the feedforward network, enabling adaptive noise suppression and detail retention. The model adopts a four-level encoder-decoder structure and is guided by a novel loss strategy to balance global semantics and fine details. Experimental results demonstrate that the proposed method effectively improves the quality of image inpainting by preserving more high-frequency information.
zh
[CV-63] Benchmarking Foundation Models and Parameter-Efficient Fine-Tuning for Prognosis Prediction in Medical Imaging
【速读】:该论文旨在解决在医学影像中利用人工智能(Artificial Intelligence, AI)进行预后预测的挑战,特别是针对新冠患者临床结局预测中模型的可迁移性问题。其解决方案的关键在于构建一个结构化的基准测试平台,用于评估和比较卷积神经网络(Convolutional Neural Networks)和基础模型(Foundation Models)在不同微调策略下的表现,包括传统方法和参数高效微调方法,并在多种学习范式下进行验证,以适应真实临床场景中的数据稀缺和类别不平衡问题。
链接: https://arxiv.org/abs/2506.18434
作者: Filippo Ruffini,Elena Mulero Ayllon,Linlin Shen,Paolo Soda,Valerio Guarrasi
机构: UniCampus(意大利联合大学); SZU(深圳大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Artificial Intelligence (AI) holds significant promise for improving prognosis prediction in medical imaging, yet its effective application remains challenging. In this work, we introduce a structured benchmark explicitly designed to evaluate and compare the transferability of Convolutional Neural Networks and Foundation Models in predicting clinical outcomes in COVID-19 patients, leveraging diverse publicly available Chest X-ray datasets. Our experimental methodology extensively explores a wide set of fine-tuning strategies, encompassing traditional approaches such as Full Fine-Tuning and Linear Probing, as well as advanced Parameter-Efficient Fine-Tuning methods including Low-Rank Adaptation, BitFit, VeRA, and IA3. The evaluations were conducted across multiple learning paradigms, including both extensive full-data scenarios and more clinically realistic Few-Shot Learning settings, which are critical for modeling rare disease outcomes and rapidly emerging health threats. By implementing a large-scale comparative analysis involving a diverse selection of pretrained models, including general-purpose architectures pretrained on large-scale datasets such as CLIP and DINOv2, to biomedical-specific models like MedCLIP, BioMedCLIP, and PubMedCLIP, we rigorously assess each model’s capacity to effectively adapt and generalize to prognosis tasks, particularly under conditions of severe data scarcity and pronounced class imbalance. The benchmark was designed to capture critical conditions common in prognosis tasks, including variations in dataset size and class distribution, providing detailed insights into the strengths and limitations of each fine-tuning strategy. This extensive and structured evaluation aims to inform the practical deployment and adoption of robust, efficient, and generalizable AI-driven solutions in real-world clinical prognosis prediction workflows.
zh
[CV-64] Latent Space Analysis for Melanoma Prevention
【速读】:该论文试图解决黑色素瘤(melanoma)早期诊断中缺乏可解释性工具的问题,当前大多数深度学习模型仅提供二分类输出,难以为临床提供足够的诊断洞察。其解决方案的关键在于引入一种基于条件变分自编码器(Conditional Variational Autoencoder)的新方法,该方法通过学习一个结构化的潜在空间来捕捉病变之间的语义关系,从而实现对形态学差异的连续、细致评估,并结合支持向量机(SVM)有效区分良性痣与黑色素瘤,同时支持通过空间邻近性进行恶性风险的可视化和几何解释。
链接: https://arxiv.org/abs/2506.18414
作者: Ciro Listone,Aniello Murano
机构: University of Naples Federico II (那不勒斯腓特烈二世大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 11 pages, 4 figures, under review
Abstract:Melanoma represents a critical health risk due to its aggressive progression and high mortality, underscoring the need for early, interpretable diagnostic tools. While deep learning has advanced in skin lesion classification, most existing models provide only binary outputs, offering limited clinical insight. This work introduces a novel approach that extends beyond classification, enabling interpretable risk modelling through a Conditional Variational Autoencoder. The proposed method learns a structured latent space that captures semantic relationships among lesions, allowing for a nuanced, continuous assessment of morphological differences. An SVM is also trained on this representation effectively differentiating between benign nevi and melanomas, demonstrating strong and consistent performance. More importantly, the learned latent space supports visual and geometric interpretation of malignancy, with the spatial proximity of a lesion to known melanomas serving as a meaningful indicator of risk. This approach bridges predictive performance with clinical applicability, fostering early detection, highlighting ambiguous cases, and enhancing trust in AI-assisted diagnosis through transparent and interpretable decision-making.
zh
[CV-65] What You Think Is What You Get: Bridge User Intent and Transfer Function Design through Multimodal Large Language Models
【速读】:该论文旨在解决直接体积渲染(Direct Volume Rendering, DVR)中传递函数(Transfer Function, TF)设计的挑战,特别是由于用户意图与TF参数空间之间的语义鸿沟导致的直观性不足问题。现有方法在探索空间大和泛化能力弱方面仍存在局限。论文提出的解决方案关键在于引入基于多模态大语言模型(Multi-model Large Language Models, MLLMs)的“你所想即你所见”(What You Think is What You Get, WYTWYG)框架,通过结合基于进化算法的TF空间探索器和基于MLLM的体积渲染质量评估器,实现更有效且具有泛化能力的TF优化。
链接: https://arxiv.org/abs/2506.18407
作者: Yiyao Wang,Bo Pan,Ke Wang,Han Liu,Jinyuan Mao,Yuxin Liu,Minfeng Zhu,Bo Zhang,Weifeng Chen,Xiuqi Huang,Wei Chen
机构: 未知
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Direct volume rendering (DVR) is a fundamental technique for visualizing volumetric data, with transfer functions (TFs) playing a crucial role in extracting meaningful structures. However, designing effective TFs remains unintuitive due to the semantic gap between user intent and TF parameter space. Researchers have developed numerous TF optimization methods to bridge this gap. However, existing methods still face two challenges: large exploration space and weak generalizability. To address these issues, we propose What You Think is What You Get (WYTWYG) framework, which leveraging Multi-model Large Language Models (MLLMs) to guide the TF optimization based on user intent. Specifically, we first introduce a novel TF optimization approach comprising two core components: (1) an evolution-based explorer for effective exploration of the TF space, and (2) a volume rendering quality evaluator based on MLLMs to provide generalizable visual guidance. We further propose a TF interactive design system based on this approach. We demonstrate the general applicability of our framework through three case studies, and validate the effectiveness of each component through extensive experiments. Our code is available at: this https URL.
zh
[CV-66] Distributed Poisson multi-Bernoulli filtering via generalised covariance intersection
【速读】:该论文旨在解决分布式多目标跟踪中的融合问题,具体是通过改进的广义协方差交集(GCI)融合规则实现多个传感器或滤波器之间的有效信息融合。其解决方案的关键在于对泊松多伯努利(PMB)密度进行合理近似,将PMB密度的幂次近似为非归一化PMB密度,从而得到一个可计算的融合形式,最终生成泊松多伯努利混合(PMBM)模型,该模型在数学上具有闭合形式,并且在预测和更新步骤中能够保持其结构,便于后续融合。
链接: https://arxiv.org/abs/2506.18397
作者: Ángel F. García-Fernández,Giorgio Battistelli
机构: Universidad Politécnica de Madrid (马德里理工大学); Università di Firenze (佛罗伦萨大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Statistics Theory (math.ST)
备注:
Abstract:This paper presents the distributed Poisson multi-Bernoulli (PMB) filter based on the generalised covariance intersection (GCI) fusion rule for distributed multi-object filtering. Since the exact GCI fusion of two PMB densities is intractable, we derive a principled approximation. Specifically, we approximate the power of a PMB density as an unnormalised PMB density, which corresponds to an upper bound of the PMB density. Then, the GCI fusion rule corresponds to the normalised product of two unnormalised PMB densities. We show that the result is a Poisson multi-Bernoulli mixture (PMBM), which can be expressed in closed form. Future prediction and update steps in each filter preserve the PMBM form, which can be projected back to a PMB density before the next fusion step. Experimental results show the benefits of this approach compared to other distributed multi-object filters.
zh
[CV-67] InternSpatial: A Comprehensive Dataset for Spatial Reasoning in Vision-Language Models
【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在空间推理能力上的不足,当前公开资源在规模、视觉多样性及指令表达性方面存在局限。其解决方案的关键在于提出InternSpatial数据集及其对应的评估基准InternSpatial-Bench,该数据集包含1200万对问答对,覆盖单视角和多视角设置,并支持19种指令格式,同时引入了新颖的旋转角度预测任务以扩展多视角推理的评估范围。实验结果表明,基于InternSpatial训练的模型在多个基准上均表现出显著提升。
链接: https://arxiv.org/abs/2506.18385
作者: Nianchen Deng,Lixin Gu,Shenglong Ye,Yinan He,Zhe Chen,Songze Li,Haomin Wang,Xingguang Wei,Tianshuo Yang,Min Dou,Tong He,Wenqi Shao,Kaipeng Zhang,Yi Wang,Botian Shi,Yanting Zhang,Jifeng Dai,Yu Qiao,Hongjie Zhang,Wenhai Wang
机构: Shanghai AI Laboratory; The Chinese University of Hong Kong; University of Science and Technology of China; Shanghai Jiao Tong University; Donghua University; Nanjing University; Tsinghua University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent benchmarks and datasets have been proposed to improve spatial reasoning in vision-language models (VLMs), yet existing open resources remain limited in scale, visual diversity, and instruction expressiveness. In this work, we introduce InternSpatial, the largest open-source dataset for spatial reasoning in VLMs, along with InternSpatial-Bench, a corresponding evaluation benchmark designed to assess spatial understanding under diverse instruction formats. InternSpatial comprises 12 million QA pairs spanning both single-view and multi-view settings, drawn from diverse visual environments and supporting 19 instruction formats that reflect varied query styles. For evaluation, we propose InternSpatial-Bench for single-view tasks and expand multi-view reasoning by introducing a novel rotation angle prediction task that has not been explored in prior work. Experimental results show that models trained on InternSpatial achieve 12.1% improvement on InternSpatial-Bench and 10.7% on VSI-Bench, while maintaining strong performance on general-purpose benchmarks. We hope these resources will support the development of spatially capable VLMs in practical applications such as robotics and embodied AI.
zh
[CV-68] OpenEvents V1: Large-Scale Benchmark Dataset for Multimodal Event Grounding
【速读】:该论文试图解决事件中心的视觉-语言理解问题,旨在推动对复杂现实事件的深度推理能力。与传统图像描述和检索数据集不同,OpenEvents V1通过两个主要任务:生成具有事件感知的丰富图像描述以及根据叙事风格的文本查询检索相关事件图像,强调上下文和时间定位。其关键在于构建一个大规模基准数据集,包含超过20万篇新闻文章和40万张来自CNN和《卫报》的关联图片,为多模态模型提供标准化评估协议和基线结果,从而促进对真实世界事件的深入分析与理解。
链接: https://arxiv.org/abs/2506.18372
作者: Hieu Nguyen,Phuc-Tan Nguyen,Thien-Phuc Tran,Minh-Quang Nguyen,Tam V. Nguyen,Minh-Triet Tran,Trung-Nghia Le
机构: University of Science,VNU-HCMVietnam; University of Dayton,OhioUS
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We introduce OpenEvents V1, a large-scale benchmark dataset aimed at advancing event-centric vision-language understanding. Unlike conventional image captioning and retrieval datasets that emphasize surface-level descriptions, OpenEvents V1 focuses on contextual and temporal grounding through two primary tasks: (1) generating rich, event-aware image captions and (2) retrieving event-relevant images based on narrative-style textual queries. The dataset contains over 200,000 news articles and 400,000 associated images sourced from CNN and The Guardian, spanning diverse domains and time periods. We provide extensive baseline results and standardized evaluation protocols for both tasks. OpenEvents V1 establishes a robust foundation for developing multimodal models capable of deep reasoning over complex real-world events. The dataset is available at this https URL
zh
[CV-69] RePIC: Reinforced Post-Training for Personalizing Multi-Modal Language Models
【速读】:该论文试图解决多模态大语言模型(MLLMs)在生成个性化图像描述时存在的局限性,尤其是在使用高质量描述进行训练后仍难以生成忠实描述的问题。现有基于后训练的MLLM个性化方法在实际场景中表现不佳,如多概念图像描述任务。为应对这一问题,论文提出了一种基于强化学习(RL)的后训练框架,这是首个针对个性化图像描述的RL-based后训练方法。该解决方案的关键在于通过强化学习提升模型的视觉识别和个性化生成能力,从而在复杂任务中取得优于传统监督微调(SFT)基线的效果。
链接: https://arxiv.org/abs/2506.18369
作者: Yeongtak Oh,Jisoo Mok,Dohyun Chung,Juhyeon Shin,Sangha Park,Johan Barthelemy,Sungroh Yoon
机构: Seoul National University (首尔国立大学); NVIDIA (NVIDIA)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL
Abstract:Recent multi-modal large language models (MLLMs) often struggle to generate personalized image captions, even when trained on high-quality captions. In this work, we observe that such limitations persist in existing post-training-based MLLM personalization methods. Specifically, despite being post-tuned with large-scale caption data through supervised fine-tuning (SFT), these models frequently fail to produce faithful descriptions in real-world scenarios, such as multi-concept image captioning. However, acquiring large-scale, high-quality captions for such complex settings is both costly and difficult. To address the data-centric nature of SFT, we propose a reinforcement learning (RL)-based post-training framework. To the best of our knowledge, this is the first RL-based approach to post-train MLLMs for personalized image captioning. Our method significantly enhances both visual recognition and personalized generation capabilities of MLLMs, and consistently outperforms existing SFT-based baselines, especially in the challenging multi-concept image captioning task.
zh
[CV-70] Sequential keypoint density estimator: an overlooked baseline of skeleton-based video anomaly detection
【速读】:该论文旨在解决在安全关键应用中检测异常人类行为的问题,特别是通过分析人体骨骼序列来识别异常姿势。其解决方案的关键在于提出SeeKer方法,该方法通过在关键点层面进行自回归分解来建模骨骼序列的密度,利用条件分布表示给定先前骨骼运动的关键点位置可能性,并将考虑的骨骼联合分布建模为跨关键点的条件高斯因果预测。当骨骼的关键点位置使模型产生意外(即接收低密度值)时,该骨骼被标记为异常。
链接: https://arxiv.org/abs/2506.18368
作者: Anja Delić,Matej Grcić,Siniša Šegvić
机构: University of Zagreb (萨格勒布大学); Faculty of Electrical Engineering and Computing (电气工程与计算学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Detecting anomalous human behaviour is an important visual task in safety-critical applications such as healthcare monitoring, workplace safety, or public surveillance. In these contexts, abnormalities are often reflected with unusual human poses. Thus, we propose SeeKer, a method for detecting anomalies in sequences of human skeletons. Our method formulates the skeleton sequence density through autoregressive factorization at the keypoint level. The corresponding conditional distributions represent probable keypoint locations given prior skeletal motion. We formulate the joint distribution of the considered skeleton as causal prediction of conditional Gaussians across its constituent keypoints. A skeleton is flagged as anomalous if its keypoint locations surprise our model (i.e. receive a low density). In practice, our anomaly score is a weighted sum of per-keypoint log-conditionals, where the weights account for the confidence of the underlying keypoint detector. Despite its conceptual simplicity, SeeKer surpasses all previous methods on the UBnormal and MSAD-HR datasets while delivering competitive performance on the ShanghaiTech dataset.
zh
[CV-71] Spatial frequency information fusion network for few-shot learning
【速读】:该论文旨在解决小样本学习(Few-shot learning)中由于类别图像数量较少而导致的过拟合和泛化性能差的问题。现有许多小样本分类模型过于关注空间域信息而忽略了频率域信息,而频率域信息包含更多特征信息,忽略该部分会限制模型对特征信息的充分挖掘,从而影响分类性能。为了解决这一问题,本文提出了一种基于传统数据增强的SFIFNet方法,其关键在于通过将频率域信息与空间域信息相结合,提升图像特征表示的准确性。
链接: https://arxiv.org/abs/2506.18364
作者: Wenqing Zhao,Guojia Xie,Han Pan,Biao Yang,Weichuan Zhang
机构: School of Electronic Information and Artificial Intelligence, Shaanxi University of Science and Technology (电子信息与人工智能学院,陕西科技大学); Society of Entrepreneurs and Ecology (SEE) Foundation (企业家与生态协会(SEE)基金会); Key Laboratory of Southwest China Wildlife Resources Conservation (Ministry of Education) (西南野生动物资源保护重点实验室(教育部))
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The objective of Few-shot learning is to fully leverage the limited data resources for exploring the latent correlations within the data by applying algorithms and training a model with outstanding performance that can adequately meet the demands of practical applications. In practical applications, the number of images in each category is usually less than that in traditional deep learning, which can lead to over-fitting and poor generalization performance. Currently, many Few-shot classification models pay more attention to spatial domain information while neglecting frequency domain information, which contains more feature information. Ignoring frequency domain information will prevent the model from fully exploiting feature information, which would effect the classification performance. Based on conventional data augmentation, this paper proposes an SFIFNet with innovative data preprocessing. The key of this method is enhancing the accuracy of image feature representation by integrating frequency domain information with spatial domain information. The experimental results demonstrate the effectiveness of this method in enhancing classification performance.
zh
[CV-72] BSMamba: Brightness and Semantic Modeling for Long-Range Interaction in Low-Light Image Enhancement
【速读】:该论文旨在解决低光照图像增强(LLIE)方法在同时提升亮度、保持语义一致性、细节信息以及计算效率方面存在的显著局限性。其解决方案的关键在于提出一种新型视觉状态空间模型——BSMamba,该模型包含两个专门设计的组件:亮度状态空间(Brightness Mamba)和语义状态空间(Semantic Mamba)。亮度状态空间通过优先连接亮度相似的远距离令牌,实现基于亮度引导的选择性注意力机制,从而有效提升亮度恢复效果;语义状态空间则通过优先连接语义相似的令牌,维持图像的上下文一致性,确保语义层次结构在增强过程中得以保留。BSMamba通过基于亮度和语义相似性的智能令牌建模,突破了传统固定扫描模式的限制,同时遵循因果建模原则,实现了更优的LLIE性能。
链接: https://arxiv.org/abs/2506.18346
作者: Tongshun Zhang,Pingping Liu,Mengen Cai,Zijian Zhang,Yubing Lu,Qiuzhan Zhou
机构: Jilin University (吉林大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Current low-light image enhancement (LLIE) methods face significant limitations in simultaneously improving brightness while preserving semantic consistency, fine details, and computational efficiency. With the emergence of state-space models, particularly Mamba, image restoration has achieved remarkable performance, yet existing visual Mamba approaches flatten 2D images into 1D token sequences using fixed scanning rules, critically limiting interactions between distant tokens with causal relationships and constraining their ability to capture meaningful long-range dependencies. To address these fundamental limitations, we propose BSMamba, a novel visual Mamba architecture comprising two specially designed components: Brightness Mamba and Semantic Mamba. The Brightness Mamba revolutionizes token interaction patterns by prioritizing connections between distant tokens with similar brightness levels, effectively addressing the challenge of brightness restoration in LLIE tasks through brightness-guided selective attention. Complementing this, the Semantic Mamba establishes priority interactions between tokens sharing similar semantic meanings, allowing the model to maintain contextual consistency by connecting semantically related regions across the image, thus preserving the hierarchical nature of image semantics during enhancement. By intelligently modeling tokens based on brightness and semantic similarity rather than arbitrary scanning patterns, BSMamba transcends the constraints of conventional token sequencing while adhering to the principles of causal modeling. Extensive experiments demonstrate that BSMamba achieves state-of-the-art performance in LLIE while preserving semantic consistency.
zh
[CV-73] Rethinking Decoder Design: Improving Biomarker Segmentation Using Depth-to-Space Restoration and Residual Linear Attention CVPR
【速读】:该论文旨在解决医学图像分割中由于染色和形态变化导致的特征提取受限问题,以及在数据样本有限的情况下,端到端方法性能不足的问题。其解决方案的关键在于提出一种能够捕捉多尺度局部与全局上下文信息的架构,以及一种新型解码器设计,该设计能有效整合编码器特征、强调重要通道和区域,并重建空间维度以提升分割精度。
链接: https://arxiv.org/abs/2506.18335
作者: Saad Wazir,Daeyoung Kim
机构: KAIST(韩国科学技术院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR), 2025, pp. 30861-30871
Abstract:Segmenting biomarkers in medical images is crucial for various biotech applications. Despite advances, Transformer and CNN based methods often struggle with variations in staining and morphology, limiting feature extraction. In medical image segmentation, where datasets often have limited sample availability, recent state-of-the-art (SOTA) methods achieve higher accuracy by leveraging pre-trained encoders, whereas end-to-end methods tend to underperform. This is due to challenges in effectively transferring rich multiscale features from encoders to decoders, as well as limitations in decoder efficiency. To address these issues, we propose an architecture that captures multi-scale local and global contextual information and a novel decoder design, which effectively integrates features from the encoder, emphasizes important channels and regions, and reconstructs spatial dimensions to enhance segmentation accuracy. Our method, compatible with various encoders, outperforms SOTA methods, as demonstrated by experiments on four datasets and ablation studies. Specifically, our method achieves absolute performance gains of 2.76% on MoNuSeg, 3.12% on DSB, 2.87% on Electron Microscopy, and 4.03% on TNBC datasets compared to existing SOTA methods. Code: this https URL
zh
[CV-74] Geometry-Aware Preference Learning for 3D Texture Generation
【速读】:该论文试图解决3D生成模型生成的内容与人类主观偏好或任务特定标准不一致的问题,以及3D纹理生成领域中现有方法依赖于2D文本到图像生成模型所带来的对3D结构理解不足的问题。解决方案的关键在于提出一种端到端的可微分偏好学习框架,该框架通过整个3D生成流程反向传播由可微分奖励函数表示的人类偏好,从而使生成过程具备固有的几何感知能力。
链接: https://arxiv.org/abs/2506.18331
作者: AmirHossein Zamani,Tianhao Xie,Amir G. Aghdam,Tiberiu Popa,Eugene Belilovsky
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advances in 3D generative models have achieved impressive results but 3D contents generated by these models may not align with subjective human preferences or task-specific criteria. Moreover, a core challenge in the 3D texture generation domain remains: most existing approaches rely on repeated calls to 2D text-to-image generative models, which lack an inherent understanding of the 3D structure of the input 3D mesh object. To address this, we propose an end-to-end differentiable preference learning framework that back-propagates human preferences, represented by differentiable reward functions, through the entire 3D generative pipeline, making the process inherently geometry-aware. We demonstrate the effectiveness of our framework using four proposed novel geometry-aware reward functions, offering a more controllable and interpretable pathway for high-quality 3D content creation from natural language.
zh
[CV-75] NSFW-Classifier Guided Prompt Sanitization for Safe Text-to-Image Generation
【速读】:该论文试图解决文本到图像(Text-to-Image, T2I)模型在生成有害内容(如色情、暴力、歧视性内容)方面的安全问题,这一问题与T2I技术的伦理目标相悖,并阻碍其可持续发展。解决方案的关键在于提出一种无需修改模型架构且不降低生成能力的去毒方法——NSFW-Classifier Guided Prompt Sanitization (PromptSan),其核心思想是利用NSFW分类器对输入提示进行净化,包括通过迭代替换有害标记(PromptSan-Modify)或训练优化后缀标记序列以中和有害意图(PromptSan-Suffix),从而有效减少有害内容的生成并平衡安全性与可用性。
链接: https://arxiv.org/abs/2506.18325
作者: Yu Xie,Chengjie Zeng,Lingyun Zhang,Yanwei Fu
机构: Fudan University (复旦大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The rapid advancement of text-to-image (T2I) models, such as Stable Diffusion, has enhanced their capability to synthesize images from textual prompts. However, this progress also raises significant risks of misuse, including the generation of harmful content (e.g., pornography, violence, discrimination), which contradicts the ethical goals of T2I technology and hinders its sustainable development. Inspired by “jailbreak” attacks in large language models, which bypass restrictions through subtle prompt modifications, this paper proposes NSFW-Classifier Guided Prompt Sanitization (PromptSan), a novel approach to detoxify harmful prompts without altering model architecture or degrading generation capability. PromptSan includes two variants: PromptSan-Modify, which iteratively identifies and replaces harmful tokens in input prompts using text NSFW classifiers during inference, and PromptSan-Suffix, which trains an optimized suffix token sequence to neutralize harmful intent while passing both text and image NSFW classifier checks. Extensive experiments demonstrate that PromptSan achieves state-of-the-art performance in reducing harmful content generation across multiple metrics, effectively balancing safety and usability.
zh
[CV-76] A Multi-Scale Spatial Attention-Based Zero-Shot Learning Framework for Low-Light Image Enhancement
【速读】:该论文试图解决低光照图像增强(low-light image enhancement)问题,尤其是在缺乏成对训练数据的情况下。其解决方案的关键在于提出一种新颖的零样本学习框架LucentVisionNet,该框架结合了多尺度空间注意力机制与深度曲线估计网络,实现了细粒度增强的同时保持语义和感知保真度,并通过递归增强策略和复合损失函数优化模型,提升了泛化能力与图像质量。
链接: https://arxiv.org/abs/2506.18323
作者: Muhammad Azeem Aslam,Hassan Khalid,Nisar Ahmed
机构: Xi’an Eurasia University (西安欧亚大学); University of Engineering and Technology Lahore (拉合尔工程与技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Low-light image enhancement remains a challenging task, particularly in the absence of paired training data. In this study, we present LucentVisionNet, a novel zero-shot learning framework that addresses the limitations of traditional and deep learning-based enhancement methods. The proposed approach integrates multi-scale spatial attention with a deep curve estimation network, enabling fine-grained enhancement while preserving semantic and perceptual fidelity. To further improve generalization, we adopt a recurrent enhancement strategy and optimize the model using a composite loss function comprising six tailored components, including a novel no-reference image quality loss inspired by human visual perception. Extensive experiments on both paired and unpaired benchmark datasets demonstrate that LucentVisionNet consistently outperforms state-of-the-art supervised, unsupervised, and zero-shot methods across multiple full-reference and no-reference image quality metrics. Our framework achieves high visual quality, structural consistency, and computational efficiency, making it well-suited for deployment in real-world applications such as mobile photography, surveillance, and autonomous navigation.
zh
[CV-77] Escaping the SpuriVerse: Can Large Vision-Language Models Generalize Beyond Seen Spurious Correlations?
【速读】:该论文试图解决多模态大视觉语言模型(LVLMs)在预训练过程中可能产生的虚假相关性问题,即非关键特征与目标标签之间的错误关联。传统基准测试通常采用人为设计的场景和狭窄任务,难以全面反映真实世界中的复杂情况。本文的关键解决方案是构建一个名为SpuriVerse的新基准,通过收集GPT-4o在真实世界视觉问答(VQA)基准中的错误,结合LVLM与人类标注及合成反事实评估,筛选出由虚假相关性导致的错误样本。该基准包含124种不同的虚假相关性类型,每种类型包含1个真实样本和10个合成样本,共计1364道多选题。实验表明,即使最先进的闭源模型在该基准上的准确率也仅达到37.1%,而通过合成样本进行微调可显著提升至78.40%,表明模型能够从多样化的虚假模式中学习并泛化到未见过的情境。
链接: https://arxiv.org/abs/2506.18322
作者: Yiwei Yang,Chung Peng Lee,Shangbin Feng,Dora Zhao,Bingbing Wen,Anthony Z. Liu,Yulia Tsvetkov,Bill Howe
机构: University of Washington (华盛顿大学); Stanford University (斯坦福大学); University of Michigan (密歇根大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Finetuning can cause spurious correlations to arise between non-essential features and the target labels, but benchmarks to study these effects involve contrived settings and narrow tasks. In contrast, we consider spurious correlations in multi-modal Large Vision Language Models (LVLMs) pretrained on extensive and diverse datasets without explicit task supervision. We develop a benchmark by sourcing GPT-4o errors on real-world visual-question-answering (VQA) benchmarks, then curating a subset through LVLM-human annotation and synthetic counterfactual evaluation to identify errors caused by spurious correlations. This process yields SpuriVerse, a novel benchmark comprised of 124 distinct types of spurious correlations extracted from real-world datasets, each containing 1 realistic and 10 synthetic VQA samples for a total of 1364 multiple choice questions. We evaluate 15 open and closed-source LVLMs on SpuriVerse, finding that even state-of-the-art closed-source models struggle significantly, achieving at best only 37.1% accuracy. Fine-tuning on synthetic examples that emphasize the spurious correlation improves performance to 78.40%, suggesting that training on diverse spurious patterns generalizes to unseen situations: models appear to learn to avoid “shortcuts” and attend to the overall image context.
zh
[CV-78] Attention-Based Ensemble Learning for Crop Classification Using Landsat 8-9 Fusion
【速读】:该论文试图解决 irrigated agricultural regions 中作物覆盖识别的准确性问题,旨在通过整合遥感数据与先进建模技术提高作物分类精度。其解决方案的关键在于采用分阶段的数据采集方法,包括通过实地调查确定目标作物并进行地理编码,以及利用Landsat 8-9影像构建标注数据集;同时对卫星影像进行了辐射校准、大气校正和几何校正等预处理,并应用图像融合技术增强光谱信息,结合植被指数和原始反射率值进行分类建模,最终通过特征选择优化分类学习效果。
链接: https://arxiv.org/abs/2506.18321
作者: Zeeshan Ramzan,Nisar Ahmed,Qurat-ul-Ain Akram,Shahzad Asif,Muhammad Shahbaz,Rabin Chakrabortty,Ahmed F. Elaksher
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under review in Earth Systems and Environment
Abstract:Remote sensing offers a highly effective method for obtaining accurate information on total cropped area and crop types. The study focuses on crop cover identification for irrigated regions of Central Punjab. Data collection was executed in two stages: the first involved identifying and geocoding six target crops through field surveys conducted in January and February 2023. The second stage involved acquiring Landsat 8-9 imagery for each geocoded field to construct a labelled dataset. The satellite imagery underwent extensive pre-processing, including radiometric calibration for reflectance values, atmospheric correction, and georeferencing verification to ensure consistency within a common coordinate system. Subsequently, image fusion techniques were applied to combine Landsat 8 and 9 spectral bands, creating a composite image with enhanced spectral information, followed by contrast enhancement. During data acquisition, farmers were interviewed, and fields were meticulously mapped using GPS instruments, resulting in a comprehensive dataset of 50,835 data points. This dataset facilitated the extraction of vegetation indices such as NDVI, SAVO, RECI, and NDRE. These indices and raw reflectance values were utilized for classification modeling using conventional classifiers, ensemble learning, and artificial neural networks. A feature selection approach was also incorporated to identify the optimal feature set for classification learning. This study demonstrates the effectiveness of combining remote sensing data and advanced modeling techniques to improve crop classification accuracy in irrigated agricultural regions.
zh
[CV-79] Rapeseed population point cloud completion network (RP-PCN) with dynamic graph convolution for 3D reconstruction of crop canopy occlusion architecture
【速读】:该论文旨在解决作物冠层结构完整三维重建中的遮挡问题,从而更准确地评估作物光合作用和产量以指导理想型设计。其解决方案的关键在于提出了一种基于多视角成像的点云补全模型(RP-PCN),该模型结合了虚拟-现实融合(VRI)模拟方法和遮挡点检测算法来生成标注数据,并采用多分辨率动态图卷积编码器(MRDG)和点金字塔解码器(PPD)来预测遮挡点,同时引入动态图卷积特征提取器(DGCFE)以捕捉生长周期中的结构变化,从而提升点云补全的精度和 yield 预测的准确性。
链接: https://arxiv.org/abs/2506.18292
作者: Ziyue Guo(1 and 2),Xin Yang(1 and 2),Yutao Shen(1 and 2),Yang Zhu(3),Lixi Jiang(3),Haiyan Cen(1 and 2) ((1) College of Biosystems Engineering and Food Science, Zhejiang University, (2) Key Laboratory of Spectroscopy Sensing, Ministry of Agriculture and Rural Affairs, (3) Institute of Crop Science, Zhejiang University)
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Quantitative descriptions of complete canopy architecture are crucial for evaluating crop photosynthesis and yield to guide ideotype design. Although three-dimensional (3D) sensing technologies have been developed for plant and canopy reconstruction, severe occlusion and complex architectures hinder accurate canopy descriptions. In this study, we propose a point cloud completion model for 3D reconstruction of rapeseed populations from seeding to silique stages using multi-view imaging. A complete point cloud generation framework was developed with the virtual-real integration (VRI) simulation method and occlusion point detection algorithm to annotate the training dataset by distinguishing surface from occluded points. The rapeseed population point cloud completion network (RP-PCN) was designed with a multi-resolution dynamic graph convolutional encoder (MRDG) and point pyramid decoder (PPD) to predict occluded points based on input surface point clouds. A dynamic graph convolutional feature extractor (DGCFE) was introduced to capture structural variations across the growth period. The effectiveness of point cloud completion was validated by predicting yield using architectural indicators from complete point clouds of rapeseed population. The results demonstrated that RP-PCN achieved chamfer distance (CD) values of 3.35 cm, 3.46 cm, 4.32 cm, and 4.51 cm at the seedling, bolting, flowering, and silique stages, respectively. Ablation studies showed the effectiveness of the MRDG and DGCFE modules, reducing CD values by 10% and 23%, respectively. The silique efficiency index (SEI) from RP-PCN improved yield prediction accuracy by 11.2% compared to incomplete point clouds. The RP-PCN pipeline proposed in this study has the potential to be extended to other crops, significantly enhancing the analysis of population canopy architectures in field environments.
zh
[CV-80] Selective Social-Interaction via Individual Importance for Fast Human Trajectory Prediction
【速读】:该论文试图解决在预测主要人员轨迹时如何有效选择重要邻近人员的问题。解决方案的关键在于提出了一种名为 Importance Estimator 的人员选择模块,该模块能够输出每个邻近人员对预测主要人员未来轨迹的重要性。为了解决基于重要性采样时非可微操作导致的梯度阻塞问题,研究者采用了 Gumbel Softmax 进行训练,从而实现了高效的轨迹预测。
链接: https://arxiv.org/abs/2506.18291
作者: Yota Urano,Hiromu Taketsugu,Norimichi Ukita
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: MIRU 2025
Abstract:This paper presents an architecture for selecting important neighboring people to predict the primary person’s trajectory. To achieve effective neighboring people selection, we propose a people selection module called the Importance Estimator which outputs the importance of each neighboring person for predicting the primary person’s future trajectory. To prevent gradients from being blocked by non-differentiable operations when sampling surrounding people based on their importance, we employ the Gumbel Softmax for training. Experiments conducted on the JRDB dataset show that our method speeds up the process with competitive prediction accuracy.
zh
[CV-81] Open Set Recognition for Endoscopic Image Classification: A Deep Learning Approach on the Kvasir Dataset
【速读】:该论文试图解决传统闭集分类框架在开放世界临床环境中存在的局限性,即面对未见过的病理情况时模型可靠性可能受到威胁的问题。解决方案的关键在于应用开放集识别(Open Set Recognition, OSR)技术,通过在Kvasir数据集上评估和比较多种深度学习架构(如ResNet-50、Swin Transformer及混合模型)的OSR能力,并采用OpenMax作为基准方法,以检验模型区分已知类别与未知类别的性能。
链接: https://arxiv.org/abs/2506.18284
作者: Kasra Moazzami,Seoyoun Son,John Lin,Sun Min Lee,Daniel Son,Hayeon Lee,Jeongho Lee,Seongji Lee
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 9 pages, 3 figures, 3 tables
Abstract:Endoscopic image classification plays a pivotal role in medical diagnostics by identifying anatomical landmarks and pathological findings. However, conventional closed-set classification frameworks are inherently limited in open-world clinical settings, where previously unseen conditions can arise andcompromise model reliability. To address this, we explore the application of Open Set Recognition (OSR) techniques on the Kvasir dataset, a publicly available and diverse endoscopic image collection. In this study, we evaluate and compare the OSR capabilities of several representative deep learning architectures, including ResNet-50, Swin Transformer, and a hybrid ResNet-Transformer model, under both closed-set and open-set conditions. OpenMax is adopted as a baseline OSR method to assess the ability of these models to distinguish known classes from previously unseen categories. This work represents one of the first efforts to apply open set recognition to the Kvasir dataset and provides a foundational benchmark for evaluating OSR performance in medical image analysis. Our results offer practical insights into model behavior in clinically realistic settings and highlight the importance of OSR techniques for the safe deployment of AI systems in endoscopy.
zh
[CV-82] ReFrame: Rectification Framework for Image Explaining Architectures
【速读】:该论文旨在解决图像解释过程中存在的不一致性和不完整性问题,即现有方法常常会生成图像中不存在的对象(hallucination)或未能识别图像中的所有对象。其解决方案的关键在于提出一种可解释的框架,该框架可以集成到多种图像解释系统(如图像描述、视觉问答和基于提示的AI)之上,通过修正错误或缺失的对象来增强这些系统的解释能力。
链接: https://arxiv.org/abs/2506.18272
作者: Debjyoti Das Adhikary,Aritra Hazra,Partha Pratim Chakrabarti
机构: Indian Institute of Technology, Kharagpur(印度理工学院,卡哈格普尔分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in CODS-COMAD December 2024
Abstract:Image explanation has been one of the key research interests in the Deep Learning field. Throughout the years, several approaches have been adopted to explain an input image fed by the user. From detecting an object in a given image to explaining it in human understandable sentence, to having a conversation describing the image, this problem has seen an immense change throughout the years, However, the existing works have been often found to (a) hallucinate objects that do not exist in the image and/or (b) lack identifying the complete set of objects present in the image. In this paper, we propose a novel approach to mitigate these drawbacks of inconsistency and incompleteness of the objects recognized during the image explanation. To enable this, we propose an interpretable framework that can be plugged atop diverse image explaining frameworks including Image Captioning, Visual Question Answering (VQA) and Prompt-based AI using LLMs, thereby enhancing their explanation capabilities by rectifying the incorrect or missing objects. We further measure the efficacy of the rectified explanations generated through our proposed approaches leveraging object based precision metrics, and showcase the improvements in the inconsistency and completeness of image explanations. Quantitatively, the proposed framework is able to improve the explanations over the baseline architectures of Image Captioning (improving the completeness by 81.81% and inconsistency by 37.10%), Visual Question Answering(average of 9.6% and 37.10% in completeness and inconsistency respectively) and Prompt-based AI model (0.01% and 5.2% for completeness and inconsistency respectively) surpassing the current state-of-the-art by a substantial margin.
zh
[CV-83] Adaptive Mask-guided K-space Diffusion for Accelerated MRI Reconstruction
【速读】:该论文试图解决磁共振成像(Magnetic Resonance Imaging, MRI)重建中传统方法未充分考虑k空间不同频率区域重要性的问题,从而影响重建图像质量。其解决方案的关键在于引入一种基于自适应掩码的扩散模型(Adaptive Mask Diffusion Model, AMDM),通过根据k空间数据动态调整频率分布,构建混合掩码机制,实现高频与低频成分的有效分离,并利用k空间频率分布生成自适应掩码,引导闭环扩散过程,从而提升MRI重建质量。
链接: https://arxiv.org/abs/2506.18270
作者: Qinrong Cai,Yu Guan,Zhibo Chen,Dong Liang,Qiuyun Fan,Qiegen Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 9 figures
Abstract:As the deep learning revolution marches on, masked modeling has emerged as a distinctive approach that involves predicting parts of the original data that are proportionally masked during training, and has demonstrated exceptional performance in multiple fields. Magnetic Resonance Imaging (MRI) reconstruction is a critical task in medical imaging that seeks to recover high-quality images from under-sampled k-space data. However, previous MRI reconstruction strategies usually optimized the entire image domain or k-space, without considering the importance of different frequency regions in the k-space This work introduces a diffusion model based on adaptive masks (AMDM), which utilizes the adaptive adjustment of frequency distribution based on k-space data to develop a hybrid masks mechanism that adapts to different k-space inputs. This enables the effective separation of high-frequency and low-frequency components, producing diverse frequency-specific representations. Additionally, the k-space frequency distribution informs the generation of adaptive masks, which, in turn, guide a closed-loop diffusion process. Experimental results verified the ability of this method to learn specific frequency information and thereby improved the quality of MRI reconstruction, providing a flexible framework for optimizing k-space data using masks in the future.
zh
[CV-84] hermalLoc: A Vision Transformer-Based Approach for Robust Thermal Camera Relocalization in Large-Scale Environments IROS2025
【速读】:该论文试图解决热成像图像中相机重定位(thermal image relocalization)的问题,即在热成像条件下如何准确恢复相机的绝对位姿。由于热成像与可见光成像在数据获取机制上的根本差异,传统针对可见光图像设计的视觉重定位方法无法直接应用于热成像场景。论文提出的解决方案关键在于引入ThermalLoc,这是一种端到端的深度学习方法,通过融合EfficientNet与Transformer架构,有效提取热图像的局部和全局特征,并利用两个MLP网络进行绝对位姿回归,从而实现了更精确和鲁棒的热成像重定位。
链接: https://arxiv.org/abs/2506.18268
作者: Yu Liu,Yangtao Meng,Xianfei Pan,Jie Jiang,Changhao Chen
机构: National University of Defense Technology (国防科技大学); China Academy of Launch Vehicle Technology (中国运载火箭技术研究院); The Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州))
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 3 figures, accepted to IROS 2025
Abstract:Thermal cameras capture environmental data through heat emission, a fundamentally different mechanism compared to visible light cameras, which rely on pinhole imaging. As a result, traditional visual relocalization methods designed for visible light images are not directly applicable to thermal images. Despite significant advancements in deep learning for camera relocalization, approaches specifically tailored for thermal camera-based relocalization remain underexplored. To address this gap, we introduce ThermalLoc, a novel end-to-end deep learning method for thermal image relocalization. ThermalLoc effectively extracts both local and global features from thermal images by integrating EfficientNet with Transformers, and performs absolute pose regression using two MLP networks. We evaluated ThermalLoc on both the publicly available thermal-odometry dataset and our own dataset. The results demonstrate that ThermalLoc outperforms existing representative methods employed for thermal camera relocalization, including AtLoc, MapNet, PoseNet, and RobustLoc, achieving superior accuracy and robustness.
zh
[CV-85] YouTube-Occ: Learning Indoor 3D Semantic Occupancy Prediction from YouTube Videos
【速读】:该论文旨在解决3D语义占据预测中对精确几何关系依赖过高的问题,特别是在复杂室内环境中,由于数据采集设备的复杂性和隐私问题,大规模、细粒度标注的数据收集变得不切实际。其解决方案的关键在于利用仅包含室内互联网数据(如YouTube房屋游览视频)进行3D空间精准训练,无需任何相机内参或外参的先验知识,并通过构建一个完全自监督的模型,将可获取的2D先验知识迁移至3D占据网络,具体通过将相似像素分组为超像素来蒸馏2D区域级知识,从而实现强大的室内3D感知能力。
链接: https://arxiv.org/abs/2506.18266
作者: Haoming Chen,Lichen Yuan,TianFang Sun,Jingyu Gong,Xin Tan,Zhizhong Zhang,Yuan Xie
机构: East China Normal University (华东师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:3D semantic occupancy prediction in the past was considered to require precise geometric relationships in order to enable effective training. However, in complex indoor environments, the large-scale and widespread collection of data, along with the necessity for fine-grained annotations, becomes impractical due to the complexity of data acquisition setups and privacy concerns. In this paper, we demonstrate that 3D spatially-accurate training can be achieved using only indoor Internet data, without the need for any pre-knowledge of intrinsic or extrinsic camera parameters. In our framework, we collect a web dataset, YouTube-Occ, which comprises house tour videos from YouTube, providing abundant real house scenes for 3D representation learning. Upon on this web dataset, we establish a fully self-supervised model to leverage accessible 2D prior knowledge for reaching powerful 3D indoor perception. Specifically, we harness the advantages of the prosperous vision foundation models, distilling the 2D region-level knowledge into the occupancy network by grouping the similar pixels into superpixels. Experimental results show that our method achieves state-of-the-art zero-shot performance on two popular benchmarks (NYUv2 and OccScanNet
zh
[CV-86] Improving Weakly Supervised Temporal Action Localization by Exploiting Multi-resolution Information in Temporal Domain
【速读】:该论文试图解决弱监督时间动作定位(Weakly supervised temporal action localization)问题,即在训练过程中仅提供视频级别的标注信息的情况下,如何准确地定位视频中各个动作的起止时间。解决方案的关键在于提出一种两阶段的方法,充分利用时间域中的多分辨率信息,并基于外观和运动流生成高质量的帧级伪标签。第一阶段通过初始标签生成模块(ILG)生成可靠的初始帧级伪标签,第二阶段通过渐进式时间标签优化框架(PTLR)迭代优化伪标签,并利用高置信度的选定帧训练神经网络,从而提升每个帧的动作类别得分预测性能。
链接: https://arxiv.org/abs/2506.18261
作者: Rui Su,Dong Xu,Luping Zhou,Wanli Ouyang
机构: The University of Sydney (悉尼大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages
Abstract:Weakly supervised temporal action localization is a challenging task as only the video-level annotation is available during the training process. To address this problem, we propose a two-stage approach to fully exploit multi-resolution information in the temporal domain and generate high quality frame-level pseudo labels based on both appearance and motion streams. Specifically, in the first stage, we generate reliable initial frame-level pseudo labels, and in the second stage, we iteratively refine the pseudo labels and use a set of selected frames with highly confident pseudo labels to train neural networks and better predict action class scores at each frame. We fully exploit temporal information at multiple scales to improve temporal action localization performance. Specifically, in order to obtain reliable initial frame-level pseudo labels, in the first stage, we propose an Initial Label Generation (ILG) module, which leverages temporal multi-resolution consistency to generate high quality class activation sequences (CASs), which consist of a number of sequences with each sequence measuring how likely each video frame belongs to one specific action class. In the second stage, we propose a Progressive Temporal Label Refinement (PTLR) framework. In our PTLR framework, two networks called Network-OTS and Network-RTS, which are respectively used to generate CASs for the original temporal scale and the reduced temporal scales, are used as two streams (i.e., the OTS stream and the RTS stream) to refine the pseudo labels in turn. By this way, the multi-resolution information in the temporal domain is exchanged at the pseudo label level, and our work can help improve each stream (i.e., the OTS/RTS stream) by exploiting the refined pseudo labels from another stream (i.e., the RTS/OTS stream).
zh
[CV-87] Morse: Dual-Sampling for Lossless Acceleration of Diffusion Models ICML2025
【速读】:该论文试图解决扩散模型在生成过程中计算效率低的问题,旨在实现无损加速。解决方案的关键在于提出Morse框架,通过结合快速跳跃采样和自适应残差反馈策略,重新构建从噪声到数据的迭代生成过程。该框架包含两个交互的模型:Dash模型作为预训练扩散模型,在跳跃采样模式下运行以提升采样效率;Dot模型则快速生成残差反馈,用于修正噪声估计以匹配Dash模型的下一步估计,从而在不牺牲生成质量的前提下提高整体运行效率。
链接: https://arxiv.org/abs/2506.18251
作者: Chao Li,Jiawei Fan,Anbang Yao
机构: 未知
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: This work is accepted to ICML 2025. The project page: this https URL
Abstract:In this paper, we present Morse, a simple dual-sampling framework for accelerating diffusion models losslessly. The key insight of Morse is to reformulate the iterative generation (from noise to data) process via taking advantage of fast jump sampling and adaptive residual feedback strategies. Specifically, Morse involves two models called Dash and Dot that interact with each other. The Dash model is just the pre-trained diffusion model of any type, but operates in a jump sampling regime, creating sufficient space for sampling efficiency improvement. The Dot model is significantly faster than the Dash model, which is learnt to generate residual feedback conditioned on the observations at the current jump sampling point on the trajectory of the Dash model, lifting the noise estimate to easily match the next-step estimate of the Dash model without jump sampling. By chaining the outputs of the Dash and Dot models run in a time-interleaved fashion, Morse exhibits the merit of flexibly attaining desired image generation performance while improving overall runtime efficiency. With our proposed weight sharing strategy between the Dash and Dot models, Morse is efficient for training and inference. Our method shows a lossless speedup of 1.78X to 3.31X on average over a wide range of sampling step budgets relative to 9 baseline diffusion models on 6 image generation tasks. Furthermore, we show that our method can be also generalized to improve the Latent Consistency Model (LCM-SDXL, which is already accelerated with consistency distillation technique) tailored for few-step text-to-image synthesis. The code and models are available at this https URL.
zh
[CV-88] Semantic Structure-Aware Generative Attacks for Enhanced Adversarial Transferability
【速读】:该论文旨在解决生成式对抗攻击中,现有方法未能充分利用生成模型的表征能力来保留和利用语义信息的问题,特别是生成器中间激活层中编码的丰富语义特征(如物体边界和粗略形状)未被充分挖掘,从而限制了扰动与目标显著区域的对齐,影响了对抗迁移性。其解决方案的关键在于引入一种基于均值教师(Mean Teacher)的语义结构感知攻击框架,通过时间平滑的特征参考引导学生模型与语义丰富的教师模型在早期层激活之间的语义一致性,并将扰动生成锚定在生成器中具有语义显著性的中间块上,从而提升对抗扰动在关键区域的生成效果。
链接: https://arxiv.org/abs/2506.18248
作者: Jongoh Jeong,Hunmin Yang,Jaeseok Jeong,Kuk-Jin Yoon
机构: KAIST(韩国科学技术院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Generative adversarial attacks train a perturbation generator on a white-box surrogate model and subsequently apply the crafted perturbations to unseen black-box victim models. In contrast to iterative attacks, these methods deliver superior inference-time efficiency, scalability, and transferability; however, up until now, existing studies have not fully exploited the representational capacity of generative models to preserve and harness semantic information. Specifically, the intermediate activations of the generator encode rich semantic features–object boundaries and coarse shapes–that remain under-exploited, thereby limiting the alignment of perturbations with object-salient regions which are critical for adversarial transferability. To remedy this, we introduce a semantic structure-aware attack framework based on the Mean Teacher, which serves as a temporally smoothed feature reference. With this smoothed reference, we further direct semantic consistency between the early-layer activations in the student and those of the semantically rich teacher by feature distillation. By anchoring perturbation synthesis to the semantically salient early intermediate blocks within the generator based on empirical findings, our method guides progressive adversarial perturbation on regions that substantially enhance adversarial transferability. We conduct extensive experiments over diverse models, domains and tasks to demonstrate consistent improvements relative to state-of-the-art generative attacks, comprehensively evaluated using conventional metrics and our newly proposed Accidental Correction Rate (ACR).
zh
[CV-89] Referring Expression Instance Retrieval and A Strong End-to-End Baseline
【速读】:该论文试图解决在现实场景中同时需要实例级检索和跨大规模图库的定位任务,传统文本-图像检索(TIR)在精度上不足,而指代表达理解(REC)在可扩展性上存在局限的问题。解决方案的关键在于提出一个新的任务——指代表达实例检索(REIR),并构建了大规模基准数据集REIRCOCO,同时设计了CLARE方法,其核心是采用双流架构与基于关系专家的混合模块(MORE),结合目标检测、REC预训练及对比语言-实例对齐(CLIA)实现端到端优化,从而有效提升模型在REIR任务中的性能及对TIR和REC的泛化能力。
链接: https://arxiv.org/abs/2506.18246
作者: Xiangzhao Hao,Kuan Zhu,Hongyu Guo,Haiyun Guo,Ming Tang,JinQiao Wang
机构: Institute of Automation, Chinese Academy of Sciences(自动化研究所,中国科学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Natural language querying of visual content underpins many vision-language tasks, typically categorized by text granularity and visual search scope. Text-Image Retrieval (TIR) retrieves whole images using coarse descriptions, while Referring Expression Comprehension (REC) localizes objects using fine-grained expressions within a single image. However, real-world scenarios often require both instance-level retrieval and localization across large galleries – tasks where TIR lacks precision and REC lacks scalability. To address this gap, we propose a new task: Referring Expression Instance Retrieval (REIR), which jointly supports instance-level retrieval and localization. We introduce REIRCOCO, a large-scale benchmark constructed by prompting vision-language models to generate fine-grained expressions for MSCOCO and RefCOCO instances. We also present a baseline method, CLARE, featuring a dual-stream architecture with a Mix of Relation Experts (MORE) module for capturing inter-instance relationships. CLARE integrates object detection and REC pretraining with Contrastive Language-Instance Alignment (CLIA) for end-to-end optimization. Experiments show that CLARE achieves state-of-the-art performance on REIR and generalizes well to TIR and REC, highlighting its effectiveness and versatility.
zh
[CV-90] Drive-R1: Bridging Reasoning and Planning in VLMs for Autonomous Driving with Reinforcement Learning
【速读】:该论文试图解决生成式 AI (Generative AI) 在自动驾驶(AD)中从场景推理到运动规划的衔接问题,具体包括两个关键挑战:一是VLMs(视觉语言模型)倾向于依赖历史输入信息而学习“捷径”,从而在未真正理解视觉输入的情况下获得看似强大的规划结果;二是链式思维(COT)推理过程与运动规划结果存在偏差,如何有效利用复杂的推理能力来提升规划效果仍缺乏深入研究。解决方案的关键在于提出Drive-R1模型,该模型通过在包含长短期COT数据的精细数据集上进行监督微调,引导其从视觉输入逐步推理至最终规划决策,并在强化学习框架下训练,以奖励机制激励发现对规划更具信息量的推理路径,从而实现更优的运动规划性能。
链接: https://arxiv.org/abs/2506.18234
作者: Yue Li,Meng Tian,Dechang Zhu,Jiangtong Zhu,Zhenyu Lin,Zhiwei Xiong,Xinhai Zhao
机构: University of Science and Technology of China (中国科学技术大学); Huawei Noah’s Ark Lab (华为诺亚方舟实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:Large vision-language models (VLMs) for autonomous driving (AD) are evolving beyond perception and cognition tasks toward motion planning. However, we identify two critical challenges in this direction: (1) VLMs tend to learn shortcuts by relying heavily on history input information, achieving seemingly strong planning results without genuinely understanding the visual inputs; and (2) the chain-ofthought (COT) reasoning processes are always misaligned with the motion planning outcomes, and how to effectively leverage the complex reasoning capability to enhance planning remains largely underexplored. In this paper, we start from a small-scale domain-specific VLM and propose Drive-R1 designed to bridges the scenario reasoning and motion planning for AD. Drive-R1 first undergoes the supervised finetuning on a elaborate dataset containing both long and short COT data. Drive-R1 is encouraged to reason step-by-step from visual input to final planning decisions. Subsequently, Drive-R1 is trained within a reinforcement learning framework that incentivizes the discovery of reasoning paths that are more informative for planning, guided by rewards based on predicted trajectories and meta actions. Experimental evaluations on the nuScenes and DriveLM-nuScenes benchmarks demonstrate that Drive-R1 achieves superior performance compared to existing state-of-the-art VLMs. We believe that Drive-R1 presents a promising direction for bridging reasoning and planning in AD, offering methodological insights for future research and applications.
zh
[CV-91] Make It Efficient: Dynamic Sparse Attention for Autoregressive Image Generation
【速读】:该论文旨在解决自回归条件图像生成模型在推理过程中因过长上下文导致的内存开销大和计算延迟问题。其解决方案的关键在于提出一种无需训练的上下文优化方法——自适应动态稀疏注意力(Adaptive Dynamic Sparse Attention, ADSA),该方法通过动态识别维持局部纹理一致性和全局语义连贯性所需的历史标记,从而高效地简化注意力计算,并结合针对ADSA设计的动态KV-cache更新机制,显著降低了推理过程中的GPU内存消耗。
链接: https://arxiv.org/abs/2506.18226
作者: Xunzhi Xiang,Qi Fan
机构: Nanjing University, School of Intelliger Science and Technology (南京大学智能科学与技术学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Autoregressive conditional image generation models have emerged as a dominant paradigm in text-to-image synthesis. These methods typically convert images into one-dimensional token sequences and leverage the self-attention mechanism, which has achieved remarkable success in natural language processing, to capture long-range dependencies, model global context, and ensure semantic coherence. However, excessively long contexts during inference lead to significant memory overhead caused by KV-cache and computational delays. To alleviate these challenges, we systematically analyze how global semantics, spatial layouts, and fine-grained textures are formed during inference, and propose a novel training-free context optimization method called Adaptive Dynamic Sparse Attention (ADSA). Conceptually, ADSA dynamically identifies historical tokens crucial for maintaining local texture consistency and those essential for ensuring global semantic coherence, thereby efficiently streamlining attention computation. Additionally, we introduce a dynamic KV-cache update mechanism tailored for ADSA, reducing GPU memory consumption during inference by approximately 50% . Extensive qualitative and quantitative experiments demonstrate the effectiveness and superiority of our approach in terms of both generation quality and resource efficiency.
zh
[CV-92] Cross-Architecture Knowledge Distillation (KD) for Retinal Fundus Image Anomaly Detection on NVIDIA Jetson Nano
【速读】:该论文旨在解决低资源环境中难以获取可靠诊断设备,从而导致视网膜疾病早期准确识别困难的问题。其解决方案的关键在于开发一种轻量级、可部署于边缘设备的疾病分类器,通过跨架构知识蒸馏(cross-architecture knowledge distilling)将高性能的视觉Transformer(ViT)教师模型的知识迁移至基于卷积神经网络(CNN)的学生模型中,以实现在资源受限条件下的高效诊断。
链接: https://arxiv.org/abs/2506.18220
作者: Berk Yilmaz,Aniruddh Aiyengar
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 15 pages, 10 figures. Berk Yilmaz and Aniruddh Aiyengar contributed equally to this work
Abstract:Early and accurate identification of retinal ailments is crucial for averting ocular decline; however, access to dependable diagnostic devices is not often available in low-resourced settings. This project proposes to solve that by developing a lightweight, edge-device deployable disease classifier using cross-architecture knowledge distilling. We first train a high-capacity vision transformer (ViT) teacher model, pre-trained using I-JEPA self-supervised learning, to classify fundus images into four classes: Normal, Diabetic Retinopathy, Glaucoma, and Cataract. We kept an Internet of Things (IoT) focus when compressing to a CNN-based student model for deployment in resource-limited conditions, such as the NVIDIA Jetson Nano. This was accomplished using a novel framework which included a Partitioned Cross-Attention (PCA) projector, a Group-Wise Linear (GL) projector, and a multi-view robust training method. The teacher model has 97.4 percent more parameters than the student model, with it achieving 89 percent classification with a roughly 93 percent retention of the teacher model’s diagnostic performance. The retention of clinical classification behavior supports our method’s initial aim: compression of the ViT while retaining accuracy. Our work serves as an example of a scalable, AI-driven triage solution for retinal disorders in under-resourced areas.
zh
[CV-93] Shape from Polarization of Thermal Emission and Reflection
【速读】:该论文试图解决透明物体的形状估计问题,这一问题由于其复杂的光传输特性而具有挑战性。解决方案的关键在于利用长波红外(LWIR)波段的偏振信息进行形状恢复(SfP),并提出了一种考虑发射与反射共同影响的偏振模型,以克服以往研究中因缺乏准确偏振建模而导致的显著误差。此外,研究还结合了基于物理的合成数据集训练的神经网络方法,以及对系统性误差的建模,从而提高了形状估计的准确性与适用性。
链接: https://arxiv.org/abs/2506.18217
作者: Kazuma Kitazawa,Tsuyoshi Takatani
机构: University of Tsukuba (筑波大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCP2025
Abstract:Shape estimation for transparent objects is challenging due to their complex light transport. To circumvent these difficulties, we leverage the Shape from Polarization (SfP) technique in the Long-Wave Infrared (LWIR) spectrum, where most materials are opaque and emissive. While a few prior studies have explored LWIR SfP, these attempts suffered from significant errors due to inadequate polarimetric modeling, particularly the neglect of reflection. Addressing this gap, we formulated a polarization model that explicitly accounts for the combined effects of emission and reflection. Based on this model, we estimated surface normals using not only a direct model-based method but also a learning-based approach employing a neural network trained on a physically-grounded synthetic dataset. Furthermore, we modeled the LWIR polarimetric imaging process, accounting for inherent systematic errors to ensure accurate polarimetry. We implemented a prototype system and created ThermoPol, the first real-world benchmark dataset for LWIR SfP. Through comprehensive experiments, we demonstrated the high accuracy and broad applicability of our method across various materials, including those transparent in the visible spectrum.
zh
[CV-94] Deep Learning-based Alignment Measurement in Knee Radiographs MICCAI2025
【速读】:该论文旨在解决传统放射学膝关节对线(Radiographic Knee Alignment, KA)测量方法存在手动操作、耗时且需要长腿X光片的问题。其关键解决方案是提出一种基于深度学习的方法,通过自动定位膝关节解剖标志点来实现KA的测量,该方法采用小时钟网络并结合注意力门结构,以增强鲁棒性和关注关键解剖特征,从而在术前和术后图像中实现高精度的膝关节角度测量。
链接: https://arxiv.org/abs/2506.18209
作者: Zhisen Hu,Dominic Cullen,Peter Thompson,David Johnson,Chang Bian,Aleksei Tiulpin,Timothy Cootes,Claudia Lindner
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to MICCAI 2025
Abstract:Radiographic knee alignment (KA) measurement is important for predicting joint health and surgical outcomes after total knee replacement. Traditional methods for KA measurements are manual, time-consuming and require long-leg radiographs. This study proposes a deep learning-based method to measure KA in anteroposterior knee radiographs via automatically localized knee anatomical landmarks. Our method builds on hourglass networks and incorporates an attention gate structure to enhance robustness and focus on key anatomical features. To our knowledge, this is the first deep learning-based method to localize over 100 knee anatomical landmarks to fully outline the knee shape while integrating KA measurements on both pre-operative and post-operative images. It provides highly accurate and reliable anatomical varus/valgus KA measurements using the anatomical tibiofemoral angle, achieving mean absolute differences ~1° when compared to clinical ground truth measurements. Agreement between automated and clinical measurements was excellent pre-operatively (intra-class correlation coefficient (ICC) = 0.97) and good post-operatively (ICC = 0.86). Our findings demonstrate that KA assessment can be automated with high accuracy, creating opportunities for digitally enhanced clinical workflows.
zh
[CV-95] Limitations of NERF with pre-trained Vision Features for Few-Shot 3D Reconstruction
【速读】:该论文试图解决在极少数样本(extreme few-shot)情况下,基于神经辐射场(Neural Radiance Fields, NeRF)的3D场景重建性能不足的问题。其解决方案的关键在于评估和比较不同类型的预训练视觉特征(如DINO特征)对重建效果的影响,包括冻结DINO特征、LoRA微调特征以及多尺度特征融合等方法。然而,实验结果表明,所有DINO增强的NeRF模型均表现劣于基线NeRF,这提示预训练视觉特征可能在少样本重建任务中存在特征-任务不匹配、过拟合以及集成困难等问题,从而质疑了当前领域中关于预训练特征有益性的普遍假设。
链接: https://arxiv.org/abs/2506.18208
作者: Ankit Sanjyal
机构: Fordham University (福特汉姆大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages, 1 table, 2 figures. First submission. Code available at: \url{ this https URL }
Abstract:Neural Radiance Fields (NeRF) have revolutionized 3D scene reconstruction from sparse image collections. Recent work has explored integrating pre-trained vision features, particularly from DINO, to enhance few-shot reconstruction capabilities. However, the effectiveness of such approaches remains unclear, especially in extreme few-shot scenarios. In this paper, we present a systematic evaluation of DINO-enhanced NeRF models, comparing baseline NeRF, frozen DINO features, LoRA fine-tuned features, and multi-scale feature fusion. Surprisingly, our experiments reveal that all DINO variants perform worse than the baseline NeRF, achieving PSNR values around 12.9 to 13.0 compared to the baseline’s 14.71. This counterintuitive result suggests that pre-trained vision features may not be beneficial for few-shot 3D reconstruction and may even introduce harmful biases. We analyze potential causes including feature-task mismatch, overfitting to limited data, and integration challenges. Our findings challenge common assumptions in the field and suggest that simpler architectures focusing on geometric consistency may be more effective for few-shot scenarios.
zh
[CV-96] Multimodal Fusion SLAM with Fourier Attention
【速读】:该论文旨在解决在噪声、光照变化和黑暗等复杂环境下视觉同步定位与建图(Visual SLAM)的挑战。其关键解决方案是提出FMF-SLAM,一种高效的多模态融合SLAM方法,通过快速傅里叶变换(FFT)提升算法效率,并引入基于傅里叶的自注意力和交叉注意力机制以提取RGB和深度信号的特征,同时结合跨模态的多尺度知识蒸馏增强多模态特征交互,从而实现实时且实用的性能。
链接: https://arxiv.org/abs/2506.18204
作者: Youjie Zhou,Guofeng Mei,Yiming Wang,Yi Wan,Fabio Poiesi
机构: Shandong University (山东大学); Fondazione Bruno Kessler (布鲁诺·凯塞尔基金会)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Visual SLAM is particularly challenging in environments affected by noise, varying lighting conditions, and darkness. Learning-based optical flow algorithms can leverage multiple modalities to address these challenges, but traditional optical flow-based visual SLAM approaches often require significant computational this http URL overcome this limitation, we propose FMF-SLAM, an efficient multimodal fusion SLAM method that utilizes fast Fourier transform (FFT) to enhance the algorithm efficiency. Specifically, we introduce a novel Fourier-based self-attention and cross-attention mechanism to extract features from RGB and depth signals. We further enhance the interaction of multimodal features by incorporating multi-scale knowledge distillation across modalities. We also demonstrate the practical feasibility of FMF-SLAM in real-world scenarios with real time performance by integrating it with a security robot by fusing with a global positioning module GNSS-RTK and global Bundle Adjustment. Our approach is validated using video sequences from TUM, TartanAir, and our real-world datasets, showcasing state-of-the-art performance under noisy, varying lighting, and dark this http URL code and datasets are available at this https URL.
zh
[CV-97] DExNet: Combining Observations of Domain Adapted Critics for Leaf Disease Classification with Limited Data
【速读】:该论文旨在解决在样本数量有限的情况下,植物叶片疾病分类模型难以达到满意性能的问题。其关键解决方案是提出一种基于少样本学习的框架——领域自适应专家网络(Domain-adapted Expert Network, DExNet),通过结合多个专家评论者的观察来弥补训练数据不足的问题。该方法首先从九个预训练的卷积神经网络(CNN)架构中提取特征嵌入作为“观察”,并通过一个公开可用的无重叠类别的叶部疾病数据集对这些评论者进行领域适应,随后将这些观察输入到特征融合模块和包含双向长短期记忆(Bi-LSTM)层的分类器网络中,从而实现高效的分类性能。
链接: https://arxiv.org/abs/2506.18173
作者: Sabbir Ahmed,Md. Bakhtiar Hasan,Tasnim Ahmed,Md. Hasanul Kabir
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to ACPR Springer, 15 pages, 1 Figure, 7 Tables, and lots of efforts :)
Abstract:While deep learning-based architectures have been widely used for correctly detecting and classifying plant diseases, they require large-scale datasets to learn generalized features and achieve state-of-the-art performance. This poses a challenge for such models to obtain satisfactory performance in classifying leaf diseases with limited samples. This work proposes a few-shot learning framework, Domain-adapted Expert Network (DExNet), for plant disease classification that compensates for the lack of sufficient training data by combining observations of a number of expert critics. It starts with extracting the feature embeddings as ‘observations’ from nine ‘critics’ that are state-of-the-art pre-trained CNN-based architectures. These critics are ‘domain adapted’ using a publicly available leaf disease dataset having no overlapping classes with the specific downstream task of interest. The observations are then passed to the ‘Feature Fusion Block’ and finally to a classifier network consisting of Bi-LSTM layers. The proposed pipeline is evaluated on the 10 classes of tomato leaf images from the PlantVillage dataset, achieving promising accuracies of 89.06%, 92.46%, and 94.07%, respectively, for 5-shot, 10-shot, and 15-shot classification. Furthermore, an accuracy of 98.09±0.7% has been achieved in 80-shot classification, which is only 1.2% less than state-of-the-art, allowing a 94.5% reduction in the training data requirement. The proposed pipeline also outperforms existing works on leaf disease classification with limited data in both laboratory and real-life conditions in single-domain, mixed-domain, and cross-domain scenarios.
zh
[CV-98] STACT-Time: Spatio-Temporal Cross Attention for Cine Thyroid Ultrasound Time Series Classification
【速读】:该论文旨在解决甲状腺结节细针穿刺活检(FNA)中因良性结节误诊导致的过度活检问题,从而减少患者的不适和焦虑。其解决方案的关键在于提出一种名为STACT-Time的模型,该模型通过整合超声 cine 影像的时空特征与分割掩码特征,利用自注意力和交叉注意力机制捕捉超声动态信息中的丰富时空上下文,提升恶性肿瘤预测的准确性。
链接: https://arxiv.org/abs/2506.18172
作者: Irsyad Adam,Tengyue Zhang,Shrayes Raman,Zhuyu Qiu,Brandon Taraku,Hexiang Feng,Sile Wang,Ashwath Radhachandran,Shreeram Athreya,Vedrana Ivezic,Peipei Ping,Corey Arnold,William Speier
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Thyroid cancer is among the most common cancers in the United States. Thyroid nodules are frequently detected through ultrasound (US) imaging, and some require further evaluation via fine-needle aspiration (FNA) biopsy. Despite its effectiveness, FNA often leads to unnecessary biopsies of benign nodules, causing patient discomfort and anxiety. To address this, the American College of Radiology Thyroid Imaging Reporting and Data System (TI-RADS) has been developed to reduce benign biopsies. However, such systems are limited by interobserver variability. Recent deep learning approaches have sought to improve risk stratification, but they often fail to utilize the rich temporal and spatial context provided by US cine clips, which contain dynamic global information and surrounding structural changes across various views. In this work, we propose the Spatio-Temporal Cross Attention for Cine Thyroid Ultrasound Time Series Classification (STACT-Time) model, a novel representation learning framework that integrates imaging features from US cine clips with features from segmentation masks automatically generated by a pretrained model. By leveraging self-attention and cross-attention mechanisms, our model captures the rich temporal and spatial context of US cine clips while enhancing feature representation through segmentation-guided learning. Our model improves malignancy prediction compared to state-of-the-art models, achieving a cross-validation precision of 0.91 (plus or minus 0.02) and an F1 score of 0.89 (plus or minus 0.02). By reducing unnecessary biopsies of benign nodules while maintaining high sensitivity for malignancy detection, our model has the potential to enhance clinical decision-making and improve patient outcomes.
zh
[CV-99] CDG-MAE: Learning Correspondences from Diffusion Generated Views
【速读】:该论文试图解决在学习密集对应关系(dense correspondences)过程中因依赖繁琐且不可扩展的人工标注而带来的挑战,特别是在视频标签传播等应用中。其解决方案的关键在于提出一种基于掩码自编码器(MAE)的自监督方法——CDG-MAE,该方法通过图像条件扩散模型生成多样化的合成视图,这些视图在姿态和视角上具有显著变化,从而提供了丰富的训练信号,克服了传统视频数据集和图像裁剪作为锚点的局限性。
链接: https://arxiv.org/abs/2506.18164
作者: Varun Belagali,Pierre Marza,Srikar Yellapragada,Zilinghan Li,Tarak Nath Nandi,Ravi K Madduri,Joel Saltz,Stergios Christodoulidis,Maria Vakalopoulou,Dimitris Samaras
机构: Stony Brook University; MICS, CentraleSupélec, Université Paris-Saclay; Argonne National Laboratory; University of Chicago; Archimedes/Athena RC
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Learning dense correspondences, critical for application such as video label propagation, is hindered by tedious and unscalable manual annotation. Self-supervised methods address this by using a cross-view pretext task, often modeled with a masked autoencoder, where a masked target view is reconstructed from an anchor view. However, acquiring effective training data remains a challenge - collecting diverse video datasets is difficult and costly, while simple image crops lack necessary pose variations. This paper introduces CDG-MAE, a novel MAE-based self-supervised method that uses diverse synthetic views generated from static images via an image-conditioned diffusion model. These generated views exhibit substantial changes in pose and perspective, providing a rich training signal that overcomes the limitations of video and crop-based anchors. We present a quantitative method to evaluate local and global consistency of generated images, discussing their use for cross-view self-supervised pretraining. Furthermore, we enhance the standard single-anchor MAE setting to a multi-anchor strategy to effectively modulate the difficulty of pretext task. CDG-MAE significantly outperforms state-of-the-art MAE methods reliant only on images and substantially narrows the performance gap to video-based approaches.
zh
[CV-100] Pitfalls of Conformal Predictions for Medical Image Classification
【速读】:该论文试图解决医疗分类任务中可靠不确定性估计的问题,特别是针对基于统计框架的合规预测(conformal predictions)在医学等安全关键领域应用时存在的局限性。解决方案的关键在于揭示合规预测在输入和标签变量分布偏移下的不可靠性,以及其在子集数据(如特定类别或患者属性)上的不稳定性,同时指出在类别数量较少的分类场景中,合规预测的实际应用价值有限。
链接: https://arxiv.org/abs/2506.18162
作者: Hendrik Mehrtens,Tabea Bucher,Titus J. Brinker
机构: German Cancer Research Center (DKFZ)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Reliable uncertainty estimation is one of the major challenges for medical classification tasks. While many approaches have been proposed, recently the statistical framework of conformal predictions has gained a lot of attention, due to its ability to provide provable calibration guarantees. Nonetheless, the application of conformal predictions in safety-critical areas such as medicine comes with pitfalls, limitations and assumptions that practitioners need to be aware of. We demonstrate through examples from dermatology and histopathology that conformal predictions are unreliable under distributional shifts in input and label variables. Additionally, conformal predictions should not be used for selecting predictions to improve accuracy and are not reliable for subsets of the data, such as individual classes or patient attributes. Moreover, in classification settings with a small number of classes, which are common in medical image classification tasks, conformal predictions have limited practical value.
zh
[CV-101] Chain-of-Memory: Enhancing GUI Agents for Cross-Application Navigation
【速读】:该论文旨在解决GUI代理在复杂和长任务中难以准确理解任务状态以及缺乏有效机制存储关键信息的问题。解决方案的关键在于提出一种名为Chain-of-Memory (CoM)的新方法,通过显式建模短期和长期记忆,结合操作描述、任务相关屏幕信息,并维护专用的记忆模块来存储和管理信息,从而提升GUI代理对任务状态的理解能力和关键历史信息的持久保留。
链接: https://arxiv.org/abs/2506.18158
作者: Xinzge Gao,Chuanrui Hu,Bin Chen,Teng Li
机构: Anhui University (安徽大学); Qihoo360 (奇虎360)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Multimodal large language models (MLLMs) are attracting growing attention in the development of Graphical User Interface (GUI) agents. Existing approaches often rely on historical screenshots or actions to implicitly represent the task state. This reliance poses challenges for GUI agents in accurately understanding task states and underscores the absence of effective mechanisms to store critical information in complex and lengthy cross-app tasks. To address these challenges, we propose Chain-of-Memory (CoM), a novel approach for explicitly modeling short-term and long-term memory in GUI agents. CoM achieves this by capturing action descriptions, integrating task-relevant screen information, and maintaining a dedicated memory module to store and manage this information. By leveraging explicit memory representations, CoM enables GUI agents to better understand task states and retain critical historical information persistently. To equip GUI agents with memory management capabilities and evaluate the effectiveness of CoM, we developed the GUI Odyssey-CoM, a dataset comprising 111k screen-action pairs annotated with Chain-of-Memory. Experimental results demonstrate that CoM significantly improves GUI agents’ performance in cross-application tasks. Additionally, GUI Odyssey-CoM enables 7B models to achieve memory management capabilities comparable to 72B models. The dataset and code will be open-sourced.
zh
[CV-102] Pattern-Based Phase-Separation of Tracer and Dispersed Phase Particles in Two-Phase Defocusing Particle Tracking Velocimetry
【速读】:该论文试图解决在非聚焦粒子追踪测速(DPTV)中对分散两相流进行相分离的问题,特别是如何在单相机设置下同时实现示踪粒子和分散相粒子的三维定位。解决方案的关键在于利用粒子图像在失焦状态下的模式差异,这些差异源于示踪粒子与气泡或液滴不同的光散射行为,并通过卷积神经网络(Convolutional Neural Networks, CNNs)进行检测与分类,其中引入了基于生成对抗网络(Generative Adversarial Network, GAN)的框架以生成更贴近实验视觉特性的自动标注数据集。
链接: https://arxiv.org/abs/2506.18157
作者: Christian Sax,Jochen Kriegseis
机构: Karlsruhe Institute of Technology (卡尔斯鲁厄理工学院); Institute of Applied and Numerical Mathematics (应用与数值数学研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Applied Physics (physics.app-ph); Fluid Dynamics (physics.flu-dyn)
备注:
Abstract:This work investigates the feasibility of a post-processing-based approach for phase separation in defocusing particle tracking velocimetry for dispersed two-phase flows. The method enables the simultaneous 3D localization determination of both tracer particles and particles of the dispersed phase, using a single-camera setup. The distinction between phases is based on pattern differences in defocused particle images, which arise from distinct light scattering behaviors of tracer particles and bubbles or droplets. Convolutional neural networks, including Faster R-CNN and YOLOv4 variants, are trained to detect and classify particle images based on these pattern features. To generate large, labeled training datasets, a generative adversarial network based framework is introduced, allowing the generation of auto-labeled data that more closely reflects experiment-specific visual appearance. Evaluation across six datasets, comprising synthetic two-phase and real single- and two-phase flows, demonstrates high detection precision and classification accuracy (95-100%), even under domain shifts. The results confirm the viability of using CNNs for robust phase separation in disperse two-phase DPTV, particularly in scenarios where traditional wavelength-, size-, or ensemble correlation-based methods are impractical.
zh
[CV-103] See-in-Pairs: Reference Image-Guided Comparative Vision-Language Models for Medical Diagnosis
【速读】:该论文试图解决医学影像诊断中因疾病模拟正常解剖结构及患者间显著变异而导致的挑战,现有医学视觉语言模型(VLMs)主要关注单图或单序列分析,缺乏显式的对比推理机制,而通用VLM虽具备多图对比推理能力,但缺乏必要的医学领域知识。解决方案的关键在于引入临床启发的对比分析方法,通过利用参考图像并结合临床指导的对比提示,提升诊断准确性,实验表明在监督微调后,该方法在多个医学视觉问答任务中显著优于单图基线。
链接: https://arxiv.org/abs/2506.18140
作者: Ruinan Jin,Gexin Huang,Xinwei Shen,Qiong Zhang,Yan Shuo Tan,Xiaoxiao Li
机构: The University of British Columbia (不列颠哥伦比亚大学); ETH Zurich (苏黎世联邦理工学院); Renmin University of China (中国人民大学); National University of Singapore (新加坡国立大学); Vector Institute (向量研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 25 pages, four figures
Abstract:Medical imaging diagnosis presents inherent challenges due to diseases that mimic normal anatomy and exhibit significant inter-patient variability. Clinicians routinely employ comparative reasoning-using reference images from healthy controls or previous patient examinations-to discern subtle yet diagnostically critical abnormalities. However, existing medical vision-language models (VLMs) focus primarily on single-image or single-series analyses and lack explicit mechanisms for comparative reasoning. Conversely, general-purpose VLMs demonstrate strong multi-image comparative reasoning capabilities but lack essential medical-domain knowledge to identify nuanced clinical differences. This work aims to bridge this gap by exploring clinically-inspired comparative analysis within VLMs, leveraging reference images to enhance diagnostic accuracy. Through extensive empirical analysis, we show that providing general-purpose VLMs with query and normative matched reference images, accompanied by clinically-informed comparative prompts, significantly improves diagnostic outcomes compared to single-image baselines, especially after supervised finetuning (SFT). Our contributions highlight the clinical relevance of comparative analysis introduce novel strategies for leveraging reference images in VLMs, empirically demonstrate enhanced performance across multiple medical visual question answering (VQA) tasks, and provide theoretical insights into the efficacy of comparative image analysis in medical diagnosis.
zh
[CV-104] argeted False Positive Synthesis via Detector-guided Adversarial Diffusion Attacker for Robust Polyp Detection MICCAI2025
【速读】:该论文旨在解决结直肠癌筛查中息肉检测模型因数据规模和多样性不足而导致的性能限制问题,特别是现有方法在数据增强中过度关注息肉多样性而忽视了假阳性(false positives)的合成。其解决方案的关键在于提出一种对抗扩散框架,通过引入区域噪声匹配策略和Detector-guided Adversarial Diffusion Attacker (DADA)模块,实现高价值假阳性的合成,从而提升检测器的鲁棒性和临床可靠性。
链接: https://arxiv.org/abs/2506.18134
作者: Quan Zhou,Gan Luo,Qiang Hu,Qingyong Zhang,Jinhua Zhang,Yinjiao Tian,Qiang Li,Zhiwei Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Early Accepted by MICCAI 2025
Abstract:Polyp detection is crucial for colorectal cancer screening, yet existing models are limited by the scale and diversity of available data. While generative models show promise for data augmentation, current methods mainly focus on enhancing polyp diversity, often overlooking the critical issue of false positives. In this paper, we address this gap by proposing an adversarial diffusion framework to synthesize high-value false positives. The extensive variability of negative backgrounds presents a significant challenge in false positive synthesis. To overcome this, we introduce two key innovations: First, we design a regional noise matching strategy to construct a negative synthesis space using polyp detection datasets. This strategy trains a negative-centric diffusion model by masking polyp regions, ensuring the model focuses exclusively on learning diverse background patterns. Second, we introduce the Detector-guided Adversarial Diffusion Attacker (DADA) module, which perturbs the negative synthesis process to disrupt a pre-trained detector’s decision, guiding the negative-centric diffusion model to generate high-value, detector-confusing false positives instead of low-value, ordinary backgrounds. Our approach is the first to apply adversarial diffusion to lesion detection, establishing a new paradigm for targeted false positive synthesis and paving the way for more reliable clinical applications in colorectal cancer screening. Extensive results on public and in-house datasets verify the superiority of our method over the current state-of-the-arts, with our synthesized data improving the detectors by at least 2.6% and 2.7% in F1-score, respectively, over the baselines. Codes are at this https URL.
zh
[CV-105] Enhancing VICReg: Random-Walk Pairing for Improved Generalization and Better Global Semantics Capturing
【速读】:该论文试图解决自监督学习(SSL)方法VICReg在面对未见过的数据时可能存在的泛化能力不足问题,其根源在于模型对训练数据的过度依赖。解决方案的关键在于提出SAG-VICReg(Stable and Generalizable VICReg),该方法通过引入新的训练技术,增强了模型捕捉数据全局语义的能力,并提升了模型的泛化性能。实验表明,SAG-VICReg在保持局部评估指标竞争力的同时,在衡量全局语义理解的指标上表现出更优性能,并提出了一种无需标签即可评估嵌入全局结构的新独立评价指标。
链接: https://arxiv.org/abs/2506.18104
作者: Idan Simai,Ronen Talmon,Uri Shaham
机构: Bar-Ilan University (巴伊兰大学); Technion (技术学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:In this paper, we argue that viewing VICReg-a popular self-supervised learning (SSL) method–through the lens of spectral embedding reveals a potential source of sub-optimality: it may struggle to generalize robustly to unseen data due to overreliance on the training data. This observation invites a closer look at how well this method achieves its goal of producing meaningful representations of images outside of the training set as well. Here, we investigate this issue and introduce SAG-VICReg (Stable and Generalizable VICReg), a method that builds on VICReg by incorporating new training techniques. These enhancements improve the model’s ability to capture global semantics within the data and strengthen the generalization capabilities. Experiments demonstrate that SAG-VICReg effectively addresses the generalization challenge while matching or surpassing diverse state-of-the-art SSL baselines. Notably, our method exhibits superior performance on metrics designed to evaluate global semantic understanding, while simultaneously maintaining competitive results on local evaluation metrics. Furthermore, we propose a new standalone evaluation metric for embeddings that complements the standard evaluation methods and accounts for the global data structure without requiring labels–a key issue when tagged data is scarce or not available.
zh
[CV-106] ShareGPT -4o-Image: Aligning Multimodal Models with GPT -4o-Level Image Generation
【速读】:该论文旨在解决当前主流图像生成系统(如GPT-4o-Image)因专有性和不可访问性而限制了研究和应用的问题,通过构建一个公开可用的数据集来推动高质量、指令对齐的图像生成技术的发展。其解决方案的关键在于创建ShareGPT-4o-Image数据集,该数据集包含45K文本到图像和46K文本与图像到图像的数据,所有数据均通过GPT-4o生成,用于蒸馏其先进的图像生成能力,进而训练出能够支持两种生成任务的多模态大语言模型Janus-4o。
链接: https://arxiv.org/abs/2506.18095
作者: Junying Chen,Zhenyang Cai,Pengcheng Chen,Shunian Chen,Ke Ji,Xidong Wang,Yunjin Yang,Benyou Wang
机构: The Chinese University of Hong Kong, Shenzhen (香港中文大学(深圳))
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Recent advances in multimodal generative models have unlocked photorealistic, instruction-aligned image generation, yet leading systems like GPT-4o-Image remain proprietary and inaccessible. To democratize these capabilities, we present ShareGPT-4o-Image, the first dataset comprising 45K text-to-image and 46K text-and-image-to-image data, all synthesized using GPT-4o’s image generation capabilities for distilling its advanced image generation abilities. Leveraging this dataset, we develop Janus-4o, a multimodal large language model capable of both text-to-image and text-and-image-to-image generation. Janus-4o not only significantly improves text-to-image generation over its predecessor, Janus-Pro, but also newly supports text-and-image-to-image generation. Notably, it achieves impressive performance in text-and-image-to-image generation from scratch, using only 91K synthetic samples and 6 hours of training on an 8 A800-GPU machine. We hope the release of ShareGPT-4o-Image and Janus-4o will foster open research in photorealistic, instruction-aligned image generation.
zh
[CV-107] EM3-Learning: Time-Efficient Multimodal Multi-Task Learning for Advanced Assistive Driving
【速读】:该论文旨在解决多任务学习(Multi-task Learning, MTL)在辅助驾驶中面临的两个关键问题:单一模态约束导致的场景理解不全面以及低效架构影响实时部署。其解决方案的关键在于提出了一种名为TEM^3-Learning(Time-Efficient Multimodal Multi-task Learning)的框架,该框架通过两阶段架构联合优化驾驶员情绪识别、驾驶员行为识别、交通环境识别和车辆行为识别。核心创新包括基于Mamba的多视角时空特征提取子网络(MTS-Mamba)和基于MTL的门控多模态特征整合器(MGMI),前者通过时序扫描机制和全局-局部空间注意力高效提取特征,后者通过任务特定的多门控模块自适应增强相关模态特征,从而有效缓解多任务学习中的负迁移问题。
链接: https://arxiv.org/abs/2506.18084
作者: Wenzhuo Liu,Yicheng Qiao,Zhen Wang,Qiannan Guo,Zilong Chen,Meihua Zhou,Xinran Li,Letian Wang,Zhiwei Li,Huaping Liu,Wenshuo Wang
机构: Beijing Institute of Technology, Zhuhai(北京理工大学珠海校区); Tsinghua University(清华大学); Yale University(耶鲁大学); University of Toronto(多伦多大学); Beijing University of Chemical Technology(北京化工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Multi-task learning (MTL) can advance assistive driving by exploring inter-task correlations through shared representations. However, existing methods face two critical limitations: single-modality constraints limiting comprehensive scene understanding and inefficient architectures impeding real-time deployment. This paper proposes TEM^3-Learning (Time-Efficient Multimodal Multi-task Learning), a novel framework that jointly optimizes driver emotion recognition, driver behavior recognition, traffic context recognition, and vehicle behavior recognition through a two-stage architecture. The first component, the mamba-based multi-view temporal-spatial feature extraction subnetwork (MTS-Mamba), introduces a forward-backward temporal scanning mechanism and global-local spatial attention to efficiently extract low-cost temporal-spatial features from multi-view sequential images. The second component, the MTL-based gated multimodal feature integrator (MGMI), employs task-specific multi-gating modules to adaptively highlight the most relevant modality features for each task, effectively alleviating the negative transfer problem in MTL. Evaluation on the AIDE dataset, our proposed model achieves state-of-the-art accuracy across all four tasks, maintaining a lightweight architecture with fewer than 6 million parameters and delivering an impressive 142.32 FPS inference speed. Rigorous ablation studies further validate the effectiveness of the proposed framework and the independent contributions of each module. The code is available on this https URL.
zh
[CV-108] MUPA: Towards Multi-Path Agent ic Reasoning for Grounded Video Question Answering
【速读】:该论文旨在解决Grounded Video Question Answering(Grounded VideoQA)中模型依赖语言先验和虚假相关性导致的预测结果缺乏视觉证据支撑的问题。其解决方案的关键在于提出MUPA(cooperative MUlti-Path Agentic approach),通过统一视频 grounding、问题回答、答案反思与聚合,构建三种不同的推理路径来处理 grounding 与 QA 代理之间的交互,并引入专门的反思代理对多路径结果进行评估与整合,从而显著提升 grounding 的准确性而不牺牲答案的正确性。
链接: https://arxiv.org/abs/2506.18071
作者: Jisheng Dang,Huilin Song,Junbin Xiao,Bimei Wang,Han Peng,Haoxuan Li,Xun Yang,Meng Wang,Tat-Seng Chua
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Grounded Video Question Answering (Grounded VideoQA) requires aligning textual answers with explicit visual evidence. However, modern multimodal models often rely on linguistic priors and spurious correlations, resulting in poorly grounded predictions. In this work, we propose MUPA, a cooperative MUlti-Path Agentic approach that unifies video grounding, question answering, answer reflection and aggregation to tackle Grounded VideoQA. MUPA features three distinct reasoning paths on the interplay of grounding and QA agents in different chronological orders, along with a dedicated reflection agent to judge and aggregate the multi-path results to accomplish consistent QA and grounding. This design markedly improves grounding fidelity without sacrificing answer accuracy. Despite using only 2B parameters, our method outperforms all 7B-scale competitors. When scaled to 7B parameters, MUPA establishes new state-of-the-art results, with Acc@GQA of 30.3% and 47.4% on NExT-GQA and DeVE-QA respectively, demonstrating MUPA’ effectiveness towards trustworthy video-language understanding. Our code is available in this https URL.
zh
[CV-109] raining-free Test-time Improvement for Explainable Medical Image Classification MICCAI2025
【速读】:该论文旨在解决概念瓶颈模型(Concept Bottleneck Models, CBMs)在新环境部署时面临的问题,尤其是由于成像协议和染色方法差异导致的概念层面分布偏移,以及仅使用图像级标签微调模型时可能降低概念预测准确性和可信度的问题。其解决方案的关键在于提出一种无需训练的混淆概念识别策略,通过利用少量仅含图像级标签的新数据(如每类4张图像),采用遮蔽误激活的混淆概念和增强欠激活的判别概念两个关键操作,提升模型在域外场景下的性能,同时不牺牲源域的准确性。
链接: https://arxiv.org/abs/2506.18070
作者: Hangzhou He,Jiachen Tang,Lei Zhu,Kaiwen Li,Yanye Lu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This is the initial version of our work accepted by MICCAI 2025. We’ll include a link to the version on SpringerLink after this becomes available
Abstract:Deep learning-based medical image classification techniques are rapidly advancing in medical image analysis, making it crucial to develop accurate and trustworthy models that can be efficiently deployed across diverse clinical scenarios. Concept Bottleneck Models (CBMs), which first predict a set of explainable concepts from images and then perform classification based on these concepts, are increasingly being adopted for explainable medical image classification. However, the inherent explainability of CBMs introduces new challenges when deploying trained models to new environments. Variations in imaging protocols and staining methods may induce concept-level shifts, such as alterations in color distribution and scale. Furthermore, since CBM training requires explicit concept annotations, fine-tuning models solely with image-level labels could compromise concept prediction accuracy and faithfulness - a critical limitation given the high cost of acquiring expert-annotated concept labels in medical domains. To address these challenges, we propose a training-free confusion concept identification strategy. By leveraging minimal new data (e.g., 4 images per class) with only image-level labels, our approach enhances out-of-domain performance without sacrificing source domain accuracy through two key operations: masking misactivated confounding concepts and amplifying under-activated discriminative concepts. The efficacy of our method is validated on both skin and white blood cell images. Our code is available at: this https URL.
zh
[CV-110] Unfolding the Past: A Comprehensive Deep Learning Approach to Analyzing Incunabula Pages
【速读】:该论文试图解决早期印刷书籍(incunabula)页面结构与内容的自动化分析问题,其核心挑战在于如何准确识别和分类页面中的不同元素,如文本、标题、图片、表格和手写部分。解决方案的关键在于构建一个定制的数据集,并结合YOLO11n模型进行对象检测,以实现高精度的页面元素分类;同时,利用Tesseract OCR对文本区域进行光学字符识别,并通过ResNet18模型对图片类别进行分类,进一步利用CLIP模型生成插图的语义描述,从而全面提升对早期印刷书籍内容的理解与分析能力。
链接: https://arxiv.org/abs/2506.18069
作者: Klaudia Ropel,Krzysztof Kutt,Luiz do Valle Miranda,Grzegorz J. Nalepa
机构: 未知
类目: Digital Libraries (cs.DL); Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 8 figures; submitted to TPDL 2025
Abstract:We developed a proof-of-concept method for the automatic analysis of the structure and content of incunabula pages. A custom dataset comprising 500 annotated pages from five different incunabula was created using resources from the Jagiellonian Digital Library. Each page was manually labeled with five predefined classes: Text, Title, Picture, Table, and Handwriting. Additionally, the publicly available DocLayNet dataset was utilized as supplementary training data. To perform object detection, YOLO11n and YOLO11s models were employed and trained using two strategies: a combined dataset (DocLayNet and the custom dataset) and the custom dataset alone. The highest performance (F1 = 0.94) was achieved by the YOLO11n model trained exclusively on the custom data. Optical character recognition was then conducted on regions classified as Text, using both Tesseract and Kraken OCR, with Tesseract demonstrating superior results. Subsequently, image classification was applied to the Picture class using a ResNet18 model, achieving an accuracy of 98.7% across five subclasses: Decorative_letter, Illustration, Other, Stamp, and Wrong_detection. Furthermore, the CLIP model was utilized to generate semantic descriptions of illustrations. The results confirm the potential of machine learning in the analysis of early printed books, while emphasizing the need for further advancements in OCR performance and visual content interpretation.
zh
[CV-111] Deep Supervised LSTM for 3D morphology estimation from Multi-View RGB Images of Wheat Spikes
【速读】:该论文旨在解决从二维RGB图像中非破坏性地估计小麦穗三维形态特征(如体积)的问题,该问题由于深度信息丢失、投影失真和实际田间条件下的遮挡而具有挑战性。其解决方案的关键在于提出一种结合自监督视觉Transformer(DINOv2)与单向长短期记忆网络(LSTM)的迁移学习框架,并通过深度监督机制提升模型的中间表征鲁棒性和泛化能力,从而在不同评估序列中实现更准确的体积预测。
链接: https://arxiv.org/abs/2506.18060
作者: Olivia Zumsteg,Nico Graf,Aaron Haeusler,Norbert Kirchgessner,Nicola Storni,Lukas Roth,Andreas Hund
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 13 figures
Abstract:Estimating three-dimensional morphological traits from two-dimensional RGB images presents inherent challenges due to the loss of depth information, projection distortions, and occlusions under field conditions. In this work, we explore multiple approaches for non-destructive volume estimation of wheat spikes, using RGB image sequences and structured-light 3D scans as ground truth references. Due to the complex geometry of the spikes, we propose a neural network approach for volume estimation in 2D images, employing a transfer learning pipeline that combines DINOv2, a self-supervised Vision Transformer, with a unidirectional Long Short-Term Memory (LSTM) network. By using deep supervision, the model is able to learn more robust intermediate representations, which enhances its generalisation ability across varying evaluation sequences. We benchmark our model against two conventional baselines: a 2D area-based projection and a geometric reconstruction using axis-aligned cross-sections. Our deep supervised model achieves a mean absolute percentage error (MAPE) of 6.46% on six-view indoor images, outperforming the area (9.36%) and geometric (13.98%) baselines. Fine-tuning the model on field-based single-image data enables domain adaptation, yielding a MAPE of 10.82%. We demonstrate that object shape significantly impacts volume prediction accuracy, with irregular geometries such as wheat spikes posing greater challenges for geometric methods compared to our deep learning approach.
zh
[CV-112] CLGRPO: Reasoning Ability Enhancement for Small VLMs
【速读】:该论文旨在解决小型视觉语言模型(Small Vision Language Models, SVLMs)由于参数数量有限而导致的推理能力不足的问题。解决方案的关键在于提出一种称为增量训练策略(Incremental Training Strategy)的后训练优化范式,该策略通过四个阶段逐步提升模型的推理能力,包括注入领域知识、对齐COT数据格式、增强推理能力以及通过ClipLow GRPO(CLGRPO)限制训练过程的捕获空间,从而有效提升SVLMs的性能。
链接: https://arxiv.org/abs/2506.18048
作者: Fanyi Wang,Binzhi Dong,Haotian Hu,Jinjin Xu,Zhiwang Zhang
机构: Honor AI Center(荣耀人工智能中心); Zhejiang University(浙江大学); Bytedance(字节跳动); Ningbo Tech University(宁波工程学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 5 figures
Abstract:Small Vision Language Models (SVLMs) generally refer to models with parameter sizes less than or equal to 2B. Their low cost and power consumption characteristics confer high commercial value. However, their reasoning abilities are limited by the number of parameters. To address this issue, this paper proposes a post-training optimization paradigm called the Incremental Training Strategy to enhance the reasoning ability of SVLMs. Firstly, we constructed a Self-Supervised Chain-of-Thought (COT) Data Construction System, which leverages multiple LVLMs with 7B parameters or more to transform original data into COT data in a self-supervised manner. Our proposed Incremental Training Strategy consists of four stages. Stage 1 injects domain knowledge by performing Supervised Fine-Tuning (SFT) to the pretrained model on the COT data. Stage 2 aligns the COT data format by conducting a small amount of Group Relative Policy Optimization (GRPO) training constrained only by format rewards on the COT data. Stage 3 enhances reasoning ability by applying GRPO training on the COT data with constraints on both format and accuracy rewards. The resulting model shows significant improvement compared to the baseline. Stage 4 addresses the limited capacity of the SVLMs and the weak ability to capture complex patterns by proposing ClipLow GRPO (CLGRPO) to constrain the capture space of the training process. We conducted extensive comparative and ablation experiments on the abstract semantic recognition dataset EMOSet-118K. Experimental results demonstrate that our method significantly improves the reasoning ability of 1B SVLM. Compared to the baseline model fine-tuned on the original data, accuracy increased by 2.77 and recall by 0.69, achieving performance comparable to that of 8B models.
zh
[CV-113] CmFNet: Cross-modal Fusion Network for Weakly-supervised Segmentation of Medical Images
【速读】:该论文旨在解决医学图像分割中依赖高成本、耗时的密集标注的问题,提出了一种基于弱监督学习的3D跨模态医学图像分割方法CmFNet。其关键解决方案包括三个核心组件:模态特定特征学习网络、跨模态特征学习网络以及混合监督学习策略,通过整合多模态图像的互补信息并结合涂抹监督、模态内正则化和模态间一致性约束,有效提升了分割性能并缓解了过拟合问题。
链接: https://arxiv.org/abs/2506.18042
作者: Dongdong Meng,Sheng Li,Hao Wu,Suqing Tian,Wenjun Ma,Guoping Wang,Xueqing Yan
机构: Peking University(北京大学); School of Physics(物理学院); School of Computer Science(计算机科学学院); Department of Radiotherapy(放射治疗科); Peking University Cancer Hospital(北京大学肿瘤医院); Department of Radiation Oncology(放射肿瘤科); Peking University Third Hospital(北京大学第三医院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 6 figures
Abstract:Accurate automatic medical image segmentation relies on high-quality, dense annotations, which are costly and time-consuming. Weakly supervised learning provides a more efficient alternative by leveraging sparse and coarse annotations instead of dense, precise ones. However, segmentation performance degradation and overfitting caused by sparse annotations remain key challenges. To address these issues, we propose CmFNet, a novel 3D weakly supervised cross-modal medical image segmentation approach. CmFNet consists of three main components: a modality-specific feature learning network, a cross-modal feature learning network, and a hybrid-supervised learning strategy. Specifically, the modality-specific feature learning network and the cross-modal feature learning network effectively integrate complementary information from multi-modal images, enhancing shared features across modalities to improve segmentation performance. Additionally, the hybrid-supervised learning strategy guides segmentation through scribble supervision, intra-modal regularization, and inter-modal consistency, modeling spatial and contextual relationships while promoting feature alignment. Our approach effectively mitigates overfitting, delivering robust segmentation results. It excels in segmenting both challenging small tumor regions and common anatomical structures. Extensive experiments on a clinical cross-modal nasopharyngeal carcinoma (NPC) dataset (including CT and MR imaging) and the publicly available CT Whole Abdominal Organ dataset (WORD) show that our approach outperforms state-of-the-art weakly supervised methods. In addition, our approach also outperforms fully supervised methods when full annotation is used. Our approach can facilitate clinical therapy and benefit various specialists, including physicists, radiologists, pathologists, and oncologists.
zh
[CV-114] Pre-Trained LLM is a Semantic-Aware and Generalizable Segmentation Booster MICCAI2025
【速读】:该论文试图解决医学图像分割任务中如何提升模型性能的同时控制参数量的问题。解决方案的关键在于将一个冻结的预训练大型语言模型(Large Language Model, LLM)层嵌入到卷积神经网络(CNN)的编码器-解码器结构中,形成一种简单的混合架构(LLM4Seg)。通过利用LLM的语义感知能力,该方法在不显著增加可训练参数的情况下,提升了不同模态医学图像的分割性能。
链接: https://arxiv.org/abs/2506.18034
作者: Fenghe Tang,Wenxin Ma,Zhiyang He,Xiaodong Tao,Zihang Jiang,S. Kevin Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注: Accepted by MICCAI 2025. Code: this https URL
Abstract:With the advancement of Large Language Model (LLM) for natural language processing, this paper presents an intriguing finding: a frozen pre-trained LLM layer can process visual tokens for medical image segmentation tasks. Specifically, we propose a simple hybrid structure that integrates a pre-trained, frozen LLM layer within the CNN encoder-decoder segmentation framework (LLM4Seg). Surprisingly, this design improves segmentation performance with a minimal increase in trainable parameters across various modalities, including ultrasound, dermoscopy, polypscopy, and CT scans. Our in-depth analysis reveals the potential of transferring LLM’s semantic awareness to enhance segmentation tasks, offering both improved global understanding and better local modeling capabilities. The improvement proves robust across different LLMs, validated using LLaMA and DeepSeek.
zh
[CV-115] MiCo: Multiple Instance Learning with Context-Aware Clustering for Whole Slide Image Analysis MICCAI2025
【速读】:该论文旨在解决全切片图像(Whole Slide Image, WSI)在癌症诊断与预后分析中由于空间异质性带来的挑战,特别是传统多重实例学习(Multiple Instance Learning, MIL)方法难以有效建模分散的组织分布和捕捉跨区域的空间交互问题。其解决方案的关键在于提出一种基于上下文感知聚类的多重实例学习框架(MiCo),通过聚类实例以提炼具有判别性的形态学模式,并利用Cluster Route模块动态链接不同区域的相同组织类型实例,以及通过Cluster Reducer模块消除语义碎片并增强不同语义群体间的信息交换,从而提升跨区域的组织内相关性和组织间语义关联性。
链接: https://arxiv.org/abs/2506.18028
作者: Junjian Li,Hulin Kuang,Jin Liu,Hailin Yue,Mengshen He,Jianxin Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: MICCAI 2025
Abstract:Multiple instance learning (MIL) has shown significant promise in histopathology whole slide image (WSI) analysis for cancer diagnosis and prognosis. However, the inherent spatial heterogeneity of WSIs presents critical challenges, as morphologically similar tissue types are often dispersed across distant anatomical regions. Conventional MIL methods struggle to model these scattered tissue distributions and capture cross-regional spatial interactions effectively. To address these limitations, we propose a novel Multiple instance learning framework with Context-Aware Clustering (MiCo), designed to enhance cross-regional intra-tissue correlations and strengthen inter-tissue semantic associations in WSIs. MiCo begins by clustering instances to distill discriminative morphological patterns, with cluster centroids serving as semantic anchors. To enhance cross-regional intra-tissue correlations, MiCo employs a Cluster Route module, which dynamically links instances of the same tissue type across distant regions via feature similarity. These semantic anchors act as contextual hubs, propagating semantic relationships to refine instance-level representations. To eliminate semantic fragmentation and strengthen inter-tissue semantic associations, MiCo integrates a Cluster Reducer module, which consolidates redundant anchors while enhancing information exchange between distinct semantic groups. Extensive experiments on two challenging tasks across nine large-scale public cancer datasets demonstrate the effectiveness of MiCo, showcasing its superiority over state-of-the-art methods. The code is available at this https URL.
zh
[CV-116] On the Robustness of Human-Object Interaction Detection against Distribution Shift
【速读】:该论文试图解决人类-物体交互(Human-Object Interaction, HOI)检测模型在面对分布偏移(distribution shifts)时鲁棒性不足的问题,这限制了其在实际场景中的应用。解决方案的关键在于提出两种简单且通用的改进方法:(1)结合mixup的跨域数据增强策略,以提升模型对不同分布数据的适应能力;(2)利用冻结视觉基础模型的特征融合策略,以增强特征表示的稳定性与泛化能力。这两种方法均可直接集成到多种HOI检测方法中,实验结果表明其显著提升了模型的鲁棒性,并在标准基准上也表现出优势。
链接: https://arxiv.org/abs/2506.18021
作者: Chi Xie,Shuang Liang,Jie Li,Feng Zhu,Rui Zhao,Yichen Wei,Shengjie Zhao
机构: Tongji University (同济大学); Sensetime Research (商汤科技); Shukun Technology (舒坤科技)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: This work has been submitted to the IEEE for possible publication
Abstract:Human-Object Interaction (HOI) detection has seen substantial advances in recent years. However, existing works focus on the standard setting with ideal images and natural distribution, far from practical scenarios with inevitable distribution shifts. This hampers the practical applicability of HOI detection. In this work, we investigate this issue by benchmarking, analyzing, and enhancing the robustness of HOI detection models under various distribution shifts. We start by proposing a novel automated approach to create the first robustness evaluation benchmark for HOI detection. Subsequently, we evaluate more than 40 existing HOI detection models on this benchmark, showing their insufficiency, analyzing the features of different frameworks, and discussing how the robustness in HOI is different from other tasks. With the insights from such analyses, we propose to improve the robustness of HOI detection methods through: (1) a cross-domain data augmentation integrated with mixup, and (2) a feature fusion strategy with frozen vision foundation models. Both are simple, plug-and-play, and applicable to various methods. Our experimental results demonstrate that the proposed approach significantly increases the robustness of various methods, with benefits on standard benchmarks, too. The dataset and code will be released.
zh
[CV-117] Auto-Regressive Surface Cutting
【速读】:该论文旨在解决表面切割(surface cutting)任务中现有方法生成的纹理图集(atlas)过于碎片化且缺乏语义一致性的问题。其解决方案的关键在于将表面切割建模为一个下一步标记预测任务,通过在网格顶点和边上的采样点云进行编码,并利用类似GPT的Transformer模型依次预测具有量化3D坐标的切割缝(seam segments)。
链接: https://arxiv.org/abs/2506.18017
作者: Yang Li,Victor Cheung,Xinhai Liu,Yuguang Chen,Zhongjin Luo,Biwen Lei,Haohan Weng,Zibo Zhao,Jingwei Huang,Zhuo Chen,Chunchao Guo
机构: Tencent Hunyuan(腾讯混元); SYSU(中山大学); CUHKSZ(香港中文大学深圳校区); SCUT(华南理工大学); ShanghaiTech(上海科技大学)
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Tech. report. this https URL
Abstract:Surface cutting is a fundamental task in computer graphics, with applications in UV parameterization, texture mapping, and mesh decomposition. However, existing methods often produce technically valid but overly fragmented atlases that lack semantic coherence. We introduce SeamGPT, an auto-regressive model that generates cutting seams by mimicking professional workflows. Our key technical innovation lies in formulating surface cutting as a next token prediction task: sample point clouds on mesh vertices and edges, encode them as shape conditions, and employ a GPT-style transformer to sequentially predict seam segments with quantized 3D coordinates. Our approach achieves exceptional performance on UV unwrapping benchmarks containing both manifold and non-manifold meshes, including artist-created, and 3D-scanned models. In addition, it enhances existing 3D segmentation tools by providing clean boundaries for part decomposition.
zh
[CV-118] OSDMamba: Enhancing Oil Spill Detection from Remote Sensing Images Using Selective State Space Model
【速读】:该论文旨在解决遥感图像中油污检测(Oil Spill Detection, OSD)面临的标注样本有限、类别不平衡以及卷积神经网络(CNN)在检测小面积油污时因感受野受限和全局上下文信息捕捉能力不足而导致的检测精度下降问题。其解决方案的关键在于引入状态空间模型(State-Space Models, SSMs),特别是Mamba架构,利用其选择性扫描机制有效扩展模型的感受野并保留关键细节,同时设计了一个包含ConvSSM和深度监督的非对称解码器以增强多尺度特征融合,从而提升模型对少数类样本的敏感性。
链接: https://arxiv.org/abs/2506.18006
作者: Shuaiyu Chen,Fu Wang,Peng Ren,Chunbo Luo,Zeyu Fu
机构: University of Exeter(埃克塞特大学); China University of Petroleum (East China)(中国石油大学(华东))
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Semantic segmentation is commonly used for Oil Spill Detection (OSD) in remote sensing images. However, the limited availability of labelled oil spill samples and class imbalance present significant challenges that can reduce detection accuracy. Furthermore, most existing methods, which rely on convolutional neural networks (CNNs), struggle to detect small oil spill areas due to their limited receptive fields and inability to effectively capture global contextual information. This study explores the potential of State-Space Models (SSMs), particularly Mamba, to overcome these limitations, building on their recent success in vision applications. We propose OSDMamba, the first Mamba-based architecture specifically designed for oil spill detection. OSDMamba leverages Mamba’s selective scanning mechanism to effectively expand the model’s receptive field while preserving critical details. Moreover, we designed an asymmetric decoder incorporating ConvSSM and deep supervision to strengthen multi-scale feature fusion, thereby enhancing the model’s sensitivity to minority class samples. Experimental results show that the proposed OSDMamba achieves state-of-the-art performance, yielding improvements of 8.9% and 11.8% in OSD across two publicly available datasets.
zh
[CV-119] Fast Neural Inverse Kinematics on Human Body Motions
【速读】:该论文旨在解决无标记运动捕捉(markerless motion capture)在实时场景中因计算需求高和推理速度慢而导致的应用限制问题。其解决方案的关键在于提出一种快速且可靠的神经逆运动学框架,该框架能够从3D关键点实时捕捉人体运动,通过优化网络架构、训练方法和推理流程实现高效的运动重建。
链接: https://arxiv.org/abs/2506.17996
作者: David Tolpin,Sefy Kagarlitsky
机构: Yoom(优姆)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Work in progress
Abstract:Markerless motion capture enables the tracking of human motion without requiring physical markers or suits, offering increased flexibility and reduced costs compared to traditional systems. However, these advantages often come at the expense of higher computational demands and slower inference, limiting their applicability in real-time scenarios. In this technical report, we present a fast and reliable neural inverse kinematics framework designed for real-time capture of human body motions from 3D keypoints. We describe the network architecture, training methodology, and inference procedure in detail. Our framework is evaluated both qualitatively and quantitatively, and we support key design decisions through ablation studies.
zh
[CV-120] Enabling PSO-Secure Synthetic Data Sharing Using Diversity-Aware Diffusion Models
【速读】:该论文试图解决合成数据在医学影像隐私保护数据共享中的法律合规性不足以及合成数据性能低于真实数据的问题。其解决方案的关键在于通过最大化图像多样性来提升合成数据的模式覆盖率,从而增强下游任务性能,同时将多样性视为一种防止自然人被单独识别的隐私保护机制,进而构建符合predicate singling-out (PSO) 安全要求的合成数据集。
链接: https://arxiv.org/abs/2506.17975
作者: Mischa Dombrowski,Bernhard Kainz
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Synthetic data has recently reached a level of visual fidelity that makes it nearly indistinguishable from real data, offering great promise for privacy-preserving data sharing in medical imaging. However, fully synthetic datasets still suffer from significant limitations: First and foremost, the legal aspect of sharing synthetic data is often neglected and data regulations, such as the GDPR, are largley ignored. Secondly, synthetic models fall short of matching the performance of real data, even for in-domain downstream applications. Recent methods for image generation have focused on maximising image diversity instead of fidelity solely to improve the mode coverage and therefore the downstream performance of synthetic data. In this work, we shift perspective and highlight how maximizing diversity can also be interpreted as protecting natural persons from being singled out, which leads to predicate singling-out (PSO) secure synthetic datasets. Specifically, we propose a generalisable framework for training diffusion models on personal data which leads to unpersonal synthetic datasets achieving performance within one percentage point of real-data models while significantly outperforming state-of-the-art methods that do not ensure privacy. Our code is available at this https URL.
zh
[CV-121] BPCLIP: A Bottom-up Image Quality Assessment from Distortion to Semantics Based on CLIP ICME2025
【速读】:该论文旨在解决传统图像质量评估(Image Quality Assessment, IQA)方法在融合多尺度特征时依赖简单线性融合,难以充分捕捉失真对语义内容影响的问题。其解决方案的关键在于提出一种基于对比语言-图像预训练模型(CLIP)的自底向上的图像质量评估方法(BPCLIP),通过引入多尺度交叉注意力模块,逐步提取低级失真对高级语义的影响,并结合40个图像质量形容词增强图像质量感知与人类语言之间的联系,从而提升评估性能和鲁棒性。
链接: https://arxiv.org/abs/2506.17969
作者: Chenyue Song,Chen Hui,Wei Zhang,Haiqi Zhu,Shaohui Liu,Hong Huang,Feng Jiang
机构: Harbin Institute of Technology(哈尔滨工业大学); Nanjing University of Information Science and Technology(南京信息工程大学); Sichuan University of Science & Engineering(四川理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICME 2025
Abstract:Image Quality Assessment (IQA) aims to evaluate the perceptual quality of images based on human subjective perception. Existing methods generally combine multiscale features to achieve high performance, but most rely on straightforward linear fusion of these features, which may not adequately capture the impact of distortions on semantic content. To address this, we propose a bottom-up image quality assessment approach based on the Contrastive Language-Image Pre-training (CLIP, a recently proposed model that aligns images and text in a shared feature space), named BPCLIP, which progressively extracts the impact of low-level distortions on high-level semantics. Specifically, we utilize an encoder to extract multiscale features from the input image and introduce a bottom-up multiscale cross attention module designed to capture the relationships between shallow and deep features. In addition, by incorporating 40 image quality adjectives across six distinct dimensions, we enable the pre-trained CLIP text encoder to generate representations of the intrinsic quality of the image, thereby strengthening the connection between image quality perception and human language. Our method achieves superior results on most public Full-Reference (FR) and No-Reference (NR) IQA benchmarks, while demonstrating greater robustness.
zh
[CV-122] h-calibration: Rethinking Classifier Recalibration with Probabilistic Error-Bounded Objective
【速读】:该论文试图解决深度神经网络在概率输出上的校准问题(calibration),即模型预测的概率与实际类别分布不一致,导致结果不可靠。为了解决这一问题,研究提出了一种名为h-calibration的概率学习框架,其关键在于理论上构建了一个具有有界性的标准校准等价学习公式,并设计了一个简单而有效的后校准算法。该方法不仅克服了以往方法的十大常见局限,还在实验中表现出优于传统方法的性能。
链接: https://arxiv.org/abs/2506.17968
作者: Wenjian Huang,Guiping Cao,Jiahao Xia,Jingkun Chen,Hao Wang,Jianguo Zhang
机构: Southern University of Science and Technology (南方科技大学); University of Technology Sydney (悉尼科技大学); University of Oxford (牛津大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Probability (math.PR); Machine Learning (stat.ML)
备注:
Abstract:Deep neural networks have demonstrated remarkable performance across numerous learning tasks but often suffer from miscalibration, resulting in unreliable probability outputs. This has inspired many recent works on mitigating miscalibration, particularly through post-hoc recalibration methods that aim to obtain calibrated probabilities without sacrificing the classification performance of pre-trained models. In this study, we summarize and categorize previous works into three general strategies: intuitively designed methods, binning-based methods, and methods based on formulations of ideal calibration. Through theoretical and practical analysis, we highlight ten common limitations in previous approaches. To address these limitations, we propose a probabilistic learning framework for calibration called h-calibration, which theoretically constructs an equivalent learning formulation for canonical calibration with boundedness. On this basis, we design a simple yet effective post-hoc calibration algorithm. Our method not only overcomes the ten identified limitations but also achieves markedly better performance than traditional methods, as validated by extensive experiments. We further analyze, both theoretically and experimentally, the relationship and advantages of our learning objective compared to traditional proper scoring rule. In summary, our probabilistic framework derives an approximately equivalent differentiable objective for learning error-bounded calibrated probabilities, elucidating the correspondence and convergence properties of computational statistics with respect to theoretical bounds in canonical calibration. The theoretical effectiveness is verified on standard post-hoc calibration benchmarks by achieving state-of-the-art performance. This research offers valuable reference for learning reliable likelihood in related fields.
zh
[CV-123] Adapting Vision-Language Models for Evaluating World Models
【速读】:该论文旨在解决世界模型(world models)在模拟环境中的滚动(rollout)评估问题,特别是如何实现对动作对齐性和语义一致性的细粒度、时间敏感的评估。现有评估指标无法有效捕捉这些能力,而视觉-语言模型(VLMs)虽具备多模态推理能力,但在此类任务中的应用仍受限。论文提出的解决方案关键在于设计并实现UNIVERSE,一种在数据和计算约束下适应VLM进行滚动评估的方法,通过统一的评估协议和参数高效微调策略,实现了与任务特定基线相当的性能,并通过人类研究验证了其与人类判断的高度一致性。
链接: https://arxiv.org/abs/2506.17967
作者: Mariya Hendriksen,Tabish Rashid,David Bignell,Raluca Georgescu,Abdelhak Lemkhenter,Katja Hofmann,Sam Devlin,Sarah Parisot
机构: Microsoft Research (微软研究院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:World models – generative models that simulate environment dynamics conditioned on past observations and actions – are gaining prominence in planning, simulation, and embodied AI. However, evaluating their rollouts remains a fundamental challenge, requiring fine-grained, temporally grounded assessment of action alignment and semantic consistency – capabilities not captured by existing metrics. Vision-Language Models (VLMs) have shown promise as automatic evaluators of generative content due to their strong multimodal reasoning abilities. Yet, their use in fine-grained, temporally sensitive evaluation tasks remains limited and requires targeted adaptation. We introduce a evaluation protocol targeting two recognition tasks – action recognition and character recognition – each assessed across binary, multiple-choice, and open-ended formats. To support this, we present UNIVERSE (UNIfied Vision-language Evaluator for Rollouts in Simulated Environments), a method for adapting VLMs to rollout evaluation under data and compute constraints. We conduct a large-scale study comparing full, partial, and parameter-efficient finetuning across task formats, context lengths, sampling strategies, and data compositions. The resulting unified evaluator matches the performance of task-specific baselines using a single checkpoint. Human studies confirm strong alignment with human judgments, establishing UNIVERSE as a scalable, semantics-aware evaluator for world models.
zh
[CV-124] LLM -Enhanced Multimodal Fusion for Cross-Domain Sequential Recommendation
【速读】:该论文试图解决跨领域序列推荐(Cross-Domain Sequential Recommendation, CDSR)中用户兴趣建模与多领域偏好捕捉的问题,旨在通过融合多模态数据提升推荐性能。解决方案的关键在于提出一种基于大语言模型(Large Language Model, LLM)增强的多模态融合方法(LLM-Enhanced Multimodal Fusion, LLM-EMF),该方法利用冻结的CLIP模型生成图像和文本嵌入,结合多注意力机制联合学习单领域和跨领域偏好,从而有效捕捉用户在多个领域中的复杂兴趣。
链接: https://arxiv.org/abs/2506.17966
作者: Wangyu Wu,Zhenhong Chen,Xianglin Qiu,Siqi Song,Xiaowei Huang,Fei Ma,Jimin Xiao
机构: Xi’an Jiaotong-Liverpool University (西安交通大学-利物浦大学); University of Liverpool (利物浦大学); Microsoft (微软)
类目: Information Retrieval (cs.IR); Computer Vision and Pattern Recognition (cs.CV)
备注: arXiv admin note: substantial text overlap with arXiv:2504.15085
Abstract:Cross-Domain Sequential Recommendation (CDSR) predicts user behavior by leveraging historical interactions across multiple domains, focusing on modeling cross-domain preferences and capturing both intra- and inter-sequence item relationships. We propose LLM-Enhanced Multimodal Fusion for Cross-Domain Sequential Recommendation (LLM-EMF), a novel and advanced approach that enhances textual information with Large Language Models (LLM) knowledge and significantly improves recommendation performance through the fusion of visual and textual data. Using the frozen CLIP model, we generate image and text embeddings, thereby enriching item representations with multimodal data. A multiple attention mechanism jointly learns both single-domain and cross-domain preferences, effectively capturing and understanding complex user interests across diverse domains. Evaluations conducted on four e-commerce datasets demonstrate that LLM-EMF consistently outperforms existing methods in modeling cross-domain user preferences, thereby highlighting the effectiveness of multimodal data integration and its advantages in enhancing sequential recommendation systems. Our source code will be released.
zh
[CV-125] ELMAR: Enhancing LiDAR Detection with 4D Radar Motion Awareness and Cross-modal Uncertainty IROS2025
【速读】:该论文旨在解决多模态传感器(LiDAR与4D雷达)在融合过程中存在的模态间不对齐问题,以提升自主驾驶和机器人感知系统的性能。其解决方案的关键在于引入基于4D雷达运动状态的增强机制以及跨模态不确定性估计,通过动态运动感知编码模块提取4D雷达中的目标运动信息,并利用实例级边界框不确定性估计来缓解模态间对齐问题,从而优化最终的LiDAR检测结果。
链接: https://arxiv.org/abs/2506.17958
作者: Xiangyuan Peng,Miao Tang,Huawei Sun,Bierzynski Kay,Lorenzo Servadei,Robert Wille
机构: Infineon Technologies AG (英飞凌科技); Technical University of Munich (慕尼黑工业大学); China University of Geosciences (中国地质大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages. Accepted by IROS2025
Abstract:LiDAR and 4D radar are widely used in autonomous driving and robotics. While LiDAR provides rich spatial information, 4D radar offers velocity measurement and remains robust under adverse conditions. As a result, increasing studies have focused on the 4D radar-LiDAR fusion method to enhance the perception. However, the misalignment between different modalities is often overlooked. To address this challenge and leverage the strengths of both modalities, we propose a LiDAR detection framework enhanced by 4D radar motion status and cross-modal uncertainty. The object movement information from 4D radar is first captured using a Dynamic Motion-Aware Encoding module during feature extraction to enhance 4D radar predictions. Subsequently, the instance-wise uncertainties of bounding boxes are estimated to mitigate the cross-modal misalignment and refine the final LiDAR predictions. Extensive experiments on the View-of-Delft (VoD) dataset highlight the effectiveness of our method, achieving state-of-the-art performance with the mAP of 74.89% in the entire area and 88.70% within the driving corridor while maintaining a real-time inference speed of 30.02 FPS.
zh
[CV-126] Mobile Image Analysis Application for Mantoux Skin Test
【速读】:该论文试图解决传统结核菌素皮肤试验(TST)方法中存在的随访率低、患者不适以及人工主观判断导致的误诊和治疗延迟问题。其解决方案的关键在于开发一款移动应用,利用标尺贴纸作为参考物体进行硬结测量,而非依赖3D重建技术,并结合增强现实核心(ARCore)和深度学习算法(如DeepLabv3)实现图像分割与精准测量,同时采用边缘检测算法提升准确性,从而提高诊断的可靠性和效率。
链接: https://arxiv.org/abs/2506.17954
作者: Liong Gele,Tan Chye Cheah
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This paper presents a newly developed mobile application designed to diagnose Latent Tuberculosis Infection (LTBI) using the Mantoux Skin Test (TST). Traditional TST methods often suffer from low follow-up return rates, patient discomfort, and subjective manual interpretation, particularly with the ball-point pen method, leading to misdiagnosis and delayed treatment. Moreover, previous developed mobile applications that used 3D reconstruction, this app utilizes scaling stickers as reference objects for induration measurement. This mobile application integrates advanced image processing technologies, including ARCore, and machine learning algorithms such as DeepLabv3 for robust image segmentation and precise measurement of skin indurations indicative of LTBI. The system employs an edge detection algorithm to enhance accuracy. The application was evaluated against standard clinical practices, demonstrating significant improvements in accuracy and reliability. This innovation is crucial for effective tuberculosis management, especially in resource-limited regions. By automating and standardizing TST evaluations, the application enhances the accessibility and efficiency of TB di-agnostics. Future work will focus on refining machine learning models, optimizing measurement algorithms, expanding functionalities to include comprehensive patient data management, and enhancing ARCore’s performance across various lighting conditions and operational settings.
zh
[CV-127] Classification of Tents in Street Bazaars Using CNN
【速读】:该论文旨在解决在街头集市中对帐篷进行自动分类的问题,这一任务对于市场组织具有重要意义,但传统的人工方法效率低下。研究提出了一种改进的深度学习模型,并通过对比自定义卷积神经网络(CNN)与EfficientNetB0模型的性能来验证其有效性。解决方案的关键在于利用迁移学习技术,通过在扩展的数据集上进行训练,该数据集包含126张原始照片并经过增强生成额外图像,从而显著提高了分类准确性和泛化能力。实验结果表明,EfficientNetB0在准确率上达到了98.4%,优于自定义CNN的92.8%。
链接: https://arxiv.org/abs/2506.17946
作者: Azamat Ibragimov,Ruslan Isaev,Remudin Reshid Mekuria,Gulnaz Gimaletdinova,Dim Shaiakhmetov
机构: Ala-Too International university (阿拉-托国际大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This research paper proposes an improved deep learning model for classifying tents in street bazaars, comparing a custom Convolutional Neural Network (CNN) with EfficientNetB0. This is a critical task for market organization with a tent classification, but manual methods in the past have been inefficient. Street bazaars represent a vital economic hub in many regions, yet their unstructured nature poses significant challenges for the automated classification of market infrastructure, such as tents. In Kyrgyzstan, more than a quarter of the country’s GDP is derived from bazaars. While CNNs have been widely applied to object recognition, their application to bazaar-specific tasks remains underexplored. Here, we build upon our original approach by training on an extended set of 126 original photographs that were augmented to generate additional images. This dataset is publicly available for download on Kaggle. A variety of performance metrics, such as accuracy, precision, recall, F1 score, and mean average precision (mAP), were used to assess the models comparatively, providing a more extensive analysis of classification performance. The results show that the CNN custom model achieved 92.8% accuracy, and EfficientNetB0 showed 98.4% accuracy results, confirming the effectiveness of transfer learning in the bazaar image classification. Also, when analyzing the confusion matrix, the analysis reveals the weaknesses and strengths of each model. These findings suggest that using a pre-trained model such as EfficientNetB0 significantly improves classification accuracy and generalization. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2506.17946 [cs.CV] (or arXiv:2506.17946v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2506.17946 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-128] SegChange-R1:Augmented Reasoning for Remote Sensing Change Detection via Large Language Models
【速读】:该论文旨在解决遥感变化检测中因多时相数据模态不对齐导致的检测精度不足问题,以及传统方法在处理复杂场景下变化区域分割效率较低的问题。其解决方案的关键在于提出一种基于大语言模型(LLM)增强的推理方法(SegChange-R1),通过融合文本描述信息引导模型更精准地分割变化区域,同时设计了一个基于线性注意力的空间变换模块(BEV),通过将不同时相特征统一到鸟瞰图(BEV)空间来解决模态对齐问题。
链接: https://arxiv.org/abs/2506.17944
作者: Fei Zhou
机构: Neusoft Institute Guangdong(东软学院广东); Airace Technology Co.,Ltd.(Airace科技有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Remote sensing change detection is widely used in a variety of fields such as urban planning, terrain and geomorphology analysis, and environmental monitoring, mainly by analyzing the significant change differences of features (e.g., building changes) in the same spatial region at different time phases. In this paper, we propose a large language model (LLM) augmented inference approach (SegChange-R1), which enhances the detection capability by integrating textual descriptive information and aims at guiding the model to segment the more interested change regions, thus accelerating the convergence speed. Moreover, we design a spatial transformation module (BEV) based on linear attention, which solves the problem of modal misalignment in change detection by unifying features from different temporal perspectives onto the BEV space. In addition, we construct the first dataset for building change detection from UAV viewpoints (DVCD ), and our experiments on four widely-used change detection datasets show a significant improvement over existing methods. The code and pre-trained models are available in this https URL.
zh
[CV-129] GEMeX-ThinkVG: Towards Thinking with Visual Grounding in Medical VQA via Reinforcement Learning
【速读】:该论文旨在解决医学视觉问答任务中模型生成答案的可靠性不足和可解释性差的问题,这些问题限制了临床医生和患者对模型输出的信任与理解。其解决方案的关键在于提出一个名为Thinking with Visual Grounding (ThinkVG)的数据集,该数据集将答案生成过程分解为明确关联医学图像中相关视觉区域的中间推理步骤,从而实现细粒度的可解释性,并引入一种可验证的奖励机制以增强强化学习中的推理过程与最终答案的一致性。
链接: https://arxiv.org/abs/2506.17939
作者: Bo Liu,Xiangyu Zhao,Along He,Yidi Chen,Huazhu Fu,Xiao-Ming Wu
机构: The Hong Kong Polytechnic University (香港理工大学); Shenzhen University (深圳大学); Sichuan University (四川大学); Agency for Science, Technology and Research (科技研究局)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Work in Progress
Abstract:Medical visual question answering aims to support clinical decision-making by enabling models to answer natural language questions based on medical images. While recent advances in multi-modal learning have significantly improved performance, current methods still suffer from limited answer reliability and poor interpretability, impairing the ability of clinicians and patients to understand and trust model-generated answers. To address this, this work first proposes a Thinking with Visual Grounding (ThinkVG) dataset wherein the answer generation is decomposed into intermediate reasoning steps that explicitly ground relevant visual regions of the medical image, thereby providing fine-grained explainability. Furthermore, we introduce a novel verifiable reward mechanism for reinforcement learning to guide post-training, improving the alignment between the model’s reasoning process and its final answer. Remarkably, our method achieves comparable performance using only one-eighth of the training data, demonstrating the efficiency and effectiveness of the proposal. The dataset is available at this https URL.
zh
[CV-130] IDAL: Improved Domain Adaptive Learning for Natural Images Dataset ICPR’24
【速读】:该论文旨在解决自然图像的无监督域适应(Unsupervised Domain Adaptation, UDA)问题,特别是在存在输入空间域偏移的情况下,如何有效提升表示空间中的域对齐。其解决方案的关键在于两个方面:一是采用ResNet的深度结构与特征金字塔网络(Feature Pyramid Network, FPN)的有效尺度分离机制,以同时处理内容和风格特征;二是设计了一种结合新型损失函数与精心选择的现有损失函数的组合策略,以应对自然图像中固有的挑战,如尺度变化、噪声和风格偏移,从而提高模型在目标域上的准确性和鲁棒性,并加速训练收敛。
链接: https://arxiv.org/abs/2506.17931
作者: Ravi Kant Gupta,Shounak Das,Amit Sethi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted in ICPR’24 (International Conference on Pattern Recognition)
Abstract:We present a novel approach for unsupervised domain adaptation (UDA) for natural images. A commonly-used objective for UDA schemes is to enhance domain alignment in representation space even if there is a domain shift in the input space. Existing adversarial domain adaptation methods may not effectively align different domains of multimodal distributions associated with classification problems. Our approach has two main features. Firstly, its neural architecture uses the deep structure of ResNet and the effective separation of scales of feature pyramidal network (FPN) to work with both content and style features. Secondly, it uses a combination of a novel loss function and judiciously selected existing loss functions to train the network architecture. This tailored combination is designed to address challenges inherent to natural images, such as scale, noise, and style shifts, that occur on top of a multi-modal (multi-class) distribution. The combined loss function not only enhances model accuracy and robustness on the target domain but also speeds up training convergence. Our proposed UDA scheme generalizes better than state-of-the-art for CNN-based methods on Office-Home, Office-31, and VisDA-2017 datasets and comaparable for DomainNet dataset.
zh
[CV-131] PlanMoGPT : Flow-Enhanced Progressive Planning for Text to Motion Synthesis
【速读】:该论文旨在解决文本到动作生成任务中基于大语言模型(Large Language Models, LLMs)的方法与非LLM方法之间存在的性能差距问题。研究指出,运动分词的粒度是导致这一差距的关键瓶颈:细粒度分词引发局部依赖问题,使LLMs过度关注短期连贯性而牺牲全局语义对齐,而粗粒度分词则损失了动作细节。解决方案的关键在于提出PlanMoGPT框架,该框架结合了渐进式规划和增强流的细粒度运动分词机制,通过分层生成运动标记并逐步细化,以及提升分词器的下采样分辨率和代码本规模,从而有效减少离散化过程中的细节损失,并通过增强流解码器恢复动作细节,显著提升了生成效果。
链接: https://arxiv.org/abs/2506.17912
作者: Chuhao Jin,Haosen Li,Bingzi Zhang,Che Liu,Xiting Wang,Ruihua Song,Wenbing Huang,Ying Qin,Fuzheng Zhang,Di Zhang
机构: Renmin University of China (中国人民大学); Kuaishou (快手)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: 14 pages, 7 figures
Abstract:Recent advances in large language models (LLMs) have enabled breakthroughs in many multimodal generation tasks, but a significant performance gap still exists in text-to-motion generation, where LLM-based methods lag far behind non-LLM methods. We identify the granularity of motion tokenization as a critical bottleneck: fine-grained tokenization induces local dependency issues, where LLMs overemphasize short-term coherence at the expense of global semantic alignment, while coarse-grained tokenization sacrifices motion details. To resolve this issue, we propose PlanMoGPT, an LLM-based framework integrating progressive planning and flow-enhanced fine-grained motion tokenization. First, our progressive planning mechanism leverages LLMs’ autoregressive capabilities to hierarchically generate motion tokens by starting from sparse global plans and iteratively refining them into full sequences. Second, our flow-enhanced tokenizer doubles the downsampling resolution and expands the codebook size by eight times, minimizing detail loss during discretization, while a flow-enhanced decoder recovers motion nuances. Extensive experiments on text-to-motion benchmarks demonstrate that it achieves state-of-the-art performance, improving FID scores by 63.8% (from 0.380 to 0.141) on long-sequence generation while enhancing motion diversity by 49.9% compared to existing methods. The proposed framework successfully resolves the diversity-quality trade-off that plagues current non-LLM approaches, establishing new standards for text-to-motion generation.
zh
[CV-132] Feedback Driven Multi Stereo Vision System for Real-Time Event Analysis
【速读】:该论文试图解决传统2D相机和现有3D相机在复杂大环境中的可靠性不足问题,这些设备难以有效支持交互系统的稳定运行。其解决方案的关键是提出一种基于3D立体视觉的处理流程(3D stereo vision based pipeline),通过鲁棒的场景理解能力,实现对普通和敏感应用的支持,并探索多3D相机融合以完成全场景重建,从而执行事件识别、目标跟踪和通知等任务。
链接: https://arxiv.org/abs/2506.17910
作者: Mohamed Benkedadra,Matei Mancas,Sidi Ahmed Mahmoudi
机构: University of Mons(蒙斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:2D cameras are often used in interactive systems. Other systems like gaming consoles provide more powerful 3D cameras for short range depth sensing. Overall, these cameras are not reliable in large, complex environments. In this work, we propose a 3D stereo vision based pipeline for interactive systems, that is able to handle both ordinary and sensitive applications, through robust scene understanding. We explore the fusion of multiple 3D cameras to do full scene reconstruction, which allows for preforming a wide range of tasks, like event recognition, subject tracking, and notification. Using possible feedback approaches, the system can receive data from the subjects present in the environment, to learn to make better decisions, or to adapt to completely new environments. Throughout the paper, we introduce the pipeline and explain our preliminary experimentation and results. Finally, we draw the roadmap for the next steps that need to be taken, in order to get this pipeline into production
zh
[CV-133] Cause-Effect Driven Optimization for Robust Medical Visual Question Answering with Language Biases IJCAI2025
【速读】:该论文旨在解决医学视觉问答(Med-VQA)模型中存在的语言偏差问题,即问题类型与答案类别之间的虚假相关性。解决方案的关键在于提出一种名为CEDO的因果驱动优化框架,其核心是整合三种机制:模态驱动的异构优化(MHO)、梯度引导的模态协同(GMS)和分布适应的损失重缩放(DLR),从因果和效应两个层面全面缓解语言偏差。
链接: https://arxiv.org/abs/2506.17903
作者: Huanjia Zhu,Yishu Liu,Xiaozhao Fang,Guangming Lu,Bingzhi Chen
机构: Beijing Institute of Technology, Zhuhai (北京理工大学珠海校区); Harbin Institute of Technology, Shenzhen (哈尔滨工业大学深圳校区); Guangdong University of Technology (广东工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted at IJCAI 2025
Abstract:Existing Medical Visual Question Answering (Med-VQA) models often suffer from language biases, where spurious correlations between question types and answer categories are inadvertently established. To address these issues, we propose a novel Cause-Effect Driven Optimization framework called CEDO, that incorporates three well-established mechanisms, i.e., Modality-driven Heterogeneous Optimization (MHO), Gradient-guided Modality Synergy (GMS), and Distribution-adapted Loss Rescaling (DLR), for comprehensively mitigating language biases from both causal and effectual perspectives. Specifically, MHO employs adaptive learning rates for specific modalities to achieve heterogeneous optimization, thus enhancing robust reasoning capabilities. Additionally, GMS leverages the Pareto optimization method to foster synergistic interactions between modalities and enforce gradient orthogonality to eliminate bias updates, thereby mitigating language biases from the effect side, i.e., shortcut bias. Furthermore, DLR is designed to assign adaptive weights to individual losses to ensure balanced learning across all answer categories, effectively alleviating language biases from the cause side, i.e., imbalance biases within datasets. Extensive experiments on multiple traditional and bias-sensitive benchmarks consistently demonstrate the robustness of CEDO over state-of-the-art competitors.
zh
[CV-134] PostAlign: Multimodal Grounding as a Corrective Lens for MLLM s
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在视觉-语言任务中过度依赖虚假相关性的问题,这主要是由于语言先验导致模型未能充分利用实际的视觉信息。其解决方案的关键在于提出MMGrounded-PostAlign框架,该框架通过引入多模态接地模块实现视觉接地和文本接地,确保输出结果基于视觉和文本证据;同时,在视觉接地模块中引入负样本拒绝机制以区分受语言偏见影响的不存在对象,而在文本接地部分则采用选择性推理机制,根据查询复杂度调整模型的推理策略,从而提升视觉理解能力并抑制幻觉现象。
链接: https://arxiv.org/abs/2506.17901
作者: Yixuan Wu,Yang Zhang,Jian Wu,Philip Torr,Jindong Gu
机构: University of Oxford (牛津大学); Zhejiang University (浙江大学); National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Multimodal Large Language Models (MLLMs) excel in vision-language tasks, such as image captioning and visual question answering. However, they often suffer from over-reliance on spurious correlations, primarily due to linguistic priors that distract the model from leveraging actual visual information. To address these issues, we introduce MMGrounded-PostAlign, a post-multimodal alignment framework designed to enhance the visual understanding capabilities and mitigate the hallucinations of MLLMs. Our framework incorporates a multimodal grounding module for both visual grounding, which identifies the referred object in the image, and textual grounding, which generates the rationale for the final answer, ensuring that outputs are anchored in both visual and textual evidence. To mitigate the hallucinations, we introduce a negative rejection mechanism in the visual grounding module to distinguish grounded entities from non-existent objects influenced by linguistic biases. On the textual grounding side, we propose a selective reasoning mechanism that adjusts the model’s reasoning strategy based on query complexity. Extensive evaluations are conducted on benchmarks such as POPE, HaloQuest, VQAv2, MME, and MMBench showing significant improvements in fine-grained visual understanding and hallucination suppression.
zh
[CV-135] EgoWorld: Translating Exocentric View to Egocentric View using Rich Exocentric Observations
【速读】:该论文旨在解决从第三人称视角(exocentric vision)到第一人称视角(egocentric vision)的转换问题,特别是在缺乏2D线索、同步多视角设置和现实假设的情况下,如何生成高质量的第一人称视觉内容。解决方案的关键在于提出EgoWorld框架,该框架通过结合丰富的第三人称观测数据(包括投影点云、3D手部姿态和文本描述),首先从估计的第三人称深度图中重建点云,再将其重新投影到第一人称视角,并利用基于扩散的修复技术生成密集且语义一致的第一人称图像。
链接: https://arxiv.org/abs/2506.17896
作者: Junho Park,Andrew Sangwoo Ye,Taein Kwon
机构: AI Lab, LG Electronics (AI 实验室,LG 电子); KAIST (韩国科学技术院); Visual Geometry Group, University of Oxford (视觉几何小组,牛津大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Project Page: this https URL
Abstract:Egocentric vision is essential for both human and machine visual understanding, particularly in capturing the detailed hand-object interactions needed for manipulation tasks. Translating third-person views into first-person views significantly benefits augmented reality (AR), virtual reality (VR) and robotics applications. However, current exocentric-to-egocentric translation methods are limited by their dependence on 2D cues, synchronized multi-view settings, and unrealistic assumptions such as necessity of initial egocentric frame and relative camera poses during inference. To overcome these challenges, we introduce EgoWorld, a novel two-stage framework that reconstructs an egocentric view from rich exocentric observations, including projected point clouds, 3D hand poses, and textual descriptions. Our approach reconstructs a point cloud from estimated exocentric depth maps, reprojects it into the egocentric perspective, and then applies diffusion-based inpainting to produce dense, semantically coherent egocentric images. Evaluated on the H2O and TACO datasets, EgoWorld achieves state-of-the-art performance and demonstrates robust generalization to new objects, actions, scenes, and subjects. Moreover, EgoWorld shows promising results even on unlabeled real-world examples.
zh
[CV-136] BeltCrack: the First Sequential-image Industrial Conveyor Belt Crack Detection Dataset and Its Baseline with Triple-domain Feature Learning
【速读】:该论文试图解决工业传送带(conveyor belt)裂纹智能检测的问题,特别是针对现有数据集主要集中在道路场景或合成数据,缺乏真实工业环境下的传送带裂纹数据这一问题。解决方案的关键在于构建了首个基于真实工厂场景的序列图像传送带裂纹检测数据集(BeltCrack14ks和BeltCrack9kd),并提出了一种基于三域(时间-空间-频率)特征分层融合学习的基准方法,以验证数据集的有效性与实用性。
链接: https://arxiv.org/abs/2506.17892
作者: Jianghong Huang,Luping Ji,Xin Ma,Mao Ye
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 32 pages, 10 figures
Abstract:Conveyor belt is a category of important equipments in modern industry, widely applied in production and manufacturing Fields. Its health status is much critical to operation efficiency and safety hazards. Among the factors affecting belt health, crack is often one of the most threatening risks. Currently, considering safety, how to intelligently detect belt cracks is catching an increasing attention. To implement the intelligent detection with machine learning, real crack samples are believed to be necessary. However, existing crack datasets primarily focus on pavement scenarios or synthetic data, no real-world industrial belt crack datasets at all. To propel machine learning advancement in this field, this paper constructs the first sequential-image belt crack detection datasets (BeltCrack14ks and BeltCrack9kd), from real-world factory scenes. Furthermore, to validate usability and effectiveness, we propose a special baseline method with triple-domain (i.e., time-space-frequency) feature hierarchical fusion learning for the two whole-new datasets. Experimental results demonstrate the availability and effectiveness of our dataset. Besides, they also show that our baseline is obviously superior to other similar detection methods. Our datasets and source codes are available at this https URL.
zh
[CV-137] Relation3D: Enhancing Relation Modeling for Point Cloud Instance Segmentation CVPR2025
【速读】:该论文旨在解决点云实例分割中对场景特征内部关系以及查询特征之间关系建模不足的问题(3D instance segmentation)。现有基于Transformer的方法主要依赖于掩码注意力机制来建模场景特征与查询特征之间的外部关系,但缺乏对场景特征内部关系及查询特征间关系的有效建模。论文提出的解决方案关键在于引入自适应超点聚合模块和对比学习引导的超点精修模块,以更好地表征超点特征,并通过对比学习引导这些特征的更新;同时,关系感知的自注意力机制通过融合位置和几何关系,增强了查询间关系建模的能力。
链接: https://arxiv.org/abs/2506.17891
作者: Jiahao Lu,Jiacheng Deng
机构: University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2025. Code: this https URL
Abstract:3D instance segmentation aims to predict a set of object instances in a scene, representing them as binary foreground masks with corresponding semantic labels. Currently, transformer-based methods are gaining increasing attention due to their elegant pipelines and superior predictions. However, these methods primarily focus on modeling the external relationships between scene features and query features through mask attention. They lack effective modeling of the internal relationships among scene features as well as between query features. In light of these disadvantages, we propose \textbfRelation3D: Enhancing Relation Modeling for Point Cloud Instance Segmentation. Specifically, we introduce an adaptive superpoint aggregation module and a contrastive learning-guided superpoint refinement module to better represent superpoint features (scene features) and leverage contrastive learning to guide the updates of these features. Furthermore, our relation-aware self-attention mechanism enhances the capabilities of modeling relationships between queries by incorporating positional and geometric relationships into the self-attention mechanism. Extensive experiments on the ScanNetV2, ScanNet++, ScanNet200 and S3DIS datasets demonstrate the superior performance of Relation3D.
zh
[CV-138] Cloud-Aware SAR Fusion for Enhanced Optical Sensing in Space Missions
【速读】:该论文旨在解决云层对光学卫星影像可用性的显著影响问题,这一问题在环境监测、灾害响应和土地利用分析等关键应用中尤为突出。其解决方案的关键在于提出一种基于SAR-光学特征融合与深度学习图像重建的云感知重建框架,通过注意力驱动的特征融合机制,将合成孔径雷达(Synthetic Aperture Radar, SAR)的结构信息与光学数据的光谱特性进行对齐,并采用云感知的模型更新策略,引入自适应损失加权以提升云遮挡区域的重建精度。
链接: https://arxiv.org/abs/2506.17885
作者: Trong-An Bui,Thanh-Thoai Le
机构: National Taipei University of Technology (国立台北科技大学); Ho Chi Minh City University of Education (胡志明市师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Cloud contamination significantly impairs the usability of optical satellite imagery, affecting critical applications such as environmental monitoring, disaster response, and land-use analysis. This research presents a Cloud-Attentive Reconstruction Framework that integrates SAR-optical feature fusion with deep learning-based image reconstruction to generate cloud-free optical imagery. The proposed framework employs an attention-driven feature fusion mechanism to align complementary structural information from Synthetic Aperture Radar (SAR) with spectral characteristics from optical data. Furthermore, a cloud-aware model update strategy introduces adaptive loss weighting to prioritize cloud-occluded regions, enhancing reconstruction accuracy. Experimental results demonstrate that the proposed method outperforms existing approaches, achieving a PSNR of 31.01 dB, SSIM of 0.918, and MAE of 0.017. These outcomes highlight the framework’s effectiveness in producing high-fidelity, spatially and spectrally consistent cloud-free optical images.
zh
[CV-139] StainPIDR: A Pathological Image Decouplingand Reconstruction Method for StainNormalization Based on Color VectorQuantization and Structure Restaining
【速读】:该论文试图解决病理图像颜色变异导致计算机辅助诊断系统性能下降的问题(color-variant pathological images)。其解决方案的关键在于提出一种称为StainPIDR的染色归一化方法,通过将图像解耦为结构特征和向量量化颜色特征,利用目标颜色特征对结构特征进行重染,并通过交叉注意力机制高效完成重染过程,同时设计了一种模板图像选择算法以提升归一化效果。
链接: https://arxiv.org/abs/2506.17879
作者: Zheng Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:The color appearance of a pathological image is highly related to the imaging protocols, the proportion of different dyes, and the scanning devices. Computer-aided diagnostic systems may deteriorate when facing these color-variant pathological images. In this work, we propose a stain normalization method called StainPIDR. We try to eliminate this color discrepancy by decoupling the image into structure features and vector-quantized color features, restaining the structure features with the target color features, and decoding the stained structure features to normalized pathological images. We assume that color features decoupled by different images with the same color should be exactly the same. Under this assumption, we train a fixed color vector codebook to which the decoupled color features will map. In the restaining part, we utilize the cross-attention mechanism to efficiently stain the structure features. As the target color (decoupled from a selected template image) will also affect the performance of stain normalization, we further design a template image selection algorithm to select a template from a given dataset. In our extensive experiments, we validate the effectiveness of StainPIDR and the template image selection algorithm. All the results show that our method can perform well in the stain normalization task. The code of StainPIDR will be publicly available later.
zh
[CV-140] SurgVidLM: Towards Multi-grained Surgical Video Understanding with Large Language Model
【速读】:该论文旨在解决现有视频大语言模型(Vid-LLMs)在精细粒度手术视频理解任务中的不足,这一问题对于分析手术过程中的具体步骤或细节至关重要。解决方案的关键在于提出SurgVidLM,这是首个针对全面和精细粒度手术视频理解设计的视频语言模型,其核心创新包括构建SVU-31K数据集以支持整体理解和细节分析,以及引入StageFocus机制实现多粒度、渐进式的视频理解,并通过Multi-frequency Fusion Attention有效融合低频和高频视觉特征,确保关键信息的保留。
链接: https://arxiv.org/abs/2506.17873
作者: Guankun Wang,Wenjin Mo,Junyi Wang,Long Bai,Kun Yuan,Ming Hu,Jinlin Wu,Junjun He,Yiming Huang,Nicolas Padoy,Zhen Lei,Hongbin Liu,Nassir Navab,Hongliang Ren
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent advances in Multimodal Large Language Models have demonstrated great potential in the medical domain, facilitating users to understand surgical scenes and procedures. Beyond image-based methods, the exploration of Video Large Language Models (Vid-LLMs) has emerged as a promising avenue for capturing the complex sequences of information involved in surgery. However, there is still a lack of Vid-LLMs specialized for fine-grained surgical video understanding tasks, which is crucial for analyzing specific processes or details within a surgical procedure. To bridge this gap, we propose SurgVidLM, the first video language model designed to address both full and fine-grained surgical video comprehension. To train our SurgVidLM, we construct the SVU-31K dataset which consists of over 31K video-instruction pairs, enabling both holistic understanding and detailed analysis of surgical procedures. Furthermore, we introduce the StageFocus mechanism which is a two-stage framework performing the multi-grained, progressive understanding of surgical videos. We also develop the Multi-frequency Fusion Attention to effectively integrate low and high-frequency visual tokens, ensuring the retention of critical information. Experimental results demonstrate that SurgVidLM significantly outperforms state-of-the-art Vid-LLMs in both full and fine-grained video understanding tasks, showcasing its superior capability in capturing complex procedural contexts.
zh
[CV-141] Decoding Federated Learning: The FedNAM Conformal Revolution
【速读】:该论文旨在解决联邦学习中缺乏综合性解决方案的问题,该方案需同时具备不确定性量化、可解释性和鲁棒性。其关键解决方案是提出FedNAM+,一个集成神经加法模型(Neural Additive Models, NAMs)与新型保真预测方法的联邦学习框架,以实现可解释且可靠的不确定性估计。该方法通过动态层级调整技术,利用基于梯度的敏感度图来识别影响预测的关键输入特征,从而提供像素级的不确定性估计,并相较于传统方法如LIME和SHAP,提供了更具置信度的可视化预测可靠性分析。
链接: https://arxiv.org/abs/2506.17872
作者: Sree Bhargavi Balija,Amitash Nanda,Debashis Sahoo
机构: UC San Diego, ECE (加州大学圣地亚哥分校,电子与计算机工程系); UC San Diego, CSE (加州大学圣地亚哥分校,计算机科学与工程系)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Federated learning has significantly advanced distributed training of machine learning models across decentralized data sources. However, existing frameworks often lack comprehensive solutions that combine uncertainty quantification, interpretability, and robustness. To address this, we propose FedNAM+, a federated learning framework that integrates Neural Additive Models (NAMs) with a novel conformal prediction method to enable interpretable and reliable uncertainty estimation. Our method introduces a dynamic level adjustment technique that utilizes gradient-based sensitivity maps to identify key input features influencing predictions. This facilitates both interpretability and pixel-wise uncertainty estimates. Unlike traditional interpretability methods such as LIME and SHAP, which do not provide confidence intervals, FedNAM+ offers visual insights into prediction reliability. We validate our approach through experiments on CT scan, MNIST, and CIFAR datasets, demonstrating high prediction accuracy with minimal loss (e.g., only 0.1% on MNIST), along with transparent uncertainty measures. Visual analysis highlights variable uncertainty intervals, revealing low-confidence regions where model performance can be improved with additional data. Compared to Monte Carlo Dropout, FedNAM+ delivers efficient and global uncertainty estimates with reduced computational overhead, making it particularly suitable for federated learning scenarios. Overall, FedNAM+ provides a robust, interpretable, and computationally efficient framework that enhances trust and transparency in decentralized predictive modeling.
zh
[CV-142] Cross-modal State Space Modeling for Real-time RGB-thermal Wild Scene Semantic Segmentation
【速读】:该论文旨在解决在野外环境中,基于RGB和热成像数据的语义分割任务中,多源数据处理带来的计算开销过大问题,从而限制了资源受限系统的应用。其解决方案的关键在于提出了一种高效的RGB-热成像语义分割架构CM-SSM,该架构利用跨模态状态空间建模(cross-modal state space modeling, CM-SSM),通过两个核心组件实现高效计算:一是跨模态二维选择扫描模块(CM-SS2D),用于建立RGB与热成像模态间的状态空间模型;二是跨模态状态空间关联模块(CM-SSA),用于融合全局关联信息与局部空间特征。相较于基于Transformer的方法,CM-SSM在图像分辨率上具有线性计算复杂度,从而在保持高性能的同时显著降低了计算成本。
链接: https://arxiv.org/abs/2506.17869
作者: Xiaodong Guo,Zi’ang Lin,Luwen Hu,Zhihong Deng,Tong Liu,Wujie Zhou
机构: Beijing Institute of Technology (北京理工大学); Zhejiang University of Science and Technology (浙江科技大学); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:The integration of RGB and thermal data can significantly improve semantic segmentation performance in wild environments for field robots. Nevertheless, multi-source data processing (e.g. Transformer-based approaches) imposes significant computational overhead, presenting challenges for resource-constrained systems. To resolve this critical limitation, we introduced CM-SSM, an efficient RGB-thermal semantic segmentation architecture leveraging a cross-modal state space modeling (SSM) approach. Our framework comprises two key components. First, we introduced a cross-modal 2D-selective-scan (CM-SS2D) module to establish SSM between RGB and thermal modalities, which constructs cross-modal visual sequences and derives hidden state representations of one modality from the other. Second, we developed a cross-modal state space association (CM-SSA) module that effectively integrates global associations from CM-SS2D with local spatial features extracted through convolutional operations. In contrast with Transformer-based approaches, CM-SSM achieves linear computational complexity with respect to image resolution. Experimental results show that CM-SSM achieves state-of-the-art performance on the CART dataset with fewer parameters and lower computational cost. Further experiments on the PST900 dataset demonstrate its generalizability. Codes are available at this https URL.
zh
[CV-143] Fetuses Made Simple: Modeling and Tracking of Fetal Shape and Pose
【速读】:该论文旨在解决胎儿MRI分析中传统方法在处理胎儿身体运动和形态时的局限性,即基于解剖学关键点的方法可能忽略完整身体形状的重要细节,而基于体部分割的方法则因胎儿大范围非局部运动而复杂化时间序列分析。解决方案的关键是构建一个基于Skinned Multi-Person Linear Model (SMPL)的3D可动统计胎儿身体模型,通过迭代估计图像空间中的身体姿态和规范姿态空间中的身体形状,从而提高对MRI运动伪影和强度失真的鲁棒性,并减少因胎儿姿势困难导致的表面观测不完整的影响。
链接: https://arxiv.org/abs/2506.17858
作者: Yingcheng Liu,Peiqi Wang,Sebastian Diaz,Esra Abaci Turk,Benjamin Billot,Patricia Ellen Grant,Polina Golland
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Analyzing fetal body motion and shape is paramount in prenatal diagnostics and monitoring. Existing methods for fetal MRI analysis mainly rely on anatomical keypoints or volumetric body segmentations. Keypoints simplify body structure to facilitate motion analysis, but may ignore important details of full-body shape. Body segmentations capture complete shape information but complicate temporal analysis due to large non-local fetal movements. To address these limitations, we construct a 3D articulated statistical fetal body model based on the Skinned Multi-Person Linear Model (SMPL). Our algorithm iteratively estimates body pose in the image space and body shape in the canonical pose space. This approach improves robustness to MRI motion artifacts and intensity distortions, and reduces the impact of incomplete surface observations due to challenging fetal poses. We train our model on segmentations and keypoints derived from 19,816 MRI volumes across 53 subjects. Our model captures body shape and motion across time series and provides intuitive visualization. Furthermore, it enables automated anthropometric measurements traditionally difficult to obtain from segmentations and keypoints. When tested on unseen fetal body shapes, our method yields a surface alignment error of 3.2 mm for 3 mm MRI voxel size. To our knowledge, this represents the first 3D articulated statistical fetal body model, paving the way for enhanced fetal motion and shape analysis in prenatal diagnostics. The code is available at this https URL .
zh
[CV-144] Robust Foreground-Background Separation for Severely-Degraded Videos Using Convolutional Sparse Representation Modeling
【速读】:该论文试图解决在不利条件下获取的视频(如硬件、环境和供电限制)中,如何准确分离前景与背景成分的问题。现有方法存在两个局限性:仅能捕获数据特定特征或通用特征,且未包含针对多种噪声类型的显式模型以在分离过程中去除噪声。解决方案的关键在于提出一种基于卷积稀疏表示(CSR)的前景模型,该模型能够自适应地捕捉成像数据中分散的特定空间结构,并将前景-背景分离建模为一个结合CSR、通用特征捕捉函数以及多种噪声类型显式表征函数的约束多凸优化问题,从而在低帧率和多种噪声环境下实现更精确的分离。
链接: https://arxiv.org/abs/2506.17838
作者: Kazuki Naganuma,Shunsuke Ono
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: Submitted to IEEE Transactions on Image Processing. The code is available at this https URL
Abstract:This paper proposes a foreground-background separation (FBS) method with a novel foreground model based on convolutional sparse representation (CSR). In order to analyze the dynamic and static components of videos acquired under undesirable conditions, such as hardware, environmental, and power limitations, it is essential to establish an FBS method that can handle videos with low frame rates and various types of noise. Existing FBS methods have two limitations that prevent us from accurately separating foreground and background components from such degraded videos. First, they only capture either data-specific or general features of the components. Second, they do not include explicit models for various types of noise to remove them in the FBS process. To this end, we propose a robust FBS method with a CSR-based foreground model. This model can adaptively capture specific spatial structures scattered in imaging data. Then, we formulate FBS as a constrained multiconvex optimization problem that incorporates CSR, functions that capture general features, and explicit noise characterization functions for multiple types of noise. Thanks to these functions, our method captures both data-specific and general features to accurately separate the components from various types of noise even under low frame rates. To obtain a solution of the optimization problem, we develop an algorithm that alternately solves its two convex subproblems by newly established algorithms. Experiments demonstrate the superiority of our method over existing methods using two types of degraded videos: infrared and microscope videos.
zh
[CV-145] me-Contrastive Pretraining for In-Context Image and Video Segmentation
【速读】:该论文旨在解决传统基于网格的上下文学习(In-context Learning, ICL)方法在视觉应用中灵活性不足的问题,这些方法受限于上下文图像的数量和分辨率。其解决方案的关键在于引入Temporal,一个时间对比自监督目标,用于预训练提示检索器以支持视觉ICL,并将ICL重新定义为视频对象分割(Video Object Segmentation, VOS)任务。通过这种方式,该方法能够在保持上下文图像全分辨率的同时,支持可变数量的上下文图像。
链接: https://arxiv.org/abs/2506.17837
作者: Assefa Wahd,Jacob Jaremko,Abhilash Hareendranathan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In-context learning (ICL) enables generalization to new tasks with minimal labeled data. However, mainstream ICL approaches rely on a gridding strategy, which lacks the flexibility required for vision applications. We introduce Temporal, a time-contrastive self-supervised objective that pretrains a prompt retriever for visual ICL, and formulate ICL as a video object segmentation (VOS) task. Temporal addresses key limitations of grid-based methods that restrict the number and resolution of context images. By reframing ICL as a VOS problem, our approach supports a variable number of context images while preserving their full resolution. To address the challenge of selecting optimal context sets for queries, we pretrain a prompt retriever on videos via self-supervised learning, where adjacent frames serve as positives and distant frames as negatives. For image segmentation, the prompt retriever selects relevant sequences that, when combined with the query, form coherent videos for VOS processing. For video segmentation, it identifies keyframes, predicts their masks using our ICL pipeline, and propagates them throughout the sequence. When evaluated on MICCAI FLARE 2022, our method achieves substantial improvements over baselines: 90.95% Dice score for image segmentation (10.64% improvement) and 92.45% Dice for video segmentation (14.88% improvement).
zh
[CV-146] Incorporating Rather Than Eliminating: Achieving Fairness for Skin Disease Diagnosis Through Group-Specific Expert
【速读】:该论文试图解决基于人工智能的皮肤疾病诊断系统在不同人口统计群体中存在偏差的问题,这种偏差导致了医疗结果的不平等和患者信任度下降。现有偏差缓解方法通常通过消除敏感属性与诊断预测之间的相关性来实现公平性,但这种方法常常因丢失临床相关的诊断线索而降低系统性能。论文提出的解决方案的关键在于引入FairMoE框架,该框架采用逐层专家混合模块作为群体特定的学习器,通过动态路由数据到最合适的专家,而非传统方法中基于群体标签的刚性分配,从而在保持公平性指标的同时显著提升诊断准确性。
链接: https://arxiv.org/abs/2506.17787
作者: Gelei Xu,Yuying Duan,Zheyuan Liu,Xueyang Li,Meng Jiang,Michael Lemmon,Wei Jin,Yiyu Shi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 2 figures
Abstract:AI-based systems have achieved high accuracy in skin disease diagnostics but often exhibit biases across demographic groups, leading to inequitable healthcare outcomes and diminished patient trust. Most existing bias mitigation methods attempt to eliminate the correlation between sensitive attributes and diagnostic prediction, but those methods often degrade performance due to the lost of clinically relevant diagnostic cues. In this work, we propose an alternative approach that incorporates sensitive attributes to achieve fairness. We introduce FairMoE, a framework that employs layer-wise mixture-of-experts modules to serve as group-specific learners. Unlike traditional methods that rigidly assign data based on group labels, FairMoE dynamically routes data to the most suitable expert, making it particularly effective for handling cases near group boundaries. Experimental results show that, unlike previous fairness approaches that reduce performance, FairMoE achieves substantial accuracy improvements while preserving comparable fairness metrics.
zh
[CV-147] Collaborative Texture Filtering
【速读】:该论文试图解决在使用纹理压缩技术时,由于无法利用GPU的纹理单元进行解压和过滤而导致的高计算成本问题,以及由此引发的放大时视觉效果不佳、噪声和闪烁等问题。其解决方案的关键在于利用GPU波通信(wave communication)原语,在执行着色器内部实现相邻像素之间的解码纹理值共享,从而避免在过滤前重复解压纹理元素,进而减少计算开销并提升图像质量。通过将任务分配到不同的波道(lane),可以在足够大的放大因子下实现每像素仅需一次纹理元素评估的零误差过滤,而对于其他情况则提出了新的过滤回退方法,进一步提升了整体性能与视觉质量。
链接: https://arxiv.org/abs/2506.17770
作者: Tomas Akenine-Möller,Pontus Ebelin,Matt Pharr,Bartlomiej Wronski
机构: NVIDIA(英伟达)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ACM/EG Symposium on High Performance Graphics (HPG), 2025
Abstract:Recent advances in texture compression provide major improvements in compression ratios, but cannot use the GPU’s texture units for decompression and filtering. This has led to the development of stochastic texture filtering (STF) techniques to avoid the high cost of multiple texel evaluations with such formats. Unfortunately, those methods can give undesirable visual appearance changes under magnification and may contain visible noise and flicker despite the use of spatiotemporal denoisers. Recent work substantially improves the quality of magnification filtering with STF by sharing decoded texel values between nearby pixels (Wronski 2025). Using GPU wave communication intrinsics, this sharing can be performed inside actively executing shaders without memory traffic overhead. We take this idea further and present novel algorithms that use wave communication between lanes to avoid repeated texel decompression prior to filtering. By distributing unique work across lanes, we can achieve zero-error filtering using =1 texel evaluations per pixel given a sufficiently large magnification factor. For the remaining cases, we propose novel filtering fallback methods that also achieve higher quality than prior approaches.
zh
[CV-148] LoLA-SpecViT: Local Attention SwiGLU Vision Transformer with LoRA for Hyperspectral Imaging
【速读】:该论文旨在解决高光谱图像(Hyperspectral Image, HSI)分类中因光谱数据高维性、波段间冗余显著以及标注样本有限所带来的挑战。其解决方案的关键在于提出一种轻量级的光谱视觉变压器模型——LoLA-SpecViT,该模型通过参数高效的架构设计,结合3D卷积光谱前端与基于局部窗口的自注意力机制,提升了光谱特征提取和空间一致性,同时降低了计算复杂度。此外,模型引入了低秩适应(LoRA)技术,使在标签稀缺条件下仍能实现高效微调,从而增强了模型的可扩展性和适应性。
链接: https://arxiv.org/abs/2506.17759
作者: Fadi Abdeladhim Zidi,Djamel Eddine Boukhari,Abdellah Zakaria Sellam,Abdelkrim Ouafi,Cosimo Distante,Salah Eddine Bekhouche,Abdelmalik Taleb-Ahmed
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Hyperspectral image classification remains a challenging task due to the high dimensionality of spectral data, significant inter-band redundancy, and the limited availability of annotated samples. While recent transformer-based models have improved the global modeling of spectral-spatial dependencies, their scalability and adaptability under label-scarce conditions remain limited. In this work, we propose \textbfLoLA-SpecViT(Low-rank adaptation Local Attention Spectral Vision Transformer), a lightweight spectral vision transformer that addresses these limitations through a parameter-efficient architecture tailored to the unique characteristics of hyperspectral imagery. Our model combines a 3D convolutional spectral front-end with local window-based self-attention, enhancing both spectral feature extraction and spatial consistency while reducing computational complexity. To further improve adaptability, we integrate low-rank adaptation (LoRA) into attention and projection layers, enabling fine-tuning with over 80% fewer trainable parameters. A novel cyclical learning rate scheduler modulates LoRA adaptation strength during training, improving convergence and generalisation. Extensive experiments on three benchmark datasets WHU-Hi LongKou, WHU-Hi HongHu, and Salinas demonstrate that LoLA-SpecViT consistently outperforms state-of-the-art baselines, achieving up to 99.91% accuracy with substantially fewer parameters and enhanced robustness under low-label regimes. The proposed framework provides a scalable and generalizable solution for real-world HSI applications in agriculture, environmental monitoring, and remote sensing analytics. Our code is available in the following \hrefthis https URLGitHub Repository.
zh
[CV-149] PhysID: Physics-based Interactive Dynamics from a Single-view Image ICASSP
【速读】:该论文试图解决将静态图像转化为交互式体验的问题,这一任务在计算机视觉中仍具挑战性,尤其在移动用户交互和AR/VR应用中具有重要价值。其解决方案的关键在于提出PhysID,该方法通过利用大规模生成模型进行3D网格生成和物理属性预测,从单视角图像中简化物理驱动的交互动态创建过程,从而显著降低对工程密集型任务如3D建模和内在属性校准的专业要求,并实现高效的设备端内存消耗与实时物理真实性渲染。
链接: https://arxiv.org/abs/2506.17746
作者: Sourabh Vasant Gothe,Ayon Chattopadhyay,Gunturi Venkata Sai Phani Kiran,Pratik,Vibhav Agarwal,Jayesh Rajkumar Vachhani,Sourav Ghosh,Parameswaranath VM,Barath Raj KR
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Published in 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Project page: this https URL
Abstract:Transforming static images into interactive experiences remains a challenging task in computer vision. Tackling this challenge holds the potential to elevate mobile user experiences, notably through interactive and AR/VR applications. Current approaches aim to achieve this either using pre-recorded video responses or requiring multi-view images as input. In this paper, we present PhysID, that streamlines the creation of physics-based interactive dynamics from a single-view image by leveraging large generative models for 3D mesh generation and physical property prediction. This significantly reduces the expertise required for engineering-intensive tasks like 3D modeling and intrinsic property calibration, enabling the process to be scaled with minimal manual intervention. We integrate an on-device physics-based engine for physically plausible real-time rendering with user interactions. PhysID represents a leap forward in mobile-based interactive dynamics, offering real-time, non-deterministic interactions and user-personalization with efficient on-device memory consumption. Experiments evaluate the zero-shot capabilities of various Multimodal Large Language Models (MLLMs) on diverse tasks and the performance of 3D reconstruction models. These results demonstrate the cohesive functioning of all modules within the end-to-end framework, contributing to its effectiveness.
zh
[CV-150] YOLOv13: Real-Time Object Detection with Hypergraph-Enhanced Adaptive Visual Perception
【速读】:该论文旨在解决YOLO系列模型在复杂场景下检测性能受限的问题,具体表现为传统卷积结构和基于区域的自注意力机制仅能捕获局部信息聚合和成对相关性,缺乏对全局多对多高阶相关性的建模能力。解决方案的关键在于提出一种基于超图的自适应相关性增强机制(HyperACE),该机制通过超图计算自适应地利用潜在的高阶相关性,实现高效的全局跨位置和跨尺度特征融合与增强,并进一步提出全流水线聚合与分发范式(FullPAD),以实现网络内细粒度的信息流与表征协同。
链接: https://arxiv.org/abs/2506.17733
作者: Mengqi Lei,Siqi Li,Yihong Wu,Han Hu,You Zhou,Xinhu Zheng,Guiguang Ding,Shaoyi Du,Zongze Wu,Yue Gao
机构: Tsinghua University (清华大学); Taiyuan University of Technology (太原理工大学); Beijing Institute of Technology (北京理工大学); Shenzhen University (深圳大学); The Hong Kong University of Science and Technology (广州) (香港科技大学(广州)); The Hong Kong University of Science and Technology (香港科技大学); Xi’an Jiaotong University (西安交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The YOLO series models reign supreme in real-time object detection due to their superior accuracy and computational efficiency. However, both the convolutional architectures of YOLO11 and earlier versions and the area-based self-attention mechanism introduced in YOLOv12 are limited to local information aggregation and pairwise correlation modeling, lacking the capability to capture global multi-to-multi high-order correlations, which limits detection performance in complex scenarios. In this paper, we propose YOLOv13, an accurate and lightweight object detector. To address the above-mentioned challenges, we propose a Hypergraph-based Adaptive Correlation Enhancement (HyperACE) mechanism that adaptively exploits latent high-order correlations and overcomes the limitation of previous methods that are restricted to pairwise correlation modeling based on hypergraph computation, achieving efficient global cross-location and cross-scale feature fusion and enhancement. Subsequently, we propose a Full-Pipeline Aggregation-and-Distribution (FullPAD) paradigm based on HyperACE, which effectively achieves fine-grained information flow and representation synergy within the entire network by distributing correlation-enhanced features to the full pipeline. Finally, we propose to leverage depthwise separable convolutions to replace vanilla large-kernel convolutions, and design a series of blocks that significantly reduce parameters and computational complexity without sacrificing performance. We conduct extensive experiments on the widely used MS COCO benchmark, and the experimental results demonstrate that our method achieves state-of-the-art performance with fewer parameters and FLOPs. Specifically, our YOLOv13-N improves mAP by 3.0% over YOLO11-N and by 1.5% over YOLOv12-N. The code and models of our YOLOv13 model are available at: this https URL.
zh
[CV-151] PDC-Net: Pattern Divide-and-Conquer Network for Pelvic Radiation Injury Segmentation MICCAI2025
【速读】:该论文旨在解决从磁共振成像(MRI)中准确分割盆腔放射性损伤(Pelvic Radiation Injury, PRI)的问题,以实现更精确的预后评估和个性化治疗方案的制定。现有方法在自动化分割中面临挑战,主要由于器官形态复杂以及上下文信息混淆等因素。其解决方案的关键在于提出一种新型的模式分治网络(Pattern Divide-and-Conquer Network, PDC-Net),通过不同网络模块对局部和全局模式进行“分割”,并在解码阶段通过灵活的特征选择“征服”感兴趣区域(Regions of Interest, ROI)。该网络包含多方向聚合(Multi-Direction Aggregation, MDA)模块和记忆引导上下文(Memory-Guided Context, MGC)模块,分别用于增强形状拟合能力和区分正负类全局模式,同时采用自适应融合解码器(Adaptive Fusion Decoder, AFD)动态选择特征以生成最终分割结果。
链接: https://arxiv.org/abs/2506.17712
作者: Xinyu Xiong,Wuteng Cao,Zihuang Wu,Lei Zhang,Chong Gao,Guanbin Li,Qiyuan Qin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: MICCAI 2025
Abstract:Accurate segmentation of Pelvic Radiation Injury (PRI) from Magnetic Resonance Images (MRI) is crucial for more precise prognosis assessment and the development of personalized treatment plans. However, automated segmentation remains challenging due to factors such as complex organ morphologies and confusing context. To address these challenges, we propose a novel Pattern Divide-and-Conquer Network (PDC-Net) for PRI segmentation. The core idea is to use different network modules to “divide” various local and global patterns and, through flexible feature selection, to “conquer” the Regions of Interest (ROI) during the decoding phase. Specifically, considering that our ROI often manifests as strip-like or circular-like structures in MR slices, we introduce a Multi-Direction Aggregation (MDA) module. This module enhances the model’s ability to fit the shape of the organ by applying strip convolutions in four distinct directions. Additionally, to mitigate the challenge of confusing context, we propose a Memory-Guided Context (MGC) module. This module explicitly maintains a memory parameter to track cross-image patterns at the dataset level, thereby enhancing the distinction between global patterns associated with the positive and negative classes. Finally, we design an Adaptive Fusion Decoder (AFD) that dynamically selects features from different patterns based on the Mixture-of-Experts (MoE) framework, ultimately generating the final segmentation results. We evaluate our method on the first large-scale pelvic radiation injury dataset, and the results demonstrate the superiority of our PDC-Net over existing approaches.
zh
[CV-152] Programmable-Room: Interactive Textured 3D Room Meshes Generation Empowered by Large Language Models
【速读】:该论文试图解决如何根据自然语言指令交互式生成和编辑三维房间网格(3D room mesh)的问题。其解决方案的关键在于将复杂的任务分解为多个简化步骤,并引入可视化编程(Visual Programming, VP)以统一支持这些分解后的任务。VP通过大型语言模型(Large Language Model, LLM)生成类似Python的程序,用于执行从创建合理的三维坐标、生成全景纹理图像、整合坐标与纹理生成三维网格,到家具布置等操作。其中,纹理生成模块利用预训练的大规模扩散模型,结合文本和视觉提示(如布局、深度图和语义图)生成全景图像,并通过双向长短期记忆网络(Bidirectional LSTM)的1D表示优化训练目标,从而提升生成质量。
链接: https://arxiv.org/abs/2506.17707
作者: Jihyun Kim,Junho Park,Kyeongbo Kong,Suk-Ju Kang
机构: Sogang University (首尔大学); Pusan National University (釜庆国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注: Accepted by IEEE Transactions on Multimedia
Abstract:We present Programmable-Room, a framework which interactively generates and edits a 3D room mesh, given natural language instructions. For precise control of a room’s each attribute, we decompose the challenging task into simpler steps such as creating plausible 3D coordinates for room meshes, generating panorama images for the texture, constructing 3D meshes by integrating the coordinates and panorama texture images, and arranging furniture. To support the various decomposed tasks with a unified framework, we incorporate visual programming (VP). VP is a method that utilizes a large language model (LLM) to write a Python-like program which is an ordered list of necessary modules for the various tasks given in natural language. We develop most of the modules. Especially, for the texture generating module, we utilize a pretrained large-scale diffusion model to generate panorama images conditioned on text and visual prompts (i.e., layout, depth, and semantic map) simultaneously. Specifically, we enhance the panorama image generation quality by optimizing the training objective with a 1D representation of a panorama scene obtained from bidirectional LSTM. We demonstrate Programmable-Room’s flexibility in generating and editing 3D room meshes, and prove our framework’s superiority to an existing model quantitatively and qualitatively. Project page is available in this https URL.
zh
[CV-153] DreamJourney: Perpetual View Generation with Video Diffusion Models
【速读】:该论文旨在解决传统方法在生成长期视频时缺乏3D感知能力导致的图像失真以及无法捕捉动态4D世界中物体运动的问题。其解决方案的关键在于提出DreamJourney,一个两阶段框架,利用视频扩散模型的世界模拟能力,实现包含相机运动和物体动态的持续场景视图生成。第一阶段通过将输入图像提升至3D点云并生成部分图像序列,结合视频扩散模型完成缺失区域并增强视觉一致性;第二阶段则通过多模态大语言模型生成描述物体运动的文本提示,并使用视频扩散模型对当前视图进行动画处理,从而实现持续的动态场景视图生成。
链接: https://arxiv.org/abs/2506.17705
作者: Bo Pan,Yang Chen,Yingwei Pan,Ting Yao,Wei Chen,Tao Mei
机构: Zhejiang University (浙江大学); HiDream.ai (HiDream.ai)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Perpetual view generation aims to synthesize a long-term video corresponding to an arbitrary camera trajectory solely from a single input image. Recent methods commonly utilize a pre-trained text-to-image diffusion model to synthesize new content of previously unseen regions along camera movement. However, the underlying 2D diffusion model lacks 3D awareness and results in distorted artifacts. Moreover, they are limited to generating views of static 3D scenes, neglecting to capture object movements within the dynamic 4D world. To alleviate these issues, we present DreamJourney, a two-stage framework that leverages the world simulation capacity of video diffusion models to trigger a new perpetual scene view generation task with both camera movements and object dynamics. Specifically, in stage I, DreamJourney first lifts the input image to 3D point cloud and renders a sequence of partial images from a specific camera trajectory. A video diffusion model is then utilized as generative prior to complete the missing regions and enhance visual coherence across the sequence, producing a cross-view consistent video adheres to the 3D scene and camera trajectory. Meanwhile, we introduce two simple yet effective strategies (early stopping and view padding) to further stabilize the generation process and improve visual quality. Next, in stage II, DreamJourney leverages a multimodal large language model to produce a text prompt describing object movements in current view, and uses video diffusion model to animate current view with object movements. Stage I and II are repeated recurrently, enabling perpetual dynamic scene view generation. Extensive experiments demonstrate the superiority of our DreamJourney over state-of-the-art methods both quantitatively and qualitatively. Our project page: this https URL.
zh
[CV-154] SSAVSV: Towards Unified Model for Self-Supervised Audio-Visual Speaker Verification
【速读】:该论文试图解决传统音视频说话人验证方法依赖大量标注数据和独立模态专用架构所带来的计算成本高、可扩展性差的问题。其解决方案的关键在于提出一种基于对比学习的自监督学习框架,结合非对称掩码和掩码数据建模,以获得鲁棒的音视频特征表示,并采用统一的框架使用单一共享的视觉Transformer主干网络处理音频和视觉输入,从而在训练和测试阶段实现计算效率与对缺失模态的鲁棒性。
链接: https://arxiv.org/abs/2506.17694
作者: Gnana Praveen Rajasekhar,Jahangir Alam
机构: Computer Research Institute of Montreal (CRIM)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Conventional audio-visual methods for speaker verification rely on large amounts of labeled data and separate modality-specific architectures, which is computationally expensive, limiting their scalability. To address these problems, we propose a self-supervised learning framework based on contrastive learning with asymmetric masking and masked data modeling to obtain robust audiovisual feature representations. In particular, we employ a unified framework for self-supervised audiovisual speaker verification using a single shared backbone for audio and visual inputs, leveraging the versatility of vision transformers. The proposed unified framework can handle audio, visual, or audiovisual inputs using a single shared vision transformer backbone during training and testing while being computationally efficient and robust to missing modalities. Extensive experiments demonstrate that our method achieves competitive performance without labeled data while reducing computational costs compared to traditional approaches.
zh
[CV-155] Domain Generalization using Action Sequences for Egocentric Action Recognition
【速读】:该论文旨在解决第一人称视角(egocentric vision)动作识别模型在未见过的环境中的性能下降问题。其关键解决方案是通过引入一种领域泛化方法,即SeqDG,该方法利用动作序列中一致的用户意图来提升模型在未知环境中的泛化能力。SeqDG的核心在于提出了一种视觉-文本序列重构目标(SeqRec),通过结合文本和视觉上下文线索来重建序列中的核心动作,并通过混合不同领域动作序列进行训练以增强模型鲁棒性。
链接: https://arxiv.org/abs/2506.17685
作者: Amirshayan Nasirimajd,Chiara Plizzari,Simone Alberto Peirone,Marco Ciccone,Giuseppe Averta,Barbara Caputo
机构: Politecnico di Torino (都灵理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at Pattern Recognition Letters. 9 pages including references. Code and Data: this https URL
Abstract:Recognizing human activities from visual inputs, particularly through a first-person viewpoint, is essential for enabling robots to replicate human behavior. Egocentric vision, characterized by cameras worn by observers, captures diverse changes in illumination, viewpoint, and environment. This variability leads to a notable drop in the performance of Egocentric Action Recognition models when tested in environments not seen during training. In this paper, we tackle these challenges by proposing a domain generalization approach for Egocentric Action Recognition. Our insight is that action sequences often reflect consistent user intent across visual domains. By leveraging action sequences, we aim to enhance the model’s generalization ability across unseen environments. Our proposed method, named SeqDG, introduces a visual-text sequence reconstruction objective (SeqRec) that uses contextual cues from both text and visual inputs to reconstruct the central action of the sequence. Additionally, we enhance the model’s robustness by training it on mixed sequences of actions from different domains (SeqMix). We validate SeqDG on the EGTEA and EPIC-KITCHENS-100 datasets. Results on EPIC-KITCHENS-100, show that SeqDG leads to +2.4% relative average improvement in cross-domain action recognition in unseen environments, and on EGTEA the model achieved +0.6% Top-1 accuracy over SOTA in intra-domain action recognition.
zh
[CV-156] CSDN: A Context-Gated Self-Adaptive Detection Network for Real-Time Object Detection
【速读】:该论文旨在解决卷积神经网络(CNN)在目标检测中因感受野有限而难以捕捉全局上下文信息的问题,以及DETR启发的检测头网络中自注意力机制可能存在冗余信息的问题。其解决方案的关键在于引入基于Transformer的Context-Gated Scale-Adaptive Detection Network (CSDN),通过一种新颖的门控机制替代传统的堆叠自注意力和交叉注意力层,使每个感兴趣区域(ROI)能够自适应地选择和融合多尺度特征信息,从而提升全局上下文建模能力并增强对不同尺寸和结构目标的适应性。
链接: https://arxiv.org/abs/2506.17679
作者: Wei Haolin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15pages, 11figures
Abstract:Convolutional neural networks (CNNs) have long been the cornerstone of target detection, but they are often limited by limited receptive fields, which hinders their ability to capture global contextual information. This paper believes that the effective utilization of extracted features is as important as the feature extraction process itself. We critically re-evaluated the DETR-inspired header network architecture, questioning the indispensable nature of its self-attention mechanism, and discovering significant information redundancies. To solve these problems, we introduced the Context-Gated Scale-Adaptive Detection Network (CSDN), a Transformer-based detection header inspired by natural language processing architecture and human visual perception. CSDN aims to efficiently utilize the characteristics of the CNN backbone network by replacing the traditional stacked self-attention and cross-attention layers with a novel gating mechanism. This mechanism enables each region of interest (ROI) to adaptively select and combine feature dimensions and scale information from multiple attention patterns. CSDN provides more powerful global context modeling capabilities and can better adapt to objects of different sizes and structures. Our proposed detection head can directly replace the native heads of various CNN-based detectors, and only a few rounds of fine-tuning on the pre-training weights can significantly improve the detection accuracy, thus avoiding the need to achieve small improvements. Various layer modules undergo extensive re-training.
zh
[CV-157] MDSAM:Memory-Driven Sparse Attention Matrix for LVLMs Hallucination Mitigation
【速读】:该论文试图解决大型视觉语言模型(Large Vision-Language Models, LVLMs)中常见的幻觉问题,该问题通常源于模型在解码过程中对图像标记(image tokens)的敏感性,表现为生成真实和幻觉实体时注意力峰值的出现。解决方案的关键在于提出一种无需训练的新型方法——记忆驱动的稀疏注意力矩阵(Memory-Driven Sparse Attention Matrix, MDSAM),该方法通过动态捕捉并优化每一层对图像标记的注意力分配,利用对齐机制记忆注意力模式并触发更新,从而增强对相关图像标记的关注,有效减少幻觉现象。
链接: https://arxiv.org/abs/2506.17664
作者: Shuaiye Lu,Linjiang Zhou,Xiaochuan Shi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Hallucinations in large vision-language models (LVLMs) often stem from the model’s sensitivity to image tokens during decoding, as evidenced by attention peaks observed when generating both real and hallucinated entities. To address this, we propose Memory-Driven Sparse Attention Matrix (MDSAM) , a novel training-free approach that dynamically captures and refines the attention allocated to image tokens at each layer. MDSAM memorizes attention patterns and activates updates through alignment during decoding, enhancing focus on relevant image tokens while effectively reducing hallucinations. We evaluate MDSAM on multiple benchmarks for tasks such as image captioning and visual question answering, demonstrating its ability to consistently reduce hallucinations and improve reliability. Compatible with various LVLM architectures, MDSAM highlights its adaptability and effectiveness in mitigating hallucinations without requiring additional training or external tools.
zh
[CV-158] Histopathology Image Report Generation by Vision Language Model with Multimodal In-Context Learning
【速读】:该论文试图解决从组织病理学图像自动生成医学报告的问题,这一任务需要有效的视觉表示和领域专业知识。解决方案的关键在于提出一种称为PathGenIC的上下文学习框架,该框架结合了来自训练集的上下文信息与多模态上下文学习(ICL)机制,通过动态检索语义相似的全切片图像(WSI)-报告对,并引入自适应反馈以增强上下文相关性和生成质量。
链接: https://arxiv.org/abs/2506.17645
作者: Shih-Wen Liu,Hsuan-Yu Fan,Wei-Ta Chu,Fu-En Yang,Yu-Chiang Frank Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to MIDL 2025
Abstract:Automating medical report generation from histopathology images is a critical challenge requiring effective visual representations and domain-specific knowledge. Inspired by the common practices of human experts, we propose an in-context learning framework called PathGenIC that integrates context derived from the training set with a multimodal in-context learning (ICL) mechanism. Our method dynamically retrieves semantically similar whole slide image (WSI)-report pairs and incorporates adaptive feedback to enhance contextual relevance and generation quality. Evaluated on the HistGen benchmark, the framework achieves state-of-the-art results, with significant improvements across BLEU, METEOR, and ROUGE-L metrics, and demonstrates robustness across diverse report lengths and disease categories. By maximizing training data utility and bridging vision and language with ICL, our work offers a solution for AI-driven histopathology reporting, setting a strong foundation for future advancements in multimodal clinical applications.
zh
[CV-159] 3D Gaussian Splatting for Fine-Detailed Surface Reconstruction in Large-Scale Scene IROS2025
【速读】:该论文旨在解决大规模场景下基于3D Gaussian Splatting的表面重建问题,特别是在计算资源需求高和户外环境动态外观复杂的情况下。其关键解决方案包括:采用从粗到细的策略高效构建粗略模型,结合自适应场景划分与子场景精修;引入解耦外观模型以捕捉全局外观变化,并通过瞬态掩码模型减少移动物体的干扰;最后扩展多视角约束并引入单视角正则化以处理无纹理区域。这些方法共同提升了大规模场景下表面重建的精度与细节表现。
链接: https://arxiv.org/abs/2506.17636
作者: Shihan Chen,Zhaojin Li,Zeyu Chen,Qingsong Yan,Gaoyang Shen,Ran Duan
机构: The Hong Kong Polytechnic University (香港理工大学); Wuhan University (武汉大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: IROS 2025
Abstract:Recent developments in 3D Gaussian Splatting have made significant advances in surface reconstruction. However, scaling these methods to large-scale scenes remains challenging due to high computational demands and the complex dynamic appearances typical of outdoor environments. These challenges hinder the application in aerial surveying and autonomous driving. This paper proposes a novel solution to reconstruct large-scale surfaces with fine details, supervised by full-sized images. Firstly, we introduce a coarse-to-fine strategy to reconstruct a coarse model efficiently, followed by adaptive scene partitioning and sub-scene refining from image segments. Additionally, we integrate a decoupling appearance model to capture global appearance variations and a transient mask model to mitigate interference from moving objects. Finally, we expand the multi-view constraint and introduce a single-view regularization for texture-less areas. Our experiments were conducted on the publicly available dataset GauU-Scene V2, which was captured using unmanned aerial vehicles. To the best of our knowledge, our method outperforms existing NeRF-based and Gaussian-based methods, achieving high-fidelity visual results and accurate surface from full-size image optimization. Open-source code will be available on GitHub.
zh
[CV-160] Adaptive Multi-prompt Contrastive Network for Few-shot Out-of-distribution Detection ICML2025
【速读】:该论文试图解决少样本(few-shot)分布外(out-of-distribution, OOD)检测的问题,即在仅有少量标记的分布内(in-distribution, ID)样本的情况下,如何有效区分分布外样本。传统OOD检测方法依赖大量独立同分布(IID)样本进行训练,而少样本场景下这一限制显著制约了方法的实用性。该论文的关键解决方案是提出一种新型网络——自适应多提示对比网络(Adaptive Multi-prompt Contrastive Network, AMCN),通过学习类间与类内分布来适应ID-OOD分离边界,并利用CLIP模型连接文本与图像,工程化可学习的ID和OOD文本提示,以弥补OOD样本不足和ID样本稀缺的问题。
链接: https://arxiv.org/abs/2506.17633
作者: Xiang Fang,Arvind Easwaran,Blaise Genest
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: ICML 2025
Abstract:Out-of-distribution (OOD) detection attempts to distinguish outlier samples to prevent models trained on the in-distribution (ID) dataset from producing unavailable outputs. Most OOD detection methods require many IID samples for training, which seriously limits their real-world applications. To this end, we target a challenging setting: few-shot OOD detection, where Only a few \em labeled ID samples are available. Therefore, few-shot OOD detection is much more challenging than the traditional OOD detection setting. Previous few-shot OOD detection works ignore the distinct diversity between different classes. In this paper, we propose a novel network: Adaptive Multi-prompt Contrastive Network (AMCN), which adapts the ID-OOD separation boundary by learning inter- and intra-class distribution. To compensate for the absence of OOD and scarcity of ID \em image samples, we leverage CLIP, connecting text with images, engineering learnable ID and OOD \em textual prompts. Specifically, we first generate adaptive prompts (learnable ID prompts, label-fixed OOD prompts and label-adaptive OOD prompts). Then, we generate an adaptive class boundary for each class by introducing a class-wise threshold. Finally, we propose a prompt-guided ID-OOD separation module to control the margin between ID and OOD prompts. Experimental results show that AMCN outperforms other state-of-the-art works.
zh
[CV-161] Optimization-Free Patch Attack on Stereo Depth Estimation
【速读】:该论文旨在解决立体深度估计(Stereo Depth Estimation, SDE)模型在现实场景中面临的物理可实现、场景自适应且具有迁移性的对抗攻击问题。现有研究中的对抗攻击多局限于不现实的设置,例如静态场景中对独立立体视图的数字扰动,限制了其实际应用。论文的关键解决方案是提出一种无需优化的对抗性补丁攻击方法——PatchHunter,该方法通过强化学习驱动的搜索,在精心设计的视觉模式结构空间中生成补丁,以破坏SDE的假设,从而实现更有效的攻击效果和更高的黑盒迁移性。
链接: https://arxiv.org/abs/2506.17632
作者: Hangcheng Liu,Xu Kuang,Xingshuo Han,Xingwan Wu,Haoran Ou,Shangwei Guo,Xingyi Huang,Tao Xiang,Tianwei Zhang
机构: Nanyang Technological University (南洋理工大学); Chongqing University (重庆大学); Wuhan University (武汉大学); Jinan University (济南大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Stereo Depth Estimation (SDE) is essential for scene understanding in vision-based systems like autonomous driving. However, recent studies show that SDE models are vulnerable to adversarial attacks, which are often limited to unrealistic settings, e.g., digital perturbations on separate stereo views in static scenes, restricting their real-world applicability. This raises a critical question: how can we design physically realizable, scene-adaptive, and transferable attacks against SDE under realistic constraints? To answer this, we make two key contributions. First, we propose a unified attack framework that extends optimization-based techniques to four core stages of stereo matching: feature extraction, cost-volume construction, cost aggregation, and disparity regression. A comprehensive stage-wise evaluation across 9 mainstream SDE models, under constraints like photometric consistency, reveals that optimization-based patches suffer from poor transferability. Interestingly, partially transferable patches suggest that patterns, rather than pixel-level perturbations, may be key to generalizable attacks. Motivated by this, we present PatchHunter, the first optimization-free adversarial patch attack against SDE. PatchHunter formulates patch generation as a reinforcement learning-driven search over a structured space of visual patterns crafted to disrupt SDE assumptions. We validate PatchHunter across three levels: the KITTI dataset, the CARLA simulator, and real-world vehicle deployment. PatchHunter not only surpasses optimization-based methods in effectiveness but also achieves significantly better black-box transferability. Even under challenging physical conditions like low light, PatchHunter maintains high attack success (e.g., D1-all 0.4), whereas optimization-based methods fail. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2506.17632 [cs.CV] (or arXiv:2506.17632v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2506.17632 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-162] Can Generated Images Serve as a Viable Modality for Text-Centric Multimodal Learning?
【速读】:该论文试图解决文本数据与多模态模型之间存在的“模态差距”问题,即如何利用文本到图像(Text-to-Image, T2I)模型实时生成的图像作为文本中心任务的补充模态。解决方案的关键在于通过系统性评估框架分析T2I模型质量、提示工程策略以及多模态融合架构等关键变量,验证生成的“合成感知”在提升文本分类任务性能方面的有效性。研究发现,该方法的有效性高度依赖于文本与生成图像之间的语义对齐程度、任务的内在“视觉可 grounding 性”以及T2I模型的生成保真度。
链接: https://arxiv.org/abs/2506.17623
作者: Yuesheng Huang,Peng Zhang,Riliang Liu,Jiaqi Liang
机构: 未知
类目: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV)
备注: 4 figures,7 tables
Abstract:A significant modality gap" exists between the abundance of text-only data and the increasing power of multimodal models. This work systematically investigates whether images generated on-the-fly by Text-to-Image (T2I) models can serve as a valuable complementary modality for text-centric tasks. Through a comprehensive evaluation framework on text classification, we analyze the impact of critical variables, including T2I model quality, prompt engineering strategies, and multimodal fusion architectures. Our findings demonstrate that this
synthetic perception" can yield significant performance gains, even when augmenting strong large language model baselines. However, we find the effectiveness of this approach is highly conditional, depending critically on the semantic alignment between text and the generated image, the inherent ``visual groundability" of the task, and the generative fidelity of the T2I model. Our work establishes the first rigorous benchmark for this paradigm, providing a clear analysis of its potential and current limitations, and demonstrating its viability as a pathway to enrich language understanding in traditionally unimodal scenarios.
zh
[CV-163] JarvisArt: Liberating Human Artistic Creativity via an Intelligent Photo Retouching Agent
【速读】:该论文旨在解决传统专业图像修图工具(如Adobe Lightroom)需要大量专业知识和手动操作,而现有基于AI的解决方案在可调整性和泛化能力方面存在不足的问题。其关键解决方案是提出JarvisArt,一个由多模态大语言模型(MLLM)驱动的智能代理,能够理解用户意图、模拟专业艺术家的推理过程,并智能协调Lightroom中的200多个修图工具。该模型通过两阶段训练流程——初始的思维链监督微调和基于群体相对策略优化的修图(GRPO-R)——提升其决策能力和工具使用熟练度,同时引入Agent-to-Lightroom协议实现与Lightroom的无缝集成。
链接: https://arxiv.org/abs/2506.17612
作者: Yunlong Lin,Zixu Lin,Kunjie Lin,Jinbin Bai,Panwang Pan,Chenxin Li,Haoyu Chen,Zhongdao Wang,Xinghao Ding,Wenbo Li,Shuicheng Yan
机构: Xiamen University (厦门大学); The Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州)); The Chinese University of Hong Kong (香港中文大学); Bytedance (字节跳动); National University of Singapore (新加坡国立大学); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 40 pages, 26 figures
Abstract:Photo retouching has become integral to contemporary visual storytelling, enabling users to capture aesthetics and express creativity. While professional tools such as Adobe Lightroom offer powerful capabilities, they demand substantial expertise and manual effort. In contrast, existing AI-based solutions provide automation but often suffer from limited adjustability and poor generalization, failing to meet diverse and personalized editing needs. To bridge this gap, we introduce JarvisArt, a multi-modal large language model (MLLM)-driven agent that understands user intent, mimics the reasoning process of professional artists, and intelligently coordinates over 200 retouching tools within Lightroom. JarvisArt undergoes a two-stage training process: an initial Chain-of-Thought supervised fine-tuning to establish basic reasoning and tool-use skills, followed by Group Relative Policy Optimization for Retouching (GRPO-R) to further enhance its decision-making and tool proficiency. We also propose the Agent-to-Lightroom Protocol to facilitate seamless integration with Lightroom. To evaluate performance, we develop MMArt-Bench, a novel benchmark constructed from real-world user edits. JarvisArt demonstrates user-friendly interaction, superior generalization, and fine-grained control over both global and local adjustments, paving a new avenue for intelligent photo retouching. Notably, it outperforms GPT-4o with a 60% improvement in average pixel-level metrics on MMArt-Bench for content fidelity, while maintaining comparable instruction-following capabilities. Project Page: this https URL.
zh
[CV-164] HIRE: Lightweight High-Resolution Image Feature Enrichment for Multimodal LLM s CVPR2025
【速读】:该论文旨在解决现代多模态大语言模型中高分辨率图像特征提取带来的计算成本过高的问题,这一问题主要源于对如ViT等大型图像编码器的多次调用。论文提出的解决方案的关键在于通过浅层特征增强器实现特征上采样,从而在保持高性能的同时显著降低训练和推理时间及计算成本,实验表明该方法在FLOPs方面最多可节省1.5倍。
链接: https://arxiv.org/abs/2506.17608
作者: Nikitha SR,Aradhya Neeraj Mathur,Tarun Ram Menta,Rishabh Jain,Mausoom Sarkar
机构: Media and Data Science Research Lab, Adobe
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in CVPR 2025 Workshop on What’s Next in Multimodal Foundational Models
Abstract:The integration of high-resolution image features in modern multimodal large language models has demonstrated significant improvements in fine-grained visual understanding tasks, achieving high performance across multiple benchmarks. Since these features are obtained from large image encoders like ViT, they come with a significant increase in computational costs due to multiple calls to these encoders. In this work, we first develop an intuition for feature upsampling as a natural extension of high-resolution feature generation. Through extensive experiments and ablations, we demonstrate how a shallow feature enricher can achieve competitive results with tremendous reductions in training and inference time as well as computational cost, with upto 1.5x saving in FLOPs.
zh
[CV-165] OpenMAP-BrainAge: Generalizable and Interpretable Brain Age Predictor
【速读】:该论文试图解决脑部磁共振成像(MRI)扫描中年龄预测模型的可解释性和对人口统计学及技术差异的鲁棒性问题。解决方案的关键在于提出一种基于Transformer的架构,该架构利用大规模数据集进行自监督预训练,并通过引入茎结构将传统Transformer模型的二次复杂度降低到线性复杂度,从而实现对高维MRI数据的可扩展处理。此外,该模型融合了来自三个解剖视角的伪3D T1加权MRI扫描和大脑体积信息,以提高预测精度和泛化能力。
链接: https://arxiv.org/abs/2506.17597
作者: Pengyu Kan,Craig Jones,Kenichi Oishi
机构: Johns Hopkins University(约翰霍普金斯大学); The Johns Hopkins University School of Medicine(约翰霍普金斯大学医学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Purpose: To develop an age prediction model which is interpretable and robust to demographic and technological variances in brain MRI scans. Materials and Methods: We propose a transformer-based architecture that leverages self-supervised pre-training on large-scale datasets. Our model processes pseudo-3D T1-weighted MRI scans from three anatomical views and incorporates brain volumetric information. By introducing a stem architecture, we reduce the conventional quadratic complexity of transformer models to linear complexity, enabling scalability for high-dimensional MRI data. We trained our model on ADNI2 \ 3 (N=1348) and OASIS3 (N=716) datasets (age range: 42 - 95) from the North America, with an 8:1:1 split for train, validation and test. Then, we validated it on the AIBL dataset (N=768, age range: 60 - 92) from Australia. Results: We achieved an MAE of 3.65 years on ADNI2 \ 3 and OASIS3 test set and a high generalizability of MAE of 3.54 years on AIBL. There was a notable increase in brain age gap (BAG) across cognitive groups, with mean of 0.15 years (95% CI: [-0.22, 0.51]) in CN, 2.55 years ([2.40, 2.70]) in MCI, 6.12 years ([5.82, 6.43]) in AD. Additionally, significant negative correlation between BAG and cognitive scores was observed, with correlation coefficient of -0.185 (p 0.001) for MoCA and -0.231 (p 0.001) for MMSE. Gradient-based feature attribution highlighted ventricles and white matter structures as key regions influenced by brain aging. Conclusion: Our model effectively fused information from different views and volumetric information to achieve state-of-the-art brain age prediction accuracy, improved generalizability and interpretability with association to neurodegenerative disorders.
zh
[CV-166] A Multimodal In Vitro Diagnostic Method for Parkinsons Disease Combining Facial Expressions and Behavioral Gait Data
【速读】:该论文旨在解决帕金森病(Parkinson’s disease, PD)早期诊断中存在的挑战,包括面部表情诊断的训练数据有限、步态诊断依赖专业设备和采集环境导致泛化能力差,以及单一模态诊断存在误诊或漏诊的风险。其解决方案的关键在于提出一种基于面部表情和行为步态的多模态体外诊断方法,采用轻量级深度学习模型进行特征提取与融合,以提高诊断准确性并支持在移动设备上的部署。
链接: https://arxiv.org/abs/2506.17596
作者: Wei Huang,Yinxuan Xu,Yintao Zhou,Zhengyu Li,Jing Huang,Meng Pang
机构: Nanchang University(南昌大学); Yichun University(宜春大学); Nanchang University Second Affiliated Hospital(南昌大学第二附属医院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 4 figures, accepted by CogSci 2025
Abstract:Parkinson’s disease (PD), characterized by its incurable nature, rapid progression, and severe disability, poses significant challenges to the lives of patients and their families. Given the aging population, the need for early detection of PD is increasing. In vitro diagnosis has garnered attention due to its non-invasive nature and low cost. However, existing methods present several challenges: 1) limited training data for facial expression diagnosis; 2) specialized equipment and acquisition environments required for gait diagnosis, resulting in poor generalizability; 3) the risk of misdiagnosis or missed diagnosis when relying on a single modality. To address these issues, we propose a novel multimodal in vitro diagnostic method for PD, leveraging facial expressions and behavioral gait. Our method employs a lightweight deep learning model for feature extraction and fusion, aimed at improving diagnostic accuracy and facilitating deployment on mobile devices. Furthermore, we have established the largest multimodal PD dataset in collaboration with a hospital and conducted extensive experiments to validate the effectiveness of our proposed method.
zh
[CV-167] SELFI: Selective Fusion of Identity for Generalizable Deepfake Detection
【速读】:该论文旨在解决深度伪造检测中关于面部身份特征(face identity)的矛盾问题,即是否应抑制或依赖身份线索以提高检测性能。其关键解决方案是提出一种名为SELFI(SELective FUSion of Identity)的通用检测框架,该框架通过动态调节身份特征的使用,实现对身份特征的显式建模与自适应控制。SELFI包含两个核心组件: Forgery-Aware Identity Adapter(FAIA)用于将身份嵌入投影到伪造相关空间,Identity-Aware Fusion Module(IAFM)则通过相关性引导的融合机制选择性地整合身份和视觉特征,从而提升跨不同篡改方法的泛化能力。
链接: https://arxiv.org/abs/2506.17592
作者: Younghun Kim,Minsuk Jang,Myung-Joon Kwon,Wonjun Lee,Changick Kim
机构: Korea Advanced Institute of Science and Technology (KAIST)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Face identity provides a powerful signal for deepfake detection. Prior studies show that even when not explicitly modeled, classifiers often learn identity features implicitly. This has led to conflicting views: some suppress identity cues to reduce bias, while others rely on them as forensic evidence. To reconcile these views, we analyze two hypotheses: (1) whether face identity alone is discriminative for detecting deepfakes, and (2) whether such identity features generalize poorly across manipulation methods. Our experiments confirm that identity is informative but context-dependent. While some manipulations preserve identity-consistent artifacts, others distort identity cues and harm generalization. We argue that identity features should neither be blindly suppressed nor relied upon, but instead be explicitly modeled and adaptively controlled based on per-sample relevance. We propose \textbfSELFI (\textbfSELective \textbfFusion of \textbfIdentity), a generalizable detection framework that dynamically modulates identity usage. SELFI consists of: (1) a Forgery-Aware Identity Adapter (FAIA) that extracts identity embeddings from a frozen face recognition model and projects them into a forgery-relevant space via auxiliary supervision; and (2) an Identity-Aware Fusion Module (IAFM) that selectively integrates identity and visual features using a relevance-guided fusion mechanism. Experiments on four benchmarks show that SELFI improves cross-manipulation generalization, outperforming prior methods by an average of 3.1% AUC. On the challenging DFDC dataset, SELFI exceeds the previous best by 6%. Code will be released upon paper acceptance.
zh
[CV-168] DRAMA-X: A Fine-grained Intent Prediction and Risk Reasoning Benchmark For Driving
【速读】:该论文旨在解决在复杂城市场景中对易受伤害道路使用者(Vulnerable Road Users, VRUs)的短期运动预测问题,特别是在安全关键情境下的多类别意图预测缺乏系统性评估的问题。为填补这一空白,研究者提出了DRAMA-X,一个基于DRAMA数据集并通过自动化标注流程构建的细粒度基准,包含丰富的标注信息以支持对象检测、意图预测、风险评估和动作建议等任务。其解决方案的关键在于引入SGG-Intent框架,该框架利用视觉语言模型(Vision-Language Models, VLMs)生成场景图,并通过组合推理阶段进行意图推断、风险评估和动作建议,从而提升自主决策的准确性与安全性。
链接: https://arxiv.org/abs/2506.17590
作者: Mihir Godbole,Xiangbo Gao,Zhengzhong Tu
机构: Texas A&M University (得克萨斯A&M大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: 19 pages, 5 figures, Preprint under review. Code available at: this https URL
Abstract:Understanding the short-term motion of vulnerable road users (VRUs) like pedestrians and cyclists is critical for safe autonomous driving, especially in urban scenarios with ambiguous or high-risk behaviors. While vision-language models (VLMs) have enabled open-vocabulary perception, their utility for fine-grained intent reasoning remains underexplored. Notably, no existing benchmark evaluates multi-class intent prediction in safety-critical situations, To address this gap, we introduce DRAMA-X, a fine-grained benchmark constructed from the DRAMA dataset via an automated annotation pipeline. DRAMA-X contains 5,686 accident-prone frames labeled with object bounding boxes, a nine-class directional intent taxonomy, binary risk scores, expert-generated action suggestions for the ego vehicle, and descriptive motion summaries. These annotations enable a structured evaluation of four interrelated tasks central to autonomous decision-making: object detection, intent prediction, risk assessment, and action suggestion. As a reference baseline, we propose SGG-Intent, a lightweight, training-free framework that mirrors the ego vehicle’s reasoning pipeline. It sequentially generates a scene graph from visual input using VLM-backed detectors, infers intent, assesses risk, and recommends an action using a compositional reasoning stage powered by a large language model. We evaluate a range of recent VLMs, comparing performance across all four DRAMA-X tasks. Our experiments demonstrate that scene-graph-based reasoning enhances intent prediction and risk assessment, especially when contextual cues are explicitly modeled.
zh
[CV-169] HalluRNN: Mitigating Hallucinations via Recurrent Cross-Layer Reasoning in Large Vision-Language Models
【速读】:该论文试图解决大型视觉-语言模型(Large Vision-Language Models, LVLMs)在生成过程中容易出现的幻觉问题,即生成的文本在语法上合理但与视觉内容不符。解决方案的关键在于提出一种架构层面的改进方法——HalluRNN,其核心是引入了一个共享跨层且可循环精炼隐藏状态的Dual-Gated Depth Propagation Unit (DG-DPU)模块,从而实现信息的自适应传播、层间一致性强化,并缓解由表征漂移引起的幻觉现象。
链接: https://arxiv.org/abs/2506.17587
作者: Le Yu,Kaishen Wang,Jianlong Xiong,Yue Cao,Tao He
机构: Sichuan University (四川大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 6 figures, 9 tables
Abstract:Though Large Vision-Language Models (LVLMs) have achieved remarkable performance across various tasks, they are still prone to hallucinations-generating outputs that are textually plausible but visually ungrounded. While prior approaches generally address this issue through data-centric fine-tuning or innovative decoding strategies, these methods often require substantial resources or task-specific configurations. In this work, we introduce an architecture-level solution, HalluRNN, which enhances model stability through recurrent cross-layer reasoning. Specifically, we propose a novel Dual-Gated Depth Propagation Unit (DG-DPU) module, which is shared across layers and recurrently refines hidden states. This allows for the adaptive propagation of information throughout the model, enforces consistency across layers, and mitigates hallucinations caused by representational drift. By fine-tuning only the DG-DPU module, HalluRNN achieves strong and robust performance across multiple benchmarks.
zh
[CV-170] VLA-OS: Structuring and Dissecting Planning Representations and Paradigms in Vision-Language-Action Models
【速读】:该论文旨在解决当前Vision-Language-Action (VLA)模型在不同网络架构、规划范式、表示方式和训练数据来源之间的性能提升原因难以明确的问题。其解决方案的关键在于引入VLA-OS,一个统一的VLA架构系列,能够支持多种任务规划范式,并通过设计全面的控制实验,在不同物体类别、视觉模态、环境和末端执行器条件下系统性地评估不同规划范式和表示方式的影响。
链接: https://arxiv.org/abs/2506.17561
作者: Chongkai Gao,Zixuan Liu,Zhenghao Chi,Junshan Huang,Xin Fei,Yiwen Hou,Yuxuan Zhang,Yudi Lin,Zhirui Fang,Zeyu Jiang,Lin Shao
机构: National University of Singapore (新加坡国立大学); Fudan University (复旦大学); Tsinghua University (清华大学); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:
Abstract:Recent studies on Vision-Language-Action (VLA) models have shifted from the end-to-end action-generation paradigm toward a pipeline involving task planning followed by action generation, demonstrating improved performance on various complex, long-horizon manipulation tasks. However, existing approaches vary significantly in terms of network architectures, planning paradigms, representations, and training data sources, making it challenging for researchers to identify the precise sources of performance gains and components to be further improved. To systematically investigate the impacts of different planning paradigms and representations isolating from network architectures and training data, in this paper, we introduce VLA-OS, a unified VLA architecture series capable of various task planning paradigms, and design a comprehensive suite of controlled experiments across diverse object categories (rigid and deformable), visual modalities (2D and 3D), environments (simulation and real-world), and end-effectors (grippers and dexterous hands). Our results demonstrate that: 1) visually grounded planning representations are generally better than language planning representations; 2) the Hierarchical-VLA paradigm generally achieves superior or comparable performance than other paradigms on task performance, pretraining, generalization ability, scalability, and continual learning ability, albeit at the cost of slower training and inference speeds.
zh
[CV-171] SynDaCaTE: A Synthetic Dataset For Evaluating Part-Whole Hierarchical Inference ICML2025
【速读】:该论文试图解决如何有效评估生成式AI(Generative AI)模型是否真正学习到部分-整体层次结构的问题,这一问题在胶囊网络(Capsule Networks)的训练中尤为突出。现有方法在监督任务如物体分类中训练胶囊网络,难以验证其是否实际学习了部分-整体层次结构。解决方案的关键在于提出一个用于胶囊测试与评估的合成数据集(SYNthetic DAtaset for CApsule Testing and Evaluation, SynDaCaTE),并通过该数据集揭示了现有胶囊模型的精确瓶颈,并验证了排列等变自注意力机制在部分到整体推理中的高效性,从而为设计有效的归纳偏置提供了新的方向。
链接: https://arxiv.org/abs/2506.17558
作者: Jake Levi,Mark van der Wilk
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at Methods and Opportunities at Small Scale (MOSS), ICML 2025, Vancouver, Canada
Abstract:Learning to infer object representations, and in particular part-whole hierarchies, has been the focus of extensive research in computer vision, in pursuit of improving data efficiency, systematic generalisation, and robustness. Models which are \emphdesigned to infer part-whole hierarchies, often referred to as capsule networks, are typically trained end-to-end on supervised tasks such as object classification, in which case it is difficult to evaluate whether such a model \emphactually learns to infer part-whole hierarchies, as claimed. To address this difficulty, we present a SYNthetic DAtaset for CApsule Testing and Evaluation, abbreviated as SynDaCaTE, and establish its utility by (1) demonstrating the precise bottleneck in a prominent existing capsule model, and (2) demonstrating that permutation-equivariant self-attention is highly effective for parts-to-wholes inference, which motivates future directions for designing effective inductive biases for computer vision.
zh
[CV-172] DRIMV_TSK: An Interpretable Surgical Evaluation Model for Incomplete Multi-View Rectal Cancer Data
【速读】:该论文旨在解决直肠癌手术难度评估中数据不完整和多模态数据融合的问题,以提高治疗成功率。其解决方案的关键在于提出一种可解释的不完全多视图手术评估模型,该模型通过双表示的多视图学习方法提取各视图间的共有信息和特定信息,并将缺失视图补全集成到表征学习中,同时引入二阶相似性约束以增强两部分之间的协同学习。此外,基于补全后的多视图数据和学习到的双表示,结合TSK模糊系统构建多视图手术评估模型,并引入合作学习机制和香农熵来适应不同视图的权重。
链接: https://arxiv.org/abs/2506.17552
作者: Wei Zhang,Zi Wang,Hanwen Zhou,Zhaohong Deng,Weiping Ding,Yuxi Ge,Te Zhang,Yuanpeng Zhang,Kup-Sze Choi,Shitong Wang,Shudong Hu
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:A reliable evaluation of surgical difficulty can improve the success of the treatment for rectal cancer and the current evaluation method is based on clinical data. However, more data about rectal cancer can be collected with the development of technology. Meanwhile, with the development of artificial intelligence, its application in rectal cancer treatment is becoming possible. In this paper, a multi-view rectal cancer dataset is first constructed to give a more comprehensive view of patients, including the high-resolution MRI image view, pressed-fat MRI image view, and clinical data view. Then, an interpretable incomplete multi-view surgical evaluation model is proposed, considering that it is hard to obtain extensive and complete patient data in real application scenarios. Specifically, a dual representation incomplete multi-view learning model is first proposed to extract the common information between views and specific information in each view. In this model, the missing view imputation is integrated into representation learning, and second-order similarity constraint is also introduced to improve the cooperative learning between these two parts. Then, based on the imputed multi-view data and the learned dual representation, a multi-view surgical evaluation model with the TSK fuzzy system is proposed. In the proposed model, a cooperative learning mechanism is constructed to explore the consistent information between views, and Shannon entropy is also introduced to adapt the view weight. On the MVRC dataset, we compared it with several advanced algorithms and DRIMV_TSK obtained the best results.
zh
[CV-173] Scene-R1: Video-Grounded Large Language Models for 3D Scene Reasoning without 3D Annotations
【速读】:该论文试图解决现有3D感知大语言模型(3D-aware LLMs)作为黑箱模型的问题,即它们无法揭示决策过程,并且依赖预训练的3D检测器提供目标建议。其解决方案的关键在于引入Scene-R1框架,该框架通过结合强化学习驱动的推理与两阶段定位流程,在无需点级3D实例监督的情况下实现对3D场景的推理。具体而言,该框架首先在时间定位阶段分析视频并选择与开放性查询最相关的视频片段,随后在图像定位阶段预测2D边界框,并利用SAM2跟踪目标以生成像素级掩码,最终将其投影回3D空间,从而避免了对3D检测器建议的依赖,同时捕捉到精细的几何和材质信息。
链接: https://arxiv.org/abs/2506.17545
作者: Zhihao Yuan,Shuyi Jiang,Chun-Mei Feng,Yaolun Zhang,Shuguang Cui,Zhen Li,Na Zhao
机构: FNii-Shenzhen, CUHKSZ; SSE, CUHKSZ; IHPC, A*STAR, Singapore; Singapore University of Technology and Design
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Currently, utilizing large language models to understand the 3D world is becoming popular. Yet existing 3D-aware LLMs act as black boxes: they output bounding boxes or textual answers without revealing how those decisions are made, and they still rely on pre-trained 3D detectors to supply object proposals. We introduce Scene-R1, a video-grounded framework that learns to reason about 3D scenes without any point-wise 3D instance supervision by pairing reinforcement-learning-driven reasoning with a two-stage grounding pipeline. In the temporal grounding stage, we explicitly reason about the video and select the video snippets most relevant to an open-ended query. In the subsequent image grounding stage, we analyze the image and predict the 2D bounding box. After that, we track the object using SAM2 to produce pixel-accurate masks in RGB frames, and project them back into 3D, thereby eliminating the need for 3D detector-based proposals while capturing fine geometry and material cues. Scene-R1 can also adapt to the 3D visual question answering task to answer free-form questions directly from video. Our training pipeline only needs task-level 2D boxes or textual labels without dense 3D point-wise labels. Scene-R1 surpasses existing open-vocabulary baselines on multiple datasets, while delivering transparent, step-by-step rationales. These results show that reinforcement-learning-based reasoning combined with RGB-D video alone offers a practical, annotation-efficient route to trustworthy 3D scene understanding.
zh
[CV-174] EASE: Embodied Active Event Perception via Self-Supervised Energy Minimization
【速读】:该论文旨在解决动态、真实场景中具身智能任务(如人机协作、辅助机器人和自主导航)中事件感知的适应性与可扩展性问题,现有方法依赖预定义动作空间、标注数据集和外部奖励,限制了其在复杂环境中的应用。解决方案的关键在于提出EASE框架,该框架通过自由能最小化统一时空表征学习与具身控制,利用预测误差和熵作为内在信号实现事件分割、观察摘要和显著主体跟踪,无需显式标注或外部奖励,从而实现了隐式记忆、目标连续性和对新环境的适应性等涌现行为。
链接: https://arxiv.org/abs/2506.17516
作者: Zhou Chen,Sanjoy Kundu,Harsimran S. Baweja,Sathyanarayanan N. Aakur
机构: Auburn University (奥本大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to IEEE Robotics and Automation Letters, 2025
Abstract:Active event perception, the ability to dynamically detect, track, and summarize events in real time, is essential for embodied intelligence in tasks such as human-AI collaboration, assistive robotics, and autonomous navigation. However, existing approaches often depend on predefined action spaces, annotated datasets, and extrinsic rewards, limiting their adaptability and scalability in dynamic, real-world scenarios. Inspired by cognitive theories of event perception and predictive coding, we propose EASE, a self-supervised framework that unifies spatiotemporal representation learning and embodied control through free energy minimization. EASE leverages prediction errors and entropy as intrinsic signals to segment events, summarize observations, and actively track salient actors, operating without explicit annotations or external rewards. By coupling a generative perception model with an action-driven control policy, EASE dynamically aligns predictions with observations, enabling emergent behaviors such as implicit memory, target continuity, and adaptability to novel environments. Extensive evaluations in simulation and real-world settings demonstrate EASE’s ability to achieve privacy-preserving and scalable event perception, providing a robust foundation for embodied systems in unscripted, dynamic tasks.
zh
[CV-175] Learning golf swing signatures from a single wrist-worn inertial sensor
【速读】:该论文旨在解决高尔夫挥杆分析中存在的问题,包括孤立指标的局限性、职业运动员样本的不足以及缺乏丰富且可解释的运动表征。其解决方案的关键在于构建一个基于单个腕部传感器的全面数据驱动框架,通过从公开视频中收集职业挥杆数据、利用生物准确的人体网格恢复技术重建全身三维运动学,并生成合成惯性数据训练神经网络,从而实现从腕部输入中推断运动并分割挥杆阶段。该方法还学习了一种组合式的运动基元离散词汇,以检测和可视化技术缺陷,并具备预测球员身份、球杆类型、性别和年龄的能力。
链接: https://arxiv.org/abs/2506.17505
作者: Jessy Lauer
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 6 figures
Abstract:Despite its importance for performance and injury prevention, golf swing analysis is limited by isolated metrics, underrepresentation of professional athletes, and a lack of rich, interpretable movement representations. We address these gaps with a holistic, data-driven framework for personalized golf swing analysis from a single wrist-worn sensor. We build a large dataset of professional swings from publicly available videos, reconstruct full-body 3D kinematics using biologically accurate human mesh recovery, and generate synthetic inertial data to train neural networks that infer motion and segment swing phases from wrist-based input. We learn a compositional, discrete vocabulary of motion primitives that facilitates the detection and visualization of technical flaws, and is expressive enough to predict player identity, club type, sex, and age. Our system accurately estimates full-body kinematics and swing events from wrist data, delivering lab-grade motion analysis on-course and supporting early detection of anomalous movement patterns. Explainability methods reveal subtle, individualized movement signatures, reinforcing the view that variability is a hallmark of skilled performance. Longitudinal tracking demonstrates practical value: as one player’s handicap improved from 50 to 2.2 over 1.5 years, our system captured measurable technical progress and provided targeted, actionable feedback. Our findings challenge common assumptions, such as swing consistency across clubs and the existence of a single “ideal” swing, and uncover latent biomarkers shaped by both intrinsic traits and task-specific constraints. This work bridges lab and field-based biomechanics, offering scalable, accessible, high-fidelity motion analysis for research, coaching, and injury prevention, while opening new directions in movement-based phenotyping, personalized equipment design, and motor skill development.
zh
[CV-176] rustworthy Few-Shot Transfer of Medical VLMs through Split Conformal Prediction MICCAI2025
【速读】:该论文旨在解决医学视觉-语言模型(Medical VLMs)在迁移学习过程中可靠性保障不足的问题,特别是在使用少量标注校准数据时,如何提供可信度保证。其解决方案的关键在于提出一种新的迁移学习框架——归纳性分割置信适应(Transductive Split Conformal Adaptation, SCA-T),该框架通过在校准和测试数据上进行无监督的归纳适应,避免了传统迁移学习中对模型进行适应所导致的分割置信预测(SCP)中严格交换性假设的破坏,从而提升了模型在不同任务中的效率和条件覆盖性能。
链接: https://arxiv.org/abs/2506.17503
作者: Julio Silva-Rodríguez,Ismail Ben Ayed,Jose Dolz
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: MICCAI 2025. Code: this https URL
Abstract:Medical vision-language models (VLMs) have demonstrated unprecedented transfer capabilities and are being increasingly adopted for data-efficient image classification. Despite its growing popularity, its reliability aspect remains largely unexplored. This work explores the split conformal prediction (SCP) framework to provide trustworthiness guarantees when transferring such models based on a small labeled calibration set. Despite its potential, the generalist nature of the VLMs’ pre-training could negatively affect the properties of the predicted conformal sets for specific tasks. While common practice in transfer learning for discriminative purposes involves an adaptation stage, we observe that deploying such a solution for conformal purposes is suboptimal since adapting the model using the available calibration data breaks the rigid exchangeability assumptions for test data in SCP. To address this issue, we propose transductive split conformal adaptation (SCA-T), a novel pipeline for transfer learning on conformal scenarios, which performs an unsupervised transductive adaptation jointly on calibration and test data. We present comprehensive experiments utilizing medical VLMs across various image modalities, transfer tasks, and non-conformity scores. Our framework offers consistent gains in efficiency and conditional coverage compared to SCP, maintaining the same empirical guarantees.
zh
[CV-177] Few-Shot Now for Real: Medical VLMs Adaptation without Balanced Sets or Validation MICCAI2025
【速读】:该论文试图解决当前在医学图像分析中使用视觉-语言模型(Vision-Language Models, VLMs)时,由于对适应数据分布的强假设而导致的现实场景下性能下降的问题。现有方法通常假设存在平衡的支持集和额外的验证集,这在医学领域并不现实。论文的关键解决方案是引入一种无需验证的、考虑数据不平衡的适应设置,并提出一种无需训练的线性探针,该探针能够自适应地融合视觉和文本监督,从而在复杂场景中实现稳健的适应。
链接: https://arxiv.org/abs/2506.17500
作者: Julio Silva-Rodríguez,Fereshteh Shakeri,Houda Bahig,Jose Dolz,Ismail Ben Ayed
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: MICCAI 2025. Code: this https URL
Abstract:Vision-language models (VLMs) are gaining attention in medical image analysis. These are pre-trained on large, heterogeneous data sources, yielding rich and transferable representations. Notably, the combination of modality-specialized VLMs with few-shot adaptation has provided fruitful results, enabling the efficient deployment of high-performing solutions. However, previous works on this topic make strong assumptions about the distribution of adaptation data, which are unrealistic in the medical domain. First, prior art assumes access to a balanced support set, a condition that breaks the natural imbalance in disease prevalence found in real-world scenarios. Second, these works typically assume the presence of an additional validation set to fix critical hyper-parameters, which is highly data-inefficient. This work challenges these favorable deployment scenarios and introduces a realistic, imbalanced, validation-free adaptation setting. Our extensive benchmark across various modalities and downstream tasks demonstrates that current methods systematically compromise their performance when operating under realistic conditions, occasionally even performing worse than zero-shot inference. Also, we introduce a training-free linear probe that adaptively blends visual and textual supervision. Detailed studies demonstrate that the proposed solver is a strong, efficient baseline, enabling robust adaptation in challenging scenarios.
zh
[CV-178] Photogranulometry – Dataset of soil images with corresponding particle size distributions
【速读】:该论文旨在解决传统颗粒粒径分布(Particle Size Distribution, PSD)分析在地质技术实验室中存在耗时、劳动成本高及维护费用昂贵的问题。其解决方案的关键在于将光学颗粒粒径分析集成到常规的地质技术实验流程中,通过采集高分辨率的12,714张土壤样本图像及其对应的PSD分析数据,为训练卷积神经网络(Convolutional Neural Network, CNN)提供可靠的数据集。土壤样本在标准化的顶视位置以45 MP分辨率拍摄,且在湿润和干燥状态下均进行成像,确保了数据的准确性和适用性。
链接: https://arxiv.org/abs/2506.17469
作者: Thomas Plante St-Cyr,François Duhaime,Jean-Sébastien Dubé,Simon Grenier
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 10 figures, conference
Abstract:Traditional particle size distribution (PSD) analyses create significant downtime and are expensive in labor and maintenance. These drawbacks could be alleviated using optical grain size analysis integrated into routine geotechnical laboratory workflow. This paper presents a high-resolution dataset of 12,714 images of 321 different soil samples collected in the Montreal, Quebec region, alongside their PSD analysis. It is designed to provide a robust starting point for training convolutional neural networks (CNN) in geotechnical applications. Soil samples were photographed in a standardized top-view position with a resolution of 45 MP and a minimum scale of 39.4 micrometers per pixel, both in their moist and dry states. A custom test bench employing 13x9 inch white aluminum trays, on which the samples are spread in a thin layer, was used. For samples exceeding a size limit, a coning and quartering method was employed for mass reduction.
zh
[CV-179] General-Purpose Robotic Navigation via LVLM-Orchestrated Perception Reasoning and Acting
【速读】:该论文试图解决在未知环境中开发通用导航策略的核心挑战,现有系统通常依赖于任务特定的神经网络和固定的数据流,限制了泛化能力。其解决方案的关键在于引入了基于大型视觉-语言模型(Large Vision-Language Models, LVLM)的代理机器人导航架构(Agentic Robotic Navigation Architecture, ARNA),该架构为LVLM代理提供了现代机器人系统中可用的感知、推理和导航工具库,并在运行时自主定义和执行任务特定的工作流程,从而实现对多模态输入的推理和导航动作的选择,提升了在未映射环境中的导航与推理能力。
链接: https://arxiv.org/abs/2506.17462
作者: Bernard Lange,Anil Yildiz,Mansur Arief,Shehryar Khattak,Mykel Kochenderfer,Georgios Georgakis
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Developing general-purpose navigation policies for unknown environments remains a core challenge in robotics. Most existing systems rely on task-specific neural networks and fixed data flows, limiting generalizability. Large Vision-Language Models (LVLMs) offer a promising alternative by embedding human-like knowledge suitable for reasoning and planning. Yet, prior LVLM-robot integrations typically depend on pre-mapped spaces, hard-coded representations, and myopic exploration. We introduce the Agentic Robotic Navigation Architecture (ARNA), a general-purpose navigation framework that equips an LVLM-based agent with a library of perception, reasoning, and navigation tools available within modern robotic stacks. At runtime, the agent autonomously defines and executes task-specific workflows that iteratively query the robotic modules, reason over multimodal inputs, and select appropriate navigation actions. This approach enables robust navigation and reasoning in previously unmapped environments, providing a new perspective on robotic stack design. Evaluated in Habitat Lab on the HM-EQA benchmark, ARNA achieves state-of-the-art performance, demonstrating effective exploration, navigation, and embodied question answering without relying on handcrafted plans, fixed input representations, or pre-existing maps.
zh
[CV-180] When Every Millisecond Counts: Real-Time Anomaly Detection via the Multimodal Asynchronous Hybrid Network ICML2025
【速读】:该论文旨在解决自动驾驶系统中异常检测的响应时间与检测精度之间的平衡问题,现有方法通常侧重于检测精度而忽视了在时间敏感驾驶场景中的响应时间。解决方案的关键在于提出一种新型的多模态异步混合网络,该网络结合事件相机产生的事件流与RGB相机的图像数据,通过异步图神经网络利用事件相机的高时间分辨率,并与卷积神经网络(CNN)从RGB图像中提取的空间特征进行融合,从而有效捕捉驾驶环境的时间动态和空间细节,实现快速且精确的异常检测。
链接: https://arxiv.org/abs/2506.17457
作者: Dong Xiao,Guangyao Chen,Peixi Peng,Yangru Huang,Yifan Zhao,Yongxing Dai,Yonghong Tian
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICML 2025 Spotlight
Abstract:Anomaly detection is essential for the safety and reliability of autonomous driving systems. Current methods often focus on detection accuracy but neglect response time, which is critical in time-sensitive driving scenarios. In this paper, we introduce real-time anomaly detection for autonomous driving, prioritizing both minimal response time and high accuracy. We propose a novel multimodal asynchronous hybrid network that combines event streams from event cameras with image data from RGB cameras. Our network utilizes the high temporal resolution of event cameras through an asynchronous Graph Neural Network and integrates it with spatial features extracted by a CNN from RGB images. This combination effectively captures both the temporal dynamics and spatial details of the driving environment, enabling swift and precise anomaly detection. Extensive experiments on benchmark datasets show that our approach outperforms existing methods in both accuracy and response time, achieving millisecond-level real-time performance.
zh
[CV-181] AQUA20: A Benchmark Dataset for Underwater Species Classification under Challenging Conditions
【速读】:该论文旨在解决水下环境中鲁棒视觉识别的问题,特别是针对浑浊、低光照和遮挡等复杂失真导致的标准视觉系统性能严重下降的挑战。其解决方案的关键在于构建了一个全面的基准数据集AQUA20,包含20种海洋生物的8,171张水下图像,以反映真实环境中的光照、浑浊度和遮挡等挑战,并评估多种先进的深度学习模型(包括轻量级CNN和基于Transformer的架构)在复杂条件下的分类性能,其中ConvNeXt表现出最佳性能,显示出其在参数规模与精度之间的良好平衡。
链接: https://arxiv.org/abs/2506.17455
作者: Taufikur Rahman Fuad,Sabbir Ahmed,Shahriar Ivan
机构: IUT-Dhaka(印度技术大学达卡分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to AJSE Springer
Abstract:Robust visual recognition in underwater environments remains a significant challenge due to complex distortions such as turbidity, low illumination, and occlusion, which severely degrade the performance of standard vision systems. This paper introduces AQUA20, a comprehensive benchmark dataset comprising 8,171 underwater images across 20 marine species reflecting real-world environmental challenges such as illumination, turbidity, occlusions, etc., providing a valuable resource for underwater visual understanding. Thirteen state-of-the-art deep learning models, including lightweight CNNs (SqueezeNet, MobileNetV2) and transformer-based architectures (ViT, ConvNeXt), were evaluated to benchmark their performance in classifying marine species under challenging conditions. Our experimental results show ConvNeXt achieving the best performance, with a Top-3 accuracy of 98.82% and a Top-1 accuracy of 90.69%, as well as the highest overall F1-score of 88.92% with moderately large parameter size. The results obtained from our other benchmark models also demonstrate trade-offs between complexity and performance. We also provide an extensive explainability analysis using GRAD-CAM and LIME for interpreting the strengths and pitfalls of the models. Our results reveal substantial room for improvement in underwater species recognition and demonstrate the value of AQUA20 as a foundation for future research in this domain. The dataset is publicly available at: this https URL.
zh
[CV-182] BlenderFusion: 3D-Grounded Visual Editing and Generative Compositing
【速读】:该论文试图解决复杂合成场景编辑任务中生成高质量、一致视觉结果的问题,特别是如何有效地重组物体、相机和背景以生成新的场景。解决方案的关键在于提出BlenderFusion框架,其核心是通过分层-编辑-合成的流程实现:首先将视觉输入分割并转换为可编辑的3D实体(layering),然后在Blender中进行基于3D语义控制的编辑(editing),最后利用生成式合成器(generative compositor)将编辑后的元素融合为连贯场景。该合成器通过扩展预训练扩散模型并采用源掩码和模拟物体抖动两种关键训练策略,实现了对源场景和目标场景的并行处理与精细控制。
链接: https://arxiv.org/abs/2506.17450
作者: Jiacheng Chen,Ramin Mehran,Xuhui Jia,Saining Xie,Sanghyun Woo
机构: Google DeepMind; Simon Fraser University (西蒙弗雷泽大学); New York University (纽约大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
Abstract:We present BlenderFusion, a generative visual compositing framework that synthesizes new scenes by recomposing objects, camera, and background. It follows a layering-editing-compositing pipeline: (i) segmenting and converting visual inputs into editable 3D entities (layering), (ii) editing them in Blender with 3D-grounded control (editing), and (iii) fusing them into a coherent scene using a generative compositor (compositing). Our generative compositor extends a pre-trained diffusion model to process both the original (source) and edited (target) scenes in parallel. It is fine-tuned on video frames with two key training strategies: (i) source masking, enabling flexible modifications like background replacement; (ii) simulated object jittering, facilitating disentangled control over objects and camera. BlenderFusion significantly outperforms prior methods in complex compositional scene editing tasks.
zh
[CV-183] Enhancing Wireless Device Identification through RF Fingerprinting: Leverag ing Transient Energy Spectrum Analysis
【速读】:该论文旨在解决在复杂电磁环境中对辐射设备进行准确识别与分类的问题。其关键解决方案是利用基于广义线性Chirplet变换的瞬态能量谱分析提取射频(RF)设备特征,并引入一种混合深度学习模型——卷积神经网络-双向门控循环单元(CNN-Bi-GRU),以提高识别的准确性和效率。该方法在10折交叉验证中达到了99.33%的精确率、99.53%的召回率、99.43%的F1分数和99.17%的分类准确率,展示了其在复杂无线环境中的设备识别与分类潜力。
链接: https://arxiv.org/abs/2506.17439
作者: Nisar Ahmed,Gulshan Saleem,Hafiz Muhammad Shahzad Asif,Muhammad Usman Younus,Kalsoom Safdar
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted in Wireless Personal Communications
Abstract:In recent years, the rapid growth of the Internet of Things technologies and the widespread adoption of 5G wireless networks have led to an exponential increase in the number of radiation devices operating in complex electromagnetic environments. A key challenge in managing and securing these devices is accurate identification and classification. To address this challenge, specific emitter identification techniques have emerged as a promising solution that aims to provide reliable and efficient means of identifying individual radiation devices in a unified and standardized manner. This research proposes an approach that leverages transient energy spectrum analysis using the General Linear Chirplet Transform to extract features from RF devices. A dataset comprising nine RF devices is utilized, with each sample containing 900 attributes and a total of 1080 equally distributed samples across the devices. These features are then used in a classification modeling framework. To overcome the limitations of conventional machine learning methods, we introduce a hybrid deep learning model called the CNN-Bi-GRU for learning the identification of RF devices based on their transient characteristics. The proposed approach provided a 10-fold cross-validation performance with a precision of 99.33%, recall of 99.53%, F1-score of 99.43%, and classification accuracy of 99.17%. The results demonstrate the promising classification performance of the CNN-Bi-GRU approach, indicating its suitability for accurately identifying RF devices based on their transient characteristics and its potential for enhancing device identification and classification in complex wireless environments.
zh
[CV-184] rans2-CBCT: A Dual-Transformer Framework for Sparse-View CBCT Reconstruction
【速读】:该论文旨在解决稀疏视角锥束计算机断层扫描(CBCT)中由于严重欠采样导致的伪影多和空间覆盖差的问题。其解决方案的关键在于构建一个统一框架,首先采用混合卷积神经网络与Transformer的TransUNet模型替代传统编码器,通过卷积层捕捉局部细节、自注意力层增强全局上下文,并结合多尺度特征和轻量级衰减预测头以提升重建质量;其次引入一种邻域感知的Point Transformer模块,利用3D位置编码和k近邻注意力机制增强体积一致性,从而在LUNA16和ToothFairy数据集上实现了性能提升。
链接: https://arxiv.org/abs/2506.17425
作者: Minmin Yang,Huantao Ren,Senem Velipasalar
机构: Syracuse University (锡拉丘兹大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Cone-beam computed tomography (CBCT) using only a few X-ray projection views enables faster scans with lower radiation dose, but the resulting severe under-sampling causes strong artifacts and poor spatial coverage. We address these challenges in a unified framework. First, we replace conventional UNet/ResNet encoders with TransUNet, a hybrid CNN-Transformer model. Convolutional layers capture local details, while self-attention layers enhance global context. We adapt TransUNet to CBCT by combining multi-scale features, querying view-specific features per 3D point, and adding a lightweight attenuation-prediction head. This yields Trans-CBCT, which surpasses prior baselines by 1.17 dB PSNR and 0.0163 SSIM on the LUNA16 dataset with six views. Second, we introduce a neighbor-aware Point Transformer to enforce volumetric coherence. This module uses 3D positional encoding and attention over k-nearest neighbors to improve spatial consistency. The resulting model, Trans ^2 -CBCT, provides an additional gain of 0.63 dB PSNR and 0.0117 SSIM. Experiments on LUNA16 and ToothFairy show consistent gains from six to ten views, validating the effectiveness of combining CNN-Transformer features with point-based geometry reasoning for sparse-view CBCT reconstruction.
zh
[CV-185] VMRA-MaR: An Asymmetry-Aware Temporal Framework for Longitudinal Breast Cancer Risk Prediction MICCAI2025
【速读】:该论文旨在解决乳腺癌早期识别中传统筛查方法在动态高风险群体评估上的不足,特别是如何有效利用纵向影像数据中的时间动态信息以提升癌症发生预测的准确性。其解决方案的关键在于引入Vision Mamba RNN (VMRNN),该模型结合状态空间模型(State-Space Model, SSM)和类似长短期记忆网络(LSTM)的记忆机制,以捕捉乳腺组织演变中的细微趋势。此外,通过集成不对称性模块,利用空间不对称性检测器(Spatial Asymmetry Detector, SAD)和纵向不对称性追踪器(Longitudinal Asymmetry Tracker, LAT),进一步增强了对双侧差异的临床相关性识别能力,从而提升了预测性能,尤其是在高密度乳腺病例和长期时间点上的表现。
链接: https://arxiv.org/abs/2506.17412
作者: Zijun Sun,Solveig Thrun,Michael Kampffmeyer
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: MICCAI 2025, Provisional Accept
Abstract:Breast cancer remains a leading cause of mortality worldwide and is typically detected via screening programs where healthy people are invited in regular intervals. Automated risk prediction approaches have the potential to improve this process by facilitating dynamically screening of high-risk groups. While most models focus solely on the most recent screening, there is growing interest in exploiting temporal information to capture evolving trends in breast tissue, as inspired by clinical practice. Early methods typically relied on two time steps, and although recent efforts have extended this to multiple time steps using Transformer architectures, challenges remain in fully harnessing the rich temporal dynamics inherent in longitudinal imaging data. In this work, we propose to instead leverage Vision Mamba RNN (VMRNN) with a state-space model (SSM) and LSTM-like memory mechanisms to effectively capture nuanced trends in breast tissue evolution. To further enhance our approach, we incorporate an asymmetry module that utilizes a Spatial Asymmetry Detector (SAD) and Longitudinal Asymmetry Tracker (LAT) to identify clinically relevant bilateral differences. This integrated framework demonstrates notable improvements in predicting cancer onset, especially for the more challenging high-density breast cases and achieves superior performance at extended time points (years four and five), highlighting its potential to advance early breast cancer recognition and enable more personalized screening strategies. Our code is available at this https URL.
zh
[CV-186] Spatial-Temporal Pre-Training for Embryo Viability Prediction Using Time-Lapse Videos
【速读】:该论文旨在解决体外受精(IVF)中胚胎活力预测的自动化问题,该问题因标注的妊娠结果数据有限而具有挑战性。现有自监督学习(SSL)方法无法直接应用于胚胎发育视频,主要面临两个挑战:一是胚胎时间 lapse 视频包含大量帧,传统 SSL 方法需要大量 GPU 内存;二是数据集中视频长度不一且存在大量异常帧,导致传统视频对齐方法难以处理语义错位。该论文提出的时空预训练(STPT)方法通过分阶段训练空间和时间编码器,并在每个阶段冻结其中一个编码器以降低内存需求,同时避免跨视频逐帧对齐,从而有效处理长视频和时间变化性。
链接: https://arxiv.org/abs/2506.17403
作者: Zhiyi Shi,Junsik Kim,Helen Y. Yang,Yonghyun Song,Hyun-Jic Oh,Dalit Ben-Yosef,Daniel Needleman,Hanspeter Pfister
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint submitted to Medical Image Analysis
Abstract:Automating embryo viability prediction for in vitro fertilization (IVF) is important but challenging due to the limited availability of labeled pregnancy outcome data, as only a small fraction of embryos are labeled after transfer. Self-supervised learning (SSL) can leverage both labeled and unlabeled data to improve prediction. However, existing SSL methods for videos are not directly applicable to embryo development videos due to two challenges: (1) embryo time-lapse videos contain hundreds of frames, requiring significant GPU memory for conventional SSL; (2) the dataset contains videos with varying lengths and many outlier frames, causing traditional video alignment methods to struggle with semantic misalignment. We propose Spatial-Temporal Pre-Training (STPT) to address these challenges. STPT includes two stages: spatial and temporal. In each stage, only one encoder is trained while the other is frozen, reducing memory demands. To handle temporal misalignment, STPT avoids frame-by-frame alignment across videos. The spatial stage learns from alignments within each video and its temporally consistent augmentations. The temporal stage then models relationships between video embeddings. Our method efficiently handles long videos and temporal variability. On 23,027 time-lapse videos (3,286 labeled), STPT achieves the highest AUC of 0.635 (95% CI: 0.632-0.638) compared to baselines, with limited computational resources.
zh
[CV-187] A workflow for generating synthetic LiDAR datasets in simulation environments
【速读】:该论文旨在解决自主车辆感知、机器人研究和传感器安全分析中对高质量合成LiDAR数据集的需求,其核心问题是缺乏可用于评估和测试的多样化、可重复且具有真实感的LiDAR数据。解决方案的关键在于构建一个基于CoppeliaSim仿真环境及其Python API的自动化模拟工作流,该工作流能够集成时间飞行LiDAR、图像传感器和二维扫描仪,并在城市场景中生成同步的多模态数据集,包括点云(PCD、PLY)和结构化数据(CSV),同时提供真实的位姿信息,从而支持安全漏洞分析与防御策略评估。
链接: https://arxiv.org/abs/2506.17378
作者: Abhishek Phadke,Shakib Mahmud Dipto,Pratip Rana
机构: Christopher Newport University (克里斯托弗·纽波特大学); Old Dominion University (老多明尼昂大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This paper presents a simulation workflow for generating synthetic LiDAR datasets to support autonomous vehicle perception, robotics research, and sensor security analysis. Leveraging the CoppeliaSim simulation environment and its Python API, we integrate time-of-flight LiDAR, image sensors, and two dimensional scanners onto a simulated vehicle platform operating within an urban scenario. The workflow automates data capture, storage, and annotation across multiple formats (PCD, PLY, CSV), producing synchronized multimodal datasets with ground truth pose information. We validate the pipeline by generating large-scale point clouds and corresponding RGB and depth imagery. The study examines potential security vulnerabilities in LiDAR data, such as adversarial point injection and spoofing attacks, and demonstrates how synthetic datasets can facilitate the evaluation of defense strategies. Finally, limitations related to environmental realism, sensor noise modeling, and computational scalability are discussed, and future research directions, such as incorporating weather effects, real-world terrain models, and advanced scanner configurations, are proposed. The workflow provides a versatile, reproducible framework for generating high-fidelity synthetic LiDAR datasets to advance perception research and strengthen sensor security in autonomous systems. Documentation and examples accompany this framework; samples of animated cloud returns and image sensor data can be found at this Link.
zh
[CV-188] From Drawings to Decisions: A Hybrid Vision-Language Framework for Parsing 2D Engineering Drawings into Structured Manufacturing Knowledge
【速读】:该论文旨在解决从二维工程图纸中高效且准确提取关键信息的问题,这些信息包括几何尺寸与公差(GDT)、测量值、材料规范和文本注释等。传统的人工提取方式效率低下且劳动强度大,而通用的OCR模型由于复杂布局、工程符号和旋转文本等问题,常导致输出不完整且不可靠。论文提出的解决方案是一个融合了旋转感知目标检测模型(YOLOv11-obb)和基于Transformer的视觉-语言解析器的混合视觉-语言框架。其关键在于利用YOLOv11-OBB对注释进行定位并提取定向边界框(OBB)图像块,再通过微调的轻量级视觉-语言模型(VLM)将这些图像块解析为结构化输出,从而提升提取的准确性与可靠性。
链接: https://arxiv.org/abs/2506.17374
作者: Muhammad Tayyab Khan,Lequn Chen,Zane Yong,Jun Ming Tan,Wenhe Feng,Seung Ki Moon
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: Preprint submitted to Elsevier
Abstract:Efficient and accurate extraction of key information from 2D engineering drawings is essential for advancing digital manufacturing workflows. Such information includes geometric dimensioning and tolerancing (GDT), measures, material specifications, and textual annotations. Manual extraction is slow and labor-intensive, while generic OCR models often fail due to complex layouts, engineering symbols, and rotated text, leading to incomplete and unreliable outputs. These limitations result in incomplete and unreliable outputs. To address these challenges, we propose a hybrid vision-language framework that integrates a rotation-aware object detection model (YOLOv11-obb) with a transformer-based vision-language parser. Our structured pipeline applies YOLOv11-OBB to localize annotations and extract oriented bounding box (OBB) patches, which are then parsed into structured outputs using a fine-tuned, lightweight vision-language model (VLM). We curate a dataset of 1,367 2D mechanical drawings annotated across nine key categories. YOLOv11-OBB is trained on this dataset to detect OBBs and extract annotation patches. These are parsed using two open-source VLMs: Donut and Florence-2. Both models are lightweight and well-suited for specialized industrial tasks under limited computational overhead. Following fine-tuning of both models on the curated dataset of image patches paired with structured annotation labels, a comparative experiment is conducted to evaluate parsing performance across four key metrics. Donut outperforms Florence-2, achieving 88.5% precision, 99.2% recall, and a 93.5% F1-score, with a hallucination rate of 11.5%. Finally, a case study demonstrates how the extracted structured information supports downstream manufacturing tasks such as process and tool selection, showcasing the practical utility of the proposed framework in modernizing 2D drawing interpretation.
zh
[CV-189] Multimodal Political Bias Identification and Neutralization
【速读】:该论文试图解决政治文章中由于政治回音室效应导致的主观偏见和情感化语言问题,尤其是针对文本和图像两方面的偏见检测与去除。现有研究主要关注文本部分,而忽略了图像作为同样重要的信息传播媒介所携带的偏见。解决方案的关键在于构建一个结合文本和图像偏见处理的模型,包含四个步骤:图像-文本对齐(基于CLIP模型进行语义对齐)、图像偏见评分(通过ViT分类器确定图像的偏见分数)、文本去偏(利用BERT模型检测并中和偏见词汇)、以及最终的去偏步骤,通过比较偏见分数替换文本和图像为中性或降低偏见的版本。该方法在初步实验中表现出良好的效果,但需要更多时间和资源以进一步优化性能。
链接: https://arxiv.org/abs/2506.17372
作者: Cedric Bernard,Xavier Pleimling,Amun Kharel,Chase Vickery
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Due to the presence of political echo chambers, it becomes imperative to detect and remove subjective bias and emotionally charged language from both the text and images of political articles. However, prior work has focused on solely the text portion of the bias rather than both the text and image portions. This is a problem because the images are just as powerful of a medium to communicate information as text is. To that end, we present a model that leverages both text and image bias which consists of four different steps. Image Text Alignment focuses on semantically aligning images based on their bias through CLIP models. Image Bias Scoring determines the appropriate bias score of images via a ViT classifier. Text De-Biasing focuses on detecting biased words and phrases and neutralizing them through BERT models. These three steps all culminate to the final step of debiasing, which replaces the text and the image with neutralized or reduced counterparts, which for images is done by comparing the bias scores. The results so far indicate that this approach is promising, with the text debiasing strategy being able to identify many potential biased words and phrases, and the ViT model showcasing effective training. The semantic alignment model also is efficient. However, more time, particularly in training, and resources are needed to obtain better results. A human evaluation portion was also proposed to ensure semantic consistency of the newly generated text and images.
zh
[CV-190] AI-based Multimodal Biometrics for Detecting Smartphone Distractions: Application to Online Learning
【速读】:该论文试图解决在需要持续注意力的任务中,如基于计算机的在线学习,由于智能手机使用导致的注意力分散问题。解决方案的关键在于利用多模态生物特征技术(Multimodal Biometrics),通过整合生理信号和头部姿态数据来检测手机使用情况。研究表明,单一生物特征信号的准确率有限,而结合所有信号的多模态模型可达到91%的准确率,凸显了多模态数据融合的优势。
链接: https://arxiv.org/abs/2506.17364
作者: Alvaro Becerra,Roberto Daza,Ruth Cobos,Aythami Morales,Mutlu Cukurova,Julian Fierrez
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注: Accepted in EC-TEL25: 20th European Conference on Technology Enhanced Learning, Newcastle and Durham, UK, 15-19 September 2025
Abstract:This work investigates the use of multimodal biometrics to detect distractions caused by smartphone use during tasks that require sustained attention, with a focus on computer-based online learning. Although the methods are applicable to various domains, such as autonomous driving, we concentrate on the challenges learners face in maintaining engagement amid internal (e.g., motivation), system-related (e.g., course design) and contextual (e.g., smartphone use) factors. Traditional learning platforms often lack detailed behavioral data, but Multimodal Learning Analytics (MMLA) and biosensors provide new insights into learner attention. We propose an AI-based approach that leverages physiological signals and head pose data to detect phone use. Our results show that single biometric signals, such as brain waves or heart rate, offer limited accuracy, while head pose alone achieves 87%. A multimodal model combining all signals reaches 91% accuracy, highlighting the benefits of integration. We conclude by discussing the implications and limitations of deploying these models for real-time support in online learning environments.
zh
[CV-191] Efficient Feedback Gate Network for Hyperspectral Image Super-Resolution
【速读】:该论文旨在解决单张高光谱图像超分辨率(SHSR)方法在缺乏辅助图像的情况下,由于未能充分探索波段间的一致性及空谱信息而导致的性能受限问题。其解决方案的关键在于提出一种基于分组的SHSR方法——高效反馈门网络,通过引入多种反馈机制和门控操作,结合大核卷积和光谱交互,实现对丰富波段信息和层次化空谱信息的学习。此外,通过构建空间-光谱强化门模块(SSRGM),结合宽边界感知门块和光谱增强门块,有效获取具有高代表性的空谱特征,并利用三维SSRGM增强高光谱数据的整体信息和一致性。
链接: https://arxiv.org/abs/2506.17361
作者: Xufei Wang,Mingjian Zhang,Fei Ge,Jinchen Zhu,Wen Sha,Jifen Ren,Zhimeng Hou,Shouguo Zheng,ling Zheng,Shizhuang Weng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 20 pages,17 figures
Abstract:Even without auxiliary images, single hyperspectral image super-resolution (SHSR) methods can be designed to improve the spatial resolution of hyperspectral images. However, failing to explore coherence thoroughly along bands and spatial-spectral information leads to the limited performance of the SHSR. In this study, we propose a novel group-based SHSR method termed the efficient feedback gate network, which uses various feedbacks and gate operations involving large kernel convolutions and spectral interactions. In particular, by providing different guidance for neighboring groups, we can learn rich band information and hierarchical hyperspectral spatial information using channel shuffling and dilatation convolution in shuffled and progressive dilated fusion module(SPDFM). Moreover, we develop a wide-bound perception gate block and a spectrum enhancement gate block to construct the spatial-spectral reinforcement gate module (SSRGM) and obtain highly representative spatial-spectral features efficiently. Additionally, we apply a three-dimensional SSRGM to enhance holistic information and coherence for hyperspectral data. The experimental results on three hyperspectral datasets demonstrate the superior performance of the proposed network over the state-of-the-art methods in terms of spectral fidelity and spatial content reconstruction.
zh
[CV-192] A Novel Multi-layer Task-centric and Data Quality Framework for Autonomous Driving
【速读】:该论文试图解决下一代自动驾驶车辆(Autonomous Vehicles, AVs)在处理多源异构数据时面临的数据质量(Data Quality, DQ)问题,即当前研究和实践中过度关注模型/算法而忽视了数据质量对系统功能、效率和可信度的影响。解决方案的关键在于提出一个以任务为中心、以数据质量为基础的框架,该框架包含五个层次:数据层、DQ层、任务层、应用层和目标层,旨在将数据质量与任务需求及性能目标进行映射。通过案例研究验证了减少多源图像数据中的冗余可以提升YOLOv8目标检测任务的性能,并揭示了多模态数据中现存的数据质量冗余问题。
链接: https://arxiv.org/abs/2506.17346
作者: Yuhan Zhou,Haihua Chen,Kewei Sha
机构: University of North Texas (北德克萨斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:The next-generation autonomous vehicles (AVs), embedded with frequent real-time decision-making, will rely heavily on a large volume of multisource and multimodal data. In real-world settings, the data quality (DQ) of different sources and modalities usually varies due to unexpected environmental factors or sensor issues. However, both researchers and practitioners in the AV field overwhelmingly concentrate on models/algorithms while undervaluing the DQ. To fulfill the needs of the next-generation AVs with guarantees of functionality, efficiency, and trustworthiness, this paper proposes a novel task-centric and data quality vase framework which consists of five layers: data layer, DQ layer, task layer, application layer, and goal layer. The proposed framework aims to map DQ with task requirements and performance goals. To illustrate, a case study investigating redundancy on the nuScenes dataset proves that partially removing redundancy on multisource image data could improve YOLOv8 object detection task performance. Analysis on multimodal data of image and LiDAR further presents existing redundancy DQ issues. This paper opens up a range of critical but unexplored challenges at the intersection of DQ, task orchestration, and performance-oriented system development in AVs. It is expected to guide the AV community toward building more adaptive, explainable, and resilient AVs that respond intelligently to dynamic environments and heterogeneous data streams. Code, data, and implementation details are publicly available at: this https URL.
zh
[CV-193] P2MFDS: A Privacy-Preserving Multimodal Fall Detection System for Elderly People in Bathroom Environments
【速读】:该论文旨在解决老年人在浴室等复杂环境中跌倒检测的准确性不足问题,尤其是在非侵入性和隐私保护的前提下,现有单模态系统(如基于WiFi、红外或毫米波的系统)因环境干扰和系统偏差导致的性能局限。其解决方案的关键在于提出一种隐私保护的多模态跌倒检测系统(Privacy-Preserving Multimodal Fall Detection System),通过融合毫米波雷达与三维振动传感数据,并构建一个大规模的隐私保护多模态数据集,同时设计双流网络结构P2MFDS,结合卷积神经网络-双向长短期记忆网络-注意力机制与多尺度卷积神经网络-通道注意力模块-自注意力机制,以联合宏观与微观特征,提升检测的准确率与召回率。
链接: https://arxiv.org/abs/2506.17332
作者: Haitian Wang,Yiren Wang,Xinyu Wang,Yumeng Miao,Yuliang Zhang,Yu Zhang,Atif Mansoor
机构: Northwestern Polytechnical University (西北工业大学); The University of Western Australia (西澳大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to appear in the 2025 IEEE International Workshop on AIoT and Smart Systems (AIoTSys’25). Nominated for Best Paper Award and Best IoT System Implementation Award. Code and pretrained models available at: this https URL
Abstract:By 2050, people aged 65 and over are projected to make up 16 percent of the global population. As aging is closely associated with increased fall risk, particularly in wet and confined environments such as bathrooms where over 80 percent of falls occur. Although recent research has increasingly focused on non-intrusive, privacy-preserving approaches that do not rely on wearable devices or video-based monitoring, these efforts have not fully overcome the limitations of existing unimodal systems (e.g., WiFi-, infrared-, or mmWave-based), which are prone to reduced accuracy in complex environments. These limitations stem from fundamental constraints in unimodal sensing, including system bias and environmental interference, such as multipath fading in WiFi-based systems and drastic temperature changes in infrared-based methods. To address these challenges, we propose a Privacy-Preserving Multimodal Fall Detection System for Elderly People in Bathroom Environments. First, we develop a sensor evaluation framework to select and fuse millimeter-wave radar with 3D vibration sensing, and use it to construct and preprocess a large-scale, privacy-preserving multimodal dataset in real bathroom settings, which will be released upon publication. Second, we introduce P2MFDS, a dual-stream network combining a CNN-BiLSTM-Attention branch for radar motion dynamics with a multi-scale CNN-SEBlock-Self-Attention branch for vibration impact detection. By uniting macro- and micro-scale features, P2MFDS delivers significant gains in accuracy and recall over state-of-the-art approaches. Code and pretrained models will be made available at: this https URL.
zh
[CV-194] RadarSeq: A Temporal Vision Framework for User Churn Prediction via Radar Chart Sequences
【速读】:该论文旨在解决非订阅型共享经济平台中用户流失预测的问题,这类平台的用户参与度不明确,缺乏显式标签,且用户行为具有动态性,使得传统方法难以有效捕捉时间线索。其解决方案的关键在于提出一种时序感知的计算机视觉框架,将用户行为模式建模为一系列雷达图图像序列,每张图像编码日级别的行为特征,并通过预训练的卷积神经网络(CNN)编码器与双向长短期记忆网络(LSTM)的结合,同时捕捉空间和时间上的行为模式,从而提升流失预测的准确性与可解释性。
链接: https://arxiv.org/abs/2506.17325
作者: Sina Najafi,M. Hadi Sepanj,Fahimeh Jafari
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Predicting user churn in non-subscription gig platforms, where disengagement is implicit, poses unique challenges due to the absence of explicit labels and the dynamic nature of user behavior. Existing methods often rely on aggregated snapshots or static visual representations, which obscure temporal cues critical for early detection. In this work, we propose a temporally-aware computer vision framework that models user behavioral patterns as a sequence of radar chart images, each encoding day-level behavioral features. By integrating a pretrained CNN encoder with a bidirectional LSTM, our architecture captures both spatial and temporal patterns underlying churn behavior. Extensive experiments on a large real-world dataset demonstrate that our method outperforms classical models and ViT-based radar chart baselines, yielding gains of 17.7 in F1 score, 29.4 in precision, and 16.1 in AUC, along with improved interpretability. The framework’s modular design, explainability tools, and efficient deployment characteristics make it suitable for large-scale churn modeling in dynamic gig-economy platforms.
zh
[CV-195] Origins of Creativity in Attention-Based Diffusion Models
【速读】:该论文试图解决扩散模型中“创造力”来源的问题,即如何在生成图像时保持与训练样本的显著差异性同时仍保持图像的合理性。其解决方案的关键在于分析得分匹配(score matching)机制下,卷积神经网络(CNN)及其自注意力机制对生成样本全局一致性的影响。具体而言,研究指出当得分函数由带有最终自注意力层的CNN参数化时,自注意力能够促进局部特征在图像层面的全局一致排列,从而超越传统CNN生成的局部拼贴式样本。
链接: https://arxiv.org/abs/2506.17324
作者: Emma Finn,T. Anderson Keller,Manos Theodosis,Demba E. Ba
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:As diffusion models have become the tool of choice for image generation and as the quality of the images continues to improve, the question of how creativity' originates in diffusion has become increasingly important. The score matching perspective on diffusion has proven particularly fruitful for understanding how and why diffusion models generate images that remain plausible while differing significantly from their training images. In particular, as explained in (Kamb \ Ganguli, 2024) and others, e.g., (Ambrogioni, 2023), theory suggests that if our score matching were optimal, we would only be able to recover training samples through our diffusion process. However, as shown by Kamb \ Ganguli, (2024), in diffusion models where the score is parametrized by a simple CNN, the inductive biases of the CNN itself (translation equivariance and locality) allow the model to generate samples that globally do not match any training samples, but are rather patch-wise
mosaics’. Notably, however, this theory does not extend to describe the role of self-attention in this process. In this work, we take a preliminary step in this direction to extend this theory to the case of diffusion models whose score is parametrized by a CNN with a final self-attention layer. We show that our theory suggests that self-attention will induce a globally image-consistent arrangement of local features beyond the patch-level in generated samples, and we verify this behavior empirically on a carefully crafted dataset.
zh
[CV-196] MAARTA:Multi-Agent ic Adaptive Radiology Teaching Assistant MICCAI2025
【速读】:该论文试图解决放射学学生在发展感知专长过程中因缺乏专家指导时间而导致的视觉搜索和诊断解释错误问题。现有人工智能系统虽然关注诊断准确性,但未能解释错误发生的原因和机制。解决方案的关键在于提出MAARTA(Multi-Agentic Adaptive Radiology Teaching Assistant),这是一个多智能体框架,通过分析眼动模式和放射学报告提供个性化反馈。MAARTA根据错误复杂性动态选择智能体,实现自适应和高效的推理,并通过结构化图比较专家与学生的注视行为,识别遗漏发现并分配感知错误教师智能体来分析差异,进而通过分步提示帮助学生理解错误并提升诊断推理能力。
链接: https://arxiv.org/abs/2506.17320
作者: Akash Awasthi,Brandon V. Chang,Anh M. Vu,Ngan Le,Rishi Agrawal,Zhigang Deng,Carol Wu,Hien Van Nguyen
机构: 未知
类目: Computers and Society (cs.CY); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted to MICCAI 2025 (Main Conference)
Abstract:Radiology students often struggle to develop perceptual expertise due to limited expert mentorship time, leading to errors in visual search and diagnostic interpretation. These perceptual errors, such as missed fixations, short dwell times, or misinterpretations, are not adequately addressed by current AI systems, which focus on diagnostic accuracy but fail to explain how and why errors occur. To address this gap, we introduce MAARTA (Multi-Agentic Adaptive Radiology Teaching Assistant), a multi-agent framework that analyzes gaze patterns and radiology reports to provide personalized feedback. Unlike single-agent models, MAARTA dynamically selects agents based on error complexity, enabling adaptive and efficient reasoning. By comparing expert and student gaze behavior through structured graphs, the system identifies missed findings and assigns Perceptual Error Teacher agents to analyze discrepancies. MAARTA then uses step-by-step prompting to help students understand their errors and improve diagnostic reasoning, advancing AI-driven radiology education.
zh
[CV-197] Learning to Adapt Frozen CLIP for Few-Shot Test-Time Domain Adaptation ICLR2025
【速读】:该论文旨在解决少样本测试时域适应(Few-shot Test-Time Domain Adaptation)问题,即在仅拥有少量未标记示例的情况下,将模型适配到特定领域以应对领域偏移。现有方法依赖于CLIP的强泛化能力,通过生成领域特定提示来引导其冻结的特征,但受限于CLIP的先验知识,难以有效适应下游数据集。该工作的关键在于直接在输入空间进行学习,以补充CLIP的领域特定知识,具体通过引入一个并行的辅助分支,利用反向注意力机制学习专属知识,并通过贪心文本集成与优化增强文本特征的互离散性,最终通过生成的领域提示实现领域感知的特征融合,从而提升模型在真实世界基准上的性能。
链接: https://arxiv.org/abs/2506.17307
作者: Zhixiang Chi,Li Gu,Huan Liu,Ziqiang Wang,Yanan Wu,Yang Wang,Konstantinos N Plataniotis
机构: University of Toronto(多伦多大学); Concordia University(康考迪亚大学); McMaster University(麦克马斯特大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: ICLR2025, this https URL
Abstract:Few-shot Test-Time Domain Adaptation focuses on adapting a model at test time to a specific domain using only a few unlabeled examples, addressing domain shift. Prior methods leverage CLIP’s strong out-of-distribution (OOD) abilities by generating domain-specific prompts to guide its generalized, frozen features. However, since downstream datasets are not explicitly seen by CLIP, solely depending on the feature space knowledge is constrained by CLIP’s prior knowledge. Notably, when using a less robust backbone like ViT-B/16, performance significantly drops on challenging real-world benchmarks. Departing from the state-of-the-art of inheriting the intrinsic OOD capability of CLIP, this work introduces learning directly on the input space to complement the dataset-specific knowledge for frozen CLIP. Specifically, an independent side branch is attached in parallel with CLIP and enforced to learn exclusive knowledge via revert attention. To better capture the dataset-specific label semantics for downstream adaptation, we propose to enhance the inter-dispersion among text features via greedy text ensemble and refinement. The text and visual features are then progressively fused in a domain-aware manner by a generated domain prompt to adapt toward a specific domain. Extensive experiments show our method’s superiority on 5 large-scale benchmarks (WILDS and DomainNet), notably improving over smaller networks like ViT-B/16 with gains of \textbf+5.1 in F1 for iWildCam and \textbf+3.1% in WC Acc for FMoW.
zh
[CV-198] Fine-Scale Soil Mapping in Alaska with Multimodal Machine Learning
【速读】:该论文试图解决阿拉斯加地区高精度土壤制图问题,尤其是在气候变化导致永冻土融化加速的背景下,传统依赖实地工作和局部模拟的方法存在局限性。解决方案的关键在于提出一种基于视觉的机器学习模型MISO,该模型整合了地理空间基础模型用于视觉特征提取、隐式神经表示用于连续空间预测以及对比学习用于多模态对齐和地理定位感知,从而实现更精确的近地表永冻土和土壤分类的全省范围细尺度土壤地图生成。
链接: https://arxiv.org/abs/2506.17302
作者: Yijun Lin,Theresa Chen,Colby Brungard,Grunwald Sabine,Sue Ives,Matt Macander,Timm Nawrocki,Yao-Yi Chiang,Nic Jelinski
机构: University of Minnesota (明尼苏达大学); New Mexico State University (新墨西哥州立大学); University of Florida (佛罗里达大学); ABR, Inc. (ABR公司); University of Alaska-Anchorage (阿拉斯加大学安克雷奇分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 12 pages, Submitted to SIGSPATIAL 2025
Abstract:Fine-scale soil mapping in Alaska, traditionally relying on fieldwork and localized simulations, remains a critical yet underdeveloped task, despite the region’s ecological importance and extensive permafrost coverage. As permafrost thaw accelerates due to climate change, it threatens infrastructure stability and key ecosystem services, such as soil carbon storage. High-resolution soil maps are essential for characterizing permafrost distribution, identifying vulnerable areas, and informing adaptation strategies. We present MISO, a vision-based machine learning (ML) model to produce statewide fine-scale soil maps for near-surface permafrost and soil taxonomy. The model integrates a geospatial foundation model for visual feature extraction, implicit neural representations for continuous spatial prediction, and contrastive learning for multimodal alignment and geo-location awareness. We compare MISO with Random Forest (RF), a traditional ML model that has been widely used in soil mapping applications. Spatial cross-validation and regional analysis across Permafrost Zones and Major Land Resource Areas (MLRAs) show that MISO generalizes better to remote, unseen locations and achieves higher recall than RF, which is critical for monitoring permafrost thaw and related environmental processes. These findings demonstrate the potential of advanced ML approaches for fine-scale soil mapping and provide practical guidance for future soil sampling and infrastructure planning in permafrost-affected landscapes. The project will be released at this https URL.
zh
[CV-199] SRKD: Towards Efficient 3D Point Cloud Segmentation via Structure- and Relation-aware Knowledge Distillation
【速读】:该论文旨在解决3D点云分割中由于大规模基于Transformer的模型计算复杂度高和部署限制所带来的实际挑战。其解决方案的关键在于提出一种名为SRKD的结构与关系感知知识蒸馏框架,通过将大型冻结教师模型(100M)中的丰富几何和语义知识迁移到轻量级学生模型(15M)中,从而在保持性能的同时显著降低模型复杂度。该方法的核心创新包括基于亲和矩阵的关系对齐模块、跨样本小批量构建策略以及KL散度与真实标签监督的联合优化,以提升学生模型对上下文交互和通用几何结构的学习能力。
链接: https://arxiv.org/abs/2506.17290
作者: Yuqi Li,Junhao Dong,Zeyu Dong,Chuanguang Yang,Zhulin An,Yongjun Xu
机构: Institute of Computing Technology, Chinese Academy of Sciences (中国科学院计算技术研究所); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages
Abstract:3D point cloud segmentation faces practical challenges due to the computational complexity and deployment limitations of large-scale transformer-based models. To address this, we propose a novel Structure- and Relation-aware Knowledge Distillation framework, named SRKD, that transfers rich geometric and semantic knowledge from a large frozen teacher model (100M) to a lightweight student model (15M). Specifically, we propose an affinity matrix-based relation alignment module, which distills structural dependencies from the teacher to the student through point-wise similarity matching, enhancing the student’s capability to learn contextual interactions. Meanwhile, we introduce a cross-sample mini-batch construction strategy that enables the student to perceive stable and generalized geometric structure. This aligns across diverse point cloud instances of the teacher, rather than within a single sample. Additionally, KL divergence is applied to align semantic distributions, and ground-truth supervision further reinforces accurate segmentation. Our method achieves state of the art performance with significantly reduced model complexity, demonstrating its effectiveness and efficiency in real-world deployment scenarios. Our Code is available at this https URL.
zh
[CV-200] Mechanistic Interpretability of Diffusion Models: Circuit-Level Analysis and Causal Validation
【速读】:该论文试图解决生成式 AI (Generative AI) 在图像生成过程中对不同数据分布(如合成数据与自然图像)处理机制的量化理解问题,以及如何通过电路级分析揭示其计算路径和机制原理。解决方案的关键在于通过系统性干预实验,识别出在真实人脸图像处理中具有更高计算复杂度的电路结构,并明确其在去噪步骤中表现出的注意力机制特异性,从而为生成模型的行为提供可量化的算法理解和控制基础。
链接: https://arxiv.org/abs/2506.17237
作者: Dip Roy
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We present a quantitative circuit-level analysis of diffusion models, establishing computational pathways and mechanistic principles underlying image generation processes. Through systematic intervention experiments across 2,000 synthetic and 2,000 CelebA facial images, we discover fundamental algorithmic differences in how diffusion architectures process synthetic versus naturalistic data distributions. Our investigation reveals that real-world face processing requires circuits with measurably higher computational complexity (complexity ratio = 1.084 plus/minus 0.008, p 0.001), exhibiting distinct attention specialization patterns with entropy divergence ranging from 0.015 to 0.166 across denoising timesteps. We identify eight functionally distinct attention mechanisms showing specialized computational roles: edge detection (entropy = 3.18 plus/minus 0.12), texture analysis (entropy = 4.16 plus/minus 0.08), and semantic understanding (entropy = 2.67 plus/minus 0.15). Intervention analysis demonstrates critical computational bottlenecks where targeted ablations produce 25.6% to 128.3% performance degradation, providing causal evidence for identified circuit functions. These findings establish quantitative foundations for algorithmic understanding and control of generative model behavior through mechanistic intervention strategies.
zh
[CV-201] PCaM: A Progressive Focus Attention-Based Information Fusion Method for Improving Vision Transformer Domain Adaptation
【速读】:该论文旨在解决无监督域适应(Unsupervised Domain Adaptation, UDA)中因前景目标尺寸和空间分布差异导致的注意力一致性减弱问题,即前景目标不匹配(foreground object mismatch)。其解决方案的关键在于提出渐进式焦点交叉注意力机制(Progressive Focus Cross-Attention Mechanism, PCaM),通过逐步过滤背景信息,使模型能够聚焦并融合跨域的判别性前景语义,同时引入注意力引导损失以增强跨域注意力一致性。
链接: https://arxiv.org/abs/2506.17232
作者: Zelin Zang,Fei Wang,Liangyu Li,Jinlin Wu,Chunshui Zhao,Zhen Lei,Baigui Sun
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Unsupervised Domain Adaptation (UDA) aims to transfer knowledge from a labeled source domain to an unlabeled target domain. Recent UDA methods based on Vision Transformers (ViTs) have achieved strong performance through attention-based feature alignment. However, we identify a key limitation: foreground object mismatch, where the discrepancy in foreground object size and spatial distribution across domains weakens attention consistency and hampers effective domain alignment. To address this issue, we propose the Progressive Focus Cross-Attention Mechanism (PCaM), which progressively filters out background information during cross-attention, allowing the model to focus on and fuse discriminative foreground semantics across domains. We further introduce an attentional guidance loss that explicitly directs attention toward task-relevant regions, enhancing cross-domain attention consistency. PCaM is lightweight, architecture-agnostic, and easy to integrate into existing ViT-based UDA pipelines. Extensive experiments on Office-Home, DomainNet, VisDA-2017, and remote sensing datasets demonstrate that PCaM significantly improves adaptation performance and achieves new state-of-the-art results, validating the effectiveness of attention-guided foreground fusion for domain adaptation.
zh
[CV-202] mporal Neural Cellular Automata: Application to modeling of contrast enhancement in breast MRI MICCAI2025
【速读】:该论文旨在解决乳腺磁共振成像(MRI)中因长时间采集和高成本导致的广泛应用受限问题,通过合成对比增强技术实现快速图像获取并避免静脉注射对比剂。其解决方案的关键在于引入TeNCA(Temporal Neural Cellular Automata),该方法扩展并优化了神经细胞自动机(NCA),以有效建模时间稀疏且非均匀采样的影像数据,通过自适应损失计算和模拟物理时间演进的迭代机制,使模型能够学习具有生理合理性的对比增强演化过程。
链接: https://arxiv.org/abs/2506.18720
作者: Daniel M. Lang,Richard Osuala,Veronika Spieker,Karim Lekadir,Rickmer Braren,Julia A. Schnabel
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: MICCAI 2025
Abstract:Synthetic contrast enhancement offers fast image acquisition and eliminates the need for intravenous injection of contrast agent. This is particularly beneficial for breast imaging, where long acquisition times and high cost are significantly limiting the applicability of magnetic resonance imaging (MRI) as a widespread screening modality. Recent studies have demonstrated the feasibility of synthetic contrast generation. However, current state-of-the-art (SOTA) methods lack sufficient measures for consistent temporal evolution. Neural cellular automata (NCA) offer a robust and lightweight architecture to model evolving patterns between neighboring cells or pixels. In this work we introduce TeNCA (Temporal Neural Cellular Automata), which extends and further refines NCAs to effectively model temporally sparse, non-uniformly sampled imaging data. To achieve this, we advance the training strategy by enabling adaptive loss computation and define the iterative nature of the method to resemble a physical progression in time. This conditions the model to learn a physiologically plausible evolution of contrast enhancement. We rigorously train and test TeNCA on a diverse breast MRI dataset and demonstrate its effectiveness, surpassing the performance of existing methods in generation of images that align with ground truth post-contrast sequences.
zh
[CV-203] A Deep Convolutional Neural Network-Based Novel Class Balancing for Imbalance Data Segmentation
【速读】:该论文旨在解决视网膜眼底图像中血管分割的挑战性问题,特别是由于数据分布不平衡和血管厚度变化带来的困难。其解决方案的关键在于提出了一种基于深度学习的双级别类别平衡方案(BLCB-CNN),通过Level-I进行血管/非血管类别的平衡,以及通过Level-II实现厚血管与薄血管类别的平衡,从而提升分割的准确性。此外,采用全局对比度归一化(GCN)、限制性自适应直方图均衡化(CLAHE)和伽马校正等预处理方法,以增强图像的强度一致性及血管与背景之间的对比度,最终实现了在标准数据集上的优越性能。
链接: https://arxiv.org/abs/2506.18474
作者: Atifa Kalsoom,M.A. Iftikhar,Amjad Ali,Zubair Shah,Shidin Balakrishnan,Hazrat Ali
机构: COMSATS University Islambad, Lahore Campus, Pakistan; Hamad Medical Corporation, Qatar; University of Stirling, UK
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: This is preprint of the paper submitted to Scientific Reports journal
Abstract:Retinal fundus images provide valuable insights into the human eye’s interior structure and crucial features, such as blood vessels, optic disk, macula, and fovea. However, accurate segmentation of retinal blood vessels can be challenging due to imbalanced data distribution and varying vessel thickness. In this paper, we propose BLCB-CNN, a novel pipeline based on deep learning and bi-level class balancing scheme to achieve vessel segmentation in retinal fundus images. The BLCB-CNN scheme uses a Convolutional Neural Network (CNN) architecture and an empirical approach to balance the distribution of pixels across vessel and non-vessel classes and within thin and thick vessels. Level-I is used for vessel/non-vessel balancing and Level-II is used for thick/thin vessel balancing. Additionally, pre-processing of the input retinal fundus image is performed by Global Contrast Normalization (GCN), Contrast Limited Adaptive Histogram Equalization (CLAHE), and gamma corrections to increase intensity uniformity as well as to enhance the contrast between vessels and background pixels. The resulting balanced dataset is used for classification-based segmentation of the retinal vascular tree. We evaluate the proposed scheme on standard retinal fundus images and achieve superior performance measures, including an area under the ROC curve of 98.23%, Accuracy of 96.22%, Sensitivity of 81.57%, and Specificity of 97.65%. We also demonstrate the method’s efficacy through external cross-validation on STARE images, confirming its generalization ability.
zh
[CV-204] aming Vision-Language Models for Medical Image Analysis: A Comprehensive Review
【速读】:该论文试图解决将通用视觉-语言模型(VLM)适配到医学图像分析任务中的问题,其核心挑战包括领域差异大、病理变化复杂以及不同任务的多样性和独特性。解决方案的关键在于系统总结和分析适用于医学领域的VLM学习策略,如预训练、微调和提示学习,并探讨五种主要的VLM适配策略在十一项医学影像任务中的实际应用,以推动其在临床实践中的创新、稳健和安全应用。
链接: https://arxiv.org/abs/2506.18378
作者: Haoneng Lin,Cheng Xu,Jing Qin
机构: Hong Kong Polytechnic University (香港理工大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 34 pages
Abstract:Modern Vision-Language Models (VLMs) exhibit unprecedented capabilities in cross-modal semantic understanding between visual and textual modalities. Given the intrinsic need for multi-modal integration in clinical applications, VLMs have emerged as a promising solution for a wide range of medical image analysis tasks. However, adapting general-purpose VLMs to medical domain poses numerous challenges, such as large domain gaps, complicated pathological variations, and diversity and uniqueness of different tasks. The central purpose of this review is to systematically summarize recent advances in adapting VLMs for medical image analysis, analyzing current challenges, and recommending promising yet urgent directions for further investigations. We begin by introducing core learning strategies for medical VLMs, including pretraining, fine-tuning, and prompt learning. We then categorize five major VLM adaptation strategies for medical image analysis. These strategies are further analyzed across eleven medical imaging tasks to illustrate their current practical implementations. Furthermore, we analyze key challenges that impede the effective adaptation of VLMs to clinical applications and discuss potential directions for future research. We also provide an open-access repository of related literature to facilitate further research, available at this https URL. It is anticipated that this article can help researchers who are interested in harnessing VLMs in medical image analysis tasks have a better understanding on their capabilities and limitations, as well as current technical barriers, to promote their innovative, robust, and safe application in clinical practice.
zh
[CV-205] ransforming HE images into IHC: A Variance-Penalized GAN for Precision Oncology
【速读】:该论文旨在解决HER2阳性乳腺癌精准诊断中传统免疫组化(IHC)方法成本高、劳动强度大且依赖抗体选择的问题,同时克服常规苏木精-伊红(HE)染色缺乏HER2特异性的局限。其解决方案的关键在于提出一种基于深度学习的图像翻译框架,通过改进金字塔pix2pix模型的损失函数以缓解生成对抗网络(GAN)中的模式崩溃问题,并引入基于方差的惩罚项以增强生成图像的结构多样性,从而实现从HE图像到高保真IHC图像的准确转换,特别是在HER2阳性(IHC 3+)图像的翻译上表现出显著优势。
链接: https://arxiv.org/abs/2506.18371
作者: Sara Rehmat,Hafeez Ur Rehman
机构: National University of Computer and Emerging Sciences(国家计算机与新兴科学大学); Oryx Universal College with Liverpool John Moores University(奥瑞克斯综合大学与利物浦约翰摩尔斯大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The overexpression of the human epidermal growth factor receptor 2 (HER2) in breast cells is a key driver of HER2-positive breast cancer, a highly aggressive subtype requiring precise diagnosis and targeted therapy. Immunohistochemistry (IHC) is the standard technique for HER2 assessment but is costly, labor-intensive, and highly dependent on antibody selection. In contrast, hematoxylin and eosin (HE) staining, a routine histopathological procedure, offers broader accessibility but lacks HER2 specificity. This study proposes an advanced deep learning-based image translation framework to generate highfidelity IHC images from HE-stained tissue samples, enabling cost-effective and scalable HER2 assessment. By modifying the loss function of pyramid pix2pix, we mitigate mode collapse, a fundamental limitation in generative adversarial networks (GANs), and introduce a novel variance-based penalty that enforces structural diversity in generated images. Our model particularly excels in translating HER2-positive (IHC 3+) images, which have remained challenging for existing methods due to their complex morphological variations. Extensive evaluations on the BCI histopathological dataset demonstrate that our model surpasses state-of-the-art methods in terms of peak signal-tonoise ratio (PSNR), structural similarity index (SSIM), and Frechet Inception Distance (FID), particularly in accurately translating HER2-positive (IHC 3+) images. Beyond medical imaging, our model exhibits superior performance in general image-to-image translation tasks, showcasing its potential across multiple domains. This work marks a significant step toward AI-driven precision oncology, offering a reliable and efficient alternative to traditional HER2 diagnostics.
zh
[CV-206] Multimodal Medical Image Binding via Shared Text Embeddings
【速读】:该论文旨在解决医学图像分析中多模态数据对齐的问题,即如何在无需显式配对数据的情况下实现多种医学影像模态(如X射线、CT、视网膜图像、心电图和病理图像)之间的特征对齐,以提升诊断和治疗规划的准确性。解决方案的关键在于提出一种名为M³Bind的预训练框架,该框架通过共享文本表示空间实现多模态图像的无缝对齐,其核心是首先微调预训练的类似CLIP的图像-文本模型以对齐模态特定的文本嵌入空间,随后将这些模态特定的文本编码器知识蒸馏到统一模型中,从而构建一个共享的文本嵌入空间。
链接: https://arxiv.org/abs/2506.18072
作者: Yunhao Liu,Suyang Xi,Shiqi Liu,Hong Ding,Chicheng Jin,Chenxi Yang,Junjun He,Yiqing Shen
机构: The Hong Kong Polytechnic University (香港理工大学); Emory University (埃默里大学); The University of Hong Kong (香港大学); University of Illinois Chicago (伊利诺伊大学芝加哥分校); University of Science and Technology of China (中国科学技术大学); University of Electronic Science and Technology of China (电子科技大学); Shanghai AI Laboratory (上海人工智能实验室); Johns Hopkins University (约翰霍普金斯大学)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 3 figures
Abstract:Medical image analysis increasingly relies on the integration of multiple imaging modalities to capture complementary anatomical and functional information, enabling more accurate diagnosis and treatment planning. Achieving aligned feature representations across these diverse modalities is therefore important for effective multimodal analysis. While contrastive language-image pre-training (CLIP) and its variant have enabled image-text alignments, they require explicitly paired data between arbitrary two modalities, which is difficult to acquire in medical contexts. To address the gap, we present Multimodal Medical Image Binding with Text (M\textsuperscript3Bind), a novel pre-training framework that enables seamless alignment of multiple medical imaging modalities through a shared text representation space without requiring explicit paired data between any two medical image modalities. Specifically, based on the insight that different images can naturally bind with text, M\textsuperscript3Bind first fine-tunes pre-trained CLIP-like image-text models to align their modality-specific text embedding space while preserving their original image-text alignments. Subsequently, we distill these modality-specific text encoders into a unified model, creating a shared text embedding space. Experiments on X-ray, CT, retina, ECG, and pathological images on multiple downstream tasks demonstrate that M\textsuperscript3Bind achieves state-of-the-art performance in zero-shot, few-shot classification and cross-modal retrieval tasks compared to its CLIP-like counterparts. These results validate M\textsuperscript3Bind’s effectiveness in achieving cross-image-modal alignment for medical analysis.
zh
[CV-207] LVPNet: A Latent-variable-based Prediction-driven End-to-end Framework for Lossless Compression of Medical Images MICCAI2025
【速读】:该论文旨在解决现有无损医学图像压缩方法中因图像分割导致的潜在变量信息分布不均、后验崩溃以及潜在变量利用效率低的问题。其解决方案的关键在于提出一种基于预测的端到端无损医学图像压缩方法LVPNet,通过引入全局潜在变量来预测像素值,并编码预测概率以实现高效压缩;同时,设计了全局多尺度感知模块(Global Multi-scale Sensing Module, GMSM)以提取紧凑且具有信息量的潜在表示,以及量化补偿模块(Quantization Compensation Module, QCM)以减少量化过程中的信息损失。
链接: https://arxiv.org/abs/2506.17983
作者: Chenyue Song,Chen Hui,Qing Lin,Wei Zhang,Siqiao Li,Shengping Zhang,Haiqi Zhu,Zhixuan Li,Shaohui Liu,Feng Jiang,Xiang Li
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to MICCAI 2025
Abstract:Autoregressive Initial Bits is a framework that integrates sub-image autoregression and latent variable modeling, demonstrating its advantages in lossless medical image compression. However, in existing methods, the image segmentation process leads to an even distribution of latent variable information across each sub-image, which in turn causes posterior collapse and inefficient utilization of latent variables. To deal with these issues, we propose a prediction-based end-to-end lossless medical image compression method named LVPNet, leveraging global latent variables to predict pixel values and encoding predicted probabilities for lossless compression. Specifically, we introduce the Global Multi-scale Sensing Module (GMSM), which extracts compact and informative latent representations from the entire image, effectively capturing spatial dependencies within the latent space. Furthermore, to mitigate the information loss introduced during quantization, we propose the Quantization Compensation Module (QCM), which learns the distribution of quantization errors and refines the quantized features to compensate for quantization loss. Extensive experiments on challenging benchmarks demonstrate that our method achieves superior compression efficiency compared to state-of-the-art lossless image compression approaches, while maintaining competitive inference speed. The code is at this https URL.
zh
[CV-208] DRO-Augment Framework: Robustness by Synergizing Wasserstein Distributionally Robust Optimization and Data Augmentation
【速读】:该论文旨在解决深度神经网络(Deep Neural Networks, DNNs)在面对数据损坏和对抗攻击时的鲁棒性不足问题,尤其是在图像分类任务中。其解决方案的关键在于提出一种名为DRO-Augment的新框架,该框架将Wasserstein分布鲁棒优化(W-DRO)与多种数据增强策略相结合,从而在广泛的数据损坏场景下显著提升模型的鲁棒性,同时保持在干净数据集上的准确率。
链接: https://arxiv.org/abs/2506.17874
作者: Jiaming Hu,Debarghya Mukherjee,Ioannis Ch. Paschalidis
机构: Boston University (波士顿大学)
类目: Machine Learning (stat.ML); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 26 pages,3 figures
Abstract:In many real-world applications, ensuring the robustness and stability of deep neural networks (DNNs) is crucial, particularly for image classification tasks that encounter various input perturbations. While data augmentation techniques have been widely adopted to enhance the resilience of a trained model against such perturbations, there remains significant room for improvement in robustness against corrupted data and adversarial attacks simultaneously. To address this challenge, we introduce DRO-Augment, a novel framework that integrates Wasserstein Distributionally Robust Optimization (W-DRO) with various data augmentation strategies to improve the robustness of the models significantly across a broad spectrum of corruptions. Our method outperforms existing augmentation methods under severe data perturbations and adversarial attack scenarios while maintaining the accuracy on the clean datasets on a range of benchmark datasets, including but not limited to CIFAR-10-C, CIFAR-100-C, MNIST, and Fashion-MNIST. On the theoretical side, we establish novel generalization error bounds for neural networks trained using a computationally efficient, variation-regularized loss function closely related to the W-DRO problem.
zh
[CV-209] Pix2Geomodel: A Next-Generation Reservoir Geomodeling with Property-to-Property Translation
【速读】:该论文旨在解决传统地质建模方法在处理复杂地下非均质性及对观测数据的条件约束方面存在的不足。其关键解决方案是提出了一种基于Pix2Pix的条件生成对抗网络(cGAN)框架Pix2Geomodel,该框架能够从Rotliegend气田的地质数据中预测储层属性(岩相、孔隙度、渗透率和含水饱和度),通过使用760万单元的数据集进行数据预处理、增强以及利用U-Net生成器和PatchGAN判别器进行训练,实现了高精度的属性预测与属性间转换。
链接: https://arxiv.org/abs/2506.17747
作者: Abdulrahman Al-Fakih,Ardiansyah Koeshidayatullah,Nabil A. Saraih,Tapan Mukerji,Rayan Kanfar,Abdulmohsen Alali,SanLinn I. Kaka
机构: King Fahd University of Petroleum & Minerals (法赫德国王石油与矿产大学); Stanford University (斯坦福大学); Saudi Aramco (沙特阿美); EXPEC Advanced Research Center (EXPEC高级研究中心)
类目: Geophysics (physics.geo-ph); Computational Engineering, Finance, and Science (cs.CE); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
备注: 34 pages, 13 figures
Abstract:Accurate geological modeling is critical for reservoir characterization, yet traditional methods struggle with complex subsurface heterogeneity, and they have problems with conditioning to observed data. This study introduces Pix2Geomodel, a novel conditional generative adversarial network (cGAN) framework based on Pix2Pix, designed to predict reservoir properties (facies, porosity, permeability, and water saturation) from the Rotliegend reservoir of the Groningen gas field. Utilizing a 7.6 million-cell dataset from the Nederlandse Aardolie Maatschappij, accessed via EPOS-NL, the methodology included data preprocessing, augmentation to generate 2,350 images per property, and training with a U-Net generator and PatchGAN discriminator over 19,000 steps. Evaluation metrics include pixel accuracy (PA), mean intersection over union (mIoU), frequency weighted intersection over union (FWIoU), and visualizations assessed performance in masked property prediction and property-to-property translation tasks. Results demonstrated high accuracy for facies (PA 0.88, FWIoU 0.85) and water saturation (PA 0.96, FWIoU 0.95), with moderate success for porosity (PA 0.70, FWIoU 0.55) and permeability (PA 0.74, FWIoU 0.60), and robust translation performance (e.g., facies-to-facies PA 0.98, FWIoU 0.97). The framework captured spatial variability and geological realism, as validated by variogram analysis, and calculated the training loss curves for the generator and discriminator for each property. Compared to traditional methods, Pix2Geomodel offers enhanced fidelity in direct property mapping. Limitations include challenges with microstructural variability and 2D constraints, suggesting future integration of multi-modal data and 3D modeling (Pix2Geomodel v2.0). This study advances the application of generative AI in geoscience, supporting improved reservoir management and open science initiatives.
zh
[CV-210] MTSIC: Multi-stage Transformer-based GAN for Spectral Infrared Image Colorization
【速读】:该论文旨在解决热红外(Thermal Infrared, TIR)图像因缺乏颜色和纹理信息而导致的下游任务受限及视觉疲劳问题。现有彩色化方法依赖于单波段图像,其光谱信息有限,特征提取能力不足,常导致图像失真和语义模糊。该论文提出的解决方案关键在于设计一种基于生成对抗网络(Generative Adversarial Network, GAN)的框架,利用多波段红外图像的丰富光谱数据,通过多阶段光谱自注意力Transformer网络(MTSIC)实现更精确的语义特征映射与图像重建,从而提升红外图像的视觉质量和语义准确性。
链接: https://arxiv.org/abs/2506.17540
作者: Tingting Liu,Yuan Liu,Jinhui Tang,Liyin Yuan,Chengyu Liu,Chunlai Li,Xiubao Sui,Qian Chen
机构: Nanjing University of Science and Technology (南京理工大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Thermal infrared (TIR) images, acquired through thermal radiation imaging, are unaffected by variations in lighting conditions and atmospheric haze. However, TIR images inherently lack color and texture information, limiting downstream tasks and potentially causing visual fatigue. Existing colorization methods primarily rely on single-band images with limited spectral information and insufficient feature extraction capabilities, which often result in image distortion and semantic ambiguity. In contrast, multiband infrared imagery provides richer spectral data, facilitating the preservation of finer details and enhancing semantic accuracy. In this paper, we propose a generative adversarial network (GAN)-based framework designed to integrate spectral information to enhance the colorization of infrared images. The framework employs a multi-stage spectral self-attention Transformer network (MTSIC) as the generator. Each spectral feature is treated as a token for self-attention computation, and a multi-head self-attention mechanism forms a spatial-spectral attention residual block (SARB), achieving multi-band feature mapping and reducing semantic confusion. Multiple SARB units are integrated into a Transformer-based single-stage network (STformer), which uses a U-shaped architecture to extract contextual information, combined with multi-scale wavelet blocks (MSWB) to align semantic information in the spatial-frequency dual domain. Multiple STformer modules are cascaded to form MTSIC, progressively optimizing the reconstruction quality. Experimental results demonstrate that the proposed method significantly outperforms traditional techniques and effectively enhances the visual quality of infrared images.
zh
[CV-211] DSA-NRP: No-Reflow Prediction from Angiographic Perfusion Dynamics in Stroke EVT
【速读】:该论文旨在解决急性缺血性卒中(AIS)患者在成功进行血管内血栓切除术(EVT)后出现的无再流(no-reflow)问题,该并发症导致微血管灌注不足,影响组织恢复并恶化临床预后。传统临床实践中依赖于术后24小时内灌注磁共振成像(MRI)进行识别,存在时间延迟。论文提出的解决方案的关键在于构建首个基于机器学习(ML)的框架,通过分析术中数字减影血管造影(DSA)序列和临床变量,实现EVT后立即预测无再流。该方法利用DSA图像中的统计和时间灌注特征,显著优于仅依赖临床特征的基线模型,证明了实时DSA灌注动态在评估微血管完整性中的关键价值。
链接: https://arxiv.org/abs/2506.17501
作者: Shreeram Athreya,Carlos Olivares,Ameera Ismail,Kambiz Nael,William Speier,Corey Arnold
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 4 figures
Abstract:Following successful large-vessel recanalization via endovascular thrombectomy (EVT) for acute ischemic stroke (AIS), some patients experience a complication known as no-reflow, defined by persistent microvascular hypoperfusion that undermines tissue recovery and worsens clinical outcomes. Although prompt identification is crucial, standard clinical practice relies on perfusion magnetic resonance imaging (MRI) within 24 hours post-procedure, delaying intervention. In this work, we introduce the first-ever machine learning (ML) framework to predict no-reflow immediately after EVT by leveraging previously unexplored intra-procedural digital subtraction angiography (DSA) sequences and clinical variables. Our retrospective analysis included AIS patients treated at UCLA Medical Center (2011-2024) who achieved favorable mTICI scores (2b-3) and underwent pre- and post-procedure MRI. No-reflow was defined as persistent hypoperfusion (Tmax 6 s) on post-procedural imaging. From DSA sequences (AP and lateral views), we extracted statistical and temporal perfusion features from the target downstream territory to train ML classifiers for predicting no-reflow. Our novel method significantly outperformed a clinical-features baseline(AUC: 0.7703 \pm 0.12 vs. 0.5728 \pm 0.12; accuracy: 0.8125 \pm 0.10 vs. 0.6331 \pm 0.09), demonstrating that real-time DSA perfusion dynamics encode critical insights into microvascular integrity. This approach establishes a foundation for immediate, accurate no-reflow prediction, enabling clinicians to proactively manage high-risk patients without reliance on delayed imaging.
zh
[CV-212] Can Common VLMs Rival Medical VLMs? Evaluation and Strategic Insights
【速读】:该论文试图解决如何在医疗影像任务中有效利用通用视觉语言模型(VLM)以替代专门针对医疗领域预训练的模型的问题,核心问题是评估经过微调的通用VLM是否能够在特定医疗影像任务中与专业医疗VLM相媲美。解决方案的关键在于通过轻量级微调(如LoRA方法)提升通用VLM在域内(ID)任务中的性能,并验证其在域外(OOD)任务中的泛化能力,结果表明通用VLM在经过适当微调后能够达到甚至超越专业医疗VLM的性能。
链接: https://arxiv.org/abs/2506.17337
作者: Yuan Zhong,Ruinan Jin,Xiaoxiao Li,Qi Dou
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Medical vision-language models (VLMs) leverage large-scale pretraining for diverse imaging tasks but require substantial computational and data resources. Meanwhile, common or general-purpose VLMs (e.g., CLIP, LLaVA), though not trained for medical use, show promise with fine-tuning. This raises a key question: Can efficient fine-tuned common VLMs rival generalist medical VLMs for solving specific medical imaging tasks? This study systematically evaluates common and medical VLMs across disease diagnosis and visual question answering (VQA). Using CLIP-based and LLaVA-based models, we examine (1) off-the-shelf performance gaps in in-domain (ID) settings, (2) whether fine-tuning bridges these gaps, and (3) generalization to out-of-domain (OOD) tasks on unseen medical modalities. While medical-specific pretraining provides advantages in ID settings, common VLMs match or surpass medical-specific models after lightweight fine-tuning, with LoRA-based adaptation proving highly effective among different tasks. In OOD tasks, common VLMs demonstrate strong adaptability in some tasks, challenging the assumption that medical-specific pre-training is essential. These findings suggest that leveraging common VLMs with fine-tuning offers a scalable and cost-effective alternative to developing large-scale medical VLMs, providing crucial insights for future research in the medical imaging field.
zh
人工智能
[AI-0] MinD: Unified Visual Imagination and Control via Hierarchical World Models
【速读】:该论文旨在解决视频生成模型(Video Generation Models, VGMs)在机器人领域应用中的两个核心问题:一是生成速度缓慢,限制了实时交互;二是生成视频与可执行动作之间的一致性较差。其解决方案的关键在于提出一种分层扩散基础的世界模型框架——Manipulate in Dream (MinD),该框架采用双系统设计实现视觉-语言操作,通过低频执行VGM提取视频预测特征,并利用高频扩散策略实现实时交互,从而实现低延迟、闭环控制的操纵任务。此外,引入视频-动作扩散匹配模块(DiffMatcher)及创新的协同训练策略,提升了系统的协调性与任务理解能力。
链接: https://arxiv.org/abs/2506.18897
作者: Xiaowei Chi,Kuangzhi Ge,Jiaming Liu,Siyuan Zhou,Peidong Jia,Zichen He,Yuzhen Liu,Tingguang Li,Lei Han,Sirui Han,Shanghang Zhang,Yike Guo
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Video generation models (VGMs) offer a promising pathway for unified world modeling in robotics by integrating simulation, prediction, and manipulation. However, their practical application remains limited due to (1) slowgeneration speed, which limits real-time interaction, and (2) poor consistency between imagined videos and executable actions. To address these challenges, we propose Manipulate in Dream (MinD), a hierarchical diffusion-based world model framework that employs a dual-system design for vision-language manipulation. MinD executes VGM at low frequencies to extract video prediction features, while leveraging a high-frequency diffusion policy for real-time interaction. This architecture enables low-latency, closed-loop control in manipulation with coherent visual guidance. To better coordinate the two systems, we introduce a video-action diffusion matching module (DiffMatcher), with a novel co-training strategy that uses separate schedulers for each diffusion model. Specifically, we introduce a diffusion-forcing mechanism to DiffMatcher that aligns their intermediate representations during training, helping the fast action model better understand video-based predictions. Beyond manipulation, MinD also functions as a world simulator, reliably predicting task success or failure in latent space before execution. Trustworthy analysis further shows that VGMs can preemptively evaluate task feasibility and mitigate risks. Extensive experiments across multiple benchmarks demonstrate that MinD achieves state-of-the-art manipulation (63%+) in RL-Bench, advancing the frontier of unified world modeling in robotics.
zh
[AI-1] Steering Conceptual Bias via Transformer Latent-Subspace Activation
【速读】:该论文试图解决如何通过激活语言模型(Language Models, LLMs)中的潜在子空间,引导科学代码生成向特定编程语言偏移的问题。其关键解决方案是提出一种梯度优化的自适应激活引导框架(Gradient-refined Adaptive Activation Steering, G-ACT),该框架通过聚类每提示的激活差异以生成少量引导方向,并在每层上训练轻量级探测器在线选择适当的引导向量,从而实现对生成语言的有效控制。
链接: https://arxiv.org/abs/2506.18887
作者: Vansh Sharma,Venkat Raman
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY)
备注:
Abstract:This work examines whether activating latent subspaces in language models (LLMs) can steer scientific code generation toward a specific programming language. Five causal LLMs were first evaluated on scientific coding prompts to quantify their baseline bias among four programming languages. A static neuron-attribution method, perturbing the highest activated MLP weight for a C++ or CPP token, proved brittle and exhibited limited generalization across prompt styles and model scales. To address these limitations, a gradient-refined adaptive activation steering framework (G-ACT) was developed: per-prompt activation differences are clustered into a small set of steering directions, and lightweight per-layer probes are trained and refined online to select the appropriate steering vector. In LLaMA-3.2 3B, this approach reliably biases generation towards the CPP language by increasing the average probe classification accuracy by 15% and the early layers (0-6) improving the probe classification accuracy by 61.5% compared to the standard ACT framework. For LLaMA-3.3 70B, where attention-head signals become more diffuse, targeted injections at key layers still improve language selection. Although per-layer probing introduces a modest inference overhead, it remains practical by steering only a subset of layers and enables reproducible model behavior. These results demonstrate a scalable, interpretable and efficient mechanism for concept-level control for practical agentic systems.
zh
[AI-2] Understanding Software Engineering Agents : A Study of Thought-Action-Result Trajectories
【速读】:该论文旨在解决大型语言模型(Large Language Model, LLM)驱动的智能体在软件工程任务中内部决策过程不透明的问题,从而提升对这些智能体操作动态和失败模式的理解。其解决方案的关键在于对三个最先进的LLM-based agents——\textscRepairAgent、\textscAutoCodeRover和\textscOpenHands的思维-行动-结果轨迹进行大规模实证研究,通过统一交互日志格式,分析结构特性、动作模式、令牌使用情况以及推理连贯性和反馈整合等多维度数据,揭示成功与失败执行中的行为特征和反模式,为改进智能体设计提供可操作的见解。
链接: https://arxiv.org/abs/2506.18824
作者: Islem Bouzenia,Michael Pradel
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Model (LLM)-based agents are increasingly employed to automate complex software engineering tasks such as program repair and issue resolution. These agents operate by autonomously generating natural language thoughts, invoking external tools, and iteratively refining their solutions. Despite their widespread adoption, the internal decision-making processes of these agents remain largely unexplored, limiting our understanding of their operational dynamics and failure modes. In this paper, we present a large-scale empirical study of the thought-action-result trajectories of three state-of-the-art LLM-based agents: \textscRepairAgent, \textscAutoCodeRover, and \textscOpenHands. We unify their interaction logs into a common format, capturing 120 trajectories and 2822 LLM interactions focused on program repair and issue resolution. Our study combines quantitative analyses of structural properties, action patterns, and token usage with qualitative assessments of reasoning coherence and feedback integration. We identify key trajectory characteristics such as iteration counts and token consumption, recurring action sequences, and the semantic coherence linking thoughts, actions, and their results. Our findings reveal behavioral motifs and anti-patterns that distinguish successful from failed executions, providing actionable insights for improving agent design, including prompting strategies, failure diagnosis, and anti-pattern detection. We release our dataset and annotation framework to support further research on transparent and robust autonomous software engineering agents.
zh
[AI-3] Shift Happens: Mixture of Experts based Continual Adaptation in Federated Learning
【速读】:该论文旨在解决流式联邦学习(Federated Learning, FL)环境中协变量和标签分布偏移(covariate and label shifts)带来的模型性能下降问题,这类非平稳数据分布会显著影响模型的泛化能力。其解决方案的关键在于提出ShiftEx框架,该框架通过最大均值差异(Maximum Mean Discrepancy)检测分布偏移,并动态构建和训练专门化的全局模型;同时引入潜在记忆机制实现专家模型的复用,并采用设施定位优化方法联合最小化协变量不匹配、专家创建成本和标签不平衡问题。
链接: https://arxiv.org/abs/2506.18789
作者: Rahul Atul Bhope,K.R. Jayaram,Praveen Venkateswaran,Nalini Venkatasubramanian
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Federated Learning (FL) enables collaborative model training across decentralized clients without sharing raw data, yet faces significant challenges in real-world settings where client data distributions evolve dynamically over time. This paper tackles the critical problem of covariate and label shifts in streaming FL environments, where non-stationary data distributions degrade model performance and require adaptive middleware solutions. We introduce ShiftEx, a shift-aware mixture of experts framework that dynamically creates and trains specialized global models in response to detected distribution shifts using Maximum Mean Discrepancy for covariate shifts. The framework employs a latent memory mechanism for expert reuse and implements facility location-based optimization to jointly minimize covariate mismatch, expert creation costs, and label imbalance. Through theoretical analysis and comprehensive experiments on benchmark datasets, we demonstrate 5.5-12.9 percentage point accuracy improvements and 22-95 % faster adaptation compared to state-of-the-art FL baselines across diverse shift scenarios. The proposed approach offers a scalable, privacy-preserving middleware solution for FL systems operating in non-stationary, real-world conditions while minimizing communication and computational overhead.
zh
[AI-4] RIZ Agents : A Multi-Agent LLM Approach for TRIZ-Based Innovation
【速读】:该论文试图解决TRIZ(The Theory of Inventive Problem Solving)在实际应用中因复杂性和跨学科知识需求而受到的限制问题。其解决方案的关键在于提出一种基于大型语言模型(LLM)的多智能体系统,称为TRIZ agents,每个智能体具备专业能力与工具访问权限,通过协作按照TRIZ方法论共同解决创新性问题。该系统利用不同领域专业知识的智能体高效地执行TRIZ步骤,从而提升解决复杂创新挑战的能力。
链接: https://arxiv.org/abs/2506.18783
作者: Kamil Szczepanik,Jarosław A. Chudziak
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 12 pages, 10 figures, 2 tables, Accepted at the 17th International Conference on Agents and Artificial Intelligence (ICAART 2025). Final version published in Proceedings of ICAART 2025 (Vol. 1), pages 196-207
Abstract:TRIZ, the Theory of Inventive Problem Solving, is a structured, knowledge-based framework for innovation and abstracting problems to find inventive solutions. However, its application is often limited by the complexity and deep interdisciplinary knowledge required. Advancements in Large Language Models (LLMs) have revealed new possibilities for automating parts of this process. While previous studies have explored single LLMs in TRIZ applications, this paper introduces a multi-agent approach. We propose an LLM-based multi-agent system, called TRIZ agents, each with specialized capabilities and tool access, collaboratively solving inventive problems based on the TRIZ methodology. This multi-agent system leverages agents with various domain expertise to efficiently navigate TRIZ steps. The aim is to model and simulate an inventive process with language agents. We assess the effectiveness of this team of agents in addressing complex innovation challenges based on a selected case study in engineering. We demonstrate the potential of agent collaboration to produce diverse, inventive solutions. This research contributes to the future of AI-driven innovation, showcasing the advantages of decentralized problem-solving in complex ideation tasks.
zh
[AI-5] Sensitivity Analysis of Image Classification Models using Generalized Polynomial Chaos
【速读】:该论文试图解决图像分类模型在预测质量应用中因模型、数据和领域偏移带来的不确定性问题,这些问题导致模型输出过于自信。解决方案的关键在于通过随机变量建模输入的分布域偏移,并利用广义多项式混沌(GPC)计算Sobol指数来量化其对模型输出的影响。
链接: https://arxiv.org/abs/2506.18751
作者: Lukas Bahr,Lucas Poßner,Konstantin Weise,Sophie Gröger,Rüdiger Daub
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Integrating advanced communication protocols in production has accelerated the adoption of data-driven predictive quality methods, notably machine learning (ML) models. However, ML models in image classification often face significant uncertainties arising from model, data, and domain shifts. These uncertainties lead to overconfidence in the classification model’s output. To better understand these models, sensitivity analysis can help to analyze the relative influence of input parameters on the output. This work investigates the sensitivity of image classification models used for predictive quality. We propose modeling the distributional domain shifts of inputs with random variables and quantifying their impact on the model’s outputs using Sobol indices computed via generalized polynomial chaos (GPC). This approach is validated through a case study involving a welding defect classification problem, utilizing a fine-tuned ResNet18 model and an emblem classification model used in BMW Group production facilities.
zh
[AI-6] BRAVE: Brain-Controlled Prosthetic Arm with Voice Integration and Embodied Learning for Enhanced Mobility IJCNN2025
【速读】:该论文旨在解决非侵入式脑机接口(BCI)在控制假肢时面临的信号噪声、分类准确率低和实时适应性差的问题。其解决方案的关键在于提出BRAVE系统,该系统结合了基于集成学习的脑电(EEG)分类与人机协同(HITL)校正框架,通过融合LSTM、CNN和随机森林模型提升分类鲁棒性,并采用自动语音识别(ASR)实现多自由度模式切换,同时利用Lab Streaming Layer(LSL)保证实时数据同步,从而实现了高精度、低延迟的非侵入式假肢控制。
链接: https://arxiv.org/abs/2506.18749
作者: Abdul Basit,Maha Nawaz,Muhammad Shafique
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: 9 pages, 12 figures, Accepted at IJCNN 2025
Abstract:Non-invasive brain-computer interfaces (BCIs) have the potential to enable intuitive control of prosthetic limbs for individuals with upper limb amputations. However, existing EEG-based control systems face challenges related to signal noise, classification accuracy, and real-time adaptability. In this work, we present BRAVE, a hybrid EEG and voice-controlled prosthetic system that integrates ensemble learning-based EEG classification with a human-in-the-loop (HITL) correction framework for enhanced responsiveness. Unlike traditional electromyography (EMG)-based prosthetic control, BRAVE aims to interpret EEG-driven motor intent, enabling movement control without reliance on residual muscle activity. To improve classification robustness, BRAVE combines LSTM, CNN, and Random Forest models in an ensemble framework, achieving a classification accuracy of 96% across test subjects. EEG signals are preprocessed using a bandpass filter (0.5-45 Hz), Independent Component Analysis (ICA) for artifact removal, and Common Spatial Pattern (CSP) feature extraction to minimize contamination from electromyographic (EMG) and electrooculographic (EOG) signals. Additionally, BRAVE incorporates automatic speech recognition (ASR) to facilitate intuitive mode switching between different degrees of freedom (DOF) in the prosthetic arm. The system operates in real time, with a response latency of 150 ms, leveraging Lab Streaming Layer (LSL) networking for synchronized data acquisition. The system is evaluated on an in-house fabricated prosthetic arm and on multiple participants highlighting the generalizability across users. The system is optimized for low-power embedded deployment, ensuring practical real-world application beyond high-performance computing environments. Our results indicate that BRAVE offers a promising step towards robust, real-time, non-invasive prosthetic control.
zh
[AI-7] ContinualFlow: Learning and Unlearning with Neural Flow Matching ICML25 ICML2025
【速读】:该论文试图解决生成式模型中目标化遗忘(targeted unlearning)的问题,即在不重新训练模型或不直接访问需遗忘样本的情况下,从数据分布中软性移除特定区域。解决方案的关键在于提出ContinualFlow框架,该框架通过基于能量的重加权损失函数,利用能量代理引导遗忘过程,从而实现对目标分布的软质量减去,而无需显式访问被遗忘的样本。
链接: https://arxiv.org/abs/2506.18747
作者: Lorenzo Simone,Davide Bacciu,Shuangge Ma
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at the ICML 2025 Workshop on Machine Unlearning for Generative AI (MUGen @ ICML25, Vancouver, July 2025)
Abstract:We introduce ContinualFlow, a principled framework for targeted unlearning in generative models via Flow Matching. Our method leverages an energy-based reweighting loss to softly subtract undesired regions of the data distribution without retraining from scratch or requiring direct access to the samples to be unlearned. Instead, it relies on energy-based proxies to guide the unlearning process. We prove that this induces gradients equivalent to Flow Matching toward a soft mass-subtracted target, and validate the framework through experiments on 2D and image domains, supported by interpretable visualizations and quantitative evaluations.
zh
[AI-8] On the Existence of Universal Simulators of Attention
【速读】:该论文试图解决Transformer架构是否能够精确模拟任意注意力机制的问题,特别是其底层操作。解决方案的关键在于构建一个由Transformer编码器组成的通用模拟器U,并通过RASP(一种用于Transformer计算的正式框架)提出算法方案,以完全复制注意力输出及相关的基础矩阵和激活操作。该研究首次证明了存在一种算法可实现的数据无关解决方案,此前此类问题仅能通过学习近似解决。
链接: https://arxiv.org/abs/2506.18739
作者: Debanjan Dutta,Faizanuddin Ansari,Anish Chakrabarty,Swagatam Das
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Prior work on the learnability of transformers has established its capacity to approximate specific algorithmic patterns through training under restrictive architectural assumptions. Fundamentally, these arguments remain data-driven and therefore can only provide a probabilistic guarantee. Expressivity, on the contrary, has theoretically been explored to address the problems \emphcomputable by such architecture. These results proved the Turing-completeness of transformers, investigated bounds focused on circuit complexity, and formal logic. Being at the crossroad between learnability and expressivity, the question remains: \emphcan transformer architectures exactly simulate an arbitrary attention mechanism, or in particular, the underlying operations? In this study, we investigate the transformer encoder’s ability to simulate a vanilla attention mechanism. By constructing a universal simulator \mathcalU composed of transformer encoders, we present algorithmic solutions to identically replicate attention outputs and the underlying elementary matrix and activation operations via RASP, a formal framework for transformer computation. Our proofs, for the first time, show the existence of an algorithmically achievable data-agnostic solution, previously known to be approximated only by learning.
zh
[AI-9] MuseControlLite: Multifunctional Music Generation with Lightweight Conditioners ICML2025
【速读】:该论文旨在解决文本到音乐生成模型在利用时间变化的音乐属性和参考音频信号进行精确条件控制时的不足问题。其解决方案的关键在于引入位置嵌入(positional embeddings),尤其是在条件为时间函数的情况下,这一机制被证明是至关重要的。通过在解耦的交叉注意力层中添加旋转位置嵌入,实验表明控制精度可从56.6%提升至61.1%,同时相比现有最先进的微调方法,参数量减少了6.75倍,仅需85M可训练参数即可实现更优的可控性。
链接: https://arxiv.org/abs/2506.18729
作者: Fang-Duo Tsai,Shih-Lun Wu,Weijaw Lee,Sheng-Ping Yang,Bo-Rui Chen,Hao-Chung Cheng,Yi-Hsuan Yang
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: Accepted by the 42nd International Conference on Machine Learning (ICML 2025)
Abstract:We propose MuseControlLite, a lightweight mechanism designed to fine-tune text-to-music generation models for precise conditioning using various time-varying musical attributes and reference audio signals. The key finding is that positional embeddings, which have been seldom used by text-to-music generation models in the conditioner for text conditions, are critical when the condition of interest is a function of time. Using melody control as an example, our experiments show that simply adding rotary positional embeddings to the decoupled cross-attention layers increases control accuracy from 56.6% to 61.1%, while requiring 6.75 times fewer trainable parameters than state-of-the-art fine-tuning mechanisms, using the same pre-trained diffusion Transformer model of Stable Audio Open. We evaluate various forms of musical attribute control, audio inpainting, and audio outpainting, demonstrating improved controllability over MusicGen-Large and Stable Audio Open ControlNet at a significantly lower fine-tuning cost, with only 85M trainble parameters. Source code, model checkpoints, and demo examples are available at: https: //MuseControlLite.this http URL.
zh
[AI-10] A Study of Dynamic Stock Relationship Modeling and SP500 Price Forecasting Based on Differential Graph Transformer
【速读】:该论文旨在解决股票价格预测中的动态关系建模问题,传统静态相关性模型无法捕捉股票间随时间变化的复杂关系。其解决方案的关键在于提出一种差分图Transformer(Differential Graph Transformer, DGT)框架,通过差分图机制将序列图结构变化整合到多头自注意力中,自适应地保留高价值连接并抑制噪声,同时利用因果时序注意力捕获价格序列中的全局与局部依赖关系。
链接: https://arxiv.org/abs/2506.18717
作者: Linyue Hu,Qi Wang
机构: 未知
类目: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI)
备注:
Abstract:Stock price prediction is vital for investment decisions and risk management, yet remains challenging due to markets’ nonlinear dynamics and time-varying inter-stock correlations. Traditional static-correlation models fail to capture evolving stock relationships. To address this, we propose a Differential Graph Transformer (DGT) framework for dynamic relationship modeling and price prediction. Our DGT integrates sequential graph structure changes into multi-head self-attention via a differential graph mechanism, adaptively preserving high-value connections while suppressing noise. Causal temporal attention captures global/local dependencies in price sequences. We further evaluate correlation metrics (Pearson, Mutual Information, Spearman, Kendall’s Tau) across global/local/dual scopes as spatial-attention priors. Using 10 years of SP 500 closing prices (z-score normalized; 64-day sliding windows), DGT with spatial priors outperformed GRU baselines (RMSE: 0.24 vs. 0.87). Kendall’s Tau global matrices yielded optimal results (MAE: 0.11). K-means clustering revealed “high-volatility growth” and “defensive blue-chip” stocks, with the latter showing lower errors (RMSE: 0.13) due to stable correlations. Kendall’s Tau and Mutual Information excelled in volatile sectors. This study innovatively combines differential graph structures with Transformers, validating dynamic relationship modeling and identifying optimal correlation metrics/scopes. Clustering analysis supports tailored quantitative strategies. Our framework advances financial time-series prediction through dynamic modeling and cross-asset interaction analysis.
zh
[AI-11] Frequency-Weighted Training Losses for Phoneme-Level DNN-based Speech Enhancement
【速读】:该论文旨在解决传统训练损失函数(如尺度不变信噪比,SDR)在多通道语音增强任务中无法有效保留对音素可懂性至关重要的细粒度频谱线索的问题。其解决方案的关键在于提出感知启发的SDR损失变体,这些损失函数在时频域中进行公式化,并通过频率依赖的加权方案进行调制,以强调语音显著或干扰噪声较强的时频区域。
链接: https://arxiv.org/abs/2506.18714
作者: Nasser-Eddine Monir,Paul Magron,Romain Serizel
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: This is the preprint of the paper submitted to the 26th IEEE International Workshop on Multimedia Signal Processing (MMSP)
Abstract:Recent advances in deep learning have significantly improved multichannel speech enhancement algorithms, yet conventional training loss functions such as the scale-invariant signal-to-distortion ratio (SDR) may fail to preserve fine-grained spectral cues essential for phoneme intelligibility. In this work, we propose perceptually-informed variants of the SDR loss, formulated in the time-frequency domain and modulated by frequency-dependent weighting schemes. These weights are designed to emphasize time-frequency regions where speech is prominent or where the interfering noise is particularly strong. We investigate both fixed and adaptive strategies, including ANSI band-importance weights, spectral magnitude-based weighting, and dynamic weighting based on the relative amount of speech and noise. We train the FaSNet multichannel speech enhancement model using these various losses. Experimental results show that while standard metrics such as the SDR are only marginally improved, their perceptual frequency-weighted counterparts exhibit a more substantial improvement. Besides, spectral and phoneme-level analysis indicates better consonant reconstruction, which points to a better preservation of certain acoustic cues.
zh
[AI-12] NOVA: Navigation via Object-Centric Visual Autonomy for High-Speed Target Tracking in Unstructured GPS-Denied Environments
【速读】:该论文试图解决在非结构化且无GPS信号的环境中实现自主空中目标跟踪这一基础性挑战。现有方法通常依赖于运动捕捉系统、预先绘制的场景或基于特征的定位,从而限制了其在真实环境中的部署。解决方案的关键在于提出NOVA,这是一个完全机载、以目标为中心的框架,仅使用立体相机和惯性测量单元(IMU)即可实现鲁棒的目标跟踪和避障导航。NOVA通过在目标参考系中进行感知、估计和控制,避免构建全局地图或依赖绝对定位,并结合轻量级目标检测、立体深度补全及直方图滤波等技术,实现了高精度的目标位姿估计与动态轨迹规划。
链接: https://arxiv.org/abs/2506.18689
作者: Alessandro Saviolo,Giuseppe Loianno
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Autonomous aerial target tracking in unstructured and GPS-denied environments remains a fundamental challenge in robotics. Many existing methods rely on motion capture systems, pre-mapped scenes, or feature-based localization to ensure safety and control, limiting their deployment in real-world conditions. We introduce NOVA, a fully onboard, object-centric framework that enables robust target tracking and collision-aware navigation using only a stereo camera and an IMU. Rather than constructing a global map or relying on absolute localization, NOVA formulates perception, estimation, and control entirely in the target’s reference frame. A tightly integrated stack combines a lightweight object detector with stereo depth completion, followed by histogram-based filtering to infer robust target distances under occlusion and noise. These measurements feed a visual-inertial state estimator that recovers the full 6-DoF pose of the robot relative to the target. A nonlinear model predictive controller (NMPC) plans dynamically feasible trajectories in the target frame. To ensure safety, high-order control barrier functions are constructed online from a compact set of high-risk collision points extracted from depth, enabling real-time obstacle avoidance without maps or dense representations. We validate NOVA across challenging real-world scenarios, including urban mazes, forest trails, and repeated transitions through buildings with intermittent GPS loss and severe lighting changes that disrupt feature-based localization. Each experiment is repeated multiple times under similar conditions to assess resilience, showing consistent and reliable performance. NOVA achieves agile target following at speeds exceeding 50 km/h. These results show that high-speed vision-based tracking is possible in the wild using only onboard sensing, with no reliance on external localization or environment assumptions.
zh
[AI-13] Dual-level Behavioral Consistency for Inter-group and Intra-group Coordination in Multi-Agent Systems
【速读】:该论文试图解决多智能体强化学习(MARL)中行为一致性控制的问题,特别是针对多智能体分组场景下的行为一致性不足。其解决方案的关键在于提出一种名为双层次行为一致性(DLBC)的新方法,该方法通过将智能体划分为不同群体,并动态调节群体内和群体间的行为多样性,从而实现群体内协作增强和群体间任务专业化,同时通过直接约束智能体策略函数确保方法的广泛适用性。
链接: https://arxiv.org/abs/2506.18651
作者: Shuocun Yang,Huawen Hu,Enze Shi,Shu Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Behavioral diversity in Multi-agent reinforcement learning(MARL) represents an emerging and promising research area. Prior work has largely centered on intra-group behavioral consistency in multi-agent systems, with limited attention given to behavioral consistency in multi-agent grouping scenarios. In this paper, we introduce Dual-Level Behavioral Consistency (DLBC), a novel MARL control method designed to explicitly regulate agent behaviors at both intra-group and inter-group levels. DLBC partitions agents into distinct groups and dynamically modulates behavioral diversity both within and between these groups. By dynamically modulating behavioral diversity within and between these groups, DLBC achieves enhanced division of labor through inter-group consistency, which constrains behavioral strategies across different groups. Simultaneously, intra-group consistency, achieved by aligning behavioral strategies within each group, fosters stronger intra-group cooperation. Crucially, DLBC’s direct constraint of agent policy functions ensures its broad applicability across various algorithmic frameworks. Experimental results in various grouping cooperation scenarios demonstrate that DLBC significantly enhances both intra-group cooperative performance and inter-group task specialization, yielding substantial performance improvements. DLBC provides new ideas for behavioral consistency control of multi-intelligent body systems, and its potential for application in more complex tasks and dynamic environments can be further explored in the future.
zh
[AI-14] Federated Loss Exploration for Improved Convergence on Non-IID Data
【速读】:该论文旨在解决联邦学习(Federated Learning, FL)在非独立同分布(non-IID)数据场景下的性能瓶颈问题,特别是在数据异质性较强时现有方法表现不佳的挑战。其解决方案的关键在于提出FedLEx(Federated Loss Exploration),通过客户端计算模型参数的梯度偏差并贡献到全局引导矩阵中,该矩阵作为策略性导航工具,指导后续FL轮次中客户端的梯度更新,从而实现全局模型参数的优化更新。FedLEx通过高效构建强全局引导矩阵,在少量训练轮次和数据量下即可实现模型收敛,无需额外的数据共享或数据分布统计信息。
链接: https://arxiv.org/abs/2506.18640
作者: Christian Internò,Markus Olhofer,Yaochu Jin,Barbara Hammer
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Federated learning (FL) has emerged as a groundbreaking paradigm in machine learning (ML), offering privacy-preserving collaborative model training across diverse datasets. Despite its promise, FL faces significant hurdles in non-identically and independently distributed (non-IID) data scenarios, where most existing methods often struggle with data heterogeneity and lack robustness in performance. This paper introduces Federated Loss Exploration (FedLEx), an innovative approach specifically designed to tackle these challenges. FedLEx distinctively addresses the shortcomings of existing FL methods in non-IID settings by optimizing its learning behavior for scenarios in which assumptions about data heterogeneity are impractical or unknown. It employs a federated loss exploration technique, where clients contribute to a global guidance matrix by calculating gradient deviations for model parameters. This matrix serves as a strategic compass to guide clients’ gradient updates in subsequent FL rounds, thereby fostering optimal parameter updates for the global model. FedLEx effectively navigates the complex loss surfaces inherent in non-IID data, enhancing knowledge transfer in an efficient manner, since only a small number of epochs and small amount of data are required to build a strong global guidance matrix that can achieve model convergence without the need for additional data sharing or data distribution statics in a large client scenario. Our extensive experiments with state-of-the art FL algorithms demonstrate significant improvements in performance, particularly under realistic non-IID conditions, thus highlighting FedLEx’s potential to overcome critical barriers in diverse FL applications.
zh
[AI-15] Granular-Ball-Induced Multiple Kernel K-Means IJCAI2025
【速读】:该论文旨在解决现有多核聚类算法(如多核K-means)在面对复杂数据分布时计算效率低和鲁棒性差的问题。这些问题源于算法依赖点对点关系进行优化,难以准确捕捉数据集的内在结构和多样性,同时多核之间的复杂交互也会加剧这些缺陷,影响其在高维空间中的聚类能力。论文提出的解决方案的关键在于引入粒球计算(granular-ball computing),通过从粗到细的自适应方式拟合数据分布,每个粒球基于密度一致性度量包围数据点,从而提升计算效率和对未知噪声的鲁棒性。基于粒球表示,论文提出了粒球核(granular-ball kernel, GBK)及其对应的粒球多核K-means框架(GB-MKKM),实验证明该框架在多种聚类任务中表现出更高的效率和聚类性能。
链接: https://arxiv.org/abs/2506.18637
作者: Shuyin Xia,Yifan Wang,Lifeng Shen,Guoyin Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted by IJCAI 2025
Abstract:Most existing multi-kernel clustering algorithms, such as multi-kernel K-means, often struggle with computational efficiency and robustness when faced with complex data distributions. These challenges stem from their dependence on point-to-point relationships for optimization, which can lead to difficulty in accurately capturing data sets’ inherent structure and diversity. Additionally, the intricate interplay between multiple kernels in such algorithms can further exacerbate these issues, effectively impacting their ability to cluster data points in high-dimensional spaces. In this paper, we leverage granular-ball computing to improve the multi-kernel clustering framework. The core of granular-ball computing is to adaptively fit data distribution by balls from coarse to acceptable levels. Each ball can enclose data points based on a density consistency measurement. Such ball-based data description thus improves the computational efficiency and the robustness to unknown noises. Specifically, based on granular-ball representations, we introduce the granular-ball kernel (GBK) and its corresponding granular-ball multi-kernel K-means framework (GB-MKKM) for efficient clustering. Using granular-ball relationships in multiple kernel spaces, the proposed GB-MKKM framework shows its superiority in efficiency and clustering performance in the empirical evaluation of various clustering tasks.
zh
[AI-16] Multi-Agent Reinforcement Learning for Inverse Design in Photonic Integrated Circuits
【速读】:该论文试图解决光子集成电路(Photonic Integrated Circuits, PICs)逆向设计中传统基于梯度的优化方法易陷入局部极小值,导致设计功能次优的问题。其解决方案的关键在于引入强化学习(Reinforcement Learning, RL)环境及多智能体RL算法,通过将设计空间离散化为网格,并将其分解为数千个独立智能体,从而在仅需少量环境样本的情况下实现高效优化。该方法在二维和三维设计任务中均优于现有的最先进梯度优化方法。
链接: https://arxiv.org/abs/2506.18627
作者: Yannik Mahlau,Maximilian Schier,Christoph Reinders,Frederik Schubert,Marco Bügling,Bodo Rosenhahn
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Inverse design of photonic integrated circuits (PICs) has traditionally relied on gradientbased optimization. However, this approach is prone to end up in local minima, which results in suboptimal design functionality. As interest in PICs increases due to their potential for addressing modern hardware demands through optical computing, more adaptive optimization algorithms are needed. We present a reinforcement learning (RL) environment as well as multi-agent RL algorithms for the design of PICs. By discretizing the design space into a grid, we formulate the design task as an optimization problem with thousands of binary variables. We consider multiple two- and three-dimensional design tasks that represent PIC components for an optical computing system. By decomposing the design space into thousands of individual agents, our algorithms are able to optimize designs with only a few thousand environment samples. They outperform previous state-of-the-art gradient-based optimization in both twoand three-dimensional design tasks. Our work may also serve as a benchmark for further exploration of sample-efficient RL for inverse design in photonics.
zh
[AI-17] Frequency Control in Microgrids: An Adaptive Fuzzy-Neural-Network Virtual Synchronous Generator
【速读】:该论文试图解决由于分布式可再生能源的广泛应用导致微电网系统惯性与阻尼下降,进而引发频率调节不稳定的问题。解决方案的关键在于通过一种基于模糊神经网络控制器的动态调整方法,实时优化虚拟同步发电机的惯性、阻尼和下垂参数,从而实现频率的稳定控制。该方法能够在考虑可再生能源渗透率和影响的前提下,有效减少频率偏差并缩短系统恢复时间。
链接: https://arxiv.org/abs/2506.18611
作者: Waleed Breesam,Rezvan Alamian,Nima Tashakor,Brahim Elkhalil Youcefa,Stefan M. Goetz
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI)
备注: 11 pages, 17 figures
Abstract:The reliance on distributed renewable energy has increased recently. As a result, power electronic-based distributed generators replaced synchronous generators which led to a change in the dynamic characteristics of the microgrid. Most critically, they reduced system inertia and damping. Virtual synchronous generators emulated in power electronics, which mimic the dynamic behaviour of synchronous generators, are meant to fix this problem. However, fixed virtual synchronous generator parameters cannot guarantee a frequency regulation within the acceptable tolerance range. Conversely, a dynamic adjustment of these virtual parameters promises robust solution with stable frequency. This paper proposes a method to adapt the inertia, damping, and droop parameters dynamically through a fuzzy neural network controller. This controller trains itself online to choose appropriate values for these virtual parameters. The proposed method can be applied to a typical AC microgrid by considering the penetration and impact of renewable energy sources. We study the system in a MATLAB/Simulink model and validate it experimentally in real time using hardware-in-the-loop based on an embedded ARM system (SAM3X8E, Cortex-M3). Compared to traditional and fuzzy logic controller methods, the results demonstrate that the proposed method significantly reduces the frequency deviation to less than 0.03 Hz and shortens the stabilizing/recovery time.
zh
[AI-18] Simulation-Free Differential Dynamics through Neural Conservation Laws
【速读】:该论文旨在解决在非常通用的目标函数上训练连续时间扩散过程的问题,传统方法要么需要预设最优扩散过程(仅适用于高度受限的问题形式),要么需要昂贵的模拟来数值获取时变密度并从扩散过程中采样。解决方案的关键在于提出了一种耦合参数化方法,联合建模时间依赖的密度函数(或概率路径)以及生成该概率路径的扩散过程的动力学。通过将Fokker-Planck方程和密度函数要求作为硬约束直接嵌入,该方法简化了神经守恒定律的构建,并实现了无需模拟的训练,适用于从数据驱动目标到最优性目标等多种问题形式。
链接: https://arxiv.org/abs/2506.18604
作者: Mengjian Hua,Eric Vanden-Eijnden,Ricky T.Q. Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:We present a novel simulation-free framework for training continuous-time diffusion processes over very general objective functions. Existing methods typically involve either prescribing the optimal diffusion process – which only works for heavily restricted problem formulations – or require expensive simulation to numerically obtain the time-dependent densities and sample from the diffusion process. In contrast, we propose a coupled parameterization which jointly models a time-dependent density function, or probability path, and the dynamics of a diffusion process that generates this probability path. To accomplish this, our approach directly bakes in the Fokker-Planck equation and density function requirements as hard constraints, by extending and greatly simplifying the construction of Neural Conservation Laws. This enables simulation-free training for a large variety of problem formulations, from data-driven objectives as in generative modeling and dynamical optimal transport, to optimality-based objectives as in stochastic optimal control, with straightforward extensions to mean-field objectives due to the ease of accessing exact density functions. We validate our method in a diverse range of application domains from modeling spatio-temporal events to learning optimal dynamics from population data.
zh
[AI-19] Optimization-Induced Dynamics of Lipschitz Continuity in Neural Networks
【速读】:该论文试图解决神经网络在训练过程中Lipschitz连续性(Lipschitz continuity)的动态演化问题,即其对小输入扰动的最坏情况敏感性的变化规律尚未被充分研究。解决方案的关键在于提出一个严格的数学框架,利用随机微分方程(SDEs)建模训练过程中Lipschitz连续性的时序演化,该框架能够捕捉确定性和随机性因素的影响,并识别出三个主要驱动因素:优化动力学引起的梯度流在参数矩阵算子范数雅可比矩阵上的投影、由小批量采样随机性产生的梯度噪声在算子范数雅可比矩阵上的投影,以及梯度噪声在参数矩阵算子范数海森矩阵上的投影。
链接: https://arxiv.org/abs/2506.18588
作者: Róisín Luo,James McDermott,Christian Gagné,Qiang Sun,Colm O’Riordan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:
Abstract:Lipschitz continuity characterizes the worst-case sensitivity of neural networks to small input perturbations; yet its dynamics (i.e. temporal evolution) during training remains under-explored. We present a rigorous mathematical framework to model the temporal evolution of Lipschitz continuity during training with stochastic gradient descent (SGD). This framework leverages a system of stochastic differential equations (SDEs) to capture both deterministic and stochastic forces. Our theoretical analysis identifies three principal factors driving the evolution: (i) the projection of gradient flows, induced by the optimization dynamics, onto the operator-norm Jacobian of parameter matrices; (ii) the projection of gradient noise, arising from the randomness in mini-batch sampling, onto the operator-norm Jacobian; and (iii) the projection of the gradient noise onto the operator-norm Hessian of parameter matrices. Furthermore, our theoretical framework sheds light on such as how noisy supervision, parameter initialization, batch size, and mini-batch sampling trajectories, among other factors, shape the evolution of the Lipschitz continuity of neural networks. Our experimental results demonstrate strong agreement between the theoretical implications and the observed behaviors.
zh
[AI-20] -CPDL: A Temporal Causal Probabilistic Description Logic for Developing Logic-RAG Agent
【速读】:该论文试图解决大型语言模型在处理涉及时间约束、因果关系和概率推理的结构化推理任务时表现不佳的问题。解决方案的关键在于提出一种名为时序因果概率描述逻辑(Temporal Causal Probabilistic Description Logic, T-CPDL)的集成框架,该框架通过引入时间区间算子、显式因果关系和概率注释,扩展了传统描述逻辑,从而实现从简单的时间排序到复杂的概率因果推理的多种推理任务。
链接: https://arxiv.org/abs/2506.18559
作者: Hong Qing Yu
机构: 未知
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注:
Abstract:Large language models excel at generating fluent text but frequently struggle with structured reasoning involving temporal constraints, causal relationships, and probabilistic reasoning. To address these limitations, we propose Temporal Causal Probabilistic Description Logic (T-CPDL), an integrated framework that extends traditional Description Logic with temporal interval operators, explicit causal relationships, and probabilistic annotations. We present two distinct variants of T-CPDL: one capturing qualitative temporal relationships through Allen’s interval algebra, and another variant enriched with explicit timestamped causal assertions. Both variants share a unified logical structure, enabling complex reasoning tasks ranging from simple temporal ordering to nuanced probabilistic causation. Empirical evaluations on temporal reasoning and causal inference benchmarks confirm that T-CPDL substantially improves inference accuracy, interpretability, and confidence calibration of language model outputs. By delivering transparent reasoning paths and fine-grained temporal and causal semantics, T-CPDL significantly enhances the capability of language models to support robust, explainable, and trustworthy decision-making. This work also lays the groundwork for developing advanced Logic-Retrieval-Augmented Generation (Logic-RAG) frameworks, potentially boosting the reasoning capabilities and efficiency of knowledge graph-enhanced RAG systems.
zh
[AI-21] Security Assessment of DeepSeek and GPT Series Models against Jailbreak Attacks
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在面对越狱攻击(jailbreak attacks)时的安全性问题,特别是针对新兴的开源模型如DeepSeek的鲁棒性不足问题。论文提出的关键解决方案是通过系统化的越狱评估,结合HarmBench基准测试,分析不同攻击策略对模型安全性的威胁,并揭示模型架构(如Mixture-of-Experts, MoE)对安全性的影响。研究发现,DeepSeek的MoE架构在优化攻击中表现出选择性鲁棒性,但在提示攻击和人工设计攻击下存在显著脆弱性,这表明需要针对性的安全调优和模块化对齐策略以提升开源LLMs的安全性。
链接: https://arxiv.org/abs/2506.18543
作者: Xiaodong Wu,Xiangman Li,Jianbing Ni
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:The widespread deployment of large language models (LLMs) has raised critical concerns over their vulnerability to jailbreak attacks, i.e., adversarial prompts that bypass alignment mechanisms and elicit harmful or policy-violating outputs. While proprietary models like GPT-4 have undergone extensive evaluation, the robustness of emerging open-source alternatives such as DeepSeek remains largely underexplored, despite their growing adoption in real-world applications. In this paper, we present the first systematic jailbreak evaluation of DeepSeek-series models, comparing them with GPT-3.5 and GPT-4 using the HarmBench benchmark. We evaluate seven representative attack strategies across 510 harmful behaviors categorized by both function and semantic domain. Our analysis reveals that DeepSeek’s Mixture-of-Experts (MoE) architecture introduces routing sparsity that offers selective robustness against optimization-based attacks such as TAP-T, but leads to significantly higher vulnerability under prompt-based and manually engineered attacks. In contrast, GPT-4 Turbo demonstrates stronger and more consistent safety alignment across diverse behaviors, likely due to its dense Transformer design and reinforcement learning from human feedback. Fine-grained behavioral analysis and case studies further show that DeepSeek often routes adversarial prompts to under-aligned expert modules, resulting in inconsistent refusal behaviors. These findings highlight a fundamental trade-off between architectural efficiency and alignment generalization, emphasizing the need for targeted safety tuning and modular alignment strategies to ensure secure deployment of open-source LLMs.
zh
[AI-22] A Question Bank to Assess AI Inclusivity: Mapping out the Journey from Diversity Errors to Inclusion Excellence
【速读】:该论文试图解决当前人工智能(Artificial Intelligence, AI)风险评估框架在多样性与包容性(Diversity and Inclusion, DI)方面存在的不足,即缺乏标准化工具来衡量AI系统与DI原则的一致性。解决方案的关键是构建一个结构化的AI包容性问题库,该问题库包含253个问题,旨在从五个核心维度——人类、数据、流程、系统和治理——全面评估AI的包容性。问题库的开发采用了多源迭代方法,结合了文献综述、DI指南、负责任AI框架以及模拟用户研究,确保其在不同AI角色和应用领域的适用性与有效性。
链接: https://arxiv.org/abs/2506.18538
作者: Rifat Ara Shams,Didar Zowghi,Muneera Bano
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Ensuring diversity and inclusion (DI) in artificial intelligence (AI) is crucial for mitigating biases and promoting equitable decision-making. However, existing AI risk assessment frameworks often overlook inclusivity, lacking standardized tools to measure an AI system’s alignment with DI principles. This paper introduces a structured AI inclusivity question bank, a comprehensive set of 253 questions designed to evaluate AI inclusivity across five pillars: Humans, Data, Process, System, and Governance. The development of the question bank involved an iterative, multi-source approach, incorporating insights from literature reviews, DI guidelines, Responsible AI frameworks, and a simulated user study. The simulated evaluation, conducted with 70 AI-generated personas related to different AI jobs, assessed the question bank’s relevance and effectiveness for AI inclusivity across diverse roles and application domains. The findings highlight the importance of integrating DI principles into AI development workflows and governance structures. The question bank provides an actionable tool for researchers, practitioners, and policymakers to systematically assess and enhance the inclusivity of AI systems, paving the way for more equitable and responsible AI technologies.
zh
[AI-23] Embedded FPGA Acceleration of Brain-Like Neural Networks: Online Learning to Scalable Inference
【速读】:该论文旨在解决边缘计算设备中对低功耗、自适应学习模型的需求,传统深度学习模型因参数过多、能耗高及依赖云端连接而难以满足这一需求。其解决方案的关键在于提出一种基于脑类神经网络(Brain-Like Neural Networks, BLNNs)的架构,特别是贝叶斯置信传播神经网络(Bayesian Confidence Propagation Neural Network, BCPNN),该架构通过模仿皮层结构和生物约束的学习机制,实现稀疏化、局部学习规则以及无监督/半监督学习,从而适应低功耗边缘智能场景。论文进一步设计了首个针对BCPNN的嵌入式FPGA加速器,采用高层次综合技术,在Zynq UltraScale+ SoC上实现了在线学习与仅推理内核,并支持可变精度和混合精度,显著提升了性能并降低了能耗。
链接: https://arxiv.org/abs/2506.18530
作者: Muhammad Ihsan Al Hafiz,Naresh Ravichandran,Anders Lansner,Pawel Herman,Artur Podobas
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
备注:
Abstract:Edge AI applications increasingly require models that can learn and adapt on-device with minimal energy budget. Traditional deep learning models, while powerful, are often overparameterized, energy-hungry, and dependent on cloud connectivity. Brain-Like Neural Networks (BLNNs), such as the Bayesian Confidence Propagation Neural Network (BCPNN), propose a neuromorphic alternative by mimicking cortical architecture and biologically-constrained learning. They offer sparse architectures with local learning rules and unsupervised/semi-supervised learning, making them well-suited for low-power edge intelligence. However, existing BCPNN implementations rely on GPUs or datacenter FPGAs, limiting their applicability to embedded systems. This work presents the first embedded FPGA accelerator for BCPNN on a Zynq UltraScale+ SoC using High-Level Synthesis. We implement both online learning and inference-only kernels with support for variable and mixed precision. Evaluated on MNIST, Pneumonia, and Breast Cancer datasets, our accelerator achieves up to 17.5x latency and 94% energy savings over ARM baselines, without sacrificing accuracy. This work enables practical neuromorphic computing on edge devices, bridging the gap between brain-like learning and real-world deployment.
zh
[AI-24] Standard Applicability Judgment and Cross-jurisdictional Reasoning : A RAG -based Framework for Medical Device Compliance
【速读】:该论文旨在解决医疗设备合规性中的关键挑战,即如何准确识别适用于特定设备的监管标准,这一过程通常需要专家对跨司法管辖区的碎片化和异构文档进行解释。解决方案的关键在于引入一个模块化的AI系统,该系统基于检索增强生成(Retrieval-Augmented Generation, RAG)管道,能够自动确定标准的适用性。该系统通过从经过筛选的语料库中检索候选标准,并利用大语言模型推断出具有可追溯性依据的管辖区域特定适用性分类(强制性、推荐性或不适用)。
链接: https://arxiv.org/abs/2506.18511
作者: Yu Han,Aaron Ceross,Jeroen H.M. Bergmann
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Identifying the appropriate regulatory standard applicability remains a critical yet understudied challenge in medical device compliance, frequently necessitating expert interpretation of fragmented and heterogeneous documentation across different jurisdictions. To address this challenge, we introduce a modular AI system that leverages a retrieval-augmented generation (RAG) pipeline to automate standard applicability determination. Given a free-text device description, our system retrieves candidate standards from a curated corpus and uses large language models to infer jurisdiction-specific applicability, classified as Mandatory, Recommended, or Not Applicable, with traceable justifications. We construct an international benchmark dataset of medical device descriptions with expert-annotated standard mappings, and evaluate our system against retrieval-only, zero-shot, and rule-based baselines. The proposed approach attains a classification accuracy of 73% and a Top-5 retrieval recall of 87%, demonstrating its effectiveness in identifying relevant regulatory standards. We introduce the first end-to-end system for standard applicability reasoning, enabling scalable and interpretable AI-supported regulatory science. Notably, our region-aware RAG agent performs cross-jurisdictional reasoning between Chinese and U.S. standards, supporting conflict resolution and applicability justification across regulatory frameworks.
zh
[AI-25] PuckTrick: A Library for Making Synthetic Data More Realistic
【速读】:该论文试图解决合成数据(synthetic data)在实际应用中因过于干净而缺乏真实数据中的缺陷(如缺失值、噪声、异常值和标签错误等)的问题,这些问题可能影响机器学习(machine learning, ML)模型的泛化能力和鲁棒性。解决方案的关键是提出Pucktrick,一个Python库,能够系统地向合成数据集中引入可控的错误,从而模拟真实世界数据的不完美特性,以评估ML模型在数据缺陷下的稳定性与性能。
链接: https://arxiv.org/abs/2506.18499
作者: Alessandra Agostini,Andrea Maurino,Blerina Spahiu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Databases (cs.DB)
备注: 17 pages, 3 figures
Abstract:The increasing reliance on machine learning (ML) models for decision-making requires high-quality training data. However, access to real-world datasets is often restricted due to privacy concerns, proprietary restrictions, and incomplete data availability. As a result, synthetic data generation (SDG) has emerged as a viable alternative, enabling the creation of artificial datasets that preserve the statistical properties of real data while ensuring privacy compliance. Despite its advantages, synthetic data is often overly clean and lacks real-world imperfections, such as missing values, noise, outliers, and misclassified labels, which can significantly impact model generalization and robustness. To address this limitation, we introduce Pucktrick, a Python library designed to systematically contaminate synthetic datasets by introducing controlled errors. The library supports multiple error types, including missing data, noisy values, outliers, label misclassification, duplication, and class imbalance, offering a structured approach to evaluating ML model resilience under real-world data imperfections. Pucktrick provides two contamination modes: one for injecting errors into clean datasets and another for further corrupting already contaminated datasets. Through extensive experiments on real-world financial datasets, we evaluate the impact of systematic data contamination on model performance. Our findings demonstrate that ML models trained on contaminated synthetic data outperform those trained on purely synthetic, error-free data, particularly for tree-based and linear models such as SVMs and Extra Trees.
zh
[AI-26] How Robust is Model Editing after Fine-Tuning? An Empirical Study on Text-to-Image Diffusion Models
【速读】:该论文试图解决模型编辑(model editing)在微调(fine-tuning)后是否能够持续有效的问题,即编辑内容是否会因后续的微调过程而被无意中撤销。其解决方案的关键在于系统性地研究不同编辑技术与微调方法之间的相互作用,特别是在文本到图像(T2I)扩散模型中的表现,从而揭示编辑持久性不足的根本原因,并评估不同方法在面对微调时的鲁棒性。
链接: https://arxiv.org/abs/2506.18428
作者: Feng He,Zhenyang Liu,Marco Valentino,Zhixue Zhao
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Model editing offers a low-cost technique to inject or correct a particular behavior in a pre-trained model without extensive retraining, supporting applications such as factual correction and bias mitigation. Despite this common practice, it remains unknown whether edits persist after fine-tuning or whether they are inadvertently reversed. This question has fundamental practical implications. For example, if fine-tuning removes prior edits, it could serve as a defence mechanism against hidden malicious edits. Vice versa, the unintended removal of edits related to bias mitigation could pose serious safety concerns. We systematically investigate the interaction between model editing and fine-tuning in the context of T2I diffusion models, which are known to exhibit biases and generate inappropriate content. Our study spans two T2I model families (Stable Diffusion and FLUX), two sota editing techniques, and three fine-tuning methods (DreamBooth, LoRA, and DoRA). Through an extensive empirical analysis across diverse editing tasks and evaluation metrics, our findings reveal a trend: edits generally fail to persist through fine-tuning, even when fine-tuning is tangential or unrelated to the edits. Notably, we observe that DoRA exhibits the strongest edit reversal effect. At the same time, among editing methods, UCE demonstrates greater robustness, retaining significantly higher efficacy post-fine-tuning compared to ReFACT. These findings highlight a crucial limitation in current editing methodologies, emphasizing the need for more robust techniques to ensure reliable long-term control and alignment of deployed AI systems. These findings have dual implications for AI safety: they suggest that fine-tuning could serve as a remediation mechanism for malicious edits while simultaneously highlighting the need for re-editing after fine-tuning to maintain beneficial safety and alignment properties.
zh
[AI-27] A Large Language Model-based Multi-Agent Framework for Analog Circuits Sizing Relationships Extraction
【速读】:该论文旨在解决模拟电路预布局阶段中器件尺寸设计的优化效率问题,特别是在现有方法未能有效引入先验知识和实现搜索空间有效剪枝的情况下,导致搜索空间压缩潜力未被充分挖掘。解决方案的关键在于提出一种基于大语言模型(Large Language Model, LLM)的多智能体框架,用于从学术论文中提取模拟电路的尺寸关系,从而有效剪枝尺寸设计过程中的搜索空间。实验表明,该方法显著提升了优化效率。
链接: https://arxiv.org/abs/2506.18424
作者: Chengjie Liu,Weiyu Chen,Huiyao Xu,Yuan Du,Jun Yang,Li Du
机构: 未知
类目: Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
备注: Accepted by ISEDA 2025
Abstract:In the design process of the analog circuit pre-layout phase, device sizing is an important step in determining whether an analog circuit can meet the required performance metrics. Many existing techniques extract the circuit sizing task as a mathematical optimization problem to solve and continuously improve the optimization efficiency from a mathematical perspective. But they ignore the automatic introduction of prior knowledge, fail to achieve effective pruning of the search space, which thereby leads to a considerable compression margin remaining in the search space. To alleviate this problem, we propose a large language model (LLM)-based multi-agent framework for analog circuits’ sizing relationships extraction from academic papers. The search space in the sizing process can be effectively pruned based on the sizing relationship extracted by this framework. Eventually, we conducted tests on 3 types of circuits, and the optimization efficiency was improved by 2.32 \sim 26.6 \times . This work demonstrates that the LLM can effectively prune the search space for analog circuit sizing, providing a new solution for the combination of LLMs and conventional analog circuit design automation methods.
zh
[AI-28] he Debugging Decay Index: Rethinking Debugging Strategies for Code LLM s
【速读】:该论文试图解决当前生成式 AI 在代码生成系统中进行迭代调试时存在的有效性衰减问题,即在多次调试尝试后,模型的调试能力会呈现指数衰减,通常在2-3次尝试后失去60-80%的调试能力。解决方案的关键在于提出了一种数学框架——调试衰减指数(Debugging Decay Index, DDI),用于量化调试失效的时间点并预测干预时机,同时采用策略性重启方法,在调试过程的战略节点从利用转向探索,从而证明适时干预可以恢复调试的有效性。
链接: https://arxiv.org/abs/2506.18403
作者: Muntasir Adnan,Carlos C. N. Kuhn
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:The effectiveness of AI debugging follows a predictable exponential decay pattern; most models lose 60-80% of their debugging capability within just 2-3 attempts, despite iterative debugging being a critical capability for practical code generation systems. We introduce the Debugging Decay Index (DDI), a mathematical framework that quantifies when debugging becomes ineffective and predicts intervention points. Our strategic fresh start approach shifts from exploitation to exploration at strategic points in the debugging process, demonstrating that well-timed interventions can rescue the effectiveness of debugging. DDI reveals a fundamental limitation in current AI debugging and provides the first quantitative framework for optimising iterative code generation strategies.
zh
[AI-29] ADNF-Clustering: An Adaptive and Dynamic Neuro-Fuzzy Clustering for Leukemia Prediction
【速读】:该论文旨在解决白血病诊断与监测中传统聚类方法在应对动态细胞模式变化和实时不确定性量化方面的不足。其解决方案的关键在于提出了一种名为自适应与动态神经模糊聚类(Adaptive and Dynamic Neuro-Fuzzy Clustering, ADNF)的流式处理框架,该框架结合了基于卷积神经网络的特征提取与在线模糊聚类引擎,通过Fuzzy Temporal Index (FTI) 实现微聚类中心、密度和模糊性参数的持续更新,并通过拓扑优化阶段防止过度和不足分割,从而提升了聚类的准确性和适应性。
链接: https://arxiv.org/abs/2506.18396
作者: Marco Aruta,Ciro Listone,Giuseppe Murano,Aniello Murano
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 6 pages, 1 figure, under review
Abstract:Leukemia diagnosis and monitoring rely increasingly on high-throughput image data, yet conventional clustering methods lack the flexibility to accommodate evolving cellular patterns and quantify uncertainty in real time. We introduce Adaptive and Dynamic Neuro-Fuzzy Clustering, a novel streaming-capable framework that combines Convolutional Neural Network-based feature extraction with an online fuzzy clustering engine. ADNF initializes soft partitions via Fuzzy C-Means, then continuously updates micro-cluster centers, densities, and fuzziness parameters using a Fuzzy Temporal Index (FTI) that measures entropy evolution. A topology refinement stage performs density-weighted merging and entropy-guided splitting to guard against over- and under-segmentation. On the C-NMC leukemia microscopy dataset, our tool achieves a silhouette score of 0.51, demonstrating superior cohesion and separation over static baselines. The method’s adaptive uncertainty modeling and label-free operation hold immediate potential for integration within the INFANT pediatric oncology network, enabling scalable, up-to-date support for personalized leukemia management.
zh
[AI-30] LOGICPO: Efficient Translation of NL-based Logical Problems to FOL using LLM s and Preference Optimization
【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)在将自然语言推理问题转化为等效逻辑表述时存在的不足,这一问题限制了模型的整体推理能力。解决方案的关键在于通过在偏好优化数据集上进行微调,使模型能够更准确地解析和表示自然语言问题为一致的逻辑程序,具体包括构建一个新的监督与偏好优化数据集LogicPO,并采用如直接偏好优化(Direct Preference Optimization, DPO)和卡尼曼-特沃斯基优化(Kahneman-Tversky Optimization, KTO)等技术来微调开源LLMs。
链接: https://arxiv.org/abs/2506.18383
作者: Koushik Viswanadha,Deepanway Ghosal,Somak Aditya
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Logical reasoning is a key task for artificial intelligence due to it’s role in major downstream tasks such as Question Answering, Summarization. Recent methods in improving the reasoning ability of LLMs fall short in correctly converting a natural language reasoning problem to an equivalent logical formulation, which hinders the framework’s overall ability to reason. Towards this, we propose to use finetuning on a preference optimization dataset to learn to parse and represent a natural language problem as a whole to a consistent logical program by 1) introducing a new supervised and preference optimization dataset LogicPO, and 2) adopting popular techniques such as Direct Preference Optimization (DPO), Kahneman-Tversky optimization (KTO) to finetune open-source LLMs. Our best model with Phi-3.5 consistently outperforms GPT-3.5-turbo’s (8-shot) by producing 10% more logically correct and with 14% less syntax errors. Through the framework and our improved evaluation metrics, we offer a promising direction in improving the logical reasoning of LLMs by better representing them in their logical formulations.
zh
[AI-31] PERSCEN: Learning Personalized Interaction Pattern and Scenario Preference for Multi-Scenario Matching KDD2025
【速读】:该论文旨在解决多场景推荐中用户个性化表示生成不足的问题,特别是在现有方法中忽视用户特定建模,导致无法有效捕捉跨场景共享偏好和场景感知偏好。解决方案的关键在于引入PERSCEN框架,该框架通过构建基于用户特征的用户特定特征图,并利用轻量级图神经网络捕捉高阶交互模式,实现跨场景共享偏好的个性化提取;同时结合向量量化技术从用户行为序列中提炼场景感知偏好,从而完成用户特定且场景感知的偏好建模。
链接: https://arxiv.org/abs/2506.18382
作者: Haotong Du,Yaqing Wang,Fei Xiong,Lei Shao,Ming Liu,Hao Gu,Quanming Yao,Zhen Wang
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted by KDD 2025
Abstract:With the expansion of business scales and scopes on online platforms, multi-scenario matching has become a mainstream solution to reduce maintenance costs and alleviate data sparsity. The key to effective multi-scenario recommendation lies in capturing both user preferences shared across all scenarios and scenario-aware preferences specific to each scenario. However, existing methods often overlook user-specific modeling, limiting the generation of personalized user representations. To address this, we propose PERSCEN, an innovative approach that incorporates user-specific modeling into multi-scenario matching. PERSCEN constructs a user-specific feature graph based on user characteristics and employs a lightweight graph neural network to capture higher-order interaction patterns, enabling personalized extraction of preferences shared across scenarios. Additionally, we leverage vector quantization techniques to distil scenario-aware preferences from users’ behavior sequence within individual scenarios, facilitating user-specific and scenario-aware preference modeling. To enhance efficient and flexible information transfer, we introduce a progressive scenario-aware gated linear unit that allows fine-grained, low-latency fusion. Extensive experiments demonstrate that PERSCEN outperforms existing methods. Further efficiency analysis confirms that PERSCEN effectively balances performance with computational cost, ensuring its practicality for real-world industrial systems.
zh
[AI-32] Robots and Children that Learn Together : Improving Knowledge Retention by Teaching Peer-Like Interactive Robots
【速读】:该论文试图解决如何在真实课堂环境中利用自主、类人社交机器人实现学习-教学(Learning-by-Teaching, LbT)范式的实施问题,现有研究多依赖于脚本化或“巫师之眼”(Wizard-of-Oz)行为,限制了对人工智能代理支持实时互动学习的理解。解决方案的关键在于引入交互式强化学习(Interactive Reinforcement Learning, RL)作为可教学社交机器人的认知模型,使机器人能够根据儿童的评价反馈进行学习,从而实现动态适应和有效教学。
链接: https://arxiv.org/abs/2506.18365
作者: Imene Tarakli,Samuele Vinanzi,Richard Moore,Alessandro Di Nuovo
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:
Abstract:Despite growing interest in Learning-by-Teaching (LbT), few studies have explored how this paradigm can be implemented with autonomous, peer-like social robots in real classrooms. Most prior work has relied on scripted or Wizard-of-Oz behaviors, limiting our understanding of how real-time, interactive learning can be supported by artificial agents. This study addresses this gap by introducing Interactive Reinforcement Learning (RL) as a cognitive model for teachable social robots. We conducted two between-subject experiments with 58 primary school children, who either taught a robot or practiced independently on a tablet while learning French vocabulary (memorization) and grammatical rules (inference). The robot, powered by Interactive RL, learned from the child’s evaluative feedback. Children in the LbT condition achieved significantly higher retention gains compared to those in the self-practice condition, especially on the grammar task. Learners with lower prior knowledge benefited most from teaching the robot. Behavioural metrics revealed that children adapted their teaching strategies over time and engaged more deeply during inference tasks. This work makes two contributions: (1) it introduces Interactive RL as a pedagogically effective and scalable model for peer-robot learning, and (2) it demonstrates, for the first time, the feasibility of deploying multiple autonomous robots simultaneously in real classrooms. These findings extend theoretical understanding of LbT by showing that social robots can function not only as passive tutees but as adaptive partners that enhance meta-cognitive engagement and long-term learning outcomes.
zh
[AI-33] Dynamic Knowledge Exchange and Dual-diversity Review: Concisely Unleashing the Potential of a Multi-Agent Research Team
【速读】:该论文试图解决当前基于大语言模型(Large Language Models, LLMs)的科学发现代理在实际科研中缺乏交互推理与评估机制的问题。其解决方案的关键在于提出IDVSCI框架,该框架包含两个核心创新:动态知识交换机制,实现代理间的迭代反馈;以及双多样性评审范式,模拟异质专家的评估过程,从而促进更深入的推理和更具创造性和影响力的科学思想生成。
链接: https://arxiv.org/abs/2506.18348
作者: Weilun Yu,Shixiang Tang,Yonggui Huang,Nanqing Dong,Li Fan,Honggang Qi,Wei Liu,Xiaoli Diao,Xi Chen,Wanli Ouyang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Scientific progress increasingly relies on effective collaboration among researchers, a dynamic that large language models (LLMs) have only begun to emulate. While recent LLM-based scientist agents show promise in autonomous scientific discovery, they often lack the interactive reasoning and evaluation mechanisms essential to real-world research. We propose IDVSCI (Internal Discussion and Vote SCIentists), a multi-agent framework built on LLMs that incorporates two key innovations: a Dynamic Knowledge Exchange mechanism enabling iterative feedback among agents, and a Dual-Diversity Review paradigm that simulates heterogeneous expert evaluation. These components jointly promote deeper reasoning and the generation of more creative and impactful scientific ideas. To evaluate the effectiveness and generalizability of our approach, we conduct experiments on two datasets: a widely used benchmark in computer science and a new dataset we introduce in the health sciences domain. Results show that IDVSCI consistently achieves the best performance across both datasets, outperforming existing systems such as AI Scientist and VIRSCI. These findings highlight the value of modeling interaction and peer review dynamics in LLM-based autonomous research.
zh
[AI-34] Controlled Generation with Equivariant Variational Flow Matching
【速读】:该论文试图解决生成模型中可控生成与对称性保持的问题,特别是如何在不重新训练模型的情况下实现对无条件生成模型的后处理控制,并确保生成结果在旋转、平移和排列操作下保持不变。解决方案的关键在于将流匹配(Flow Matching)框架下的生成目标形式化为变分推断问题,从而支持两种可控生成方式:一种是通过条件生成模型的端到端训练,另一种是作为贝叶斯推断问题进行后处理控制。此外,论文提出了针对分子生成的等变流匹配(Equivariant Flow Matching)公式,确保生成过程对对称变换的不变性。
链接: https://arxiv.org/abs/2506.18340
作者: Floor Eijkelboom,Heiko Zimmermann,Sharvaree Vadgama,Erik J Bekkers,Max Welling,Christian A. Naesseth,Jan-Willem van de Meent
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:We derive a controlled generation objective within the framework of Variational Flow Matching (VFM), which casts flow matching as a variational inference problem. We demonstrate that controlled generation can be implemented two ways: (1) by way of end-to-end training of conditional generative models, or (2) as a Bayesian inference problem, enabling post hoc control of unconditional models without retraining. Furthermore, we establish the conditions required for equivariant generation and provide an equivariant formulation of VFM tailored for molecular generation, ensuring invariance to rotations, translations, and permutations. We evaluate our approach on both uncontrolled and controlled molecular generation, achieving state-of-the-art performance on uncontrolled generation and outperforming state-of-the-art models in controlled generation, both with end-to-end training and in the Bayesian inference setting. This work strengthens the connection between flow-based generative modeling and Bayesian inference, offering a scalable and principled framework for constraint-driven and symmetry-aware generation.
zh
[AI-35] Structured Kolmogorov-Arnold Neural ODEs for Interpretable Learning and Symbolic Discovery of Nonlinear Dynamics
【速读】:该论文旨在解决非线性动力系统建模中同时实现高精度与物理可解释性的难题。其解决方案的关键在于提出结构化Kolmogorov-Arnold神经微分方程(SKANODEs),该框架将结构化状态空间建模与Kolmogorov-Arnold网络(KAN)相结合,利用KAN作为通用函数逼近器进行虚拟传感以恢复具有物理意义的潜在状态,并通过KAN的符号回归能力提取系统的紧凑且可解释的动力学表达式,最终在神经微分方程框架中进行参数校准以提升模型的精确性和预测能力。
链接: https://arxiv.org/abs/2506.18339
作者: Wei Liu,Kiran Bacsa,Loon Ching Tang,Eleni Chatzi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Symbolic Computation (cs.SC); Chaotic Dynamics (nlin.CD); Data Analysis, Statistics and Probability (physics.data-an)
备注:
Abstract:Understanding and modeling nonlinear dynamical systems is a fundamental problem across scientific and engineering domains. While deep learning has demonstrated remarkable potential for learning complex system behavior, achieving models that are both highly accurate and physically interpretable remains a major challenge. To address this, we propose Structured Kolmogorov-Arnold Neural ODEs (SKANODEs), a novel framework that integrates structured state-space modeling with the Kolmogorov-Arnold Network (KAN). SKANODE first employs a fully trainable KAN as a universal function approximator within a structured Neural ODE framework to perform virtual sensing, recovering latent states that correspond to physically interpretable quantities such as positions and velocities. Once this structured latent representation is established, we exploit the symbolic regression capability of KAN to extract compact and interpretable expressions for the system’s governing dynamics. The resulting symbolic expression is then substituted back into the Neural ODE framework and further calibrated through continued training to refine its coefficients, enhancing both the precision of the discovered equations and the predictive accuracy of system responses. Extensive experiments on both simulated and real-world systems demonstrate that SKANODE achieves superior performance while offering interpretable, physics-consistent models that uncover the underlying mechanisms of nonlinear dynamical systems.
zh
[AI-36] Bias vs Bias – Dawn of Justice: A Fair Fight in Recommendation Systems
【速读】:该论文试图解决推荐系统中因物品类别偏见导致的不公平推荐问题,以及现有公平性重排序方法主要针对二元敏感属性而忽略多维敏感属性的局限性。其解决方案的关键在于提出一种公平感知的重排序方法,该方法利用已有的偏见来纠正不同人口统计群体之间的推荐差异,并有效缓解包括性别、年龄和职业在内的多种敏感属性带来的社会偏见。
链接: https://arxiv.org/abs/2506.18327
作者: Tahsin Alamgir Kheya,Mohamed Reda Bouadjenek,Sunil Aryal
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:Recommendation systems play a crucial role in our daily lives by impacting user experience across various domains, including e-commerce, job advertisements, entertainment, etc. Given the vital role of such systems in our lives, practitioners must ensure they do not produce unfair and imbalanced recommendations. Previous work addressing bias in recommendations overlooked bias in certain item categories, potentially leaving some biases unaddressed. Additionally, most previous work on fair re-ranking focused on binary-sensitive attributes. In this paper, we address these issues by proposing a fairness-aware re-ranking approach that helps mitigate bias in different categories of items. This re-ranking approach leverages existing biases to correct disparities in recommendations across various demographic groups. We show how our approach can mitigate bias on multiple sensitive attributes, including gender, age, and occupation. We experimented on three real-world datasets to evaluate the effectiveness of our re-ranking scheme in mitigating bias in recommendations. Our results show how this approach helps mitigate social bias with little to no degradation in performance.
zh
[AI-37] Use Property-Based Testing to Bridge LLM Code Generation and Validation
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在代码生成过程中难以确保输出功能正确性的问题,尤其是在复杂编程任务中。传统测试驱动开发(Test-Driven Development, TDD)在与LLMs结合使用时效果受限,主要由于高质量测试用例稀缺或自动化测试生成中的缺陷,如测试偏差或输出预测不准确。论文提出的解决方案是Property-Generated Solver框架,其关键在于利用基于性质的测试(Property-Based Testing, PBT)验证程序的高层属性或不变式,而非依赖具体的输入输出示例,从而避免“自我欺骗循环”,并通过两个协作的LLM代理——生成器和测试器——实现代码的迭代优化与反馈引导。
链接: https://arxiv.org/abs/2506.18315
作者: Lehan He,Zeren Chen,Zhe Zhang,Jing Shao,Xiang Gao,Lu Sheng
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) excel at code generation, but ensuring their outputs to be functionally correct, especially in complex programming tasks, is a persistent challenge. While traditional Test-Driven Development (TDD) offers a path for code refinement, its efficacy with LLMs is often undermined by the scarcity of high-quality test cases or the pitfalls of automated test generation, including biased tests or inaccurate output predictions that can misdirect the correction process. This paper introduces Property-Generated Solver, a novel framework that leverages Property-Based Testing (PBT) to validate high-level program properties or invariants, instead of relying on specific input-output examples. These properties are often simpler to define and verify than directly predicting exhaustive test oracles, breaking the “cycle of self-deception” where tests might share flaws with the code they are meant to validate. Property-Generated Solver employs two collaborative LLM-based agents: a Generator dedicated to code generation and iterative refinement, and a Tester that manages the PBT life-cycle and formulate semantically rich feedback from property violations. The resulting comprehensive and actionable feedback then guides the Generator in its refinement efforts. By establishing PBT as the core validation engine within this iterative, closed-loop paradigm, Property-Generated Solver provides a robust mechanism for steering LLMs towards more correct and generalizable code. Extensive experimental results on multiple code generation benchmarks demonstrate that Property-Generated Solver achieves substantial pass@1 improvements, ranging from 23.1% to 37.3% relative gains over established TDD methods.
zh
[AI-38] LettinGo: Explore User Profile Generation for Recommendation System
【速读】:该论文旨在解决传统基于嵌入的用户画像在可解释性和适应性方面的不足,以及现有方法因固定格式而无法全面捕捉用户行为多样性的局限。其解决方案的关键在于提出LettinGo框架,该框架通过利用大语言模型(Large Language Models, LLMs)的表达能力,并结合下游推荐任务的直接反馈,采用直接偏好优化(Direct Preference Optimization, DPO)来对齐画像生成器与任务性能,从而实现灵活、自适应的用户画像生成。
链接: https://arxiv.org/abs/2506.18309
作者: Lu Wang,Di Zhang,Fangkai Yang,Pu Zhao,Jianfeng Liu,Yuefeng Zhan,Hao Sun,Qingwei Lin,Weiwei Deng,Dongmei Zhang,Feng Sun,Qi Zhang
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: 11 pages, 3 figures
Abstract:User profiling is pivotal for recommendation systems, as it transforms raw user interaction data into concise and structured representations that drive personalized recommendations. While traditional embedding-based profiles lack interpretability and adaptability, recent advances with large language models (LLMs) enable text-based profiles that are semantically richer and more transparent. However, existing methods often adhere to fixed formats that limit their ability to capture the full diversity of user behaviors. In this paper, we introduce LettinGo, a novel framework for generating diverse and adaptive user profiles. By leveraging the expressive power of LLMs and incorporating direct feedback from downstream recommendation tasks, our approach avoids the rigid constraints imposed by supervised fine-tuning (SFT). Instead, we employ Direct Preference Optimization (DPO) to align the profile generator with task-specific performance, ensuring that the profiles remain adaptive and effective. LettinGo operates in three stages: (1) exploring diverse user profiles via multiple LLMs, (2) evaluating profile quality based on their impact in recommendation systems, and (3) aligning the profile generation through pairwise preference data derived from task performance. Experimental results demonstrate that our framework significantly enhances recommendation accuracy, flexibility, and contextual awareness. This work enhances profile generation as a key innovation for next-generation recommendation systems.
zh
[AI-39] Spiffy: Efficient Implementation of CoLaNET for Raspberry Pi
【速读】:该论文试图解决在没有专用神经形态硬件或框架的情况下高效运行脉冲神经网络(Spiking Neural Networks, SNNs)的问题。解决方案的关键在于提出一种轻量级的软件方法,将特定的SNN架构(CoLaNET)用Rust实现,并针对通用计算平台进行优化,从而在低成本设备如Raspberry Pi上实现高精度和低延迟的性能。
链接: https://arxiv.org/abs/2506.18306
作者: Andrey Derzhavin,Denis Larionov
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
备注: 7 pages, 3 figures
Abstract:This paper presents a lightweight software-based approach for running spiking neural networks (SNNs) without relying on specialized neuromorphic hardware or frameworks. Instead, we implement a specific SNN architecture (CoLaNET) in Rust and optimize it for common computing platforms. As a case study, we demonstrate our implementation, called Spiffy, on a Raspberry Pi using the MNIST dataset. Spiffy achieves 92% accuracy with low latency - just 0.9 ms per training step and 0.45 ms per inference step. The code is open-source.
zh
[AI-40] Sharpening the Spear: Adaptive Expert-Guided Adversarial Attack Against DRL-based Autonomous Driving Policies
【速读】:该论文旨在解决深度强化学习(Deep Reinforcement Learning, DRL)在自动驾驶中面临的对抗攻击问题,特别是现有攻击方法在攻击频率与效率、训练稳定性之间的平衡难题。其解决方案的关键在于提出一种自适应的专家引导式对抗攻击方法,通过模仿学习从成功的攻击示范中提取专家策略,并利用集成的Mixture-of-Experts架构提升策略的泛化能力,同时引入基于KL散度正则化的机制来指导DRL代理的攻击策略,最终结合性能感知的退火策略逐步减少对专家策略的依赖,从而提升攻击策略训练的稳定性和效率。
链接: https://arxiv.org/abs/2506.18304
作者: Junchao Fan,Xuyang Lei,Xiaolin Chang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 12 pages, 3 figures, 2 tables
Abstract:Deep reinforcement learning (DRL) has emerged as a promising paradigm for autonomous driving. However, despite their advanced capabilities, DRL-based policies remain highly vulnerable to adversarial attacks, posing serious safety risks in real-world deployments. Investigating such attacks is crucial for revealing policy vulnerabilities and guiding the development of more robust autonomous systems. While prior attack methods have made notable progress, they still face several challenges: 1) they often rely on high-frequency attacks, yet critical attack opportunities are typically context-dependent and temporally sparse, resulting in inefficient attack patterns; 2) restricting attack frequency can improve efficiency but often results in unstable training due to the adversary’s limited exploration. To address these challenges, we propose an adaptive expert-guided adversarial attack method that enhances both the stability and efficiency of attack policy training. Our method first derives an expert policy from successful attack demonstrations using imitation learning, strengthened by an ensemble Mixture-of-Experts architecture for robust generalization across scenarios. This expert policy then guides a DRL-based adversary through a KL-divergence regularization term. Due to the diversity of scenarios, expert policies may be imperfect. To address this, we further introduce a performance-aware annealing strategy that gradually reduces reliance on the expert as the adversary improves. Extensive experiments demonstrate that our method achieves outperforms existing approaches in terms of collision rate, attack efficiency, and training stability, especially in cases where the expert policy is sub-optimal.
zh
[AI-41] GeNeRT: A Physics-Informed Approach to Intelligent Wireless Channel Modeling via Generalizable Neural Ray Tracing
【速读】:该论文旨在解决现有神经光线追踪(Neural Ray Tracing, NRT)方法在泛化能力受限和对电磁定律遵循不足的问题。其关键解决方案是提出一种可泛化的神经光线追踪框架GeNeRT,该框架通过引入受菲涅尔原理启发的神经网络设计,增强了多径分量(Multipath Component, MPC)预测的准确性,并采用GPU张量化加速策略提升运行效率,从而实现了场景内空间迁移能力和跨场景零样本泛化能力的提升。
链接: https://arxiv.org/abs/2506.18295
作者: Kejia Bian,Meixia Tao,Shu Sun,Jun Yu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Neural ray tracing (RT) has emerged as a promising paradigm for channel modeling by combining physical propagation principles with neural networks. It enables high modeling accuracy and efficiency. However, current neural RT methods face two key limitations: constrained generalization capability due to strong spatial dependence, and weak adherence to electromagnetic laws. In this paper, we propose GeNeRT, a Generalizable Neural RT framework with enhanced generalization, accuracy and efficiency. GeNeRT supports both intra-scenario spatial transferability and inter-scenario zero-shot generalization. By incorporating Fresnel-inspired neural network design, it also achieves higher accuracy in multipath component (MPC) prediction. Furthermore, a GPU-tensorized acceleration strategy is introduced to improve runtime efficiency. Extensive experiments conducted in outdoor scenarios demonstrate that GeNeRT generalizes well across untrained regions within a scenario and entirely unseen environments, and achieves superior accuracy in MPC prediction compared to baselines. Moreover, it outperforms Wireless Insite in runtime efficiency, particularly in multi-transmitter settings. Ablation experiments validate the effectiveness of the network architecture and training strategy in capturing physical principles of ray-surface interactions.
zh
[AI-42] u®ning AI Green: Exploring Energy Efficiency Cascading with Orthogonal Optimizations
【速读】:该论文试图解决当前人工智能(Artificial Intelligence, AI)在计算需求和能耗方面日益增长的问题,特别是针对现有优化技术(称为“knobs”)作为事后补救措施、缺乏对能效组合效应的系统性理解所导致的效率不足。解决方案的关键在于将能效视为计算密集型流水线中的首要考虑因素,并通过在五个AI流水线阶段(数据、模型、训练、系统、推理)中进行战略性选择,实现能效的级联提升。实验验证表明,正交组合可使能耗降低高达94.6%,同时保持原始F1分数的95.95%。这种有选择性的方法为可持续人工智能提供了可操作的框架,平衡了效率、性能与环境责任。
链接: https://arxiv.org/abs/2506.18289
作者: Saurabhsingh Rajput,Mootez Saad,Tushar Sharma
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: In review
Abstract:AI’s exponential growth intensifies computational demands and energy challenges. While practitioners employ various optimization techniques, that we refer as “knobs” in this paper, to tune model efficiency, these are typically afterthoughts and reactive ad-hoc changes applied in isolation without understanding their combinatorial effects on energy efficiency. This paper emphasizes on treating energy efficiency as the first-class citizen and as a fundamental design consideration for a compute-intensive pipeline. We show that strategic selection across five AI pipeline phases (data, model, training, system, inference) creates cascading efficiency. Experimental validation shows orthogonal combinations reduce energy consumption by up to 94.6 % while preserving 95.95 % of the original F1 score of non-optimized pipelines. This curated approach provides actionable frameworks for informed sustainable AI that balance efficiency, performance, and environmental responsibility.
zh
[AI-43] Learning Causal Graphs at Scale: A Foundation Model Approach
【速读】:该论文旨在解决DAG(有向无环图)学习中的两个核心挑战:计算成本的超指数增长以及小样本情况下的可识别性问题。其解决方案的关键在于利用线性Transformer的成功,提出一种基于注意力机制的架构——Attention-DAG(ADAG),通过非线性注意力核将观测数据映射到图结构和参数,从而实现多个任务中线性结构方程模型(SEMs)的高效多任务估计。此外,通过将多任务学习过程建模为连续优化问题,预训练的ADAG模型能够捕捉共有的结构特性作为共享的低维先验,从而降低小样本条件下下游DAG学习任务的病态性。
链接: https://arxiv.org/abs/2506.18285
作者: Naiyu Yin,Tian Gao,Yue Yu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Due to its human-interpretability and invariance properties, Directed Acyclic Graph (DAG) has been a foundational tool across various areas of AI research, leading to significant advancements. However, DAG learning remains highly challenging, due to its super-exponential growth in computational cost and identifiability issues, particularly in small-sample regimes. To address these two challenges, in this work we leverage the recent success of linear transformers and develop a foundation model approach for discovering multiple order-consistent DAGs across tasks. In particular, we propose Attention-DAG (ADAG), a novel attention-mechanism-based architecture for learning multiple linear Structural Equation Models (SEMs). ADAG learns the mapping from observed data to both graph structure and parameters via a nonlinear attention-based kernel, enabling efficient multi-task estimation of the underlying linear SEMs. By formulating the learning process across multiple tasks as a continuous optimization problem, the pre-trained ADAG model captures the common structural properties as a shared low-dimensional prior, thereby reducing the ill-posedness of downstream DAG learning tasks in small-sample regimes. We evaluate our proposed approach on benchmark synthetic datasets and find that ADAG achieves substantial improvements in both DAG learning accuracy and zero-shot inference efficiency. To the best of our knowledge, this is the first practical approach for pre-training a foundation model specifically designed for DAG learning, representing a step toward more efficient and generalizable down-stream applications in causal discovery.
zh
[AI-44] ARD-LoRA: Dynamic Rank Allocation for Parameter-Efficient Fine-Tuning of Foundation Models with Heterogeneous Adaptation Needs
【速读】:该论文旨在解决传统低秩适配(Low-Rank Adaptation, LoRA)方法中固定秩分配导致的适应性不足问题,即在不同Transformer层和注意力头之间采用统一的秩,未能考虑其异质的学习动态。解决方案的关键在于提出自适应秩动态LoRA(Adaptive Rank Dynamic LoRA, ARD-LoRA),通过可学习的缩放因子自动分配秩,并利用元目标平衡任务性能与参数效率,结合ℓ1稀疏性约束最小化秩和总变分正则化确保秩的稳定过渡,从而实现逐头的连续、可微分的秩适应。
链接: https://arxiv.org/abs/2506.18267
作者: Haseeb Ullah Khan Shinwari,Muhammad Usama
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Conventional Low-Rank Adaptation (LoRA) methods employ a fixed rank, imposing uniform adaptation across transformer layers and attention heads despite their heterogeneous learning dynamics. This paper introduces Adaptive Rank Dynamic LoRA (ARD-LoRA), a novel framework that automates rank allocation through learnable scaling factors. These factors are optimized via a meta-objective balancing task performance and parameter efficiency, incorporating \ell_1 sparsity for minimal rank and Total Variation regularization for stable rank transitions. ARD-LoRA enables continuous, differentiable, per-head rank adaptation. Experiments on LLAMA-3.1-70B and PaliGemma-2 demonstrate ARD-LoRA’s efficacy, achieving up to 99.3% of full fine-tuning performance with only 0.32% trainable parameters, outperforming strong baselines like DoRA and AdaLoRA. Furthermore, it reduces multimodal adaptation memory by 41%. These results establish dynamic, fine-grained rank allocation as a critical paradigm for efficient foundation model adaptation.
zh
[AI-45] Advanced For-Loop for QML algorithm search
【速读】:该论文旨在解决量子机器学习(Quantum Machine Learning, QML)算法自动化搜索与优化的问题,其核心挑战在于如何高效地将经典机器学习概念转化为适用于量子计算的算法。解决方案的关键在于提出一种基于大语言模型的多智能体系统(Large Language Model-based Multi-Agent Systems, LLMMA),该系统通过在抽象层面迭代生成和优化经典机器学习算法(如多层感知机、前向-前向算法和反向传播算法)的量子变换,实现对QML算法的自动探索与改进。
链接: https://arxiv.org/abs/2506.18260
作者: FuTe Wong
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 7 pages, 8 figures
Abstract:This paper introduces an advanced framework leveraging Large Language Model-based Multi-Agent Systems (LLMMA) for the automated search and optimization of Quantum Machine Learning (QML) algorithms. Inspired by Google DeepMind’s FunSearch, the proposed system works on abstract level to iteratively generates and refines quantum transformations of classical machine learning algorithms (concepts), such as the Multi-Layer Perceptron, forward-forward and backpropagation algorithms. As a proof of concept, this work highlights the potential of agentic frameworks to systematically explore classical machine learning concepts and adapt them for quantum computing, paving the way for efficient and automated development of QML algorithms. Future directions include incorporating planning mechanisms and optimizing strategy in the search space for broader applications in quantum-enhanced machine learning.
zh
[AI-46] Smart-LLaMA-DPO: Reinforced Large Language Model for Explainable Smart Contract Vulnerability Detection ISSTA2025
【速读】:该论文试图解决智能合约漏洞检测中的两个主要问题:现有数据集在覆盖范围和高质量解释方面不足,以及大型语言模型(Large Language Models, LLMs)在准确理解智能合约安全特定概念时存在困难。其解决方案的关键在于构建一个涵盖四种主要漏洞类型及机器不可审计漏洞的综合性数据集,并通过持续预训练(Continual Pre-training, CPT)、监督微调(Supervised Fine-tuning, SFT)以及直接偏好优化(Direct Preference Optimization, DPO)方法对模型进行优化,以提升模型在漏洞检测任务中的性能与解释能力。
链接: https://arxiv.org/abs/2506.18245
作者: Lei Yu,Zhirong Huang,Hang Yuan,Shiqi Cheng,Li Yang,Fengjun Zhang,Chenjie Shen,Jiajia Ma,Jingyuan Zhang,Junyi Lu,Chun Zuo
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: Accepted to ISSTA 2025
Abstract:Smart contract vulnerability detection remains a major challenge in blockchain security. Existing vulnerability detection methods face two main issues: (1) Existing datasets lack comprehensive coverage and high-quality explanations for preference learning. (2) Large language models (LLMs) often struggle with accurately interpreting specific concepts in smart contract security. Empirical analysis shows that even after continual pre-training (CPT) and supervised fine-tuning (SFT), LLMs may misinterpret the execution order of state changes, resulting in incorrect explanations despite making correct detection decisions. To address these challenges, we propose Smart-LLaMA-DPO based on LLaMA-3.1-8B. We construct a comprehensive dataset covering four major vulnerability types and machine-unauditable vulnerabilities, including precise labels, explanations, and locations for SFT, as well as high-quality and low-quality output pairs for Direct Preference Optimization (DPO). Second, we perform CPT using large-scale smart contract to enhance the LLM’s understanding of specific security practices in smart contracts. Futhermore, we conduct SFT with our comprehensive dataset. Finally, we apply DPO, leveraging human feedback and a specially designed loss function that increases the probability of preferred explanations while reducing the likelihood of non-preferred outputs. We evaluate Smart-LLaMA-DPO on four major vulnerability types: reentrancy, timestamp dependence, integer overflow/underflow, and delegatecall, as well as machine-unauditable vulnerabilities. Our method significantly outperforms state-of-the-art baselines, with average improvements of 10.43% in F1 score and 7.87% in accuracy. Moreover, both LLM evaluation and human evaluation confirm that our method generates more correct, thorough, and clear explanations.
zh
[AI-47] Quantum-Classical Hybrid Quantized Neural Network
【速读】:该论文试图解决在量化神经网络训练中使用任意激活函数和损失函数的难题,以及大规模求解二次约束二值优化(QCBO)模型时因约束过多导致的惩罚系数调优复杂性问题。其解决方案的关键在于提出一种基于样条插值的二次二值优化(QBO)模型,并引入前向区间传播(FIP)方法,通过将激活函数离散化为线性子区间来处理非线性和多层复合结构,同时利用量子条件梯度下降(QCGD)算法直接求解QCBO问题,从而提升求解效率与精度。
链接: https://arxiv.org/abs/2506.18240
作者: Wenxin Li,Chuan Wang,Hongdong Zhu,Qi Gao,Yin Ma,Hai Wei,Kai Wen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optics (physics.optics)
备注: 30 pages, 5 figures, comments are welcome
Abstract:Here in this work, we present a novel Quadratic Binary Optimization (QBO) model for quantized neural network training, enabling the use of arbitrary activation and loss functions through spline interpolation. We introduce Forward Interval Propagation (FIP), a method designed to tackle the challenges of non-linearity and the multi-layer composite structure in neural networks by discretizing activation functions into linear subintervals. This approach preserves the universal approximation properties of neural networks while allowing complex nonlinear functions to be optimized using quantum computers, thus broadening their applicability in artificial intelligence. We provide theoretical upper bounds on the approximation error and the number of Ising spins required, by deriving the sample complexity of the empirical risk minimization problem, from an optimization perspective. A significant challenge in solving the associated Quadratic Constrained Binary Optimization (QCBO) model on a large scale is the presence of numerous constraints. When employing the penalty method to handle these constraints, tuning a large number of penalty coefficients becomes a critical hyperparameter optimization problem, increasing computational complexity and potentially affecting solution quality. To address this, we employ the Quantum Conditional Gradient Descent (QCGD) algorithm, which leverages quantum computing to directly solve the QCBO problem. We prove the convergence of QCGD under a quantum oracle with randomness and bounded variance in objective value, as well as under limited precision constraints in the coefficient matrix. Additionally, we provide an upper bound on the Time-To-Solution for the QCBO solving process. Experimental results using a coherent Ising machine (CIM) demonstrate a 94.95% accuracy on the Fashion MNIST classification task, with only 1.1-bit precision.
zh
[AI-48] he 4th Dimension for Scaling Model Size
【速读】:该论文试图解决大规模语言模型扩展中的参数增长与模型性能提升之间的关系问题,特别是探索在不增加总体参数数量的情况下提升模型能力的途径。其解决方案的关键在于引入虚拟逻辑深度(Virtual Logical Depth, VLD)这一第四维度,通过在模型内部重用参数来增加有效算法深度,从而在保持参数总量不变的情况下提升模型的推理能力。
链接: https://arxiv.org/abs/2506.18233
作者: Ruike Zhu,Hanwen Zhang,Tianyu Shi,Chi Wang,Tianyi Zhou,Zengyi Qin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Scaling the size of large language models typically involves three dimensions: depth, width, and the number of parameters. In this work, we explore a fourth dimension, virtual logical depth (VLD), which increases the effective algorithmic depth without changing the overall parameter count by reusing parameters within the model. Although parameter reuse is not a new concept, its potential and characteristics in model scaling have not been thoroughly studied. Through carefully designed controlled experiments, we make the following key discoveries regarding VLD scaling: VLD scaling forces the knowledge capacity of the model to remain almost constant, with only minor variations. VLD scaling enables a significant improvement in reasoning capability, provided the scaling method is properly implemented. The number of parameters correlates with knowledge capacity, but not with reasoning capability. Under certain conditions, it is not necessary to increase the parameter count to enhance reasoning. These findings are consistent across various model configurations and are likely to be generally valid within the scope of our experiments. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2506.18233 [cs.AI] (or arXiv:2506.18233v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2506.18233 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Tianyu Shi [view email] [v1] Mon, 23 Jun 2025 01:56:25 UTC (674 KB) Full-text links: Access Paper: View a PDF of the paper titled The 4th Dimension for Scaling Model Size, by Ruike Zhu and 5 other authorsView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.AI prev | next new | recent | 2025-06 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack
zh
[AI-49] hese are Not All the Features You are Looking For: A Fundamental Bottleneck In Supervised Pretraining
【速读】:该论文试图解决迁移学习中因预训练特征不足以处理未见过的数据集而导致的性能下降问题,尤其是在任务间相关性难以量化的情况下。其解决方案的关键在于识别深度学习模型中的“信息饱和瓶颈”,即网络在训练过程中编码了相似的竞争特征后,无法继续学习新特征,导致关键特征丢失和泛化能力下降。研究提出通过构建更丰富的特征表示来改善跨新数据集的泛化能力,并探讨了现有方法与一种新方法的初步步骤作为潜在解决路径。
链接: https://arxiv.org/abs/2506.18221
作者: Xingyu Alice Yang,Jianyu Zhang,Léon Bottou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 10 pages, 7 figures, Preprint. Under review
Abstract:Transfer learning is a cornerstone of modern machine learning, promising a way to adapt models pretrained on a broad mix of data to new tasks with minimal new data. However, a significant challenge remains in ensuring that transferred features are sufficient to handle unseen datasets, amplified by the difficulty of quantifying whether two tasks are “related”. To address these challenges, we evaluate model transfer from a pretraining mixture to each of its component tasks, assessing whether pretrained features can match the performance of task-specific direct training. We identify a fundamental limitation in deep learning models – an “information saturation bottleneck” – where networks fail to learn new features once they encode similar competing features during training. When restricted to learning only a subset of key features during pretraining, models will permanently lose critical features for transfer and perform inconsistently on data distributions, even components of the training mixture. Empirical evidence from published studies suggests that this phenomenon is pervasive in deep learning architectures – factors such as data distribution or ordering affect the features that current representation learning methods can learn over time. This study suggests that relying solely on large-scale networks may not be as effective as focusing on task-specific training, when available. We propose richer feature representations as a potential solution to better generalize across new datasets and, specifically, present existing methods alongside a novel approach, the initial steps towards addressing this challenge.
zh
[AI-50] A Conceptual Framework for AI Capability Evaluations
【速读】:该论文试图解决当前在AI能力评估中缺乏全面且可靠的方法论的问题(the lack of clarity on how to perform these assessments both comprehensively and reliably)。其解决方案的关键在于提出一个概念性框架,用于分析AI能力评估,该框架通过结构化和描述性的方法系统化地分析广泛使用的评估方法和术语,而无需引入新的分类体系或固定格式,从而支持评估的透明性、可比性和可解释性。
链接: https://arxiv.org/abs/2506.18213
作者: María Victoria Carro,Denise Alejandra Mester,Francisca Gauna Selasco,Luca Nicolás Forziati Gangi,Matheo Sandleris Musa,Lola Ramos Pereyra,Mario Leiva,Juan Gustavo Corvalan,María Vanina Martinez,Gerardo Simari
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: arXiv admin note: text overlap with arXiv:2306.04181 by other authors
Abstract:As AI systems advance and integrate into society, well-designed and transparent evaluations are becoming essential tools in AI governance, informing decisions by providing evidence about system capabilities and risks. Yet there remains a lack of clarity on how to perform these assessments both comprehensively and reliably. To address this gap, we propose a conceptual framework for analyzing AI capability evaluations, offering a structured, descriptive approach that systematizes the analysis of widely used methods and terminology without imposing new taxonomies or rigid formats. This framework supports transparency, comparability, and interpretability across diverse evaluations. It also enables researchers to identify methodological weaknesses, assists practitioners in designing evaluations, and provides policymakers with an accessible tool to scrutinize, compare, and navigate complex evaluation landscapes.
zh
[AI-51] wo Sonification Methods for the MindCube
【速读】:该论文旨在解决如何利用MindCube这一交互设备作为音乐接口,以支持情绪调节的问题。其解决方案的关键在于提出两种不同的映射方式,其中基于生成式AI (Generative AI) 的映射方法通过在潜在空间中注入意义并使用外部控制器进行导航,实现了对音乐系统的创造性控制。
链接: https://arxiv.org/abs/2506.18196
作者: Fangzheng Liu,Lancelot Blanchard,Don D. Haddad,Joseph A. Paradiso
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: 5 pages, 5 figures
Abstract:In this work, we explore the musical interface potential of the MindCube, an interactive device designed to study emotions. Embedding diverse sensors and input devices, this interface resembles a fidget cube toy commonly used to help users relieve their stress and anxiety. As such, it is a particularly well-suited controller for musical systems that aim to help with emotion regulation. In this regard, we present two different mappings for the MindCube, with and without AI. With our generative AI mapping, we propose a way to infuse meaning within a latent space and techniques to navigate through it with an external controller. We discuss our results and propose directions for future work.
zh
[AI-52] DeInfoReg: A Decoupled Learning Framework for Better Training Throughput
【速读】:该论文试图解决深度学习中长期梯度流导致的梯度消失问题,以及传统反向传播(Backpropagation, BP)在并行计算资源利用上的不足。其解决方案的关键在于提出了一种名为解耦监督学习与信息正则化(Decoupled Supervised Learning with Information Regularization, DeInfoReg)的方法,通过将长梯度流分解为多个较短的梯度流,从而缓解梯度消失问题,并结合流水线策略实现跨多GPU的模型并行化,提升训练吞吐量。
链接: https://arxiv.org/abs/2506.18193
作者: Zih-Hao Huang,You-Teng Lin,Hung-Hsuan Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:
Abstract:This paper introduces Decoupled Supervised Learning with Information Regularization (DeInfoReg), a novel approach that transforms a long gradient flow into multiple shorter ones, thereby mitigating the vanishing gradient problem. Integrating a pipeline strategy, DeInfoReg enables model parallelization across multiple GPUs, significantly improving training throughput. We compare our proposed method with standard backpropagation and other gradient flow decomposition techniques. Extensive experiments on diverse tasks and datasets demonstrate that DeInfoReg achieves superior performance and better noise resistance than traditional BP models and efficiently utilizes parallel computing resources. The code for reproducibility is available at: this https URL.
zh
[AI-53] Call Me Maybe: Enhancing JavaScript Call Graph Construction using Graph Neural Networks
【速读】:该论文试图解决JavaScript中调用图(call graph)构造不准确的问题,现有算法在处理复杂语言特性时无法保证完备性和正确性,导致产生虚假边或遗漏有效边。解决方案的关键在于将问题建模为全程序图上的链接预测任务,并利用图神经网络(Graph Neural Network, GNN)来捕捉代码元素之间的非局部关系。通过结合语法和语义边的丰富表示,GRAPHIA能够从不完美的标签中学习,包括来自现有工具的静态调用边和来自测试的动态边,从而提升未解析调用点的正确目标排名。
链接: https://arxiv.org/abs/2506.18191
作者: Masudul Hasan Masud Bhuiyan,Gianluca De Stefano,Giancarlo Pellegrino,Cristian-Alexandru Staicu
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Static analysis plays a key role in finding bugs, including security issues. A critical step in static analysis is building accurate call graphs that model function calls in a program. However, due to hard-to-analyze language features, existing call graph construction algorithms for JavaScript are neither sound nor complete. Prior work shows that even advanced solutions produce false edges and miss valid ones. In this work, we assist these tools by identifying missed call edges. Our main idea is to frame the problem as link prediction on full program graphs, using a rich representation with multiple edge types. Our approach, GRAPHIA, leverages recent advances in graph neural networks to model non-local relationships between code elements. Concretely, we propose representing JavaScript programs using a combination of syntactic- and semantic-based edges. GRAPHIA can learn from imperfect labels, including static call edges from existing tools and dynamic edges from tests, either from the same or different projects. Because call graphs are sparse, standard machine learning metrics like ROC are not suitable. Instead, we evaluate GRAPHIA by ranking function definitions for each unresolved call site. We conduct a large-scale evaluation on 50 popular JavaScript libraries with 163K call edges (150K static and 13K dynamic). GRAPHIA builds program graphs with 6.6M structural and 386K semantic edges. It ranks the correct target as the top candidate in over 42% of unresolved cases and within the top 5 in 72% of cases, reducing the manual effort needed for analysis. Our results show that learning-based methods can improve the recall of JavaScript call graph construction. To our knowledge, this is the first work to apply GNN-based link prediction to full multi-file program graphs for interprocedural analysis.
zh
[AI-54] he Impact of Medication Non-adherence on Adverse Outcomes: Evidence from Schizophrenia Patients via Survival Analysis ALT
【速读】:该论文试图解决精神分裂症患者抗精神病药物非依从性与不良预后之间关联的量化问题,其核心是通过生存分析框架评估非依从性对早期死亡、强制住院和逮捕等不良事件发生时间的影响。解决方案的关键在于将标准因果推断方法(如T-learner、S-learner和最近邻匹配)扩展至生存模型,以估计个体和平均处理效应,其中处理对应于药物非依从性,并结合不同时间跨度的纵向数据进行分析。研究还通过消融实验验证了县提供的风险评分在调整关键混杂因素中的作用,进一步支持了结果的稳健性。
链接: https://arxiv.org/abs/2506.18187
作者: Shahriar Noroozizadeh,Pim Welle,Jeremy C. Weiss,George H. Chen
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注: Conference on Health, Inference, and Learning (CHIL 2025)
Abstract:This study quantifies the association between non-adherence to antipsychotic medications and adverse outcomes in individuals with schizophrenia. We frame the problem using survival analysis, focusing on the time to the earliest of several adverse events (early death, involuntary hospitalization, jail booking). We extend standard causal inference methods (T-learner, S-learner, nearest neighbor matching) to utilize various survival models to estimate individual and average treatment effects, where treatment corresponds to medication non-adherence. Analyses are repeated using different amounts of longitudinal information (3, 6, 9, and 12 months). Using data from Allegheny County in western Pennsylvania, we find strong evidence that non-adherence advances adverse outcomes by approximately 1 to 4 months. Ablation studies confirm that county-provided risk scores adjust for key confounders, as their removal amplifies the estimated effects. Subgroup analyses by medication formulation (injectable vs. oral) and medication type consistently show that non-adherence is associated with earlier adverse events. These findings highlight the clinical importance of adherence in delaying psychiatric crises and show that integrating survival analysis with causal inference tools can yield policy-relevant insights. We caution that although we apply causal inference, we only make associative claims and discuss assumptions needed for causal interpretation.
zh
[AI-55] Understanding Reasoning in Thinking Language Models via Steering Vectors
【速读】:该论文试图解决如何控制生成式语言模型(Generative Language Models)在推理过程中的行为问题,特别是在思考型语言模型(Thinking Language Models)中实现对内部推理链的可控性。其解决方案的关键在于通过分析和操纵深度学习模型(DeepSeek-R1-Distill)中的特定推理行为,发现这些行为由模型激活空间中的线性方向所调控,并利用转向向量(steering vectors)对模型的推理过程进行调制,从而实现对模型推理行为的可控且可解释的引导。
链接: https://arxiv.org/abs/2506.18167
作者: Constantin Venhoff,Iván Arcuschin,Philip Torr,Arthur Conmy,Neel Nanda
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent advances in large language models (LLMs) have led to the development of thinking language models that generate extensive internal reasoning chains before producing responses. While these models achieve improved performance, controlling their reasoning processes remains challenging. This work presents a steering approach for thinking LLMs by analyzing and manipulating specific reasoning behaviors in DeepSeek-R1-Distill models. Through a systematic experiment on 500 tasks across 10 diverse categories, we identify several reasoning behaviors exhibited by thinking models, including expressing uncertainty, generating examples for hypothesis validation, and backtracking in reasoning chains. We demonstrate that these behaviors are mediated by linear directions in the model’s activation space and can be controlled using steering vectors. By extracting and applying these vectors, we provide a method to modulate specific aspects of the model’s reasoning process, such as its tendency to backtrack or express uncertainty. Our approach offers practical tools for steering reasoning processes in thinking models in a controlled and interpretable manner. We validate our steering method using two DeepSeek-R1-Distill models, demonstrating consistent control across different model architectures.
zh
[AI-56] Non-equilibrium Annealed Adjoint Sampler
【速读】:该论文旨在解决基于学习的扩散采样器在实际应用中因依赖重要性采样而导致的高方差和可扩展性受限的问题。其解决方案的关键在于提出一种新的基于随机最优控制(Stochastic Optimal Control, SOC)的扩散采样器——非平衡退火伴随采样器(Non-equilibrium Annealed Adjoint Sampler, NAAS),该方法通过利用退火参考动力学而不依赖重要性采样,结合受伴随匹配启发的轻量级伴随系统,实现了高效且可扩展的训练过程。
链接: https://arxiv.org/abs/2506.18165
作者: Jaemoo Choi,Yongxin Chen,Molei Tao,Guan-Horng Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 21 pages, 7 figures
Abstract:Recently, there has been significant progress in learning-based diffusion samplers, which aim to sample from a given unnormalized density. These methods typically follow one of two paradigms: (i) formulating sampling as an unbiased stochastic optimal control (SOC) problem using a canonical reference process, or (ii) refining annealed path measures through importance-weighted sampling. Although annealing approaches have advantages in guiding samples toward high-density regions, reliance on importance sampling leads to high variance and limited scalability in practice. In this paper, we introduce the \textbfNon-equilibrium Annealed Adjoint Sampler (NAAS), a novel SOC-based diffusion sampler that leverages annealed reference dynamics without resorting to importance sampling. NAAS employs a lean adjoint system inspired by adjoint matching, enabling efficient and scalable training. We demonstrate the effectiveness of our approach across a range of tasks, including sampling from classical energy landscapes and molecular Boltzmann distribution.
zh
[AI-57] AI Through the Human Lens: Investigating Cognitive Theories in Machine Psychology
【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)是否展现出类似人类的心理认知模式这一问题,具体通过Thematic Apperception Test (TAT)、Framing Bias、Moral Foundations Theory (MFT)和Cognitive Dissonance四个心理学框架进行验证。其解决方案的关键在于利用结构化提示和自动化评分方法,对多种专有及开源模型进行评估,从而揭示模型在叙事连贯性、框架偏差、道德判断及自我矛盾等方面的行为特征,并分析这些行为与训练数据和对齐方法之间的关系。
链接: https://arxiv.org/abs/2506.18156
作者: Akash Kundu,Rishika Goswami
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:We investigate whether Large Language Models (LLMs) exhibit human-like cognitive patterns under four established frameworks from psychology: Thematic Apperception Test (TAT), Framing Bias, Moral Foundations Theory (MFT), and Cognitive Dissonance. We evaluated several proprietary and open-source models using structured prompts and automated scoring. Our findings reveal that these models often produce coherent narratives, show susceptibility to positive framing, exhibit moral judgments aligned with Liberty/Oppression concerns, and demonstrate self-contradictions tempered by extensive rationalization. Such behaviors mirror human cognitive tendencies yet are shaped by their training data and alignment methods. We discuss the implications for AI transparency, ethical deployment, and future work that bridges cognitive psychology and AI safety
zh
[AI-58] CoachGPT : A Scaffolding-based Academic Writing Assistant SIGIR2025
【速读】:该论文试图解决学术写作技能对学生的必要性与缺乏有效指导之间的矛盾,尤其是在第二语言环境中,传统方法如向教师求助或查阅词典存在可及性不足的问题。其解决方案的关键在于开发CoachGPT,这是一个基于AI代理的网络应用,能够通过经验丰富的教育者提供的指令,将其转化为子任务,并利用大型语言模型(Large Language Models, LLMs)提供实时反馈和建议,从而实现个性化的写作指导与沉浸式学习体验。
链接: https://arxiv.org/abs/2506.18149
作者: Fumian Chen,Sotheara Veng,Joshua Wilson,Xiaoming Li,Hui Fang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: SIGIR 2025 DEMO Pre-print
Abstract:Academic writing skills are crucial for students’ success, but can feel overwhelming without proper guidance and practice, particularly when writing in a second language. Traditionally, students ask instructors or search dictionaries, which are not universally accessible. Early writing assistants emerged as rule-based systems that focused on detecting misspellings, subject-verb disagreements, and basic punctuation errors; however, they are inaccurate and lack contextual understanding. Machine learning-based assistants demonstrate a strong ability for language understanding but are expensive to train. Large language models (LLMs) have shown remarkable capabilities in generating responses in natural languages based on given prompts. Still, they have a fundamental limitation in education: they generate essays without teaching, which can have detrimental effects on learning when misused. To address this limitation, we develop CoachGPT, which leverages large language models (LLMs) to assist individuals with limited educational resources and those who prefer self-paced learning in academic writing. CoachGPT is an AI agent-based web application that (1) takes instructions from experienced educators, (2) converts instructions into sub-tasks, and (3) provides real-time feedback and suggestions using large language models. This unique scaffolding structure makes CoachGPT unique among existing writing assistants. Compared to existing writing assistants, CoachGPT provides a more immersive writing experience with personalized feedback and guidance. Our user studies prove the usefulness of CoachGPT and the potential of large language models for academic writing.
zh
[AI-59] Routing Mamba: Scaling State Space Models with Mixture-of-Experts Projection
【速读】:该论文旨在解决如何高效扩展线性状态空间模型(Linear State Space Models, SSMs)的表达能力,特别是在引入混合专家(Mixture of Experts, MoE)架构时所面临的挑战。传统方法在将MoE直接集成到SSMs中时往往效果不佳或导致性能下降。论文提出的解决方案是Routing Mamba(RoM),其关键在于通过稀疏的线性投影专家混合来扩展SSMs参数,并在不同投影层和轻量子模块之间共享路由决策,从而实现Mamba层的有效且高效的稀疏扩展。
链接: https://arxiv.org/abs/2506.18145
作者: Zheng Zhan,Liliang Ren,Shuohang Wang,Liyuan Liu,Yang Liu,Yeyun Gong,Yanzhi Wang,Yelong Shen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Linear State Space Models (SSMs) offer remarkable performance gains in efficient sequence modeling, with constant inference-time computation and memory complexity. Recent advances, such as Mamba, further enhance SSMs with input-dependent gating and hardware-aware implementations, positioning them as strong alternatives to Transformers for long sequence modeling. However, efficiently scaling the expressive power of SSMs, particularly with Mixture of Experts (MoE), remains challenging, as naive integration attempts often falter or degrade performance. In this work, we introduce Routing Mamba (RoM), a novel approach that scales SSM parameters using sparse mixtures of linear projection experts. By sharing routing decisions between projection layers and lightweight sub-modules within Mamba across experts, RoM leverages synergies among linear projection experts for effective and efficient sparse scaling of Mamba layers. At a scale of 1.3B active parameters (10B total) and 16K training sequence length, RoM achieves language modeling performance equivalent to a dense Mamba model requiring over 2.3x more active parameters, and demonstrates consistent perplexity across context lengths. Experimental results further show RoM effectively scales hybrid language models, yielding a 23% FLOPS saving compared to dense Mamba scaling for similar performance.
zh
[AI-60] AI Harmonizer: Expanding Vocal Expression with a Generative Neurosymbolic Music AI System
【速读】:该论文试图解决传统和声生成工具需要用户手动指定调性或通过外部键盘选择音高所带来的音乐专业知识要求过高的问题。解决方案的关键在于引入一种基于生成式 AI (Generative AI) 的方法,能够自主生成音乐上连贯的四部和声,无需用户提前提供和声信息。该系统结合了先进的音高检测与语音建模技术以及定制训练的符号音乐模型,将任何人声旋律转化为丰富的合唱纹理。
链接: https://arxiv.org/abs/2506.18143
作者: Lancelot Blanchard,Cameron Holt,Joseph A. Paradiso
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: 4 pages, 3 figures
Abstract:Vocals harmonizers are powerful tools to help solo vocalists enrich their melodies with harmonically supportive voices. These tools exist in various forms, from commercially available pedals and software to custom-built systems, each employing different methods to generate harmonies. Traditional harmonizers often require users to manually specify a key or tonal center, while others allow pitch selection via an external keyboard-both approaches demanding some degree of musical expertise. The AI Harmonizer introduces a novel approach by autonomously generating musically coherent four-part harmonies without requiring prior harmonic input from the user. By integrating state-of-the-art generative AI techniques for pitch detection and voice modeling with custom-trained symbolic music models, our system arranges any vocal melody into rich choral textures. In this paper, we present our methods, explore potential applications in performance and composition, and discuss future directions for real-time implementations. While our system currently operates offline, we believe it represents a significant step toward AI-assisted vocal performance and expressive musical augmentation. We release our implementation on GitHub.
zh
[AI-61] Decentralized Consensus Inference-based Hierarchical Reinforcement Learning for Multi-Constrained UAV Pursuit-Evasion Game
【速读】:该论文旨在解决多旋翼无人机(UAV)系统在通信受限条件下进行协同规避与编队覆盖(CEFC)任务的挑战性问题,该任务属于多约束追逃博弈(MC-PEG)中的核心难题。其关键在于通过一种新颖的两层框架——基于共识推理的分层强化学习(CI-HRL),将目标定位分配给高层策略,而底层策略则负责避障、导航和编队控制。高层策略引入了面向共识的多智能体通信(ConsMAC)模块,以实现全局信息感知和从局部状态中建立共识;底层策略则结合了基于交替训练的多智能体近端策略优化(AT-M)和策略蒸馏技术,从而提升群体的协同规避与任务完成能力。
链接: https://arxiv.org/abs/2506.18126
作者: Xiang Yuming,Li Sizhao,Li Rongpeng,Zhao Zhifeng,Zhang Honggang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Multiple quadrotor unmanned aerial vehicle (UAV) systems have garnered widespread research interest and fostered tremendous interesting applications, especially in multi-constrained pursuit-evasion games (MC-PEG). The Cooperative Evasion and Formation Coverage (CEFC) task, where the UAV swarm aims to maximize formation coverage across multiple target zones while collaboratively evading predators, belongs to one of the most challenging issues in MC-PEG, especially under communication-limited constraints. This multifaceted problem, which intertwines responses to obstacles, adversaries, target zones, and formation dynamics, brings up significant high-dimensional complications in locating a solution. In this paper, we propose a novel two-level framework (i.e., Consensus Inference-based Hierarchical Reinforcement Learning (CI-HRL)), which delegates target localization to a high-level policy, while adopting a low-level policy to manage obstacle avoidance, navigation, and formation. Specifically, in the high-level policy, we develop a novel multi-agent reinforcement learning module, Consensus-oriented Multi-Agent Communication (ConsMAC), to enable agents to perceive global information and establish consensus from local states by effectively aggregating neighbor messages. Meanwhile, we leverage an Alternative Training-based Multi-agent proximal policy optimization (AT-M) and policy distillation to accomplish the low-level control. The experimental results, including the high-fidelity software-in-the-loop (SITL) simulations, validate that CI-HRL provides a superior solution with enhanced swarm’s collaborative evasion and task completion capabilities.
zh
[AI-62] Conceptualization Operationalization and Measurement of Machine Companionship: A Scoping Review
【速读】:该论文试图解决机器伴侣(Machine Companionship, MC)作为正式概念或可测量变量缺乏系统性研究的问题。其解决方案的关键在于通过PRISMA指南引导的范围综述,系统地采样、调查和综合当前关于MC的学术文献(共71篇,时间范围为2017-2025年),并基于指导理论、先验指定属性维度(主观积极、持续时间长、协同作用、自足性)以及测量概念(超过50个不同的测量变量)对MC进行分析,最终提出一个以自足性、协调连接、随时间展开且主观积极为特征的文献引导型定义。
链接: https://arxiv.org/abs/2506.18119
作者: Jaime Banks,Zhixin Li
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:
Abstract:The notion of machine companions has long been embedded in social-technological imaginaries. Recent advances in AI have moved those media musings into believable sociality manifested in interfaces, robotic bodies, and devices. Those machines are often referred to colloquially as “companions” yet there is little careful engagement of machine companionship (MC) as a formal concept or measured variable. This PRISMA-guided scoping review systematically samples, surveys, and synthesizes current scholarly works on MC (N = 71; 2017-2025), to that end. Works varied widely in considerations of MC according to guiding theories, dimensions of a-priori specified properties (subjectively positive, sustained over time, co-active, autotelic), and in measured concepts (with more than 50 distinct measured variables). WE ultimately offer a literature-guided definition of MC as an autotelic, coordinated connection between human and machine that unfolds over time and is subjectively positive.
zh
[AI-63] RL for Reasoning by Adaptively Revealing Rationales
【速读】:该论文试图解决复杂序列生成任务中监督微调(SFT)和强化学习(RL)各自存在的局限性,即SFT依赖密集的地面真实标签导致成本过高,而RL在稀疏奖励和组合爆炸的输出空间中表现不佳。解决方案的关键是引入自适应回溯(AdaBack),这是一种基于样本的课程学习算法,通过在训练过程中仅揭示目标输出的部分前缀,并根据模型过去的奖励信号动态调整监督长度,使模型能够逐步学习完成推理链。
链接: https://arxiv.org/abs/2506.18110
作者: Mohammad Hossein Amani,Aryo Lotfi,Nicolas Mario Baldwin,Samy Bengio,Mehrdad Farajtabar,Emmanuel Abbe,Robert West
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 18 pages, 8 figures
Abstract:We propose that reinforcement learning (RL) from partial expert demonstrations is not merely a training heuristic, but a promising framework for solving complex sequence generation tasks. Supervised fine-tuning (SFT) relies on dense ground-truth labels, which become increasingly costly as sequence length grows. RL, on the other hand, struggles with sparse rewards and a combinatorially large output space. We address this by introducing adaptive backtracking (AdaBack), a per-sample curriculum learning algorithm that reveals only a partial prefix of the target output during training. The supervision length is adjusted dynamically for each sample based on the model’s past reward signal, allowing it to incrementally learn to complete reasoning chains by conditioning on correct partial solutions. We investigate this intermediate regime between SFT and RL and argue that per-sample curriculum learning is more than a trade-off between efficiency and generality, it can succeed in tasks with long sequences of latent dependencies where SFT and RL both fail to generalize. Using a synthetic task with latent parity constraints, we show that our adaptive curriculum over partial answers reliably solves problems that are otherwise intractable. On mathematical reasoning benchmarks (MATH, GSM8k), we find that curriculum learning enables models to solve problems that RL alone cannot, acquiring new reasoning capabilities through incremental exposure to partial solutions.
zh
[AI-64] Deep Research Agents : A Systematic Examination And Roadmap
【速读】:该论文旨在探讨深度研究(Deep Research, DR)代理所面临的技术挑战与解决方案,重点分析其核心架构与关键技术。论文试图解决如何构建能够执行复杂、多轮信息研究任务的自主AI系统的问题,其关键在于整合动态推理、自适应长周期规划、多跳信息检索、迭代工具使用以及结构化分析报告生成等技术。通过提出分类框架和评估现有基准,论文为DR代理的进一步发展提供了理论基础和技术方向。
链接: https://arxiv.org/abs/2506.18096
作者: Yuxuan Huang,Yihang Chen,Haozheng Zhang,Kang Li,Meng Fang,Linyi Yang,Xiaoguang Li,Lifeng Shang,Songcen Xu,Jianye Hao,Kun Shao,Jun Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:The rapid progress of Large Language Models (LLMs) has given rise to a new category of autonomous AI systems, referred to as Deep Research (DR) agents. These agents are designed to tackle complex, multi-turn informational research tasks by leveraging a combination of dynamic reasoning, adaptive long-horizon planning, multi-hop information retrieval, iterative tool use, and the generation of structured analytical reports. In this paper, we conduct a detailed analysis of the foundational technologies and architectural components that constitute Deep Research agents. We begin by reviewing information acquisition strategies, contrasting API-based retrieval methods with browser-based exploration. We then examine modular tool-use frameworks, including code execution, multimodal input processing, and the integration of Model Context Protocols (MCPs) to support extensibility and ecosystem development. To systematize existing approaches, we propose a taxonomy that differentiates between static and dynamic workflows, and we classify agent architectures based on planning strategies and agent composition, including single-agent and multi-agent configurations. We also provide a critical evaluation of current benchmarks, highlighting key limitations such as restricted access to external knowledge, sequential execution inefficiencies, and misalignment between evaluation metrics and the practical objectives of DR agents. Finally, we outline open challenges and promising directions for future research. A curated and continuously updated repository of DR agent research is available at: this https URL.
zh
[AI-65] Federated Learning-Based Data Collaboration Method for Enhancing Edge Cloud AI System Security Using Large Language Models
【速读】:该论文旨在解决在AI驱动的应用中,如何在保持高效性能的同时确保数据隐私的紧迫安全问题。其解决方案的关键在于提出一种基于联邦学习(Federated Learning)的数据协作方法,并引入大规模语言模型(LLMs)以增强数据隐私保护和系统鲁棒性。该方法在现有联邦学习框架基础上,结合安全多方计算协议,利用LLM优化分布式节点间的数据聚合与加密过程,同时结合先进的对抗训练技术,提升边缘云AI系统对数据泄露和模型污染等安全威胁的抵抗能力。
链接: https://arxiv.org/abs/2506.18087
作者: Huaiying Luo,Cheng Ji
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: Accepted by the 2025 5th International Symposium on Computer Technology and Information Science (ISCTIS 2025)
Abstract:With the widespread application of edge computing and cloud systems in AI-driven applications, how to maintain efficient performance while ensuring data privacy has become an urgent security issue. This paper proposes a federated learning-based data collaboration method to improve the security of edge cloud AI systems, and use large-scale language models (LLMs) to enhance data privacy protection and system robustness. Based on the existing federated learning framework, this method introduces a secure multi-party computation protocol, which optimizes the data aggregation and encryption process between distributed nodes by using LLM to ensure data privacy and improve system efficiency. By combining advanced adversarial training techniques, the model enhances the resistance of edge cloud AI systems to security threats such as data leakage and model poisoning. Experimental results show that the proposed method is 15% better than the traditional federated learning method in terms of data protection and model robustness.
zh
[AI-66] Distributionally robust minimization in meta-learning for system identification
【速读】:该论文旨在解决元学习(meta learning)中因忽略任务变异性而导致的模型在新场景下适应能力不足的问题,特别是在系统识别领域。其解决方案的关键在于采用分布鲁棒优化(distributionally robust optimization)方法,通过优先考虑高损失任务来提升模型在最坏情况下的性能,从而增强模型在安全关键应用中的可靠性。
链接: https://arxiv.org/abs/2506.18074
作者: Matteo Rufolo,Dario Piga,Marco Forgione
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注:
Abstract:Meta learning aims at learning how to solve tasks, and thus it allows to estimate models that can be quickly adapted to new scenarios. This work explores distributionally robust minimization in meta learning for system identification. Standard meta learning approaches optimize the expected loss, overlooking task variability. We use an alternative approach, adopting a distributionally robust optimization paradigm that prioritizes high-loss tasks, enhancing performance in worst-case scenarios. Evaluated on a meta model trained on a class of synthetic dynamical systems and tested in both in-distribution and out-of-distribution settings, the proposed approach allows to reduce failures in safety-critical applications.
zh
[AI-67] Weighted Assumption Based Argumentation to reason about ethical principles and actions
【速读】:该论文试图解决传统Assumption Based Argumentation (ABA)在处理复杂伦理推理场景时缺乏对论点强度和攻击力度量化表达的问题。其解决方案的关键在于引入加权论证机制,即为每个论点分配权重,并据此推导出ABA论点之间攻击的权重,从而实现对论证结构更精细的建模与分析。
链接: https://arxiv.org/abs/2506.18056
作者: Paolo Baldi,Fabio Aurelio D’Asaro,Abeer Dyoub,Francesca Alessandra Lisi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:We augment Assumption Based Argumentation (ABA for short) with weighted argumentation. In a nutshell, we assign weights to arguments and then derive the weight of attacks between ABA arguments. We illustrate our proposal through running examples in the field of ethical reasoning, and present an implementation based on Answer Set Programming.
zh
[AI-68] Mechanistic Interpretability in the Presence of Architectural Obfuscation
【速读】:该论文试图解决在隐私保护的大语言模型(Large-Language-Model, LLM)推理中,使用架构混淆(architectural obfuscation)技术对模型内部表示进行扰动后,其对机制可解释性(mechanistic interpretability)的影响问题。解决方案的关键在于通过分析一个经过代表性混淆映射训练的GPT-2-small模型,评估混淆对注意力头激活模式、因果路径和逻辑归因的影响,从而揭示混淆是否真正阻碍了对模型工作原理的理解,还是仅改变了表示的坐标系。研究发现,混淆显著改变了注意力头内的激活模式,但保留了层间计算图,导致用户提示的逆向工程受阻,而前馈和残差路径仍保持功能完整性,证明混淆在不影响整体任务性能的前提下,降低了细粒度的可解释性。
链接: https://arxiv.org/abs/2506.18053
作者: Marcos Florencio,Thomas Barton
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Architectural obfuscation - e.g., permuting hidden-state tensors, linearly transforming embedding tables, or remapping tokens - has recently gained traction as a lightweight substitute for heavyweight cryptography in privacy-preserving large-language-model (LLM) inference. While recent work has shown that these techniques can be broken under dedicated reconstruction attacks, their impact on mechanistic interpretability has not been systematically studied. In particular, it remains unclear whether scrambling a network’s internal representations truly thwarts efforts to understand how the model works, or simply relocates the same circuits to an unfamiliar coordinate system. We address this gap by analyzing a GPT-2-small model trained from scratch with a representative obfuscation map. Assuming the obfuscation map is private and the original basis is hidden (mirroring an honest-but-curious server), we apply logit-lens attribution, causal path-patching, and attention-head ablation to locate and manipulate known circuits. Our findings reveal that obfuscation dramatically alters activation patterns within attention heads yet preserves the layer-wise computational graph. This disconnect hampers reverse-engineering of user prompts: causal traces lose their alignment with baseline semantics, and token-level logit attributions become too noisy to reconstruct. At the same time, feed-forward and residual pathways remain functionally intact, suggesting that obfuscation degrades fine-grained interpretability without compromising top-level task performance. These results establish quantitative evidence that architectural obfuscation can simultaneously (i) retain global model behaviour and (ii) impede mechanistic analyses of user-specific content. By mapping where interpretability breaks down, our study provides guidance for future privacy defences and for robustness-aware interpretability tooling.
zh
[AI-69] Action Language BC
【速读】:该论文试图解决传统动作语言与现代答案集编程(Answer Set Programming, ASP)语言之间存在的表达能力差距问题。解决方案的关键在于提出一种新的动作语言BC+,其语义基于命题公式的广义稳定模型语义,使得现代ASP中的多种有用构造(如选择规则、聚集和抽象约束原子)可以被视为命题公式的简写形式,从而实现动作语言与ASP语言之间的有效融合与表达能力的提升。
链接: https://arxiv.org/abs/2506.18044
作者: Joseph Babb,Joohyung Lee
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Journal of Logic and Computation, 2015
Abstract:Action languages are formal models of parts of natural language that are designed to describe effects of actions. Many of these languages can be viewed as high level notations of answer set programs structured to represent transition systems. However, the form of answer set programs considered in the earlier work is quite limited in comparison with the modern Answer Set Programming (ASP) language, which allows several useful constructs for knowledge representation, such as choice rules, aggregates, and abstract constraint atoms. We propose a new action language called BC+, which closes the gap between action languages and the modern ASP language. The main idea is to define the semantics of BC+ in terms of general stable model semantics for propositional formulas, under which many modern ASP language constructs can be identified with shorthands for propositional formulas. Language BC+ turns out to be sufficiently expressive to encompass the best features of other action languages, such as languages B, C, C+, and BC. Computational methods available in ASP solvers are readily applicable to compute BC+, which led to an implementation of the language by extending system cplus2asp.
zh
[AI-70] Pathwise Explanation of ReLU Neural Networks
【速读】:该论文试图解决神经网络的“黑箱”特性所带来的透明性和可靠性问题。其解决方案的关键在于引入一种新方法,该方法关注决策路径中涉及的隐藏单元子集,而非全部隐藏单元的激活状态,从而提供更清晰和一致的输入与决策过程之间的关系理解。这种方法在解释范围上具有灵活性,能够从整体输入归因到输入中的特定组件,并允许对给定输入的解释进行分解以获得更详细的分析。
链接: https://arxiv.org/abs/2506.18037
作者: Seongwoo Lim,Won Jo,Joohyung Lee,Jaesik Choi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: In Proceedings of The 27th International Conference on Artificial Intelligence and Statistics, PMLR 238:4645-4653, 2024
Abstract:Neural networks have demonstrated a wide range of successes, but their ``black box" nature raises concerns about transparency and reliability. Previous research on ReLU networks has sought to unwrap these networks into linear models based on activation states of all hidden units. In this paper, we introduce a novel approach that considers subsets of the hidden units involved in the decision making path. This pathwise explanation provides a clearer and more consistent understanding of the relationship between the input and the decision-making process. Our method also offers flexibility in adjusting the range of explanations within the input, i.e., from an overall attribution input to particular components within the input. Furthermore, it allows for the decomposition of explanations for a given input for more detailed explanations. Experiments demonstrate that our method outperforms others both quantitatively and qualitatively.
zh
[AI-71] Graphs Meet AI Agents : Taxonomy Progress and Future Opportunities
【速读】:该论文试图解决如何提升AI代理在复杂现实任务中的规划、执行、记忆保持及多代理协作能力的问题,其核心挑战在于处理日益复杂的环境信息、操作和交互。解决方案的关键在于通过数据结构化,特别是利用图(Graph)这一数据范式,将复杂且无序的数据转化为结构化形式,从而增强AI代理对数据的理解与处理能力。论文系统性地综述了图技术如何赋能AI代理,探索其与核心代理功能的融合,并指出未来研究的方向。
链接: https://arxiv.org/abs/2506.18019
作者: Yuanchen Bei,Weizhi Zhang,Siwen Wang,Weizhi Chen,Sheng Zhou,Hao Chen,Yong Li,Jiajun Bu,Shirui Pan,Yizhou Yu,Irwin King,Fakhri Karray,Philip S. Yu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 20 pages, 7 figures
Abstract:AI agents have experienced a paradigm shift, from early dominance by reinforcement learning (RL) to the rise of agents powered by large language models (LLMs), and now further advancing towards a synergistic fusion of RL and LLM capabilities. This progression has endowed AI agents with increasingly strong abilities. Despite these advances, to accomplish complex real-world tasks, agents are required to plan and execute effectively, maintain reliable memory, and coordinate smoothly with other agents. Achieving these capabilities involves contending with ever-present intricate information, operations, and interactions. In light of this challenge, data structurization can play a promising role by transforming intricate and disorganized data into well-structured forms that agents can more effectively understand and process. In this context, graphs, with their natural advantage in organizing, managing, and harnessing intricate data relationships, present a powerful data paradigm for structurization to support the capabilities demanded by advanced AI agents. To this end, this survey presents a first systematic review of how graphs can empower AI agents. Specifically, we explore the integration of graph techniques with core agent functionalities, highlight notable applications, and identify prospective avenues for future research. By comprehensively surveying this burgeoning intersection, we hope to inspire the development of next-generation AI agents equipped to tackle increasingly sophisticated challenges with graphs. Related resources are collected and continuously updated for the community in the Github link.
zh
[AI-72] ADA-DPM: A Neural Descriptors-based Adaptive Noise Point Filtering Strategy for SLAM
【速读】:该论文旨在解决LiDAR SLAM在动态物体干扰、点云噪声和非结构化环境下的定位精度与系统鲁棒性之间的权衡问题。其解决方案的关键在于提出一种自适应噪声过滤的SLAM策略——ADA-DPM,通过设计动态分割头以识别并剔除动态特征点、全局重要性评分头以自适应选择高贡献特征点并抑制噪声干扰,以及构建跨层内图卷积模块(GLI-GCN)以融合多尺度邻域结构,从而提升重叠特征的区分能力。
链接: https://arxiv.org/abs/2506.18016
作者: Yongxin Shao,Binrui Wang,Aihong Tan
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:LiDAR SLAM has demonstrated significant application value in various fields, including mobile robot navigation and high-precision map construction. However, existing methods often need to make a trade-off between positioning accuracy and system robustness when faced with dynamic object interference, point cloud noise, and unstructured environments. To address this challenge, we propose an adaptive noise filtering SLAM strategy-ADA-DPM, achieving excellent preference in both aspects. We design the Dynamic Segmentation Head to predict the category of feature points belonging to dynamic points, to eliminate dynamic feature points; design the Global Importance Scoring Head to adaptively select feature points with higher contribution and features while suppressing noise interference; and construct the Cross Layer Intra-Graph Convolution Module (GLI-GCN) to fuse multi-scale neighborhood structures, thereby enhancing the discriminative ability of overlapping features. Finally, to further validate the effectiveness of our method, we tested it on several publicly available datasets and achieved outstanding results.
zh
[AI-73] Probing the Embedding Space of Transformers via Minimal Token Perturbations IJCAI2025
【速读】:该论文试图解决Transformer模型中信息传播机制的可解释性问题,具体关注最小token扰动对嵌入空间的影响。其解决方案的关键在于结合token扰动与嵌入空间中的位移分析,通过实验揭示稀有token通常导致更大的嵌入空间位移,并表明输入信息在深层网络中逐渐混合,从而验证了早期层可作为模型解释的代理这一假设。
链接: https://arxiv.org/abs/2506.18011
作者: Eddie Conti,Alejandro Astruc,Alvaro Parafita,Axel Brando
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: IJCAI 2025 Workshop on Explainable Artificial Intelligence
Abstract:Understanding how information propagates through Transformer models is a key challenge for interpretability. In this work, we study the effects of minimal token perturbations on the embedding space. In our experiments, we analyze the frequency of which tokens yield to minimal shifts, highlighting that rare tokens usually lead to larger shifts. Moreover, we study how perturbations propagate across layers, demonstrating that input information is increasingly intermixed in deeper layers. Our findings validate the common assumption that the first layers of a model can be used as proxies for model explanations. Overall, this work introduces the combination of token perturbations and shifts on the embedding space as a powerful tool for model interpretability.
zh
[AI-74] GeNIE: A Generalizable Navigation System for In-the-Wild Environments
【速读】:该论文旨在解决在非结构化、真实环境中的可靠导航问题,特别是在多样化的地形、天气条件和传感器配置下,如何实现鲁棒且泛化的导航能力。其解决方案的关键在于提出了一种名为GeNIE(Generalizable Navigation System for In-the-Wild Environments)的导航框架,该框架集成了基于SAM2的可泛化可行驶性预测模型,以及一种新颖的路径融合策略,以增强在噪声和模糊环境中的规划稳定性。
链接: https://arxiv.org/abs/2506.17960
作者: Jiaming Wang,Diwen Liu,Jizhuo Chen,Jiaxuan Da,Nuowen Qian,Tram Minh Man,Harold Soh
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 8 pages, 5 figures. Jiaming Wang, Diwen Liu, and Jizhuo Chen contributed equally
Abstract:Reliable navigation in unstructured, real-world environments remains a significant challenge for embodied agents, especially when operating across diverse terrains, weather conditions, and sensor configurations. In this paper, we introduce GeNIE (Generalizable Navigation System for In-the-Wild Environments), a robust navigation framework designed for global deployment. GeNIE integrates a generalizable traversability prediction model built on SAM2 with a novel path fusion strategy that enhances planning stability in noisy and ambiguous settings. We deployed GeNIE in the Earth Rover Challenge (ERC) at ICRA 2025, where it was evaluated across six countries spanning three continents. GeNIE took first place and achieved 79% of the maximum possible score, outperforming the second-best team by 17%, and completed the entire competition without a single human intervention. These results set a new benchmark for robust, generalizable outdoor robot navigation. We will release the codebase, pretrained model weights, and newly curated datasets to support future research in real-world navigation.
zh
[AI-75] medicX-KG: A Knowledge Graph for Pharmacists Drug Information Needs
【速读】:该论文旨在解决药房领域中缺乏统一国家药物资源的问题,从而减少药师对碎片化信息源的依赖。解决方案的关键在于构建一个面向药师的知识图谱(Knowledge Graph, KG)——medicX-KG,该图谱整合了英国国家处方集(British National Formulary, BNF)、DrugBank以及马耳他药品管理局(Malta Medicines Authority, MMA)的数据,结合欧洲药品管理局的合规性与部分英国供应依赖性,通过语义技术和人工智能技术揭示隐藏的关系,支持临床和监管决策。
链接: https://arxiv.org/abs/2506.17959
作者: Lizzy Farrugia,Lilian M. Azzopardi,Jeremy Debattista,Charlie Abela
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:The role of pharmacists is evolving from medicine dispensing to delivering comprehensive pharmaceutical services within multidisciplinary healthcare teams. Central to this shift is access to accurate, up-to-date medicinal product information supported by robust data integration. Leveraging artificial intelligence and semantic technologies, Knowledge Graphs (KGs) uncover hidden relationships and enable data-driven decision-making. This paper presents medicX-KG, a pharmacist-oriented knowledge graph supporting clinical and regulatory decisions. It forms the semantic layer of the broader medicX platform, powering predictive and explainable pharmacy services. medicX-KG integrates data from three sources, including, the British National Formulary (BNF), DrugBank, and the Malta Medicines Authority (MMA) that addresses Malta’s regulatory landscape and combines European Medicines Agency alignment with partial UK supply dependence. The KG tackles the absence of a unified national drug repository, reducing pharmacists’ reliance on fragmented sources. Its design was informed by interviews with practicing pharmacists to ensure real-world applicability. We detail the KG’s construction, including data extraction, ontology design, and semantic mapping. Evaluation demonstrates that medicX-KG effectively supports queries about drug availability, interactions, adverse reactions, and therapeutic classes. Limitations, including missing detailed dosage encoding and real-time updates, are discussed alongside directions for future enhancements.
zh
[AI-76] An entropy-optimal path to humble AI
【速读】:该论文试图解决当前人工智能(Artificial Intelligence, AI)模型在计算成本高昂、资源消耗大以及预测结果过度自信等问题。其解决方案的关键在于提出一种基于全概率定律的非平衡熵优化重构框架,用于改进玻尔兹曼机(Boltzmann Machines)的数学建模,从而实现无需梯度下降的高效学习框架,并具备数学上严格证明的存在性和唯一性条件以及答案置信度/可靠性评估机制。
链接: https://arxiv.org/abs/2506.17940
作者: Davide Bassetti,Lukáš Pospíšil,Michael Groom,Terence J. O’Kane,Illia Horenko
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: 30 pages, 4 figures
Abstract:Progress of AI has led to a creation of very successful, but by no means humble models and tools, especially regarding (i) the huge and further exploding costs and resources they demand, and (ii) the over-confidence of these tools with the answers they provide. Here we introduce a novel mathematical framework for a non-equilibrium entropy-optimizing reformulation of Boltzmann machines based on the exact law of total probability. It results in the highly-performant, but much cheaper, gradient-descent-free learning framework with mathematically-justified existence and uniqueness criteria, and answer confidence/reliability measures. Comparisons to state-of-the-art AI tools in terms of performance, cost and the model descriptor lengths on a set of synthetic problems with varying complexity reveal that the proposed method results in more performant and slim models, with the descriptor lengths being very close to the intrinsic complexity scaling bounds for the underlying problems. Applying this framework to historical climate data results in models with systematically higher prediction skills for the onsets of La Niña and El Niño climate phenomena, requiring just few years of climate data for training - a small fraction of what is necessary for contemporary climate prediction tools.
zh
[AI-77] Software Reuse in the Generative AI Era: From Cargo Cult Towards AI Native Software Engineering
【速读】:该论文试图解决人工智能辅助的生成式软件复用(Generative Software Reuse)在新兴“AI原生”软件工程背景下的潜在问题与挑战。其关键解决方案在于提出一个初步的研究议程和行动呼吁,以应对与这种新型软件复用方法相关的中心问题,包括其对传统软件复用实践的影响、开发者的信任机制以及可能引发的类似“货物崇拜开发”(Cargo Cult Development)的风险。
链接: https://arxiv.org/abs/2506.17937
作者: Tommi Mikkonen,Antero Taivalsaari
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Software development is currently under a paradigm shift in which artificial intelligence and generative software reuse are taking the center stage in software creation. Consequently, earlier software reuse practices and methods are rapidly being replaced by AI-assisted approaches in which developers place their trust on code that has been generated by artificial intelligence. This is leading to a new form of software reuse that is conceptually not all that different from cargo cult development. In this paper we discuss the implications of AI-assisted generative software reuse in the context of emerging “AI native” software engineering, bring forth relevant questions, and define a tentative research agenda and call to action for tackling some of the central issues associated with this approach.
zh
[AI-78] When concept-based XAI is imprecise: Do people distinguish between generalisations and misrepresentations?
【速读】:该论文试图解决如何在复杂任务中通过概念基础的可解释人工智能(C-XAI)有效传达AI模型的内部表示,特别是人们是否能够识别和评价这些概念中的泛化能力。其解决方案的关键在于设计实验评估用户对不同类型的C-XAI概念的反应,其中概念基于与分类图像匹配的程度,区分了高度相关特征(如与轨道的关系)和较少相关特征(如动作)。研究发现,用户对相关特征的不精确性高度敏感,而对基于不相关特征的泛化概念评价较低,这表明人们可能无法自发地识别AI的泛化能力。
链接: https://arxiv.org/abs/2506.17936
作者: Romy Müller
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:
Abstract:Concept-based explainable artificial intelligence (C-XAI) can help reveal the inner representations of AI models. Understanding these representations is particularly important in complex tasks like safety evaluation. Such tasks rely on high-level semantic information (e.g., about actions) to make decisions about abstract categories (e.g., whether a situation is dangerous). In this context, it may desirable for C-XAI concepts to show some variability, suggesting that the AI is capable of generalising beyond the concrete details of a situation. However, it is unclear whether people recognise and appreciate such generalisations and can distinguish them from other, less desirable forms of imprecision. This was investigated in an experimental railway safety scenario. Participants evaluated the performance of a simulated AI that evaluated whether traffic scenes involving people were dangerous. To explain these decisions, the AI provided concepts in the form of similar image snippets. These concepts differed in their match with the classified image, either regarding a highly relevant feature (i.e., relation to tracks) or a less relevant feature (i.e., actions). Contrary to the hypotheses, concepts that generalised over less relevant features led to ratings that were lower than for precisely matching concepts and comparable to concepts that systematically misrepresented these features. Conversely, participants were highly sensitive to imprecisions in relevant features. These findings cast doubts on whether people spontaneously recognise generalisations. Accordingly, they might not be able to infer from C-XAI concepts whether AI models have gained a deeper understanding of complex situations.
zh
[AI-79] A GenAI System for Improved FAIR Independent Biological Database Integration
【速读】:该论文试图解决生命科学领域研究人员在面对不断变化的Linked Open Data (LOD)网络时,难以高效、准确地识别、访问和处理数据源的问题,尤其是在数据源未遵循FAIR(Findable, Accessible, Interoperable, and Reusable)原则的情况下。解决方案的关键在于FAIRBridge系统,该系统利用自然语言处理技术解析用户查询意图,并将其映射到科学文献中描述的相关数据库,通过智能资源访问计划生成可执行查询,同时提供强大的工具以减少低质量查询处理,确保信息的高保真度和响应性。
链接: https://arxiv.org/abs/2506.17934
作者: Syed N. Sakib,Kallol Naha,Sajratul Y. Rubaiat,Hasan M. Jamil
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:Life sciences research increasingly requires identifying, accessing, and effectively processing data from an ever-evolving array of information sources on the Linked Open Data (LOD) network. This dynamic landscape places a significant burden on researchers, as the quality of query responses depends heavily on the selection and semantic integration of data sources --processes that are often labor-intensive, error-prone, and costly. While the adoption of FAIR (Findable, Accessible, Interoperable, and Reusable) data principles has aimed to address these challenges, barriers to efficient and accurate scientific data processing persist. In this paper, we introduce FAIRBridge, an experimental natural language-based query processing system designed to empower scientists to discover, access, and query biological databases, even when they are not FAIR-compliant. FAIRBridge harnesses the capabilities of AI to interpret query intents, map them to relevant databases described in scientific literature, and generate executable queries via intelligent resource access plans. The system also includes robust tools for mitigating low-quality query processing, ensuring high fidelity and responsiveness in the information delivered. FAIRBridge’s autonomous query processing framework enables users to explore alternative data sources, make informed choices at every step, and leverage community-driven crowd curation when needed. By providing a user-friendly, automated hypothesis-testing platform in natural English, FAIRBridge significantly enhances the integration and processing of scientific data, offering researchers a powerful new tool for advancing their inquiries. Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI) Cite as: arXiv:2506.17934 [cs.IR] (or arXiv:2506.17934v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2506.17934 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-80] ASTER: Adaptive Spatio-Temporal Early Decision Model for Dynamic Resource Allocation
【速读】:该论文旨在解决传统时空预测方法在将预测结果转化为可操作策略时存在的效率低下问题,特别是在应急响应等场景中,单纯依赖事件预测无法满足资源分配与干预的高效需求。其解决方案的关键在于提出一种自适应时空早期决策模型(Adaptive Spatio-Temporal Early Decision, ASTER),通过重构预测范式,将信息直接用于决策支持,从而提升整体效能。ASTER的核心创新包括一个考虑资源感知的时空交互模块(Resource-aware Spatio-Temporal interaction module, RaST)以及一个基于多目标强化学习的偏好导向决策代理(Preference-oriented decision agent, Poda),前者用于动态资源条件下捕捉时空依赖关系,后者则通过优化行动策略实现资源高效的干预方案。
链接: https://arxiv.org/abs/2506.17929
作者: Shulun Chen,Wei Shao,Flora D. Salim,Hao Xue
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: ASTER: Adaptive Spatio-Temporal Early Decision Model for Dynamic Resource Allocation
Abstract:Supporting decision-making has long been a central vision in the field of spatio-temporal intelligence. While prior work has improved the timeliness and accuracy of spatio-temporal forecasting, converting these forecasts into actionable strategies remains a key challenge. A main limitation is the decoupling of the prediction and the downstream decision phases, which can significantly degrade the downstream efficiency. For example, in emergency response, the priority is successful resource allocation and intervention, not just incident prediction. To this end, it is essential to propose an Adaptive Spatio-Temporal Early Decision model (ASTER) that reforms the forecasting paradigm from event anticipation to actionable decision support. This framework ensures that information is directly used for decision-making, thereby maximizing overall effectiveness. Specifically, ASTER introduces a new Resource-aware Spatio-Temporal interaction module (RaST) that adaptively captures long- and short-term dependencies under dynamic resource conditions, producing context-aware spatiotemporal representations. To directly generate actionable decisions, we further design a Preference-oriented decision agent (Poda) based on multi-objective reinforcement learning, which transforms predictive signals into resource-efficient intervention strategies by deriving optimal actions under specific preferences and dynamic constraints. Experimental results on four benchmark datasets demonstrate the state-of-the-art performance of ASTER in improving both early prediction accuracy and resource allocation outcomes across six downstream metrics.
zh
[AI-81] Permutation Equivariant Model-based Offline Reinforcement Learning for Auto-bidding
【速读】:该论文试图解决自动出价(auto-bidding)中强化学习(Reinforcement Learning, RL)方法的局限性问题,即传统离线强化学习(Offline RL Bidding, ORLB)因数据集状态空间覆盖不足而表现有限,而基于模拟器的强化学习(Simulation-based RL Bidding, SRLB)则面临模拟器与现实之间的差距导致策略误导的问题。解决方案的关键在于提出基于模型的强化学习出价(Model-based RL Bidding, MRLB),通过从真实数据中学习环境模型来弥合这一差距,并结合真实数据与模型生成数据进行策略训练,从而扩展状态空间覆盖范围。为确保模型可靠性,提出了两种关键技术:1) 一种具有排列等变性的模型架构以提升泛化能力;2) 一种鲁棒的离线Q-learning方法,对模型误差进行悲观惩罚,最终形成Permutation Equivariant Model-based Offline RL (PE-MORL)算法。
链接: https://arxiv.org/abs/2506.17919
作者: Zhiyu Mou,Miao Xu,Wei Chen,Rongquan Bai,Chuan Yu,Jian Xu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Reinforcement learning (RL) for auto-bidding has shifted from using simplistic offline simulators (Simulation-based RL Bidding, SRLB) to offline RL on fixed real datasets (Offline RL Bidding, ORLB). However, ORLB policies are limited by the dataset’s state space coverage, offering modest gains. While SRLB expands state coverage, its simulator-reality gap risks misleading policies. This paper introduces Model-based RL Bidding (MRLB), which learns an environment model from real data to bridge this gap. MRLB trains policies using both real and model-generated data, expanding state coverage beyond ORLB. To ensure model reliability, we propose: 1) A permutation equivariant model architecture for better generalization, and 2) A robust offline Q-learning method that pessimistically penalizes model errors. These form the Permutation Equivariant Model-based Offline RL (PE-MORL) algorithm. Real-world experiments show that PE-MORL outperforms state-of-the-art auto-bidding methods.
zh
[AI-82] Learning Reasoning Refinement: A Framework for Kahnemans Dual-System Intelligence in GUI Agents
【速读】:该论文旨在解决现有图形用户界面(GUI)代理系统在自动化数字任务时存在的两大问题:一是依赖试错决策而非渐进式推理,导致缺乏从交互中学习和适应的能力;二是评估指标过于简单,无法反映真实世界GUI交互的复杂性。其解决方案的关键在于提出CogniGUI,一个基于认知框架的GUI自动化系统,该系统受卡尼曼双系统理论启发,结合了全向解析引擎与基于群体的相对策略优化(GRPO)接地代理,通过即时层次化解析GUI元素和多路径交互评估机制,实现类似人类行为的自适应学习能力。
链接: https://arxiv.org/abs/2506.17913
作者: Jinjie Wei,Jiyao Liu,Lihao Liu,Ming Hu,Junzhi Ning,Mingcheng Li,Weijie Yin,Junjun He,Xiao Liang,Chao Feng,Dingkang Yang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Graphical User Interface (GUI) agents have made significant progress in automating digital tasks through the utilization of computer vision and language models. Nevertheless, existing agent systems encounter notable limitations. Firstly, they predominantly depend on trial and error decision making rather than progressive reasoning, thereby lacking the capability to learn and adapt from interactive encounters. Secondly, these systems are assessed using overly simplistic single step accuracy metrics, which do not adequately reflect the intricate nature of real world GUI interactions. In this paper, we present CogniGUI, a cognitive framework developed to overcome these limitations by enabling adaptive learning for GUI automation resembling human-like behavior. Inspired by Kahneman’s Dual Process Theory, our approach combines two main components: (1) an omni parser engine that conducts immediate hierarchical parsing of GUI elements through quick visual semantic analysis to identify actionable components, and (2) a Group based Relative Policy Optimization (GRPO) grounding agent that assesses multiple interaction paths using a unique relative reward system, promoting minimal and efficient operational routes. This dual-system design facilitates iterative ‘‘exploration learning mastery’’ cycles, enabling the agent to enhance its strategies over time based on accumulated experience. Moreover, to assess the generalization and adaptability of agent systems, we introduce ScreenSeek, a comprehensive benchmark that includes multi application navigation, dynamic state transitions, and cross interface coherence, which are often overlooked challenges in current benchmarks. Experimental results demonstrate that CogniGUI surpasses state-of-the-art methods in both the current GUI grounding benchmarks and our newly proposed benchmark.
zh
[AI-83] Leverag ing Large Language Model for Intelligent Log Processing and Autonomous Debugging in Cloud AI Platforms
【速读】:该论文旨在解决云平台中AI系统规模快速扩大和复杂性增加所带来的日志数据海量、非结构化及语义模糊问题,这些问题给故障定位与系统自愈带来了巨大挑战。其解决方案的关键在于提出一种基于大型语言模型(Large Language Model, LLM)的智能日志处理与自动调试框架——LLM-ID,该框架通过扩展预训练Transformer模型并集成多阶段语义推理机制,实现系统日志的上下文理解与故障链的自动重建,同时结合微调的LLM与多轮注意力机制进行上下文推理,生成潜在故障假设和根本原因路径,并引入强化学习驱动的策略引导恢复规划器以支持动态决策与自适应调试。
链接: https://arxiv.org/abs/2506.17900
作者: Cheng Ji,Huaiying Luo
机构: 未知
类目: Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: Accepted by 2025 8th International Conference on Advanced Electronic Materials, Computers and Software Engineering (AEMCSE 2025)
Abstract:With the increasing complexity and rapid expansion of the scale of AI systems in cloud platforms, the log data generated during system operation is massive, unstructured, and semantically ambiguous, which brings great challenges to fault location and system self-repair. In order to solve this problem, this paper proposes an intelligent log processing and automatic debugging framework based on Large Language Model (LLM), named Intelligent Debugger (LLM-ID). This method is extended on the basis of the existing pre-trained Transformer model, and integrates a multi-stage semantic inference mechanism to realize the context understanding of system logs and the automatic reconstruction of fault chains. Firstly, the system log is dynamically structured, and the unsupervised clustering and embedding mechanism is used to extract the event template and semantic schema. Subsequently, the fine-tuned LLM combined with the multi-round attention mechanism to perform contextual reasoning on the log sequence to generate potential fault assumptions and root cause paths. Furthermore, this paper introduces a reinforcement learning-based policy-guided recovery planner, which is driven by the remediation strategy generated by LLM to support dynamic decision-making and adaptive debugging in the cloud environment. Compared with the existing rule engine or traditional log analysis system, the proposed model has stronger semantic understanding ability, continuous learning ability and heterogeneous environment adaptability. Experiments on the cloud platform log dataset show that LLM-ID improves the fault location accuracy by 16.2%, which is significantly better than the current mainstream methods
zh
[AI-84] owards Robust Fact-Checking: A Multi-Agent System with Advanced Evidence Retrieval
【速读】:该论文试图解决数字时代虚假信息快速传播所带来的公共话语挑战,特别是传统人工事实核查方法在处理在线内容的数量和速度上的局限性。其解决方案的关键在于提出一种基于多智能体系统的自动化事实核查框架,该框架由四个专业代理组成:输入摄入代理用于声明分解,查询生成代理用于制定针对性子查询,证据检索代理用于获取可信证据,判决预测代理用于合成可解释的真伪判断。该系统在基准数据集(FEVEROUS、HOVER、SciFact)上实现了比基线方法更高的Macro F1-score,有效提升了事实核查的准确性、效率和可解释性。
链接: https://arxiv.org/abs/2506.17878
作者: Tam Trinh,Manh Nguyen,Truong-Son Hy
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:The rapid spread of misinformation in the digital era poses significant challenges to public discourse, necessitating robust and scalable fact-checking solutions. Traditional human-led fact-checking methods, while credible, struggle with the volume and velocity of online content, prompting the integration of automated systems powered by Large Language Models (LLMs). However, existing automated approaches often face limitations, such as handling complex claims, ensuring source credibility, and maintaining transparency. This paper proposes a novel multi-agent system for automated fact-checking that enhances accuracy, efficiency, and explainability. The system comprises four specialized agents: an Input Ingestion Agent for claim decomposition, a Query Generation Agent for formulating targeted subqueries, an Evidence Retrieval Agent for sourcing credible evidence, and a Verdict Prediction Agent for synthesizing veracity judgments with human-interpretable explanations. Evaluated on benchmark datasets (FEVEROUS, HOVER, SciFact), the proposed system achieves a 12.3% improvement in Macro F1-score over baseline methods. The system effectively decomposes complex claims, retrieves reliable evidence from trusted sources, and generates transparent explanations for verification decisions. Our approach contributes to the growing field of automated fact-checking by providing a more accurate, efficient, and transparent verification methodology that aligns with human fact-checking practices while maintaining scalability for real-world applications. Our source code is available at this https URL
zh
[AI-85] NestQuant: Post-Training Integer-Nesting Quantization for On-Device DNN
【速读】:该论文旨在解决在资源受限的物联网(IoT)设备上部署量化深度神经网络(DNN)模型时,动态/混合精度量化需要重新训练或专用硬件,而后训练量化(PTQ)在资源适应性方面的两个局限性:一是现有PTQ方法仅提供固定位宽模型,难以适应IoT设备的动态资源;二是部署多个不同位宽的PTQ模型会消耗大量存储资源和切换开销。其解决方案的关键在于提出一种资源友好的后训练整数嵌套量化方法(NestQuant),通过整数权重分解将量化权重按位拆分为高比特和低比特整数类型,并利用分解权重嵌套机制对高比特权重进行自适应舍入并嵌入原始量化权重中,从而实现仅需传输和存储一个NestQuant模型即可通过分页低比特权重在全比特/部分比特模型间切换,以适应资源变化并降低消耗。
链接: https://arxiv.org/abs/2506.17870
作者: Jianhang Xie,Chuntao Ding,Xiaqing Li,Shenyuan Ren,Yidong Li,Zhichao Lu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: IEEE Transactions on Mobile Computing, accepted manuscript, DOI: https://doi.org/10.1109/TMC.2025.3582583%3B Code: this https URL
Abstract:Deploying quantized deep neural network (DNN) models with resource adaptation capabilities on ubiquitous Internet of Things (IoT) devices to provide high-quality AI services can leverage the benefits of compression and meet multi-scenario resource requirements. However, existing dynamic/mixed precision quantization requires retraining or special hardware, whereas post-training quantization (PTQ) has two limitations for resource adaptation: (i) The state-of-the-art PTQ methods only provide one fixed bitwidth model, which makes it challenging to adapt to the dynamic resources of IoT devices; (ii) Deploying multiple PTQ models with diverse bitwidths consumes large storage resources and switching overheads. To this end, this paper introduces a resource-friendly post-training integer-nesting quantization, i.e., NestQuant, for on-device quantized model switching on IoT devices. The proposed NestQuant incorporates the integer weight decomposition, which bit-wise splits quantized weights into higher-bit and lower-bit weights of integer data types. It also contains a decomposed weights nesting mechanism to optimize the higher-bit weights by adaptive rounding and nest them into the original quantized weights. In deployment, we can send and store only one NestQuant model and switch between the full-bit/part-bit model by paging in/out lower-bit weights to adapt to resource changes and reduce consumption. Experimental results on the ImageNet-1K pretrained DNNs demonstrated that the NestQuant model can achieve high performance in top-1 accuracy, and reduce in terms of data transmission, storage consumption, and switching overheads. In particular, the ResNet-101 with INT8 nesting INT6 can achieve 78.1% and 77.9% accuracy for full-bit and part-bit models, respectively, and reduce switching overheads by approximately 78.1% compared with diverse bitwidths PTQ models.
zh
[AI-86] In-Context Learning Strategies Emerge Rationally
【速读】:该论文试图解决的问题是:为什么模型在进行上下文学习(In-Context Learning, ICL)时会学习到各种不同的策略。其解决方案的关键在于提出了一种基于贝叶斯预测器的统一框架,将模型在不同任务混合训练中学习到的策略视为一种贝叶斯推断过程,其中记忆性预测器和泛化性预测器分别对应于离散先验和与任务分布匹配的先验。通过引入认知科学中的理性分析视角,该研究构建了一个分层贝叶斯框架,能够准确预测Transformer模型在训练过程中的下一个词预测行为,而无需假设对模型权重的访问。该框架强调了策略损失与复杂性之间的权衡,从而解释了已知的ICL现象并提出了新的预测。
链接: https://arxiv.org/abs/2506.17859
作者: Daniel Wurgaft,Ekdeep Singh Lubana,Core Francisco Park,Hidenori Tanaka,Gautam Reddy,Noah D. Goodman
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Preprint
Abstract:Recent work analyzing in-context learning (ICL) has identified a broad set of strategies that describe model behavior in different experimental conditions. We aim to unify these findings by asking why a model learns these disparate strategies in the first place. Specifically, we start with the observation that when trained to learn a mixture of tasks, as is popular in the literature, the strategies learned by a model for performing ICL can be captured by a family of Bayesian predictors: a memorizing predictor, which assumes a discrete prior on the set of seen tasks, and a generalizing predictor, wherein the prior matches the underlying task distribution. Adopting the lens of rational analysis from cognitive science, where a learner’s behavior is explained as an optimal adaptation to data given computational constraints, we develop a hierarchical Bayesian framework that almost perfectly predicts Transformer next token predictions throughout training without assuming access to its weights. Under this framework, pretraining is viewed as a process of updating the posterior probability of different strategies, and its inference-time behavior as a posterior-weighted average over these strategies’ predictions. Our framework draws on common assumptions about neural network learning dynamics, which make explicit a tradeoff between loss and complexity among candidate strategies: beyond how well it explains the data, a model’s preference towards implementing a strategy is dictated by its complexity. This helps explain well-known ICL phenomena, while offering novel predictions: e.g., we show a superlinear trend in the timescale for transition to memorization as task diversity is increased. Overall, our work advances an explanatory and predictive account of ICL grounded in tradeoffs between strategy loss and complexity.
zh
[AI-87] Pathway-based Progressive Inference (PaPI) for Energy-Efficient Continual Learning
【速读】:该论文旨在解决持续学习系统中灾难性遗忘与能效之间的双重挑战,特别是在资源受限环境下的应用。其解决方案的关键在于提出了一种基于路径的渐进推理框架(Pathway-based Progressive Inference, PaPI),通过数学严谨的路径选择与适应机制来优化模型的稳定性与可塑性平衡。PaPI将持续学习建模为一个能效约束的优化问题,并通过Fisher信息矩阵分析推导出遗忘率的紧致边界,证明其能耗仅与活跃参数数量相关,而非模型总规模,从而在保证较低遗忘率的同时实现更高的能效。
链接: https://arxiv.org/abs/2506.17848
作者: Suyash Gaurav,Jukka Heikkonen,Jatin Chaudhary
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Continual learning systems face the dual challenge of preventing catastrophic forgetting while maintaining energy efficiency, particularly in resource-constrained environments. This paper introduces Pathway-based Progressive Inference (PaPI), a novel theoretical framework that addresses these challenges through a mathematically rigorous approach to pathway selection and adaptation. We formulate continual learning as an energy-constrained optimization problem and provide formal convergence guarantees for our pathway routing mechanisms. Our theoretical analysis demonstrates that PaPI achieves an \mathcalO(K) improvement in the stability-plasticity trade-off compared to monolithic architectures, where K is the number of pathways. We derive tight bounds on forgetting rates using Fisher Information Matrix analysis and prove that PaPI’s energy consumption scales with the number of active parameters rather than the total model size. Comparative theoretical analysis shows that PaPI provides stronger guarantees against catastrophic forgetting than Elastic Weight Consolidation (EWC) while maintaining better energy efficiency than both EWC and Gradient Episodic Memory (GEM). Our experimental validation confirms these theoretical advantages across multiple benchmarks, demonstrating PaPI’s effectiveness for continual learning in energy-constrained settings. Our codes are available at this https URL.
zh
[AI-88] A Comparative Study of Open-Source Libraries for Synthetic Tabular Data Generation: SDV vs. SynthCity
【速读】:该论文试图解决在有限数据条件下,如何有效生成高质量合成数据以支持机器学习模型训练的问题,特别是在小型组织和初创企业难以获取真实高质量数据的情况下。解决方案的关键在于利用合成数据生成器复制真实数据的统计和结构特性,从而在保证隐私和可扩展性的前提下,为模型提供足够的训练数据。研究评估了来自SDV和Synthicity两个开源库的六种表格型合成数据生成器,并通过统计相似性和预测效用两个指标进行性能分析。
链接: https://arxiv.org/abs/2506.17847
作者: Cristian Del Gobbo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 23 Pages, 5 figures, and 6 tables
Abstract:High-quality training data is critical to the performance of machine learning models, particularly Large Language Models (LLMs). However, obtaining real, high-quality data can be challenging, especially for smaller organizations and early-stage startups. Synthetic data generators provide a promising solution by replicating the statistical and structural properties of real data while preserving privacy and scalability. This study evaluates the performance of six tabular synthetic data generators from two widely used open-source libraries: SDV (Gaussian Copula, CTGAN, TVAE) and Synthicity (Bayesian Network, CTGAN, TVAE). Using a real-world dataset from the UCI Machine Learning Repository, comprising energy consumption and environmental variables from Belgium, we simulate a low-data regime by training models on only 1,000 rows. Each generator is then tasked with producing synthetic datasets under two conditions: a 1:1 (1,000 rows) and a 1:10 (10,000 rows) input-output ratio. Evaluation is conducted using two criteria: statistical similarity, measured via classical statistics and distributional metrics; and predictive utility, assessed using a “Train on Synthetic, Test on Real” approach with four regression models. While statistical similarity remained consistent across models in both scenarios, predictive utility declined notably in the 1:10 case. The Bayesian Network from Synthicity achieved the highest fidelity in both scenarios, while TVAE from SDV performed best in predictive tasks under the 1:10 setting. Although no significant performance gap was found between the two libraries, SDV stands out for its superior documentation and ease of use, making it more accessible for practitioners.
zh
[AI-89] Out of Control – Why Alignment Needs Formal Control Theory (and an Alignment Control Stack) NEURIPS2025
【速读】:该论文试图解决当前AI对齐研究中缺乏统一且具备泛化能力的控制框架的问题,以及不同对齐/控制协议之间互操作性不足的问题。解决方案的关键在于将对齐问题重新表述为形式化最优控制理论(formal optimal control theory)下的问题,并构建一个分层的对齐控制栈(Alignment Control Stack),在该架构中定义各层级的测量与控制特性及其形式化互操作性,从而更全面地理解前沿模型和自主AI系统的控制潜力与局限性。
链接: https://arxiv.org/abs/2506.17846
作者: Elija Perrier
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Under review for Neurips 2025
Abstract:This position paper argues that formal optimal control theory should be central to AI alignment research, offering a distinct perspective from prevailing AI safety and security approaches. While recent work in AI safety and mechanistic interpretability has advanced formal methods for alignment, they often fall short of the generalisation required of control frameworks for other technologies. There is also a lack of research into how to render different alignment/control protocols interoperable. We argue that by recasting alignment through principles of formal optimal control and framing alignment in terms of hierarchical stack from physical to socio-technical layers according to which controls may be applied we can develop a better understanding of the potential and limitations for controlling frontier models and agentic AI systems. To this end, we introduce an Alignment Control Stack which sets out a hierarchical layered alignment stack, identifying measurement and control characteristics at each layer and how different layers are formally interoperable. We argue that such analysis is also key to the assurances that will be needed by governments and regulators in order to see AI technologies sustainably benefit the community. Our position is that doing so will bridge the well-established and empirically validated methods of optimal control with practical deployment considerations to create a more comprehensive alignment framework, enhancing how we approach safety and reliability for advanced AI systems.
zh
[AI-90] Generative Grasp Detection and Estimation with Concept Learning-based Safety Criteria
【速读】:该论文旨在解决协作机器人(Cobot)在安全关键应用中因神经网络高复杂性导致的黑箱特性问题,从而影响其透明度和可靠性。解决方案的关键在于提出一个集成可解释AI(Explainable AI)的方法,通过提取模型学习到的特征并将其与输入中的对应类别相关联,提供对模型预测的解释,进而作为额外标准确保工具的安全操作。
链接: https://arxiv.org/abs/2506.17842
作者: Al-Harith Farhad,Khalil Abuibaid,Christiane Plociennik,Achim Wagner,Martin Ruskowski
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: RAAD 2025: 34th International Conference on Robotics in Alpe-Adria-Danube Region
Abstract:Neural networks are often regarded as universal equations that can estimate any function. This flexibility, however, comes with the drawback of high complexity, rendering these networks into black box models, which is especially relevant in safety-centric applications. To that end, we propose a pipeline for a collaborative robot (Cobot) grasping algorithm that detects relevant tools and generates the optimal grasp. To increase the transparency and reliability of this approach, we integrate an explainable AI method that provides an explanation for the underlying prediction of a model by extracting the learned features and correlating them to corresponding classes from the input. These concepts are then used as additional criteria to ensure the safe handling of work tools. In this paper, we show the consistency of this approach and the criterion for improving the handover position. This approach was tested in an industrial environment, where a camera system was set up to enable a robot to pick up certain tools and objects.
zh
[AI-91] Causal Spherical Hypergraph Networks for Modelling Social Uncertainty
【速读】:该论文试图解决动态社会环境中不确定性下的学习问题,具体表现为对人类社会行为的预测任务中如何有效建模高阶结构、方向性影响及认知不确定性。解决方案的关键在于提出因果球面超图网络(Causal Spherical Hypergraph Networks, Causal-SphHN),该方法通过将个体表示为超球面嵌入、群体情境表示为超边,捕捉语义与关系几何,并利用香农熵量化不确定性,结合格兰杰信息子图识别时间因果依赖,同时采用角度信息传递机制实现信念扩散与方向语义的尊重,从而提升预测的准确性、鲁棒性与校准度。
链接: https://arxiv.org/abs/2506.17840
作者: Anoushka Harit,Zhongtian Sun
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Human social behaviour is governed by complex interactions shaped by uncertainty, causality, and group dynamics. We propose Causal Spherical Hypergraph Networks (Causal-SphHN), a principled framework for socially grounded prediction that jointly models higher-order structure, directional influence, and epistemic uncertainty. Our method represents individuals as hyperspherical embeddings and group contexts as hyperedges, capturing semantic and relational geometry. Uncertainty is quantified via Shannon entropy over von Mises-Fisher distributions, while temporal causal dependencies are identified using Granger-informed subgraphs. Information is propagated through an angular message-passing mechanism that respects belief dispersion and directional semantics. Experiments on SNARE (offline networks), PHEME (online discourse), and AMIGOS (multimodal affect) show that Causal-SphHN improves predictive accuracy, robustness, and calibration over strong baselines. Moreover, it enables interpretable analysis of influence patterns and social ambiguity. This work contributes a unified causal-geometric approach for learning under uncertainty in dynamic social environments.
zh
[AI-92] Reflective Verbal Reward Design for Pluralistic Alignment IJCAI2025
【速读】:该论文试图解决传统强化学习从人类反馈(RLHF)中通过聚合单一奖励模型来对齐AI代理行为所导致的少数群体偏好被压制的问题,因为人类价值观并非同质化,存在差异甚至冲突。解决方案的关键在于提出一种个性化奖励建模方法,通过语言模型引导用户进行反思性对话,以构建个性化的奖励函数,即“话语奖励模型”,该模型基于用户的反思历史来评估新的行为轨迹,从而更准确地反映个体偏好。
链接: https://arxiv.org/abs/2506.17834
作者: Carter Blair,Kate Larson,Edith Law
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 9 pages, 3 figures, accepted to the IJCAI 2025 Human-Centred AI track. Project repository at: this https URL
Abstract:AI agents are commonly aligned with “human values” through reinforcement learning from human feedback (RLHF), where a single reward model is learned from aggregated human feedback and used to align an agent’s behavior. However, human values are not homogeneous–different people hold distinct and sometimes conflicting values. Aggregating feedback into a single reward model risks disproportionately suppressing minority preferences. To address this, we present a novel reward modeling approach for learning individualized reward models. Our approach uses a language model to guide users through reflective dialogues where they critique agent behavior and construct their preferences. This personalized dialogue history, containing the user’s reflections and critiqued examples, is then used as context for another language model that serves as an individualized reward function (what we call a “verbal reward model”) for evaluating new trajectories. In studies with 30 participants, our method achieved a 9-12% improvement in accuracy over non-reflective verbal reward models while being more sample efficient than traditional supervised learning methods.
zh
[AI-93] Actionable Interpretability via Causal Hypergraphs: Unravelling Batch Size Effects in Deep Learning
【速读】:该论文试图解决小批量大小(batch size)对模型泛化能力影响的因果机制在图和文本领域尚未被充分探索的问题。其解决方案的关键在于引入一种基于超图的因果框架HGCNet,该框架利用深度结构因果模型(DSCMs)来揭示批量大小通过梯度噪声、极小值尖锐性和模型复杂性影响泛化的路径。HGCNet通过超图捕捉训练动态中的高阶交互关系,并结合do-演算量化批量大小干预的直接和中介效应,从而提供可解释的、因果基础的优化见解。
链接: https://arxiv.org/abs/2506.17826
作者: Zhongtian Sun,Anoushka Harit,Pietro Lio
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:While the impact of batch size on generalisation is well studied in vision tasks, its causal mechanisms remain underexplored in graph and text domains. We introduce a hypergraph-based causal framework, HGCNet, that leverages deep structural causal models (DSCMs) to uncover how batch size influences generalisation via gradient noise, minima sharpness, and model complexity. Unlike prior approaches based on static pairwise dependencies, HGCNet employs hypergraphs to capture higher-order interactions across training dynamics. Using do-calculus, we quantify direct and mediated effects of batch size interventions, providing interpretable, causally grounded insights into optimisation. Experiments on citation networks, biomedical text, and e-commerce reviews show that HGCNet outperforms strong baselines including GCN, GAT, PI-GNN, BERT, and RoBERTa. Our analysis reveals that smaller batch sizes causally enhance generalisation through increased stochasticity and flatter minima, offering actionable interpretability to guide training strategies in deep learning. This work positions interpretability as a driver of principled architectural and optimisation choices beyond post hoc analysis.
zh
[AI-94] Learning to Dock: A Simulation-based Study on Closing the Sim2Real Gap in Autonomous Underwater Docking
【速读】:该论文试图解决自主水下航行器(Autonomous Underwater Vehicle, AUV)在动态和不确定环境中进行自主对接时存在的模拟到现实(sim2real)差距问题。解决方案的关键在于通过训练多种控制器并在实际扰动条件下评估其性能,以减少模拟环境与真实环境之间的差异,特别是针对不同载荷条件下的对接挑战,探索包括随机化技术和历史条件控制器在内的增强鲁棒性的方法。
链接: https://arxiv.org/abs/2506.17823
作者: Kevin Chang,Rakesh Vivekanandan,Noah Pragin,Sean Bullock,Geoffrey Hollinger
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Advancing Quantitative and Qualitative Simulators for Marine Applications Workshop Paper at International Conference on Robotics and Automation 2025
Abstract:Autonomous Underwater Vehicle (AUV) docking in dynamic and uncertain environments is a critical challenge for underwater robotics. Reinforcement learning is a promising method for developing robust controllers, but the disparity between training simulations and the real world, or the sim2real gap, often leads to a significant deterioration in performance. In this work, we perform a simulation study on reducing the sim2real gap in autonomous docking through training various controllers and then evaluating them under realistic disturbances. In particular, we focus on the real-world challenge of docking under different payloads that are potentially outside the original training distribution. We explore existing methods for improving robustness including randomization techniques and history-conditioned controllers. Our findings provide insights into mitigating the sim2real gap when training docking controllers. Furthermore, our work indicates areas of future research that may be beneficial to the marine robotics community.
zh
[AI-95] CultureMERT: Continual Pre-Training for Cross-Cultural Music Representation Learning
【速读】:该论文旨在解决音乐基础模型在跨文化音乐表示学习与理解方面的有效性受限问题,特别是在非西方音乐传统中的表现不足。其关键解决方案是提出一种两阶段持续预训练策略,该策略结合了学习率重燃(learning rate re-warming)和重衰减(re-decaying),以在计算资源有限的情况下实现稳定的多文化适应。此外,研究还探索了任务算术(task arithmetic)作为一种替代的多文化适应方法,通过在权重空间中融合单文化适配模型,取得了与多文化训练模型相当的效果。
链接: https://arxiv.org/abs/2506.17818
作者: Angelos-Nikolaos Kanatas,Charilaos Papaioannou,Alexandros Potamianos
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
备注: 10 pages, 4 figures, accepted to the 26th International Society for Music Information Retrieval conference (ISMIR 2025), to be held in Daejeon, South Korea
Abstract:Recent advances in music foundation models have improved audio representation learning, yet their effectiveness across diverse musical traditions remains limited. We introduce CultureMERT-95M, a multi-culturally adapted foundation model developed to enhance cross-cultural music representation learning and understanding. To achieve this, we propose a two-stage continual pre-training strategy that integrates learning rate re-warming and re-decaying, enabling stable adaptation even with limited computational resources. Training on a 650-hour multi-cultural data mix, comprising Greek, Turkish, and Indian music traditions, results in an average improvement of 4.9% in ROC-AUC and AP across diverse non-Western music auto-tagging tasks, surpassing prior state-of-the-art, with minimal forgetting on Western-centric benchmarks. We further investigate task arithmetic, an alternative approach to multi-cultural adaptation that merges single-culture adapted models in the weight space. Task arithmetic performs on par with our multi-culturally trained model on non-Western auto-tagging tasks and shows no regression on Western datasets. Cross-cultural evaluation reveals that single-culture models transfer with varying effectiveness across musical traditions, whereas the multi-culturally adapted model achieves the best overall performance. To support research on world music representation learning, we publicly release CultureMERT-95M and CultureMERT-TA-95M, fostering the development of more culturally aware music foundation models.
zh
[AI-96] RoboMonkey: Scaling Test-Time Sampling and Verification for Vision-Language-Action Models
【速读】:该论文旨在解决视觉-语言-动作(Vision-Language-Action, VLA)模型在非结构化现实环境中的鲁棒性和泛化能力不足的问题。其关键解决方案是引入RoboMonkey,这是一个基于采样与验证的测试时扩展框架,通过从VLA中采样少量动作、应用高斯扰动和多数投票构建动作提议分布,并利用基于视觉语言模型(Vision Language Model, VLM)的验证器选择最优动作,从而提升模型的可靠性与性能。
链接: https://arxiv.org/abs/2506.17811
作者: Jacky Kwok,Christopher Agia,Rohan Sinha,Matt Foutter,Shulu Li,Ion Stoica,Azalia Mirhoseini,Marco Pavone
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注:
Abstract:Vision-Language-Action (VLA) models have demonstrated remarkable capabilities in visuomotor control, yet ensuring their robustness in unstructured real-world environments remains a persistent challenge. In this paper, we investigate test-time scaling through the lens of sampling and verification as means to enhance the robustness and generalization of VLAs. We first demonstrate that the relationship between action error and the number of generated samples follows an exponentiated power law across a range of VLAs, indicating the existence of inference-time scaling laws. Building on these insights, we introduce RoboMonkey, a test-time scaling framework for VLAs. At deployment, RoboMonkey samples a small set of actions from a VLA, applies Gaussian perturbation and majority voting to construct an action proposal distribution, and then uses a Vision Language Model (VLM)-based verifier to select the optimal action. We propose a synthetic data generation pipeline for training such VLM-based action verifiers, and demonstrate that scaling the synthetic dataset consistently improves verification and downstream accuracy. Through extensive simulated and hardware experiments, we show that pairing existing VLAs with RoboMonkey yields significant performance gains, achieving a 25% absolute improvement on out-of-distribution tasks and 8% on in-distribution tasks. Additionally, when adapting to new robot setups, we show that fine-tuning both VLAs and action verifiers yields a 7% performance increase compared to fine-tuning VLAs alone.
zh
[AI-97] Reimagining Parameter Space Exploration with Diffusion Models ICML2025
【速读】:该论文试图解决传统神经网络在适应新任务时需要进行耗时且依赖标注数据的任务特定微调的问题。其解决方案的关键在于利用生成式 AI (Generative AI) 直接从任务标识生成任务特定的参数,而非依赖传统的任务特定训练过程。具体而言,作者提出使用扩散模型学习有效任务特定参数空间的潜在结构,并按需合成参数,从而实现无需额外训练即可生成专用权重的机制。
链接: https://arxiv.org/abs/2506.17807
作者: Lijun Zhang,Xiao Liu,Hui Guan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at ICML 2025 EXAIT Workshop
Abstract:Adapting neural networks to new tasks typically requires task-specific fine-tuning, which is time-consuming and reliant on labeled data. We explore a generative alternative that produces task-specific parameters directly from task identity, eliminating the need for task-specific training. To this end, we propose using diffusion models to learn the underlying structure of effective task-specific parameter space and synthesize parameters on demand. Once trained, the task-conditioned diffusion model can generate specialized weights directly from task identifiers. We evaluate this approach across three scenarios: generating parameters for a single seen task, for multiple seen tasks, and for entirely unseen tasks. Experiments show that diffusion models can generate accurate task-specific parameters and support multi-task interpolation when parameter subspaces are well-structured, but fail to generalize to unseen tasks, highlighting both the potential and limitations of this generative solution.
zh
[AI-98] Efficient Strategy Synthesis for MDPs via Hierarchical Block Decomposition
【速读】:该论文试图解决传统策略合成方法在大规模状态空间中的可扩展性问题,这些问题常见于软件密集型系统(如软件产品线和机器人系统)中。其解决方案的关键在于通过动态细化马尔可夫决策过程(MDP)并迭代选择最脆弱的MDP区域进行细化,从而在保证准确性的同时提高策略合成的效率。
链接: https://arxiv.org/abs/2506.17792
作者: Alexandros Evangelidis,Gricel Vázquez,Simos Gerasimou
机构: 未知
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO); Software Engineering (cs.SE)
备注:
Abstract:Software-intensive systems, such as software product lines and robotics, utilise Markov decision processes (MDPs) to capture uncertainty and analyse sequential decision-making problems. Despite the usefulness of conventional policy synthesis methods, they fail to scale to large state spaces. Our approach addresses this issue and accelerates policy synthesis in large MDPs by dynamically refining the MDP and iteratively selecting the most fragile MDP regions for refinement. This iterative procedure offers a balance between accuracy and efficiency, as refinement occurs only when necessary. Through a comprehensive empirical evaluation comprising diverse case studies and MDPs up to 1M states, we demonstrate significant performance improvements yielded by our approach compared to the leading probabilistic model checker PRISM (up to 2x), thus offering a very competitive solution for real-world policy synthesis tasks in larger MDPs.
zh
[AI-99] AnyMAC: Cascading Flexible Multi-Agent Collaboration via Next-Agent Prediction
【速读】:该论文试图解决多智能体协作中通信拓扑结构静态或基于图结构导致的适应性和灵活性不足的问题。其解决方案的关键在于提出一种基于序列结构而非图结构的多智能体协调框架,通过引入Next-Agent Prediction和Next-Context Selection(NCS)两个核心组件,实现任务自适应的通信管道,从而提升角色灵活性和全局信息流动效率,同时显著降低通信开销。
链接: https://arxiv.org/abs/2506.17784
作者: Song Wang,Zhen Tan,Zihan Chen,Shuang Zhou,Tianlong Chen,Jundong Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Recent progress in large language model (LLM)-based multi-agent collaboration highlights the power of structured communication in enabling collective intelligence. However, existing methods largely rely on static or graph-based inter-agent topologies, lacking the potential adaptability and flexibility in communication. In this work, we propose a new framework that rethinks multi-agent coordination through a sequential structure rather than a graph structure, offering a significantly larger topology space for multi-agent communication. Our method focuses on two key directions: (1) Next-Agent Prediction, which selects the most suitable agent role at each step, and (2) Next-Context Selection (NCS), which enables each agent to selectively access relevant information from any previous step. Together, these components construct task-adaptive communication pipelines that support both role flexibility and global information flow. Extensive evaluations across multiple benchmarks demonstrate that our approach achieves superior performance while substantially reducing communication overhead.
zh
[AI-100] Expanding Relevance Judgments for Medical Case-based Retrieval Task with Multimodal LLM s SIGIR2025
【速读】:该论文试图解决信息检索(Information Retrieval, IR)系统评估中依赖高成本、耗时的人工相关性判断(qrels)的问题。其解决方案的关键在于利用多模态大语言模型(Multimodal Large Language Model, MLLM)扩展相关性判断,通过迭代优化的结构化提示策略,模拟人类评估过程,从而大规模生成自动化判断。该方法在ImageCLEFmed 2013任务中使用Gemini 1.5 Pro模型,显著提升了数据集规模和相关标注数量,展示了MLLM在医学和多模态IR任务中的潜力。
链接: https://arxiv.org/abs/2506.17782
作者: Catarina Pires,Sérgio Nunes,Luís Filipe Teixeira
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: To appear at the Third Workshop on Large Language Models for Evaluation in Information Retrieval (LLM4Eval 2025), co-located with SIGIR 2025. 9 pages, 2 figures, 5 tables
Abstract:Evaluating Information Retrieval (IR) systems relies on high-quality manual relevance judgments (qrels), which are costly and time-consuming to obtain. While pooling reduces the annotation effort, it results in only partially labeled datasets. Large Language Models (LLMs) offer a promising alternative to reducing reliance on manual judgments, particularly in complex domains like medical case-based retrieval, where relevance assessment requires analyzing both textual and visual information. In this work, we explore using a Multimodal Large Language Model (MLLM) to expand relevance judgments, creating a new dataset of automated judgments. Specifically, we employ Gemini 1.5 Pro on the ImageCLEFmed 2013 case-based retrieval task, simulating human assessment through an iteratively refined, structured prompting strategy that integrates binary scoring, instruction-based evaluation, and few-shot learning. We systematically experimented with various prompt configurations to maximize agreement with human judgments. To evaluate agreement between the MLLM and human judgments, we use Cohen’s Kappa, achieving a substantial agreement score of 0.6, comparable to inter-annotator agreement typically observed in multimodal retrieval tasks. Starting from the original 15,028 manual judgments (4.72% relevant) across 35 topics, our MLLM-based approach expanded the dataset by over 37x to 558,653 judgments, increasing relevant annotations to 5,950. On average, each medical case query received 15,398 new annotations, with approximately 99% being non-relevant, reflecting the high sparsity typical in this domain. Our results demonstrate the potential of MLLMs to scale relevance judgment collection, offering a promising direction for supporting retrieval evaluation in medical and multimodal IR tasks.
zh
[AI-101] oward Autonomous UI Exploration: The UIExplorer Benchmark
【速读】:该论文试图解决自主代理在用户界面(User Interface, UI)中进行有效探索的问题,以实现可靠的任务求解,但目前缺乏系统性的评估方法。其解决方案的关键在于提出UIExplore-Bench,这是首个专门针对UI探索的基准测试平台,通过结构化模式(Structured mode)和屏幕模式(Screen mode)在标准化的GitLab沙箱环境中评估代理,并引入人类归一化的UI功能性观察(human-normalized UI-Functionalities Observed, hUFO)作为量化探索效果的指标。该方法为高效UI探索策略的研究提供了基础。
链接: https://arxiv.org/abs/2506.17779
作者: Andrei Cristian Nica,Akshaya Vishnu Kudlu Shanbhogue,Harshil Shah,Aleix Cambray,Tudor Berariu,Lucas Maystre,David Barber
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Autonomous agents must know how to explore user interfaces (UIs) for reliable task solving, yet systematic evaluation of this crucial phase is lacking. We introduce UIExplore-Bench, the first benchmark explicitly dedicated to UI exploration. The benchmark evaluates agents with either Structured mode (granting access to layout information like DOM trees) or Screen mode (relying on GUI-only observations such as screenshots and human-like mouse/keyboard interactions) across three levels in a standardized GitLab sandbox environment. We formalize exploration as the process of maximizing the set of actionable UI components discovered and propose a metric, human-normalized UI-Functionalities Observed (hUFO), to quantify the effectiveness of exploration. Our results show that UIExplore-AlGo achieves the leading mean hUFO scores, reaching up to 77.2% of human performance in Structured mode and 59.0% in Screen mode at 2,000 steps, particularly excelling at the Sparse level. The results highlight the relevance of our benchmark, as current agents show a substantial performance gap compared to one hour of human expert exploration, indicating ample room for future advancements. We publicly release the benchmark environment, an exploration dataset, and an evaluation suite to catalyze research into efficient UI exploration strategies and their downstream applications, such as experience-driven task completion and automated training data generation.
zh
[AI-102] Machine Learning Model Integration with Open World Temporal Logic for Process Automation
【速读】:该论文试图解决将机器学习(Machine Learning, ML)模型的感知或提取输出转化为复杂操作流程中可执行、合理决策的问题。解决方案的关键在于将多种ML模型的输出直接集成到PyReason框架中,该框架是一个开放世界的时序逻辑编程推理引擎。PyReason基于广义注释逻辑,能够无缝整合来自不同ML模型的实数值输出(如概率、置信度分数),将其作为逻辑框架中的真值区间处理,并通过Python实现的机制持续轮询ML模型输出、转换为逻辑事实并动态重新计算最小模型,从而实现实时自适应决策。
链接: https://arxiv.org/abs/2506.17776
作者: Dyuman Aditya,Colton Payne,Mario Leiva,Paulo Shakarian
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注:
Abstract:Recent advancements in Machine Learning (ML) have yielded powerful models capable of extracting structured information from diverse and complex data sources. However, a significant challenge lies in translating these perceptual or extractive outputs into actionable, reasoned decisions within complex operational workflows. To address these challenges, this paper introduces a novel approach that integrates the outputs from various machine learning models directly with the PyReason framework, an open-world temporal logic programming reasoning engine. PyReason’s foundation in generalized annotated logic allows for the seamless incorporation of real-valued outputs (e.g., probabilities, confidence scores) from diverse ML models, treating them as truth intervals within its logical framework. Crucially, PyReason provides mechanisms, implemented in Python, to continuously poll ML model outputs, convert them into logical facts, and dynamically recompute the minimal model, ensuring real-tine adaptive decision-making. Furthermore, its native support for temporal reasoning, knowledge graph integration, and fully explainable interface traces enables sophisticated analysis over time-sensitive process data and existing organizational knowledge. By combining the strengths of perception and extraction from ML models with the logical deduction and transparency of PyReason, we aim to create a powerful system for automating complex processes. This integration finds utility across numerous domains, including manufacturing, healthcare, and business operations.
zh
[AI-103] CARTS: Collaborative Agents for Recommendation Textual Summarization
【速读】:该论文试图解决推荐系统中文本摘要生成的问题,特别是如何生成与物品集合核心特征高度相关且符合严格字数限制的简洁连贯标题。解决方案的关键在于提出一种多智能体大语言模型框架CARTS(Collaborative Agents for Recommendation Textual Summarization),该框架将任务分解为生成增强生成、精炼循环和仲裁三个阶段,通过智能体角色依次完成关键特征提取、候选标题迭代优化以及最终标题的协同选择,从而提升标题的相关性和用户参与度。
链接: https://arxiv.org/abs/2506.17765
作者: Jiao Chen,Kehui Yao,Reza Yousefi Maragheh,Kai Zhao,Jianpeng Xu,Jason Cho,Evren Korpeoglu,Sushant Kumar,Kannan Achan
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:Current recommendation systems often require some form of textual data summarization, such as generating concise and coherent titles for product carousels or other grouped item displays. While large language models have shown promise in NLP domains for textual summarization, these approaches do not directly apply to recommendation systems, where explanations must be highly relevant to the core features of item sets, adhere to strict word limit constraints. In this paper, we propose CARTS (Collaborative Agents for Recommendation Textual Summarization), a multi-agent LLM framework designed for structured summarization in recommendation systems. CARTS decomposes the task into three stages-Generation Augmented Generation (GAG), refinement circle, and arbitration, where successive agent roles are responsible for extracting salient item features, iteratively refining candidate titles based on relevance and length feedback, and selecting the final title through a collaborative arbitration process. Experiments on large-scale e-commerce data and live A/B testing show that CARTS significantly outperforms single-pass and chain-of-thought LLM baselines, delivering higher title relevance and improved user engagement metrics.
zh
[AI-104] Beyond Syntax: Action Semantics Learning for App Agents
【速读】:该论文旨在解决当前基于微调的小型开源大型语言模型(Large Language Models, LLMs)在生成式AI(Generative AI)应用代理(App agents)中存在的一致性问题,即现有微调方法采用的语法学习范式导致代理在面对分布外(out-of-distribution, OOD)场景时易出现性能下降。解决方案的关键在于提出一种新的学习框架——动作语义学习(Action Semantics Learning, ASL),其核心思想是通过捕捉真实动作的语义而非严格复制语法形式来训练应用代理,从而提升其在不同场景下的鲁棒性和泛化能力。
链接: https://arxiv.org/abs/2506.17697
作者: Bohan Tang,Dezhao Luo,Jingxuan Chen,Shaogang Gong,Jianye Hao,Jun Wang,Kun Shao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:The advent of Large Language Models (LLMs) enables the rise of App agents that interpret user intent and operate smartphone Apps through actions such as clicking and scrolling. While prompt-based solutions with closed LLM APIs show promising ability, they incur heavy compute costs and external API dependency. Fine-tuning smaller open-source LLMs solves these limitations. However, current fine-tuning methods use a syntax learning paradigm that forces agents to reproduce exactly the ground truth action strings, leading to out-of-distribution (OOD) vulnerability. To fill this gap, we propose Action Semantics Learning (ASL), a novel learning framework, where the learning objective is capturing the semantics of the ground truth actions. Specifically, inspired by the programming language theory, we define the action semantics for App agents as the state transition induced by the action in the user interface. With this insight, ASL employs a novel SEmantic Estimator (SEE) to compute a semantic reward to train the App agents in generating actions aligned with the semantics of ground truth actions, even when the syntactic forms differ. To support the effectiveness of ASL, we theoretically demonstrate the superior robustness of ASL for the OOD problem compared with the existing syntax learning paradigm. Extensive experiments on offline and online smartphone App operation benchmarks show that ASL significantly improves the accuracy and generalisation of App agents over existing methods.
zh
[AI-105] Reinforcing User Interest Evolution in Multi-Scenario Learning for recommender systems
【速读】:该论文旨在解决多场景推荐系统中用户兴趣不一致带来的统一建模难题(multi-scenario learning challenge),因为在不同场景下用户的关注点和决策过程存在差异,导致兴趣表达不一致。其解决方案的关键在于提出一种基于强化学习的新方法,通过建模用户兴趣在多个场景中的演化过程来捕捉跨场景的用户偏好,并采用Double Q-learning提升下一物品预测的准确性,同时利用Q值优化对比学习损失以增强模型性能。
链接: https://arxiv.org/abs/2506.17682
作者: Zhijian Feng,Wenhao Zheng,Xuanji Xiao
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:In real-world recommendation systems, users would engage in variety scenarios, such as homepages, search pages, and related recommendation pages. Each of these scenarios would reflect different aspects users focus on. However, the user interests may be inconsistent in different scenarios, due to differences in decision-making processes and preference expression. This variability complicates unified modeling, making multi-scenario learning a significant challenge. To address this, we propose a novel reinforcement learning approach that models user preferences across scenarios by modeling user interest evolution across multiple scenarios. Our method employs Double Q-learning to enhance next-item prediction accuracy and optimizes contrastive learning loss using Q-value to make model performance better. Experimental results demonstrate that our approach surpasses state-of-the-art methods in multi-scenario recommendation tasks. Our work offers a fresh perspective on multi-scenario modeling and highlights promising directions for future research.
zh
[AI-106] Enhancing Stress-Strain Predictions with Seq2Seq and Cross-Attention based on Small Punch Test IJCNN2025
【速读】:该论文试图解决从小型冲压试验(Small Punch Test, SPT)的载荷-位移数据中预测高强钢真实应力-应变曲线的问题。解决方案的关键在于利用Gramian Angular Field (GAF)将载荷-位移序列转换为图像,以捕捉其时空特征,并采用基于LSTM的序列到序列(Seq2Seq)模型,通过多头交叉注意力机制增强预测精度。
链接: https://arxiv.org/abs/2506.17680
作者: Zhengni Yang,Rui Yang,Weijian Han,Qixin Liu
机构: 未知
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI)
备注: accepted by IJCNN2025
Abstract:This paper introduces a novel deep-learning approach to predict true stress-strain curves of high-strength steels from small punch test (SPT) load-displacement data. The proposed approach uses Gramian Angular Field (GAF) to transform load-displacement sequences into images, capturing spatial-temporal features and employs a Sequence-to-Sequence (Seq2Seq) model with an LSTM-based encoder-decoder architecture, enhanced by multi-head cross-attention to improved accuracy. Experimental results demonstrate that the proposed approach achieves superior prediction accuracy, with minimum and maximum mean absolute errors of 0.15 MPa and 5.58 MPa, respectively. The proposed method offers a promising alternative to traditional experimental techniques in materials science, enhancing the accuracy and efficiency of true stress-strain relationship predictions.
zh
[AI-107] PhysUniBench: An Undergraduate-Level Physics Reasoning Benchmark for Multimodal Models
【速读】:该论文旨在解决当前大型人工智能模型在物理问题求解中的局限性,特别是其在整合概念理解、数学推理和物理图表解释方面的能力不足。为应对这一问题,论文提出了PhysUniBench,这是一个大规模多模态基准测试,用于评估和提升多模态大语言模型(MLLMs)在本科物理问题上的推理能力。PhysUniBench的关键在于其系统化的题目构建过程,包括多阶段的迭代优化、专家级评估、自动化过滤以及五级难度评分体系,从而确保了基准测试的广度和严谨性。
链接: https://arxiv.org/abs/2506.17667
作者: Lintao Wang,Encheng Su,Jiaqi Liu,Pengze Li,Peng Xia,Jiabei Xiao,Wenlong Zhang,Xinnan Dai,Xi Chen,Yuan Meng,Mingyu Ding,Lei Bai,Wanli Ouyang,Shixiang Tang,Aoran Wang,Xinzhu Ma
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Physics problem-solving is a challenging domain for large AI models, requiring integration of conceptual understanding, mathematical reasoning, and interpretation of physical diagrams. Current evaluation methodologies show notable limitations in capturing the breadth and complexity of undergraduate-level physics, underscoring the need for more rigorous assessments. To this end, we present PhysUniBench, a large-scale multimodal benchmark designed to evaluate and improve the reasoning capabilities of multimodal large language models (MLLMs) specifically on undergraduate-level physics problems. PhysUniBench consists of 3,304 physics questions spanning 8 major sub-disciplines of physics, each accompanied by one visual diagrams. The benchmark includes both open-ended and multiple-choice questions, systematically curated and difficulty-rated through an iterative model-in-the-loop process. The benchmark’s construction involved a rigorous multi-stage process, including multiple roll-outs, expert-level evaluation, automated filtering of easily solved problems, and a nuanced difficulty grading system with five levels. Through extensive experiments, we observe that current state-of-the-art models encounter substantial challenges in physics reasoning. For example, GPT-4o mini achieves only about 34.2% accuracy in the proposed PhysUniBench. These results highlight that current MLLMs struggle with advanced physics reasoning, especially on multi-step problems and those requiring precise diagram interpretation. By providing a broad and rigorous assessment tool, PhysUniBench aims to drive progress in AI for Science, encouraging the development of models with stronger physical reasoning, problem-solving skills, and multimodal understanding. The benchmark and evaluation scripts are available at this https URL.
zh
[AI-108] Measuring and Augmenting Large Language Models for Solving Capture-the-Flag Challenges
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在Capture-the-Flag (CTF) 竞赛中技术知识应用不足的问题,即尽管LLMs具备丰富的技术知识,但在具体场景中准确应用这些知识并根据反馈调整策略方面存在明显缺陷。论文提出的解决方案关键在于构建一个专门的基准测试集CTFKnow,并开发CTFAgent框架,该框架引入了两阶段检索增强生成(two-stage Retrieval Augmented Generation, RAG)和交互式环境增强模块,以提升LLMs在CTF中的技术知识理解和漏洞利用能力。
链接: https://arxiv.org/abs/2506.17644
作者: Zimo Ji,Daoyuan Wu,Wenyuan Jiang,Pingchuan Ma,Zongjie Li,Shuai Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Capture-the-Flag (CTF) competitions are crucial for cybersecurity education and training. As large language models (LLMs) evolve, there is increasing interest in their ability to automate CTF challenge solving. For example, DARPA has organized the AIxCC competition since 2023 to advance AI-powered automated offense and defense. However, this demands a combination of multiple abilities, from knowledge to reasoning and further to actions. In this paper, we highlight the importance of technical knowledge in solving CTF problems and deliberately construct a focused benchmark, CTFKnow, with 3,992 questions to measure LLMs’ performance in this core aspect. Our study offers a focused and innovative measurement of LLMs’ capability in understanding CTF knowledge and applying it to solve CTF challenges. Our key findings reveal that while LLMs possess substantial technical knowledge, they falter in accurately applying this knowledge to specific scenarios and adapting their strategies based on feedback from the CTF environment. Based on insights derived from this measurement study, we propose CTFAgent, a novel LLM-driven framework for advancing CTF problem-solving. CTFAgent introduces two new modules: two-stage Retrieval Augmented Generation (RAG) and interactive Environmental Augmentation, which enhance LLMs’ technical knowledge and vulnerability exploitation on CTF, respectively. Our experimental results show that, on two popular CTF datasets, CTFAgent both achieves over 80% performance improvement. Moreover, in the recent picoCTF2024 hosted by CMU, CTFAgent ranked in the top 23.6% of nearly 7,000 participating teams. This reflects the benefit of our measurement study and the potential of our framework in advancing LLMs’ capabilities in CTF problem-solving. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2506.17644 [cs.AI] (or arXiv:2506.17644v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2506.17644 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-109] RLRC: Reinforcement Learning-based Recovery for Compressed Vision-Language-Action Models
【速读】:该论文旨在解决Vision-Language-Action models (VLA)在实际部署中面临的参数量大和推理延迟高的问题,这对资源受限的机器人平台构成了显著挑战。其解决方案的关键在于提出RLRC,一种针对压缩VLAs的三阶段恢复方法,包括结构化剪枝、基于监督微调(SFT)和强化学习(RL)的性能恢复,以及进一步的量化,从而在显著降低内存占用和提升推理吞吐量的同时,保持或超越原始VLA的任务成功率。
链接: https://arxiv.org/abs/2506.17639
作者: Yuxuan Chen,Xiao Li
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Vision-Language-Action models (VLA) have demonstrated remarkable capabilities and promising potential in solving complex robotic manipulation tasks. However, their substantial parameter sizes and high inference latency pose significant challenges for real-world deployment, particularly on resource-constrained robotic platforms. To address this issue, we begin by conducting an extensive empirical study to explore the effectiveness of model compression techniques when applied to VLAs. Building on the insights gained from these preliminary experiments, we propose RLRC, a three-stage recovery method for compressed VLAs, including structured pruning, performance recovery based on SFT and RL, and further quantization. RLRC achieves up to an 8x reduction in memory usage and a 2.3x improvement in inference throughput, while maintaining or even surpassing the original VLA’s task success rate. Extensive experiments show that RLRC consistently outperforms existing compression baselines, demonstrating strong potential for on-device deployment of VLAs. Project website: this https URL
zh
[AI-110] LLM -Prompt: Integrated Heterogeneous Prompts for Unlocking LLM s in Time Series Forecasting
【速读】:该论文旨在解决基于大语言模型(Large Language Model, LLM)的时间序列预测方法中存在的两个关键问题:缺乏统一的文本提示(textual prompt)构建范式以及忽视文本提示与时间序列之间的模态差异。其解决方案的关键在于提出LLM-Prompt框架,该框架通过构建包含可学习软提示和文本化硬提示的统一文本提示范式,并引入语义空间嵌入与跨模态对齐模块,实现时间序列与文本信息的跨模态融合,从而提升模型对预测任务的整体理解能力。
链接: https://arxiv.org/abs/2506.17631
作者: Zesen Wang,Yonggang Li,Lijuan Lan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Time series forecasting aims to model temporal dependencies among variables for future state inference, holding significant importance and widespread applications in real-world scenarios. Although deep learning-based methods have achieved remarkable progress, they still exhibit suboptimal performance in long-term forecasting and data-scarce scenarios. Recent research demonstrates that large language models (LLMs) achieve promising performance in time series forecasting. However, we find existing LLM-based methods still have shortcomings: (1) the absence of a unified paradigm for textual prompt formulation and (2) the neglect of modality discrepancies between textual prompts and time series. To address this, we propose LLM-Prompt, an LLM-based time series forecasting framework integrating multi-prompt information and cross-modal semantic alignment. Specifically, we first construct a unified textual prompt paradigm containing learnable soft prompts and textualized hard prompts. Second, to enhance LLMs’ comprehensive understanding of the forecasting task, we design a semantic space embedding and cross-modal alignment module to achieve cross-modal fusion of temporal and textual information. Finally, the transformed time series from the LLMs are projected to obtain the forecasts. Comprehensive evaluations on 6 public datasets and 3 carbon emission datasets demonstrate that LLM-Prompt is a powerful framework for time series forecasting.
zh
[AI-111] Exploiting Efficiency Vulnerabilities in Dynamic Deep Learning Systems
【速读】:该论文试图解决动态深度学习系统(Dynamic Deep Learning Systems, DDLSs)在实际部署中因输入依赖的执行路径所带来的效率安全漏洞问题。这些问题可能导致攻击者通过特定输入引发过度延迟、能耗增加,甚至在时间敏感场景中造成拒绝服务。解决方案的关键在于识别现有模型架构和防御机制的不足,并研究针对现代DDLSs的效率攻击的可行性,从而开发出能够保持系统鲁棒性的针对性防御措施。
链接: https://arxiv.org/abs/2506.17621
作者: Ravishka Rathnasuriya,Wei Yang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: Proceedings of the 2025 Poster Session of the 10th IEEE European Symposium on Security and Privacy (EuroSP 2025)
Abstract:The growing deployment of deep learning models in real-world environments has intensified the need for efficient inference under strict latency and resource constraints. To meet these demands, dynamic deep learning systems (DDLSs) have emerged, offering input-adaptive computation to optimize runtime efficiency. While these systems succeed in reducing cost, their dynamic nature introduces subtle and underexplored security risks. In particular, input-dependent execution pathways create opportunities for adversaries to degrade efficiency, resulting in excessive latency, energy usage, and potential denial-of-service in time-sensitive deployments. This work investigates the security implications of dynamic behaviors in DDLSs and reveals how current systems expose efficiency vulnerabilities exploitable by adversarial inputs. Through a survey of existing attack strategies, we identify gaps in the coverage of emerging model architectures and limitations in current defense mechanisms. Building on these insights, we propose to examine the feasibility of efficiency attacks on modern DDLSs and develop targeted defenses to preserve robustness under adversarial conditions.
zh
[AI-112] Risk-Guided Diffusion: Toward Deploying Robot Foundation Models in Space Where Failure Is Not An Option
【速读】:该论文旨在解决在极端、陌生地形中实现安全、可靠的机器人导航问题,这是未来航天探测任务的关键挑战。其解决方案的关键在于提出一种基于风险引导的扩散框架,该框架融合了快速学习的“System-1”与慢速物理基础的“System-2”,在训练和推理阶段共享计算资源,从而在适应性与形式化安全性之间建立耦合关系。
链接: https://arxiv.org/abs/2506.17601
作者: Rohan Thakker,Adarsh Patnaik,Vince Kurtz,Jonas Frey,Jonathan Becktor,Sangwoo Moon,Rob Royce,Marcel Kaufmann,Georgios Georgakis,Pascal Roth,Joel Burdick,Marco Hutter,Shehryar Khattak
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Safe, reliable navigation in extreme, unfamiliar terrain is required for future robotic space exploration missions. Recent generative-AI methods learn semantically aware navigation policies from large, cross-embodiment datasets, but offer limited safety guarantees. Inspired by human cognitive science, we propose a risk-guided diffusion framework that fuses a fast, learned “System-1” with a slow, physics-based “System-2”, sharing computation at both training and inference to couple adaptability with formal safety. Hardware experiments conducted at the NASA JPL’s Mars-analog facility, Mars Yard, show that our approach reduces failure rates by up to 4\times while matching the goal-reaching performance of learning-based robotic models by leveraging inference-time compute without any additional training.
zh
[AI-113] aming the Untamed: Graph-Based Knowledge Retrieval and Reasoning for MLLM s to Conquer the Unknown
【速读】:该论文试图解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在面对罕见领域特定任务时因相关知识有限而表现不佳的问题。解决方案的关键在于构建一个包含多模态信息和复杂实体关系的多模态知识图谱(Multimodal Knowledge Graph, MH-MMKG),并设计一系列挑战性查询以评估模型在复杂知识检索与推理方面的能力,同时提出一种无需额外训练的多智能体检索器,使模型能够自主搜索相关知识。
链接: https://arxiv.org/abs/2506.17589
作者: Bowen Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:The real value of knowledge lies not just in its accumulation, but in its potential to be harnessed effectively to conquer the unknown. Although recent multimodal large language models (MLLMs) exhibit impressing multimodal capabilities, they often fail in rarely encountered domain-specific tasks due to limited relevant knowledge. To explore this, we adopt visual game cognition as a testbed and select Monster Hunter: World as the target to construct a multimodal knowledge graph (MH-MMKG), which incorporates multi-modalities and intricate entity relations. We also design a series of challenging queries based on MH-MMKG to evaluate the models’ ability for complex knowledge retrieval and reasoning. Furthermore, we propose a multi-agent retriever that enables a model to autonomously search relevant knowledge without additional training. Experimental results show that our approach significantly enhances the performance of MLLMs, providing a new perspective on multimodal knowledge-augmented reasoning and laying a solid foundation for future research.
zh
[AI-114] Context-Aware Scientific Knowledge Extraction on Linked Open Data using Large Language Models
【速读】:该论文试图解决科学文献爆炸式增长所带来的知识提取与综合难题,传统搜索引擎返回的资料缺乏直接且详细的答案,而通用大语言模型(Large Language Models, LLMs)虽能提供简洁回答,但可能深度不足或遗漏最新信息,具备搜索能力的LLMs还受限于上下文窗口,导致回答短且不完整。论文提出的解决方案是WISE(Workflow for Intelligent Scientific Knowledge Extraction),其关键在于采用结构化的工作流,通过树状架构的LLM驱动方法提炼数据,聚焦于与查询对齐、上下文感知且无冗余的信息,结合动态评分与排序机制优先考虑各来源的独特贡献,并通过自适应停止条件减少处理开销,从而系统性地探索和综合来自多种来源的知识。
链接: https://arxiv.org/abs/2506.17580
作者: Sajratul Y. Rubaiat,Hasan M. Jamil
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Digital Libraries (cs.DL); Emerging Technologies (cs.ET)
备注:
Abstract:The exponential growth of scientific literature challenges researchers extracting and synthesizing knowledge. Traditional search engines return many sources without direct, detailed answers, while general-purpose LLMs may offer concise responses that lack depth or omit current information. LLMs with search capabilities are also limited by context window, yielding short, incomplete answers. This paper introduces WISE (Workflow for Intelligent Scientific Knowledge Extraction), a system addressing these limits by using a structured workflow to extract, refine, and rank query-specific knowledge. WISE uses an LLM-powered, tree-based architecture to refine data, focusing on query-aligned, context-aware, and non-redundant information. Dynamic scoring and ranking prioritize unique contributions from each source, and adaptive stopping criteria minimize processing overhead. WISE delivers detailed, organized answers by systematically exploring and synthesizing knowledge from diverse sources. Experiments on HBB gene-associated diseases demonstrate WISE reduces processed text by over 80% while achieving significantly higher recall over baselines like search engines and other LLM-based approaches. ROUGE and BLEU metrics reveal WISE’s output is more unique than other systems, and a novel level-based metric shows it provides more in-depth information. We also explore how the WISE workflow can be adapted for diverse domains like drug discovery, material science, and social science, enabling efficient knowledge extraction and synthesis from unstructured scientific papers and web sources.
zh
[AI-115] Optimizing Mastery Learning by Fast-Forwarding Over-Practice Steps
【速读】:该论文试图解决辅导系统中由于学生重复练习已掌握技能而导致的过度练习(overpractice)问题。解决方案的关键在于提出一种名为Fast-Forwarding的技术,该技术通过优化现有的问题选择算法,在学生完成所有剩余学习路径均已被完全掌握的情况下,跳过不必要的问题解决步骤,从而减少过度练习。此方法基于学习者模型和真实学生数据推导出的问题解决路径进行模拟研究,能够在不进行资源密集型课程重新设计的情况下提升练习效率。
链接: https://arxiv.org/abs/2506.17577
作者: Meng Xia,Robin Schmucker,Conrad Borchers,Vincent Aleven
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Full research paper accepted at EC-TEL 2025
Abstract:Mastery learning improves learning proficiency and efficiency. However, the overpractice of skills–students spending time on skills they have already mastered–remains a fundamental challenge for tutoring systems. Previous research has reduced overpractice through the development of better problem selection algorithms and the authoring of focused practice tasks. However, few efforts have concentrated on reducing overpractice through step-level adaptivity, which can avoid resource-intensive curriculum redesign. We propose and evaluate Fast-Forwarding as a technique that enhances existing problem selection algorithms. Based on simulation studies informed by learner models and problem-solving pathways derived from real student data, Fast-Forwarding can reduce overpractice by up to one-third, as it does not require students to complete problem-solving steps if all remaining pathways are fully mastered. Fast-Forwarding is a flexible method that enhances any problem selection algorithm, though its effectiveness is highest for algorithms that preferentially select difficult problems. Therefore, our findings suggest that while Fast-Forwarding may improve student practice efficiency, the size of its practical impact may also depend on students’ ability to stay motivated and engaged at higher levels of difficulty.
zh
[AI-116] Accelerating Residual Reinforcement Learning with Uncertainty Estimation
【速读】:该论文旨在解决预训练策略在适应新任务时样本效率低以及难以处理稀疏奖励和随机性基础策略的问题。其关键解决方案是通过利用基础策略的不确定性估计来引导探索,并对离策略残差学习进行简单修改,以观察基础策略的动作并更好地应对随机性基础策略,从而提升样本效率和适用性。
链接: https://arxiv.org/abs/2506.17564
作者: Lakshita Dodeja,Karl Schmeckpeper,Shivam Vats,Thomas Weng,Mingxi Jia,George Konidaris,Stefanie Tellex
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:
Abstract:Residual Reinforcement Learning (RL) is a popular approach for adapting pretrained policies by learning a lightweight residual policy that provides corrective actions. While Residual RL is more sample-efficient than finetuning the entire base policy, existing methods struggle with sparse rewards and are designed for deterministic base policies. We propose two improvements to Residual RL that further enhance its sample efficiency and make it suitable for stochastic base policies. First, we leverage uncertainty estimates of the base policy to focus exploration on regions in which the base policy is not confident. Second, we propose a simple modification to off-policy residual learning that allows it to observe base actions and better handle stochastic base policies. We evaluate our method with both Gaussian-based and Diffusion-based stochastic base policies on tasks from Robosuite and D4RL, and compare against state-of-the-art finetuning methods, demo-augmented RL methods, and other residual RL methods. Our algorithm significantly outperforms existing baselines in a variety of simulation benchmark environments. We also deploy our learned polices in the real world to demonstrate their robustness with zero-shot sim-to-real transfer.
zh
[AI-117] owards Zero-Shot Coordination between Teams of Agents : The N-XPlay Framework
【速读】:该论文试图解决在多智能体系统中,尤其是涉及多团队协作的复杂场景下,零样本协作(Zero-shot Coordination, ZSC)能力不足的问题。现有方法主要关注两个未交互过的智能体之间的协作,而未能反映现实世界中多团队系统(Multi-Team Systems, MTS)中层级子群体及团队间交互的复杂性。论文的关键解决方案是引入N-player Overcooked作为扩展的多智能体ZSC基准,并提出N-XPlay算法,用于在多团队设置下的ZSC任务。实验表明,与Self-Play相比,N-XPlay训练的智能体在平衡“队内”和“队间”协作方面表现更优。
链接: https://arxiv.org/abs/2506.17560
作者: Ava Abderezaei,Chi-Hui Lin,Joseph Miceli,Naren Sivagnanadasan,Stéphane Aroca-Ouellette,Jake Brawer,Alessandro Roncone
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注: Accepted to RSS Workshop on Scalable and Resilient Multi-Robot Systems: Decision-Making, Coordination, and Learning 2025
Abstract:Zero-shot coordination (ZSC) – the ability to collaborate with unfamiliar partners – is essential to making autonomous agents effective teammates. Existing ZSC methods evaluate coordination capabilities between two agents who have not previously interacted. However, these scenarios do not reflect the complexity of real-world multi-agent systems, where coordination often involves a hierarchy of sub-groups and interactions between teams of agents, known as Multi-Team Systems (MTS). To address this gap, we first introduce N-player Overcooked, an N-agent extension of the popular two-agent ZSC benchmark, enabling evaluation of ZSC in N-agent scenarios. We then propose N-XPlay for ZSC in N-agent, multi-team settings. Comparison against Self-Play across two-, three- and five-player Overcooked scenarios, where agents are split between an ego-team'' and a group of unseen collaborators shows that agents trained with N-XPlay are better able to simultaneously balance
intra-team’’ and ``inter-team’’ coordination than agents trained with SP.
zh
[AI-118] Research on Model Parallelism and Data Parallelism Optimization Methods in Large Language Model-Based Recommendation Systems
【速读】:该论文旨在解决大规模语言模型(Large Language Models, LLMs)在推荐系统中应用时所面临的计算与通信瓶颈问题。其关键解决方案是提出一种混合并行策略,结合模型并行(Model Parallelism)与数据并行(Data Parallelism),其中模型并行采用张量并行(Tensor Parallelism)和流水线并行(Pipeline Parallelism)并引入自适应负载均衡机制以减少跨设备通信开销;数据并行则通过对比同步与异步模式,结合梯度压缩与稀疏化技术以及高效的聚合通信框架,显著提升带宽利用率。实验结果表明,该方案在保持强可扩展性和鲁棒性的前提下,将训练吞吐量提高了30%以上,资源利用率提升了约20%。
链接: https://arxiv.org/abs/2506.17551
作者: Haowei Yang,Yu Tian,Zhongheng Yang,Zhao Wang,Chengrui Zhou,Dannier Li
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注:
Abstract:With the rapid adoption of large language models (LLMs) in recommendation systems, the computational and communication bottlenecks caused by their massive parameter sizes and large data volumes have become increasingly prominent. This paper systematically investigates two classes of optimization methods-model parallelism and data parallelism-for distributed training of LLMs in recommendation scenarios. For model parallelism, we implement both tensor parallelism and pipeline parallelism, and introduce an adaptive load-balancing mechanism to reduce cross-device communication overhead. For data parallelism, we compare synchronous and asynchronous modes, combining gradient compression and sparsification techniques with an efficient aggregation communication framework to significantly improve bandwidth utilization. Experiments conducted on a real-world recommendation dataset in a simulated service environment demonstrate that our proposed hybrid parallelism scheme increases training throughput by over 30% and improves resource utilization by approximately 20% compared to traditional single-mode parallelism, while maintaining strong scalability and robustness. Finally, we discuss trade-offs among different parallel strategies in online deployment and outline future directions involving heterogeneous hardware integration and automated scheduling technologies.
zh
[AI-119] ConsumerBench: Benchmarking Generative AI Applications on End-User Devices
【速读】:该论文旨在解决生成式 AI(Generative AI)应用从云端环境向终端设备迁移过程中面临的资源管理、系统效率和用户体验问题。其解决方案的关键在于提出 ConsumerBench,一个全面的基准测试框架,用于评估终端设备上 GenAI 模型的系统效率和响应时间。ConsumerBench 通过模拟真实场景下的多应用程序并发执行以及支持可定制的工作流,揭示了资源共享中的低效性、贪婪分配下的不公平调度问题以及静态模型服务器配置的性能缺陷,为模型开发者和系统设计者提供了实用的优化建议。
链接: https://arxiv.org/abs/2506.17538
作者: Yile Gu,Rohan Kadekodi,Hoang Nguyen,Keisuke Kamahori,Yiyu Liu,Baris Kasikci
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Operating Systems (cs.OS)
备注: The code is available at this https URL
Abstract:The recent shift in Generative AI (GenAI) applications from cloud-only environments to end-user devices introduces new challenges in resource management, system efficiency, and user experience. This paper presents ConsumerBench, a comprehensive benchmarking framework designed to evaluate the system efficiency and response time of GenAI models running on end-user devices. Unlike existing benchmarks that assume exclusive model access on dedicated GPUs, ConsumerBench simulates realistic multi-application scenarios executing concurrently on constrained hardware. Furthermore, ConsumerBench supports customizable workflows that simulate complex tasks requiring coordination among multiple applications. ConsumerBench captures both application-level metrics, including latency and Service Level Objective (SLO) attainment, and system-level metrics like CPU/GPU utilization and memory bandwidth. Through extensive experiments, ConsumerBench reveals inefficiencies in resource sharing, unfair scheduling under greedy allocation, and performance pitfalls of static model server configurations. The paper also provides practical insights for model developers and system designers, highlighting the benefits of custom kernels tailored to consumer-grade GPU architectures and the value of implementing SLO-aware scheduling strategies.
zh
[AI-120] A Survey of State Representation Learning for Deep Reinforcement Learning
【速读】:该论文试图解决在顺序决策问题中复杂观测空间带来的挑战,其解决方案的关键在于通过表示学习方法获取有意义的状态表示,从而提升样本效率、泛化能力和性能。论文在无模型在线设置下对相关方法进行了广泛分类,探讨了不同方法在状态表示学习上的处理方式,并通过构建分类体系来增强对该领域的理解,为新研究者提供指导。
链接: https://arxiv.org/abs/2506.17518
作者: Ayoub Echchahed,Pablo Samuel Castro
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:
Abstract:Representation learning methods are an important tool for addressing the challenges posed by complex observations spaces in sequential decision making problems. Recently, many methods have used a wide variety of types of approaches for learning meaningful state representations in reinforcement learning, allowing better sample efficiency, generalization, and performance. This survey aims to provide a broad categorization of these methods within a model-free online setting, exploring how they tackle the learning of state representations differently. We categorize the methods into six main classes, detailing their mechanisms, benefits, and limitations. Through this taxonomy, our aim is to enhance the understanding of this field and provide a guide for new researchers. We also discuss techniques for assessing the quality of representations, and detail relevant future directions.
zh
[AI-121] Kaleidoscopic Teaming in Multi Agent Simulations
【速读】:该论文试图解决当前红队测试或安全评估框架在评估智能体(agent)复杂行为、思维过程及交互中的安全风险时存在的不足,尤其是在多智能体设置中,当智能体进行复杂行为和相互作用时可能暴露的各种漏洞。其解决方案的关键在于引入“kaleidoscopic teaming”这一概念,旨在捕捉单智能体和多智能体场景中可能出现的广泛且复杂的漏洞,并提出一种新的kaleidoscopic teaming框架,该框架能够生成多样化的场景以模拟现实世界的人类社会,从而评估智能体的安全性。此外,还引入了新的上下文优化技术以生成更优的安全分析场景,并提出了相应的度量标准来衡量智能体的安全性。
链接: https://arxiv.org/abs/2506.17514
作者: Ninareh Mehrabi,Tharindu Kumarage,Kai-Wei Chang,Aram Galstyan,Rahul Gupta
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Warning: This paper contains content that may be inappropriate or offensive. AI agents have gained significant recent attention due to their autonomous tool usage capabilities and their integration in various real-world applications. This autonomy poses novel challenges for the safety of such systems, both in single- and multi-agent scenarios. We argue that existing red teaming or safety evaluation frameworks fall short in evaluating safety risks in complex behaviors, thought processes and actions taken by agents. Moreover, they fail to consider risks in multi-agent setups where various vulnerabilities can be exposed when agents engage in complex behaviors and interactions with each other. To address this shortcoming, we introduce the term kaleidoscopic teaming which seeks to capture complex and wide range of vulnerabilities that can happen in agents both in single-agent and multi-agent scenarios. We also present a new kaleidoscopic teaming framework that generates a diverse array of scenarios modeling real-world human societies. Our framework evaluates safety of agents in both single-agent and multi-agent setups. In single-agent setup, an agent is given a scenario that it needs to complete using the tools it has access to. In multi-agent setup, multiple agents either compete against or cooperate together to complete a task in the scenario through which we capture existing safety vulnerabilities in agents. We introduce new in-context optimization techniques that can be used in our kaleidoscopic teaming framework to generate better scenarios for safety analysis. Lastly, we present appropriate metrics that can be used along with our framework to measure safety of agents. Utilizing our kaleidoscopic teaming framework, we identify vulnerabilities in various models with respect to their safety in agentic use-cases. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2506.17514 [cs.AI] (or arXiv:2506.17514v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2506.17514 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-122] Mapping the Evolution of Research Contributions using KnoVo
【速读】:该论文试图解决科学文献中研究新颖性(research novelty)的量化与分析问题,传统引用分析方法仅能衡量影响力而无法准确评估论文在特定研究维度上的创新程度。解决方案的关键在于构建KnoVo框架,该框架利用生成式AI(Generative AI)动态提取比较维度(如方法、应用、数据集),并通过对比分析目标论文与其相关文献在这些维度上的表现,生成定量的新颖性评分,从而实现对知识演进的多维追踪与可视化。
链接: https://arxiv.org/abs/2506.17508
作者: Sajratul Y. Rubaiat,Syed N. Sakib,Hasan M. Jamil
机构: 未知
类目: Digital Libraries (cs.DL); Artificial Intelligence (cs.AI); Databases (cs.DB); Emerging Technologies (cs.ET); Information Retrieval (cs.IR)
备注:
Abstract:This paper presents KnoVo (Knowledge Evolution), an intelligent framework designed for quantifying and analyzing the evolution of research novelty in the scientific literature. Moving beyond traditional citation analysis, which primarily measures impact, KnoVo determines a paper’s novelty relative to both prior and subsequent work within its multilayered citation network. Given a target paper’s abstract, KnoVo utilizes Large Language Models (LLMs) to dynamically extract dimensions of comparison (e.g., methodology, application, dataset). The target paper is then compared to related publications along these same extracted dimensions. This comparative analysis, inspired by tournament selection, yields quantitative novelty scores reflecting the relative improvement, equivalence, or inferiority of the target paper in specific aspects. By aggregating these scores and visualizing their progression, for instance, through dynamic evolution graphs and comparative radar charts, KnoVo facilitates researchers not only to assess originality and identify similar work, but also to track knowledge evolution along specific research dimensions, uncover research gaps, and explore cross-disciplinary connections. We demonstrate these capabilities through a detailed analysis of 20 diverse papers from multiple scientific fields and report on the performance of various open-source LLMs within the KnoVo framework.
zh
[AI-123] From Generality to Mastery: Composer-Style Symbolic Music Generation via Large-Scale Pre-training
【速读】:该论文旨在解决可控符号音乐生成中由于数据稀缺导致的特定控制模态(如作曲家风格音乐生成)建模困难的问题。其关键解决方案是采用两阶段训练范式:首先在大规模流行、民间和古典音乐语料库上预训练基于REMI的音乐生成模型,以学习通用音乐知识;随后在少量经过人工验证的四位著名作曲家(巴赫、莫扎特、贝多芬和肖邦)的钢琴作品数据集上进行微调,并通过轻量级适配器模块将模型条件化为风格指示符,从而提升对特定作曲家风格的建模能力。
链接: https://arxiv.org/abs/2506.17497
作者: Mingyang Yao,Ke Chen
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
备注: Proceedings of the 6th Conference on AI Music Creativity, AIMC 2025
Abstract:Despite progress in controllable symbolic music generation, data scarcity remains a challenge for certain control modalities. Composer-style music generation is a prime example, as only a few pieces per composer are available, limiting the modeling of both styles and fundamental music elements (e.g., melody, chord, rhythm). In this paper, we investigate how general music knowledge learned from a broad corpus can enhance the mastery of specific composer styles, with a focus on piano piece generation. Our approach follows a two-stage training paradigm. First, we pre-train a REMI-based music generation model on a large corpus of pop, folk, and classical music. Then, we fine-tune it on a small, human-verified dataset from four renowned composers, namely Bach, Mozart, Beethoven, and Chopin, using a lightweight adapter module to condition the model on style indicators. To evaluate the effectiveness of our approach, we conduct both objective and subjective evaluations on style accuracy and musicality. Experimental results demonstrate that our method outperforms ablations and baselines, achieving more precise composer-style modeling and better musical aesthetics. Additionally, we provide observations on how the model builds music concepts from the generality pre-training and refines its stylistic understanding through the mastery fine-tuning.
zh
[AI-124] Distilling On-device Language Models for Robot Planning with Minimal Human Intervention
【速读】:该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的机器人系统依赖云端模型导致在通信不稳定环境(如户外或工业场景)中可用性受限的问题。其解决方案的关键在于提出PRISM框架,通过自动合成多样化的任务和环境,从LLM中提取计划,并利用合成数据对小型语言模型(Small Language Model, SLM)进行知识蒸馏,从而在无需人类监督的情况下,生成可在本地运行的紧凑SLM,作为原始模型的即插即用替代方案。
链接: https://arxiv.org/abs/2506.17486
作者: Zachary Ravichandran,Ignacio Hounie,Fernando Cladera,Alejandro Ribeiro,George J. Pappas,Vijay Kumar
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Large language models (LLMs) provide robots with powerful contextual reasoning abilities and a natural human interface. Yet, current LLM-enabled robots typically depend on cloud-hosted models, limiting their usability in environments with unreliable communication infrastructure, such as outdoor or industrial settings. We present PRISM, a framework for distilling small language model (SLM)-enabled robot planners that run on-device with minimal human supervision. Starting from an existing LLM-enabled planner, PRISM automatically synthesizes diverse tasks and environments, elicits plans from the LLM, and uses this synthetic dataset to distill a compact SLM as a drop-in replacement of the source model. We apply PRISM to three LLM-enabled planners for mapping and exploration, manipulation, and household assistance, and we demonstrate that PRISM improves the performance of Llama-3.2-3B from 10-20% of GPT-4o’s performance to over 93% - using only synthetic data. We further demonstrate that the distilled planners generalize across heterogeneous robotic platforms (ground and aerial) and diverse environments (indoor and outdoor). We release all software, trained models, and datasets at this https URL.
zh
[AI-125] From Unstructured Communication to Intelligent RAG : Multi-Agent Automation for Supply Chain Knowledge Bases KDD KDD25
【速读】:该论文旨在解决供应链运营中由于大量非结构化通信(如支持工单、电子邮件和聊天记录)导致的关键知识难以被有效利用的问题,这些知识包括系统使用实践、故障排除流程和解决技术等。现有基于检索增强生成(RAG)的系统因原始数据噪声大、不一致和不完整而效果受限。该研究提出了一种以离线处理为核心的创新方法,其关键在于采用基于大语言模型(LLM)的多智能体系统,该系统由三个专业代理组成:分类发现(用于构建分类体系)、分类(用于工单分组)以及知识合成(用于生成知识文章),从而将非结构化通信转化为结构化的知识库,显著提升了知识质量和检索效率。
链接: https://arxiv.org/abs/2506.17484
作者: Yao Zhang,Zaixi Shang,Silpan Patel,Mikel Zuniga
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted In Proceedings of the 1st Workshop on AI for Supply Chain: Today and Future @ 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.2 (KDD 25), August 3, 2025, Toronto, ON, Canada. ACM, New York, NY, USA, 14 pages, 2 figures
Abstract:Supply chain operations generate vast amounts of operational data; however, critical knowledge such as system usage practices, troubleshooting workflows, and resolution techniques often remains buried within unstructured communications like support tickets, emails, and chat logs. While RAG systems aim to leverage such communications as a knowledge base, their effectiveness is limited by raw data challenges: support tickets are typically noisy, inconsistent, and incomplete, making direct retrieval suboptimal. Unlike existing RAG approaches that focus on runtime optimization, we introduce a novel offline-first methodology that transforms these communications into a structured knowledge base. Our key innovation is a LLMs-based multi-agent system orchestrating three specialized agents: Category Discovery for taxonomy creation, Categorization for ticket grouping, and Knowledge Synthesis for article generation. Applying our methodology to real-world support tickets with resolution notes and comments, our system creates a compact knowledge base - reducing total volume to just 3.4% of original ticket data while improving quality. Experiments demonstrate that our prebuilt knowledge base in RAG systems significantly outperforms traditional RAG implementations (48.74% vs. 38.60% helpful answers) and achieves a 77.4% reduction in unhelpful responses. By automating institutional knowledge capture that typically remains siloed in experts’ heads, our solution translates to substantial operational efficiency: reducing support workload, accelerating resolution times, and creating self-improving systems that automatically resolve approximately 50% of future supply chain tickets. Our approach addresses a key gap in knowledge management by transforming transient communications into structured, reusable knowledge through intelligent offline processing rather than latency-inducing runtime architectures.
zh
[AI-126] FedNAMs: Performing Interpretability Analysis in Federated Learning Context
【速读】:该论文旨在解决联邦学习中可解释性与透明度不足的问题,尤其是在数据分散和隐私保护要求较高的场景下。其解决方案的关键在于引入神经加法模型(Neural Additive Models, NAMs),通过将NAMs的特征独立学习能力与联邦学习的分布式训练机制相结合,构建出联邦神经加法模型(FedNAMs)。该方法在保持模型性能的同时,提升了模型的可解释性,并通过本地数据训练有效保障了隐私安全。
链接: https://arxiv.org/abs/2506.17466
作者: Amitash Nanda,Sree Bhargavi Balija,Debashis Sahoo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 13 pages, 6 figures
Abstract:Federated learning continues to evolve but faces challenges in interpretability and explainability. To address these challenges, we introduce a novel approach that employs Neural Additive Models (NAMs) within a federated learning framework. This new Federated Neural Additive Models (FedNAMs) approach merges the advantages of NAMs, where individual networks concentrate on specific input features, with the decentralized approach of federated learning, ultimately producing interpretable analysis results. This integration enhances privacy by training on local data across multiple devices, thereby minimizing the risks associated with data centralization and improving model robustness and generalizability. FedNAMs maintain detailed, feature-specific learning, making them especially valuable in sectors such as finance and healthcare. They facilitate the training of client-specific models to integrate local updates, preserve privacy, and mitigate concerns related to centralization. Our studies on various text and image classification tasks, using datasets such as OpenFetch ML Wine, UCI Heart Disease, and Iris, show that FedNAMs deliver strong interpretability with minimal accuracy loss compared to traditional Federated Deep Neural Networks (DNNs). The research involves notable findings, including the identification of critical predictive features at both client and global levels. Volatile acidity, sulfates, and chlorides for wine quality. Chest pain type, maximum heart rate, and number of vessels for heart disease. Petal length and width for iris classification. This approach strengthens privacy and model efficiency and improves interpretability and robustness across diverse datasets. Finally, FedNAMs generate insights on causes of highly and low interpretable features.
zh
[AI-127] OmniReflect: Discovering Transferable Constitutions for LLM agents via Neuro-Symbolic Reflections
【速读】:该论文旨在解决大型语言模型(Large Language Model, LLM)代理在复杂任务中表现不足的问题,特别是针对长期学习机制缺乏通用性以及动态环境中效率低下的问题。其解决方案的关键在于提出OmniReflect框架,该框架通过构建“宪法”——即从任务经验中提炼出的一组指导原则,以提升LLM代理的有效性和效率。OmniReflect采用两种模式:Self-sustaining模式下,单个代理在执行任务过程中定期自我整理反思;Co-operative模式下,由一个Meta-advisor从少量校准数据集中推导出宪法来指导其他代理。该框架结合神经、符号和神经符号技术,在上下文适应性和计算效率之间取得平衡。
链接: https://arxiv.org/abs/2506.17449
作者: Manasa Bharadwaj,Nikhil Verma,Kevin Ferreira
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Efforts to improve Large Language Model (LLM) agent performance on complex tasks have largely focused on fine-tuning and iterative self-correction. However, these approaches often lack generalizable mechanisms for longterm learning and remain inefficient in dynamic environments. We introduce OmniReflect, a hierarchical, reflection-driven framework that constructs a constitution, a compact set of guiding principles distilled from task experiences, to enhance the effectiveness and efficiency of an LLM agent. OmniReflect operates in two modes: Self-sustaining, where a single agent periodically curates its own reflections during task execution, and Co-operative, where a Meta-advisor derives a constitution from a small calibration set to guide another agent. To construct these constitutional principles, we employ Neural, Symbolic, and NeuroSymbolic techniques, offering a balance between contextual adaptability and computational efficiency. Empirical results averaged across models show major improvements in task success, with absolute gains of +10.3% on ALFWorld, +23.8% on BabyAI, and +8.3% on PDDL in the Self-sustaining mode. Similar gains are seen in the Co-operative mode, where a lightweight Qwen3-4B ReAct agent outperforms all Reflexion baselines on BabyAI. These findings highlight the robustness and effectiveness of OmniReflect across environments and backbones.
zh
[AI-128] Keeping Medical AI Healthy: A Review of Detection and Correction Methods for System Degradation
【速读】:该论文试图解决医疗人工智能系统在实际应用中因数据分布变化、患者特征演变、临床协议更新及数据质量波动等因素导致的性能退化问题,这些问题可能影响模型可靠性并引发安全风险。解决方案的关键在于建立持续的性能监控机制、早期退化检测方法以及有效的自我修正策略,涵盖数据和模型漂移检测、根本原因分析及从模型再训练到测试时适应等多种修正技术。
链接: https://arxiv.org/abs/2506.17442
作者: Hao Guan,David Bates,Li Zhou
机构: 未知
类目: Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
备注: 15 pages, 5 figures
Abstract:Artificial intelligence (AI) is increasingly integrated into modern healthcare, offering powerful support for clinical decision-making. However, in real-world settings, AI systems may experience performance degradation over time, due to factors such as shifting data distributions, changes in patient characteristics, evolving clinical protocols, and variations in data quality. These factors can compromise model reliability, posing safety concerns and increasing the likelihood of inaccurate predictions or adverse outcomes. This review presents a forward-looking perspective on monitoring and maintaining the “health” of AI systems in healthcare. We highlight the urgent need for continuous performance monitoring, early degradation detection, and effective self-correction mechanisms. The paper begins by reviewing common causes of performance degradation at both data and model levels. We then summarize key techniques for detecting data and model drift, followed by an in-depth look at root cause analysis. Correction strategies are further reviewed, ranging from model retraining to test-time adaptation. Our survey spans both traditional machine learning models and state-of-the-art large language models (LLMs), offering insights into their strengths and limitations. Finally, we discuss ongoing technical challenges and propose future research directions. This work aims to guide the development of reliable, robust medical AI systems capable of sustaining safe, long-term deployment in dynamic clinical settings.
zh
[AI-129] Resource Rational Contractualism Should Guide AI Alignment
【速读】:该论文试图解决AI系统在复杂人类环境中进行决策时与人类及其他AI代理的目标和价值观不一致的问题,即如何实现AI系统与多方利益相关者的价值对齐。解决方案的关键在于提出资源理性契约主义(Resource-Rational Contractualism, RRC),该框架通过使用基于规范性且受认知启发的启发式工具箱,使AI系统在计算资源与决策准确性之间进行权衡,从而近似理性主体在理想条件下可能达成的协议,以实现高效且动态适应人类社会环境的决策能力。
链接: https://arxiv.org/abs/2506.17434
作者: Sydney Levine,Matija Franklin,Tan Zhi-Xuan,Secil Yanik Guyot,Lionel Wong,Daniel Kilov,Yejin Choi,Joshua B. Tenenbaum,Noah Goodman,Seth Lazar,Iason Gabriel
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 24 pages, 10 figures
Abstract:AI systems will soon have to navigate human environments and make decisions that affect people and other AI agents whose goals and values diverge. Contractualist alignment proposes grounding those decisions in agreements that diverse stakeholders would endorse under the right conditions, yet securing such agreement at scale remains costly and slow – even for advanced AI. We therefore propose Resource-Rational Contractualism (RRC): a framework where AI systems approximate the agreements rational parties would form by drawing on a toolbox of normatively-grounded, cognitively-inspired heuristics that trade effort for accuracy. An RRC-aligned agent would not only operate efficiently, but also be equipped to dynamically adapt to and interpret the ever-changing human social world.
zh
[AI-130] AI based Content Creation and Product Recommendation Applications in E-commerce: An Ethical overview
【速读】:该论文试图解决生成式 AI 在电子商务中内容创作和产品推荐应用所带来的伦理问题,特别是数据隐私、算法偏见和消费者自主性等方面的挑战。解决方案的关键在于建立公平、透明且具有包容性的伦理框架,通过定期算法审计、多样化训练数据、引入公平性指标以及加强消费者数据保护等具体措施,以减少偏见并提升系统的伦理合规性。
链接: https://arxiv.org/abs/2506.17370
作者: Aditi Madhusudan Jain,Ayush Jain
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:As e-commerce rapidly integrates artificial intelligence for content creation and product recommendations, these technologies offer significant benefits in personalization and efficiency. AI-driven systems automate product descriptions, generate dynamic advertisements, and deliver tailored recommendations based on consumer behavior, as seen in major platforms like Amazon and Shopify. However, the widespread use of AI in e-commerce raises crucial ethical challenges, particularly around data privacy, algorithmic bias, and consumer autonomy. Bias – whether cultural, gender-based, or socioeconomic – can be inadvertently embedded in AI models, leading to inequitable product recommendations and reinforcing harmful stereotypes. This paper examines the ethical implications of AI-driven content creation and product recommendations, emphasizing the need for frameworks to ensure fairness, transparency, and need for more established and robust ethical standards. We propose actionable best practices to remove bias and ensure inclusivity, such as conducting regular audits of algorithms, diversifying training data, and incorporating fairness metrics into AI models. Additionally, we discuss frameworks for ethical conformance that focus on safeguarding consumer data privacy, promoting transparency in decision-making processes, and enhancing consumer autonomy. By addressing these issues, we provide guidelines for responsibly utilizing AI in e-commerce applications for content creation and product recommendations, ensuring that these technologies are both effective and ethically sound.
zh
[AI-131] Re-Evaluating Code LLM Benchmarks Under Semantic Mutation
【速读】:该论文试图解决代码基准测试中因提示模板(prompt template)敏感性导致的评估不可靠问题,即微小的提示变化可能引起模型性能的显著波动,从而影响对大语言模型(LLMs)能力的准确评估。其解决方案的关键在于提出一个通用框架,该框架在尽可能保持提示模板语义和结构的前提下对其进行修改,以系统性地研究提示敏感性的影响。
链接: https://arxiv.org/abs/2506.17369
作者: Zhiyuan Pan,Xing Hu,Xin Xia,Xiaohu Yang
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:In the era of large language models (LLMs), code benchmarks have become an important research area in software engineering and are widely used by practitioners. These benchmarks evaluate the performance of LLMs on specific code-related tasks, such as code understanding and generation. A critical step in constructing code benchmarks is the design of prompts. However, as existing code benchmarks typically rely on a single prompt template per task, they are prone to the issue of prompt sensitivity, where minor prompt variations could result in substantial performance variations, leading to unreliable evaluations of model capabilities. While previous studies have explored prompt sensitivity, their experimental designs and findings are limited to traditional natural language processing (NLP) tasks. In this paper, we present an empirical study to investigate prompt sensitivity in code benchmarks. We first propose a general framework that modifies prompt templates in a manner that preserves both their semantics and their structure as much as possible. Based on the framework, we conduct extensive experiments across eight code benchmark tasks on 10 representative open-source LLMs, with each task featuring 100 semantically similar prompt templates. We then analyze the evaluation results using various statistical metrics, focusing on both absolute and relative model performance. Our findings suggest that even slight prompt variations can lead to significant shifts in performance. Additionally, we observe that such variations can introduce inconsistencies in the performance rankings across different models. These insights highlight the need for considering prompt sensitivity when designing future code benchmarks, to ensure more reliable and accurate evaluation of LLM capabilities. Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI) Cite as: arXiv:2506.17369 [cs.SE] (or arXiv:2506.17369v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2506.17369 Focus to learn more arXiv-issued DOI via DataCite
zh
[AI-132] SAFEx: Analyzing Vulnerabilities of MoE-Based LLM s via Stable Safety-critical Expert Identification
【速读】:该论文旨在解决基于Mixture-of-Experts (MoE)架构的大语言模型在安全对齐方面存在的独特挑战,特别是其位置脆弱性——即安全对齐行为依赖于特定专家模块所带来的潜在风险。现有针对密集模型的安全对齐策略无法有效应对MoE架构中的特殊漏洞。论文提出的解决方案关键在于SAFEx框架,该框架通过一种新颖的基于稳定性的专家选择(Stability-based Expert Selection, SES)算法,能够鲁棒地识别、表征和验证安全关键专家,并将其分解为不同的功能组,如有害内容检测和安全响应生成控制。实验表明,MoE模型的安全机制高度依赖于少量位置专家,禁用这些专家会显著降低模型拒绝有害请求的能力。
链接: https://arxiv.org/abs/2506.17368
作者: Zhenglin Lai,Mengyao Liao,Dong Xu,Zebin Zhao,Zhihang Yuan,Chao Fan,Jianqiang Li,Bingzhe Wu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: 9 pages, 7 figures
Abstract:Large language models based on Mixture-of-Experts have achieved substantial gains in efficiency and scalability, yet their architectural uniqueness introduces underexplored safety alignment challenges. Existing safety alignment strategies, predominantly designed for dense models, are ill-suited to address MoE-specific vulnerabilities. In this work, we formalize and systematically study MoE model’s positional vulnerability - the phenomenon where safety-aligned behaviors rely on specific expert modules, revealing critical risks inherent to MoE architectures. To this end, we present SAFEx, an analytical framework that robustly identifies, characterizes, and validates the safety-critical experts using a novel Stability-based Expert Selection (SES) algorithm. Notably, our approach enables the explicit decomposition of safety-critical experts into distinct functional groups, including those responsible for harmful content detection and those controlling safe response generation. Extensive experiments on mainstream MoE models, such as the recently released Qwen3-MoE, demonstrated that their intrinsic safety mechanisms heavily rely on a small subset of positional experts. Disabling these experts significantly compromised the models’ ability to refuse harmful requests. For Qwen3-MoE with 6144 experts (in the FNN layer), we find that disabling as few as 12 identified safety-critical experts can cause the refusal rate to drop by 22%, demonstrating the disproportionate impact of a small set of experts on overall model safety.
zh
[AI-133] A Large-Scale Real-World Evaluation of LLM -Based Virtual Teaching Assistant ACL2025
【速读】:该论文试图解决虚拟教学助手(Virtual Teaching Assistant, VTA)在真实课堂环境中有效性与接受度的实证研究不足问题,以及其在实际教学中的可行性和关键挑战。解决方案的关键在于开发一个基于大语言模型(Large Language Model, LLM)的VTA,并在实际课程中进行部署,通过多轮问卷调查和对学生与VTA交互数据的分析,评估其性能演变及与传统师生互动的差异,从而为VTA在教育场景中的广泛应用提供实证依据。
链接: https://arxiv.org/abs/2506.17363
作者: Sunjun Kweon,Sooyohn Nam,Hyunseung Lim,Hwajung Hong,Edward Choi
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: ACL 2025 Industry Track
Abstract:Virtual Teaching Assistants (VTAs) powered by Large Language Models (LLMs) have the potential to enhance student learning by providing instant feedback and facilitating multi-turn interactions. However, empirical studies on their effectiveness and acceptance in real-world classrooms are limited, leaving their practical impact uncertain. In this study, we develop an LLM-based VTA and deploy it in an introductory AI programming course with 477 graduate students. To assess how student perceptions of the VTA’s performance evolve over time, we conduct three rounds of comprehensive surveys at different stages of the course. Additionally, we analyze 3,869 student–VTA interaction pairs to identify common question types and engagement patterns. We then compare these interactions with traditional student–human instructor interactions to evaluate the VTA’s role in the learning process. Through a large-scale empirical study and interaction analysis, we assess the feasibility of deploying VTAs in real-world classrooms and identify key challenges for broader adoption. Finally, we release the source code of our VTA system, fostering future advancements in AI-driven education: \textttthis https URL.
zh
[AI-134] Speeding up Local Optimization in Vehicle Routing with Tensor-based GPU Acceleration
【速读】:该论文旨在解决车辆路径问题(Vehicle Routing Problem, VRP)及其变种中局部搜索(local search)计算成本高、耗时的问题。其解决方案的关键在于引入一种基于张量(tensor)的GPU加速方法,通过属性化表示实现广泛的可扩展性,并将密集计算任务完全卸载至GPU,从而显著提升计算效率并改善求解质量。
链接: https://arxiv.org/abs/2506.17357
作者: Zhenyu Lei,Jin-Kao Hao,Qinghua Wu
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注:
Abstract:Local search plays a central role in many effective heuristic algorithms for the vehicle routing problem (VRP) and its variants. However, neighborhood exploration is known to be computationally expensive and time consuming, especially for large instances or problems with complex constraints. In this study, we explore a promising direction to address this challenge by introducing an original tensor-based GPU acceleration method designed to speed up the commonly used local search operators in vehicle routing. By using an attribute-based representation, the method offers broad extensibility, making it applicable to different VRP variants. Its low-coupling architecture, with intensive computations completely offloaded to the GPU, ensures seamless integration in various local search-based algorithms and frameworks, leading to significant improvements in computational efficiency and potentially improved solution quality. Through comparative experiments on benchmark instances of three routing problems, we demonstrate the substantial computational advantages of the proposed approach over traditional CPU-based implementations. We also provide a detailed analysis of the strengths and limitations of the method, providing valuable insights into its performance characteristics and identifying potential bottlenecks in practical applications. These findings contribute to a better understanding and suggest directions for future improvements.
zh
[AI-135] Automatic Large Language Models Creation of Interactive Learning Lessons
【速读】:该论文试图解决如何自动生成用于培训在线中学数学新手导师的交互式、情境化课程的问题。其解决方案的关键在于采用基于检索增强生成(Retrieval-Augmented Generation)方法的提示工程,结合GPT-4o模型,通过任务分解提示策略将课程生成过程拆分为子任务,从而提高生成课程的质量和结构化程度。
链接: https://arxiv.org/abs/2506.17356
作者: Jionghao Lin,Jiarui Rao,Yiyang Zhao,Yuting Wang,Ashish Gurung,Amanda Barany,Jaclyn Ocumpaugh,Ryan S. Baker,Kenneth R. Koedinger
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Full Research Paper, 15 pages, In Proceedings of 20th European Conference on Technology Enhanced Learning (ECTEL2025)
Abstract:We explore the automatic generation of interactive, scenario-based lessons designed to train novice human tutors who teach middle school mathematics online. Employing prompt engineering through a Retrieval-Augmented Generation approach with GPT-4o, we developed a system capable of creating structured tutor training lessons. Our study generated lessons in English for three key topics: Encouraging Students’ Independence, Encouraging Help-Seeking Behavior, and Turning on Cameras, using a task decomposition prompting strategy that breaks lesson generation into sub-tasks. The generated lessons were evaluated by two human evaluators, who provided both quantitative and qualitative evaluations using a comprehensive rubric informed by lesson design research. Results demonstrate that the task decomposition strategy led to higher-rated lessons compared to single-step generation. Human evaluators identified several strengths in the LLM-generated lessons, including well-structured content and time-saving potential, while also noting limitations such as generic feedback and a lack of clarity in some instructional sections. These findings underscore the potential of hybrid human-AI approaches for generating effective lessons in tutor training.
zh
[AI-136] Differentiation-Based Extraction of Proprietary Data from Fine-Tuned LLM s CCS’25
【速读】:该论文试图解决监督微调(Supervised Fine-Tuning, SFT)数据的提取问题,即如何从微调后的大型语言模型(Large Language Models, LLMs)中提取出有价值的指令-响应对。解决方案的关键在于提出一种名为差异化数据提取(Differentiated Data Extraction, DDE)的新方法,该方法利用微调模型的置信度水平及其与预训练基础模型的行为差异,从而实现对SFT数据的有效提取。
链接: https://arxiv.org/abs/2506.17353
作者: Zongjie Li,Daoyuan Wu,Shuai Wang,Zhendong Su
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: In Proceedings of the 2025 ACM SIGSAC Conference on Computer and Communications Security (CCS’25), October 13-17, 2025, Taipei, Taiwan, China. ACM, New York, NY, USA, 15 pages. this https URL
Abstract:The increasing demand for domain-specific and human-aligned Large Language Models (LLMs) has led to the widespread adoption of Supervised Fine-Tuning (SFT) techniques. SFT datasets often comprise valuable instruction-response pairs, making them highly valuable targets for potential extraction. This paper studies this critical research problem for the first time. We start by formally defining and formulating the problem, then explore various attack goals, types, and variants based on the unique properties of SFT data in real-world scenarios. Based on our analysis of extraction behaviors of direct extraction, we develop a novel extraction method specifically designed for SFT models, called Differentiated Data Extraction (DDE), which exploits the confidence levels of fine-tuned models and their behavioral differences from pre-trained base models. Through extensive experiments across multiple domains and scenarios, we demonstrate the feasibility of SFT data extraction using DDE. Our results show that DDE consistently outperforms existing extraction baselines in all attack settings. To counter this new attack, we propose a defense mechanism that mitigates DDE attacks with minimal impact on model performance. Overall, our research reveals hidden data leak risks in fine-tuned LLMs and provides insights for developing more secure models.
zh
[AI-137] CUBA: Controlled Untargeted Backdoor Attack against Deep Neural Networks
【速读】:该论文试图解决传统后门攻击中存在的一致性问题,即无目标后门攻击在本质上具有自削弱性,因为其缺乏明确的目标性,导致攻击效果受限。为了解决这一问题,论文提出了一种新的约束无目标后门攻击(Constrained Untargeted Backdoor Attack, CUBA),其关键在于将无目标攻击的灵活性与有目标攻击的意图性相结合。通过在训练过程中对交叉熵损失应用logit归一化并使用翻转的one-hot标签,使被攻破的模型在选定的目标类别范围内呈现均匀分布,从而实现可控的无目标后门攻击,有效规避现有的后门防御方法。
链接: https://arxiv.org/abs/2506.17350
作者: Yinghao Wu,Liyan Zhang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Backdoor attacks have emerged as a critical security threat against deep neural networks in recent years. The majority of existing backdoor attacks focus on targeted backdoor attacks, where trigger is strongly associated to specific malicious behavior. Various backdoor detection methods depend on this inherent property and shows effective results in identifying and mitigating such targeted attacks. However, a purely untargeted attack in backdoor scenarios is, in some sense, self-weakening, since the target nature is what makes backdoor attacks so powerful. In light of this, we introduce a novel Constrained Untargeted Backdoor Attack (CUBA), which combines the flexibility of untargeted attacks with the intentionality of targeted attacks. The compromised model, when presented with backdoor images, will classify them into random classes within a constrained range of target classes selected by the attacker. This combination of randomness and determinedness enables the proposed untargeted backdoor attack to natively circumvent existing backdoor defense methods. To implement the untargeted backdoor attack under controlled flexibility, we propose to apply logit normalization on cross-entropy loss with flipped one-hot labels. By constraining the logit during training, the compromised model will show a uniform distribution across selected target classes, resulting in controlled untargeted attack. Extensive experiments demonstrate the effectiveness of the proposed CUBA on different datasets.
zh
[AI-138] Advanced Game-Theoretic Frameworks for Multi-Agent AI Challenges: A 2025 Outlook
【速读】:该论文试图解决下一代人工智能(Artificial Intelligence, AI)在复杂环境中的战略互动与适应性问题,特别是针对2025年前可能出现的挑战。其解决方案的关键在于引入先进的博弈论范式,包括动态联盟形成、基于语言的效用评估、破坏风险以及部分可观测性,并结合重复博弈、贝叶斯更新用于对抗检测以及收益结构中的道德框架,以提供数学形式化、仿真和编码方案来支持多智能体AI系统的协商与适应。
链接: https://arxiv.org/abs/2506.17348
作者: Pavel Malinovskiy
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注: 43 pages, 7 figures, 30 references
Abstract:This paper presents a substantially reworked examination of how advanced game-theoretic paradigms can serve as a foundation for the next-generation challenges in Artificial Intelligence (AI), forecasted to arrive in or around 2025. Our focus extends beyond traditional models by incorporating dynamic coalition formation, language-based utilities, sabotage risks, and partial observability. We provide a set of mathematical formalisms, simulations, and coding schemes that illustrate how multi-agent AI systems may adapt and negotiate in complex environments. Key elements include repeated games, Bayesian updates for adversarial detection, and moral framing within payoff structures. This work aims to equip AI researchers with robust theoretical tools for aligning strategic interaction in uncertain, partially adversarial contexts.
zh
[AI-139] Distinguishing Predictive and Generative AI in Regulation
【速读】:该论文试图解决当前监管框架在应对生成式 AI (Generative AI) 时存在的适应性不足问题,因为现有政策主要针对预测型 AI 设计,未能充分考虑生成式 AI 在泛化性、可适应性、评估难度、法律风险及价值链分布等方面的独特特性。解决方案的关键在于重新评估过去十年政策的有效性,并制定针对生成式 AI 独特风险的新政策,同时通过识别监管目标和利用更广泛生态系统中的约束条件来实现有效治理。
链接: https://arxiv.org/abs/2506.17347
作者: Jennifer Wang,Andrew Selbst,Solon Barocas,Suresh Venkatasubramanian
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:Over the past decade, policymakers have developed a set of regulatory tools to ensure AI development aligns with key societal goals. Many of these tools were initially developed in response to concerns with predictive AI and therefore encode certain assumptions about the nature of AI systems and the utility of certain regulatory approaches. With the advent of generative AI, however, some of these assumptions no longer hold, even as policymakers attempt to maintain a single regulatory target that covers both types of AI. In this paper, we identify four distinct aspects of generative AI that call for meaningfully different policy responses. These are the generality and adaptability of generative AI that make it a poor regulatory target, the difficulty of designing effective evaluations, new legal concerns that change the ecosystem of stakeholders and sources of expertise, and the distributed structure of the generative AI value chain. In light of these distinctions, policymakers will need to evaluate where the past decade of policy work remains relevant and where new policies, designed to address the unique risks posed by generative AI, are necessary. We outline three recommendations for policymakers to more effectively identify regulatory targets and leverage constraints across the broader ecosystem to govern generative AI. Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI) Cite as: arXiv:2506.17347 [cs.CY] (or arXiv:2506.17347v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2506.17347 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-140] Adaptive Social Metaverse Streaming based on Federated Multi-Agent Deep Reinforcement Learning
【速读】:该论文试图解决社会元宇宙中隐私保护与高质量、低延迟流媒体传输的双重挑战,其中隐私问题源于沉浸式交互需要持续收集生物特征和行为数据,而流媒体质量则受限于实时交互、沉浸式渲染及带宽优化的需求。解决方案的关键在于提出ASMS(Adaptive Social Metaverse Streaming)系统,该系统基于联邦多智能体近端策略优化(F-MAPPO),通过融合联邦学习(FL)与深度强化学习(DRL)技术,动态调整流媒体码率,同时保障用户隐私。实验结果表明,ASMS在不同网络条件下相比现有流媒体方法提升了至少14%的用户体验。
链接: https://arxiv.org/abs/2506.17342
作者: Zijian Long,Haopeng Wang,Haiwei Dong,Abdulmotaleb El Saddik
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multimedia (cs.MM); Networking and Internet Architecture (cs.NI)
备注: Accepted by IEEE Transactions on Computational Social Systems
Abstract:The social metaverse is a growing digital ecosystem that blends virtual and physical worlds. It allows users to interact socially, work, shop, and enjoy entertainment. However, privacy remains a major challenge, as immersive interactions require continuous collection of biometric and behavioral data. At the same time, ensuring high-quality, low-latency streaming is difficult due to the demands of real-time interaction, immersive rendering, and bandwidth optimization. To address these issues, we propose ASMS (Adaptive Social Metaverse Streaming), a novel streaming system based on Federated Multi-Agent Proximal Policy Optimization (F-MAPPO). ASMS leverages F-MAPPO, which integrates federated learning (FL) and deep reinforcement learning (DRL) to dynamically adjust streaming bit rates while preserving user privacy. Experimental results show that ASMS improves user experience by at least 14% compared to existing streaming methods across various network conditions. Therefore, ASMS enhances the social metaverse experience by providing seamless and immersive streaming, even in dynamic and resource-constrained networks, while ensuring that sensitive user data remains on local devices.
zh
[AI-141] PBFT-Backed Semantic Voting for Multi-Agent Memory Pruning
【速读】:该论文旨在解决多智能体系统(Multi-Agent Systems, MAS)在复杂动态环境中共享知识管理的挑战,特别是如何保持分布式记忆的同步性、相关性,并避免过时或无关数据的累积,这一过程类似于生物遗忘。解决方案的关键在于提出一种名为Co-Forgetting Protocol的协议,该协议通过三个核心组件实现同步记忆剪枝:基于上下文感知语义投票的轻量级DistilBERT模型用于评估记忆项的相关性;多尺度时间衰减函数根据记忆的年龄和访问频率分配其重要性;以及基于实用拜占庭容错(Practical Byzantine Fault Tolerance, PBFT)的一致性机制,确保在存在最多f个拜占庭故障代理的情况下,仍能达成关于保留或删除记忆项的共识。
链接: https://arxiv.org/abs/2506.17338
作者: Duong Bach
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 13 pages
Abstract:The proliferation of multi-agent systems (MAS) in complex, dynamic environments necessitates robust and efficient mechanisms for managing shared knowledge. A critical challenge is ensuring that distributed memories remain synchronized, relevant, and free from the accumulation of outdated or inconsequential data - a process analogous to biological forgetting. This paper introduces the Co-Forgetting Protocol, a novel, comprehensive framework designed to address this challenge by enabling synchronized memory pruning in MAS. The protocol integrates three key components: (1) context-aware semantic voting, where agents utilize a lightweight DistilBERT model to assess the relevance of memory items based on their content and the current operational context; (2) multi-scale temporal decay functions, which assign diminishing importance to memories based on their age and access frequency across different time horizons; and (3) a Practical Byzantine Fault Tolerance (PBFT)-based consensus mechanism, ensuring that decisions to retain or discard memory items are agreed upon by a qualified and fault-tolerant majority of agents, even in the presence of up to f Byzantine (malicious or faulty) agents in a system of N greater than or equal to 3f+1 agents. The protocol leverages gRPC for efficient inter-agent communication and Pinecone for scalable vector embedding storage and similarity search, with SQLite managing metadata. Experimental evaluations in a simulated MAS environment with four agents demonstrate the protocol’s efficacy, achieving a 52% reduction in memory footprint over 500 epochs, 88% voting accuracy in forgetting decisions against human-annotated benchmarks, a 92% PBFT consensus success rate under simulated Byzantine conditions, and an 82% cache hit rate for memory access.
zh
[AI-142] LMR-BENCH: Evaluating LLM Agents Ability on Reproducing Language Modeling Research
【速读】:该论文试图解决大型语言模型(Large Language Model, LLM)在从科研论文中复现代码方面的能力不足问题,尤其是在自然语言处理(NLP)领域。其关键解决方案是构建LMR-BENCH基准测试,该基准包含28个来自过去五年内顶级NLP会议的23篇研究论文的代码复现任务,覆盖九个基础类别,用于系统评估LLM代理在代码复现任务中的表现。
链接: https://arxiv.org/abs/2506.17335
作者: Shuo Yan,Ruochen Li,Ziming Luo,Zimu Wang,Daoyang Li,Liqiang Jing,Kaiyu He,Peilin Wu,George Michalopoulos,Yue Zhang,Ziyang Zhang,Mian Zhang,Zhiyu Chen,Xinya Du
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language model (LLM) agents have demonstrated remarkable potential in advancing scientific discovery. However, their capability in the fundamental yet crucial task of reproducing code from research papers, especially in the NLP domain, remains underexplored. This task includes unique complex reasoning challenges in the intellectual synthesis of abstract concepts and the comprehension of code repositories with interdependent files. Motivated by this gap, we present LMR-BENCH, a benchmark designed to systematically evaluate the capability of LLM agents on code reproduction from Language Modeling Research. It consists of 28 code reproduction tasks derived from 23 research papers published in top-tier NLP venues over the past five years, spanning nine fundamental categories. Models are provided with a research paper, a code repository containing one or more masked functions, and instructions for implementing these functions. We conduct extensive experiments in standard prompting and LLM agent settings with state-of-the-art LLMs, evaluating the accuracy of unit tests and performing LLM-based evaluation of code correctness. Experimental results reveal that even the most advanced models still exhibit persistent limitations in scientific reasoning and code synthesis, highlighting critical gaps in LLM agents’ ability to autonomously reproduce scientific research
zh
[AI-143] On the Performance of Cyber-Biomedical Features for Intrusion Detection in Healthcare 5.0
【速读】:该论文试图解决 Healthcare 5.0 环境下因依赖互联医疗技术而面临的网络安全威胁问题,以及现有基于人工智能的网络安全模型对生物医学数据关注不足导致的效果与可解释性受限的问题。解决方案的关键在于将可解释人工智能(eXplainable AI, XAI)应用于整合了网络流量和生物医学传感器数据的 Healthcare 5.0 数据集,通过 XGBoost 分类器实现高精度的异常检测,并利用 SHAP 值分析揭示不同数据特征在入侵检测与伪造检测中的贡献程度。
链接: https://arxiv.org/abs/2506.17329
作者: Pedro H. Lui,Lucas P. Siqueira,Juliano F. Kazienko,Vagner E. Quincozes,Silvio E. Quincozes,Daniel Welfer
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 12 pages, 7 figures, conference
Abstract:Healthcare 5.0 integrates Artificial Intelligence (AI), the Internet of Things (IoT), real-time monitoring, and human-centered design toward personalized medicine and predictive diagnostics. However, the increasing reliance on interconnected medical technologies exposes them to cyber threats. Meanwhile, current AI-driven cybersecurity models often neglect biomedical data, limiting their effectiveness and interpretability. This study addresses this gap by applying eXplainable AI (XAI) to a Healthcare 5.0 dataset that integrates network traffic and biomedical sensor data. Classification outputs indicate that XGBoost achieved 99% F1-score for benign and data alteration, and 81% for spoofing. Explainability findings reveal that network data play a dominant role in intrusion detection whereas biomedical features contributed to spoofing detection, with temperature reaching a Shapley values magnitude of 0.37.
zh
[AI-144] I Know Which LLM Wrote Your Code Last Summer: LLM generated Code Stylometry for Authorship Attribution
【速读】:该论文试图解决如何准确识别由大型语言模型(Large Language Models, LLMs)生成的C语言程序的作者问题,这是合成内容检测领域的一个新兴研究挑战。解决方案的关键在于提出了一种名为CodeT5-Authorship的新模型,该模型仅使用原始CodeT5编码器-解码器架构中的编码器层,摒弃解码器以专注于分类任务,其编码器输出(第一个标记)通过一个包含GELU激活和Dropout的两层分类头,生成可能作者的概率分布。
链接: https://arxiv.org/abs/2506.17323
作者: Tamas Bisztray,Bilel Cherif,Richard A. Dubniczky,Nils Gruschka,Bertalan Borsos,Mohamed Amine Ferrag,Attila Kovacs,Vasileios Mavroeidis,Norbert Tihanyi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:
Abstract:Detecting AI-generated code, deepfakes, and other synthetic content is an emerging research challenge. As code generated by Large Language Models (LLMs) becomes more common, identifying the specific model behind each sample is increasingly important. This paper presents the first systematic study of LLM authorship attribution for C programs. We released CodeT5-Authorship, a novel model that uses only the encoder layers from the original CodeT5 encoder-decoder architecture, discarding the decoder to focus on classification. Our model’s encoder output (first token) is passed through a two-layer classification head with GELU activation and dropout, producing a probability distribution over possible authors. To evaluate our approach, we introduce LLM-AuthorBench, a benchmark of 32,000 compilable C programs generated by eight state-of-the-art LLMs across diverse tasks. We compare our model to seven traditional ML classifiers and eight fine-tuned transformer models, including BERT, RoBERTa, CodeBERT, ModernBERT, DistilBERT, DeBERTa-V3, Longformer, and LoRA-fine-tuned Qwen2-1.5B. In binary classification, our model achieves 97.56% accuracy in distinguishing C programs generated by closely related models such as GPT-4.1 and GPT-4o, and 95.40% accuracy for multi-class attribution among five leading LLMs (Gemini 2.5 Flash, Claude 3.5 Haiku, GPT-4.1, Llama 3.3, and DeepSeek-V3). To support open science, we release the CodeT5-Authorship architecture, the LLM-AuthorBench benchmark, and all relevant Google Colab scripts on GitHub: this https URL.
zh
[AI-145] Context manipulation attacks : Web agents are susceptible to corrupted memory
【速读】:该论文旨在解决自主网络导航代理在面对上下文操纵攻击时的安全性问题,特别是“计划注入”(plan injection)这一新型攻击方式对代理内部任务表示的破坏。解决方案的关键在于识别并利用代理系统中由于客户端或第三方应用管理外部记忆而导致的上下文安全漏洞,通过设计逻辑桥梁将合法用户目标与攻击者目标连接起来,从而实现更高的隐私数据泄露成功率。研究强调了在智能体系统中,安全的记忆管理应作为首要考虑因素。
链接: https://arxiv.org/abs/2506.17318
作者: Atharv Singh Patlan,Ashwin Hebbar,Pramod Viswanath,Prateek Mittal
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 10 pages, 6 figures
Abstract:Autonomous web navigation agents, which translate natural language instructions into sequences of browser actions, are increasingly deployed for complex tasks across e-commerce, information retrieval, and content discovery. Due to the stateless nature of large language models (LLMs), these agents rely heavily on external memory systems to maintain context across interactions. Unlike centralized systems where context is securely stored server-side, agent memory is often managed client-side or by third-party applications, creating significant security vulnerabilities. This was recently exploited to attack production systems. We introduce and formalize “plan injection,” a novel context manipulation attack that corrupts these agents’ internal task representations by targeting this vulnerable context. Through systematic evaluation of two popular web agents, Browser-use and Agent-E, we show that plan injections bypass robust prompt injection defenses, achieving up to 3x higher attack success rates than comparable prompt-based attacks. Furthermore, “context-chained injections,” which craft logical bridges between legitimate user goals and attacker objectives, lead to a 17.7% increase in success rate for privacy exfiltration tasks. Our findings highlight that secure memory handling must be a first-class concern in agentic systems. Comments: 10 pages, 6 figures Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2506.17318 [cs.CR] (or arXiv:2506.17318v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2506.17318 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-146] Heterogeneous Temporal Hypergraph Neural Network IJCAI2025
【速读】:该论文旨在解决现有图表示学习(Graph Representation Learning, GRL)方法在建模复杂异构时序图(Heterogeneous Temporal Graphs, HTGs)时,难以捕捉高阶群体交互关系的问题。现有方法主要关注低阶拓扑信息,忽略了更符合现实网络结构的高阶交互关系,同时大多数超图方法仅能处理静态同构图,限制了其在HTGs中的应用。该论文的关键解决方案是提出一种不依赖额外信息的P-统一异构超边构造算法,并构建了一个异构时序超图神经网络(Heterogeneous Temporal HyperGraph Neural network, HTHGN),通过层次化注意力机制和对比学习策略,有效捕捉HTGs中的高阶交互关系并提升模型性能。
链接: https://arxiv.org/abs/2506.17312
作者: Huan Liu,Pengfei Jiao,Mengzhou Gao,Chaochao Chen,Di Jin
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted by IJCAI 2025
Abstract:Graph representation learning (GRL) has emerged as an effective technique for modeling graph-structured data. When modeling heterogeneity and dynamics in real-world complex networks, GRL methods designed for complex heterogeneous temporal graphs (HTGs) have been proposed and have achieved successful applications in various fields. However, most existing GRL methods mainly focus on preserving the low-order topology information while ignoring higher-order group interaction relationships, which are more consistent with real-world networks. In addition, most existing hypergraph methods can only model static homogeneous graphs, limiting their ability to model high-order interactions in HTGs. Therefore, to simultaneously enable the GRL model to capture high-order interaction relationships in HTGs, we first propose a formal definition of heterogeneous temporal hypergraphs and P -uniform heterogeneous hyperedge construction algorithm that does not rely on additional information. Then, a novel Heterogeneous Temporal HyperGraph Neural network (HTHGN), is proposed to fully capture higher-order interactions in HTGs. HTHGN contains a hierarchical attention mechanism module that simultaneously performs temporal message-passing between heterogeneous nodes and hyperedges to capture rich semantics in a wider receptive field brought by hyperedges. Furthermore, HTHGN performs contrastive learning by maximizing the consistency between low-order correlated heterogeneous node pairs on HTG to avoid the low-order structural ambiguity issue. Detailed experimental results on three real-world HTG datasets verify the effectiveness of the proposed HTHGN for modeling high-order interactions in HTGs and demonstrate significant performance improvements.
zh
[AI-147] AlgoSelect: Universal Algorithm Selection via the Comb Operator
【速读】:该论文旨在解决自动化算法选择(algorithm selection)问题,即在给定问题特征的情况下,从多个可用算法中选择最优解。其核心解决方案是提出了一种基于Comb Operator的系统性框架AlgoSelect,该框架通过学习不同计算方法之间的插值来实现最优选择。关键在于引入了Comb Operator,它不仅能够处理算法对之间的选择,还扩展为适用于多算法选择的N-Path Comb结构,并通过理论分析证明了其通用性、信息论最优性及计算效率,从而为自动化算法选择提供了具有理论保障和实际可行性的解决方案。
链接: https://arxiv.org/abs/2506.17304
作者: Jasper Yao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Data Structures and Algorithms (cs.DS)
备注: 24 pages, 4 figures, 1 repository, 1 supplementary document
Abstract:We introduce AlgoSelect, a principled framework for learning optimal algorithm selection from data, centered around the novel Comb Operator. Given a set of algorithms and a feature representation of problems, AlgoSelect learns to interpolate between diverse computational approaches. For pairs of algorithms, a simple sigmoid-gated selector, an instance of the Comb Operator, facilitates this interpolation. We extend this to an N-Path Comb for multiple algorithms. We prove that this framework is universal (can approximate any algorithm selector), information-theoretically optimal in its learnability (thresholds for selection converge almost surely, demonstrated via Borel-Cantelli arguments), computationally efficient, and robust. Key theoretical contributions include: (1) a universal approximation theorem demonstrating that Comb-based selectors can achieve arbitrary accuracy; (2) information-theoretic learnability for selection thresholds; (3) formalization of the Comb Operator within linear operator theory, detailing its boundedness and spectral properties; (4) an N-Path Comb generalization for multi-algorithm selection; and (5) a practical learning framework for the adaptive seeding functions that guide the Comb Operator. Empirical validation on a comprehensive 20 \times 20 problem-algorithm study demonstrates near-perfect selection (99.9%+ accuracy) with remarkably few samples and rapid convergence, revealing that H(\textAlgorithm|\textProblem) \approx 0 in structured domains. AlgoSelect provides a theoretically grounded, practically deployable solution to automated algorithm selection with provable optimality and learnability guarantees, with significant implications for AI and adaptive systems.
zh
[AI-148] Individual Causal Inference with Structural Causal Model
【速读】:该论文试图解决个体因果推断(Individual Causal Inference, ICI)中因个体数据有限及传统因果推断方法多基于总体而难以准确估计个体因果效应(Individual Causal Effect, ICE)的问题。其解决方案的关键在于将结构因果模型(Structural Causal Model, SCM)作为“第三层级”因果推断的框架,通过引入indiv-operator(indiv(W))来形式化个体化过程,并利用个体因果查询P(Y | indiv(W), do(X), Z)来表示ICI,从而实现对个体在假设干预下的潜在结果进行推断,而非传统的非实际反事实推断。
链接: https://arxiv.org/abs/2506.17300
作者: Daniel T. Chang
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Individual causal inference (ICI) uses causal inference methods to understand and predict the effects of interventions on individuals, considering their specific characteristics / facts. It aims to estimate individual causal effect (ICE), which varies across individuals. Estimating ICE can be challenging due to the limited data available for individuals, and the fact that most causal inference methods are population-based. Structural Causal Model (SCM) is fundamentally population-based. Therefore, causal discovery (structural learning and parameter learning), association queries and intervention queries are all naturally population-based. However, exogenous variables (U) in SCM can encode individual variations and thus provide the mechanism for individualized population per specific individual characteristics / facts. Based on this, we propose ICI with SCM as a “rung 3” causal inference, because it involves “imagining” what would be the causal effect of a hypothetical intervention on an individual, given the individual’s observed characteristics / facts. Specifically, we propose the indiv-operator, indiv(W), to formalize/represent the population individualization process, and the individual causal query, P(Y | indiv(W), do(X), Z), to formalize/represent ICI. We show and argue that ICI with SCM is inference on individual alternatives (possible), not individual counterfactuals (non-actual).
zh
[AI-149] LLM Jailbreak Oracle
【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)在安全关键型应用中对越狱攻击(jailbreak attacks)的脆弱性评估问题,这一问题缺乏系统性的评估方法,构成了重要的安全漏洞。论文提出了一种称为“越狱预言问题”(jailbreak oracle problem)的形式化定义,旨在判断在给定模型、提示和解码策略的情况下,是否能够生成超出特定概率阈值的越狱响应。解决方案的关键在于提出Boa算法,该算法采用三阶段搜索策略:构建阻断列表以识别拒绝模式、广度优先采样以发现易访问的越狱路径,以及基于细粒度安全评分的深度优先优先搜索,从而系统地探索低概率但可能的越狱路径,实现对模型安全性的严格评估。
链接: https://arxiv.org/abs/2506.17299
作者: Shuyi Lin,Anshuman Suri,Alina Oprea,Cheng Tan
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:As large language models (LLMs) become increasingly deployed in safety-critical applications, the lack of systematic methods to assess their vulnerability to jailbreak attacks presents a critical security gap. We introduce the jailbreak oracle problem: given a model, prompt, and decoding strategy, determine whether a jailbreak response can be generated with likelihood exceeding a specified threshold. This formalization enables a principled study of jailbreak vulnerabilities. Answering the jailbreak oracle problem poses significant computational challenges – the search space grows exponentially with the length of the response tokens. We present Boa, the first efficient algorithm for solving the jailbreak oracle problem. Boa employs a three-phase search strategy: (1) constructing block lists to identify refusal patterns, (2) breadth-first sampling to identify easily accessible jailbreaks, and (3) depth-first priority search guided by fine-grained safety scores to systematically explore promising low-probability paths. Boa enables rigorous security assessments including systematic defense evaluation, standardized comparison of red team attacks, and model certification under extreme adversarial conditions.
zh
[AI-150] SafeRL-Lite: A Lightweight Explainable and Constrained Reinforcement Learning Library
【速读】:该论文试图解决强化学习(Reinforcement Learning, RL)代理在训练过程中缺乏硬性安全约束的强制机制以及无法生成人类可解释的决策依据的问题。解决方案的关键在于SafeRL-Lite库通过模块化包装标准Gym环境和深度Q学习代理,实现了安全感知的训练(通过约束强制)和实时事后解释(通过SHAP值和显著性图)。
链接: https://arxiv.org/abs/2506.17297
作者: Satyam Mishra,Phung Thao Vi,Shivam Mishra,Vishwanath Bijalwan,Vijay Bhaskar Semwal,Abdul Manan Khan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 10 pages, 7 figures, open-source library, PyPI installable: pip install saferl-lite
Abstract:We introduce SafeRL-Lite, an open-source Python library for building reinforcement learning (RL) agents that are both constrained and explainable. Existing RL toolkits often lack native mechanisms for enforcing hard safety constraints or producing human-interpretable rationales for decisions. SafeRL-Lite provides modular wrappers around standard Gym environments and deep Q-learning agents to enable: (i) safety-aware training via constraint enforcement, and (ii) real-time post-hoc explanation via SHAP values and saliency maps. The library is lightweight, extensible, and installable via pip, and includes built-in metrics for constraint violations. We demonstrate its effectiveness on constrained variants of CartPole and provide visualizations that reveal both policy logic and safety adherence. The full codebase is available at: this https URL.
zh
[AI-151] heoretically Unmasking Inference Attacks Against LDP-Protected Clients in Federated Vision Models ICML2025
【速读】:该论文试图解决联邦学习中隐私保护与模型实用性之间的矛盾问题,特别是针对在本地差分隐私(Local Differential Privacy, LDP)保护下的训练数据仍可能遭受成员推理攻击(Membership Inference Attacks, MIAs)的风险。解决方案的关键在于理论分析LDP保护下低多项式时间MIAs的成功率下界,并证明即使在LDP保护下,隐私风险依然存在,且依赖于隐私预算,从而揭示了在保证隐私的同时维持模型性能的挑战性。
链接: https://arxiv.org/abs/2506.17292
作者: Quan Nguyen,Minh N. Vu,Truc Nguyen,My T. Thai
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: Accepted to ICML 2025
Abstract:Federated Learning enables collaborative learning among clients via a coordinating server while avoiding direct data sharing, offering a perceived solution to preserve privacy. However, recent studies on Membership Inference Attacks (MIAs) have challenged this notion, showing high success rates against unprotected training data. While local differential privacy (LDP) is widely regarded as a gold standard for privacy protection in data analysis, most studies on MIAs either neglect LDP or fail to provide theoretical guarantees for attack success rates against LDP-protected data. To address this gap, we derive theoretical lower bounds for the success rates of low-polynomial time MIAs that exploit vulnerabilities in fully connected or self-attention layers. We establish that even when data are protected by LDP, privacy risks persist, depending on the privacy budget. Practical evaluations on federated vision models confirm considerable privacy risks, revealing that the noise required to mitigate these attacks significantly degrades models’ utility.
zh
[AI-152] Evaluating Generalization and Representation Stability in Small LMs via Prompting ICML
【速读】:该论文旨在探究在少量样本提示(few-shot prompting)和监督微调(supervised fine-tuning)两种常见适配范式下,小型语言模型的泛化能力。其关键在于通过对比分析不同任务格式、提示风格和模型规模下的提示与微调方法,在分布内(in-distribution)和分布外(out-of-distribution, OOD)设置中的表现,揭示小模型在不同适配策略下知识内化与泛化的差异。研究不仅关注模型的准确性,还深入分析了各方法所学习到的内部表示,以评估任务特征的稳定性和抽象性,从而为低数据场景下的模型选择提供实践指导,并为提示与微调之间的争论提供实证见解。
链接: https://arxiv.org/abs/2506.17289
作者: Rahul Raja,Arpita Vats
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at ICML
Abstract:We investigate the generalization capabilities of small language models under two popular adaptation paradigms: few-shot prompting and supervised fine-tuning. While prompting is often favored for its parameter efficiency and flexibility, it remains unclear how robust this approach is in low-resource settings and under distributional shifts. This paper presents a comparative study of prompting and fine-tuning across task formats, prompt styles, and model scales, with a focus on their behavior in both in-distribution and out-of-distribution (OOD) settings. Beyond accuracy, we analyze the internal representations learned by each approach to assess the stability and abstraction of task-specific features. Our findings highlight critical differences in how small models internalize and generalize knowledge under different adaptation strategies. This work offers practical guidance for model selection in low-data regimes and contributes empirical insight into the ongoing debate over prompting versus fine-tuning. Code for the experiments is available at the following Comments: Accepted at ICML Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2506.17289 [cs.AI] (or arXiv:2506.17289v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2506.17289 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-153] A Theoretical Framework for Virtual Power Plant Integration with Gigawatt-Scale AI Data Centers: Multi-Timescale Control and Stability Analysis
【速读】:该论文试图解决人工智能(Artificial Intelligence, AI)数据中心的快速增长对电力系统运行带来的挑战,特别是其引起的毫秒级功率波动和超过500 MW的瞬时功率变化问题。解决方案的关键在于提出一种四层分层控制架构,涵盖从100微秒到24小时的时间尺度,以适应极端动态特性。该框架包括子毫秒级控制层、新的稳定性准则以及量化灵活性表征,旨在通过主动抑制功率振荡、优化保护系统动态响应和提升工作负载可延迟性,实现AI基础设施在电力系统中的稳定集成。
链接: https://arxiv.org/abs/2506.17284
作者: Ali Peivandizadeh
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI)
备注: 19 pages, 5 figures
Abstract:The explosive growth of artificial intelligence has created gigawatt-scale data centers that fundamentally challenge power system operation, exhibiting power fluctuations exceeding 500 MW within seconds and millisecond-scale variations of 50-75% of thermal design power. This paper presents a comprehensive theoretical framework that reconceptualizes Virtual Power Plants (VPPs) to accommodate these extreme dynamics through a four-layer hierarchical control architecture operating across timescales from 100 microseconds to 24 hours. We develop control mechanisms and stability criteria specifically tailored to converter-dominated systems with pulsing megawatt-scale loads. We prove that traditional VPP architectures, designed for aggregating distributed resources with response times of seconds to minutes, cannot maintain stability when confronted with AI data center dynamics exhibiting slew rates exceeding 1,000 MW/s at gigawatt scale. Our framework introduces: (1) a sub-millisecond control layer that interfaces with data center power electronics to actively dampen power oscillations; (2) new stability criteria incorporating protection system dynamics, demonstrating that critical clearing times reduce from 150 ms to 83 ms for gigawatt-scale pulsing loads; and (3) quantified flexibility characterization showing that workload deferability enables 30% peak reduction while maintaining AI service availability above 99.95%. This work establishes the mathematical foundations necessary for the stable integration of AI infrastructure that will constitute 50-70% of data center electricity consumption by 2030. Comments: 19 pages, 5 figures Subjects: Systems and Control (eess.SY); Artificial Intelligence (cs.AI) ACMclasses: C.4 Cite as: arXiv:2506.17284 [eess.SY] (or arXiv:2506.17284v1 [eess.SY] for this version) https://doi.org/10.48550/arXiv.2506.17284 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Ali Peivandizadeh [view email] [v1] Sat, 14 Jun 2025 22:44:17 UTC (297 KB)
zh
[AI-154] CORONA: A Coarse-to-Fine Framework for Graph-based Recommendation with Large Language Models
【速读】:该论文旨在解决传统推荐系统在候选物品筛选过程中未能充分利用大语言模型(Large Language Models, LLMs)推理能力的问题,从而导致推荐性能受限。其解决方案的关键在于将LLMs的推理能力引入候选过滤阶段,通过构建Chain Of Retrieval ON grAphs (CORONA)框架,分阶段利用LLMs进行偏好推理和意图推理,逐步缩小候选物品范围,并结合图神经网络(Graph Neural Networks, GNNs)捕捉高阶协同过滤信息,最终提升推荐效果。
链接: https://arxiv.org/abs/2506.17281
作者: Junze Chen,Xinjie Yang,Cheng Yang,Junfei Bao,Zeyuan Guo,Yawen Li,Chuan Shi
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:Recommender systems (RSs) are designed to retrieve candidate items a user might be interested in from a large pool. A common approach is using graph neural networks (GNNs) to capture high-order interaction relationships. As large language models (LLMs) have shown strong capabilities across domains, researchers are exploring their use to enhance recommendation. However, prior work limits LLMs to re-ranking results or dataset augmentation, failing to utilize their power during candidate filtering - which may lead to suboptimal performance. Instead, we propose to leverage LLMs’ reasoning abilities during the candidate filtering process, and introduce Chain Of Retrieval ON grAphs (CORONA) to progressively narrow down the range of candidate items on interaction graphs with the help of LLMs: (1) First, LLM performs preference reasoning based on user profiles, with the response serving as a query to extract relevant users and items from the interaction graph as preference-assisted retrieval; (2) Then, using the information retrieved in the previous step along with the purchase history of target user, LLM conducts intent reasoning to help refine an even smaller interaction subgraph as intent-assisted retrieval; (3) Finally, we employ a GNN to capture high-order collaborative filtering information from the extracted subgraph, performing GNN-enhanced retrieval to generate the final recommendation results. The proposed framework leverages the reasoning capabilities of LLMs during the retrieval process, while seamlessly integrating GNNs to enhance overall recommendation performance. Extensive experiments on various datasets and settings demonstrate that our proposed CORONA achieves state-of-the-art performance with an 18.6% relative improvement in recall and an 18.4% relative improvement in NDCG on average.
zh
[AI-155] Step-by-Step Reasoning Attack: Revealing Erased Knowledge in Large Language Models
【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)中知识擦除(knowledge erasure)的可靠性问题,即现有去学习(unlearning)方法在删除特定知识的同时,可能未能彻底消除相关知识,导致其通过特定提示(prompt)被恢复。解决方案的关键在于提出一种基于分步推理的黑盒攻击方法Sleek,该方法通过构建对抗性提示、成功召回被擦除内容并揭示保留知识的不公平抑制,从而系统性地暴露去学习失败的问题。
链接: https://arxiv.org/abs/2506.17279
作者: Yash Sinha,Manit Baser,Murari Mandal,Dinil Mon Divakaran,Mohan Kankanhalli
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Knowledge erasure in large language models (LLMs) is important for ensuring compliance with data and AI regulations, safeguarding user privacy, mitigating bias, and misinformation. Existing unlearning methods aim to make the process of knowledge erasure more efficient and effective by removing specific knowledge while preserving overall model performance, especially for retained information. However, it has been observed that the unlearning techniques tend to suppress and leave the knowledge beneath the surface, thus making it retrievable with the right prompts. In this work, we demonstrate that \textitstep-by-step reasoning can serve as a backdoor to recover this hidden information. We introduce a step-by-step reasoning-based black-box attack, Sleek, that systematically exposes unlearning failures. We employ a structured attack framework with three core components: (1) an adversarial prompt generation strategy leveraging step-by-step reasoning built from LLM-generated queries, (2) an attack mechanism that successfully recalls erased content, and exposes unfair suppression of knowledge intended for retention and (3) a categorization of prompts as direct, indirect, and implied, to identify which query types most effectively exploit unlearning weaknesses. Through extensive evaluations on four state-of-the-art unlearning techniques and two widely used LLMs, we show that existing approaches fail to ensure reliable knowledge removal. Of the generated adversarial prompts, 62.5% successfully retrieved forgotten Harry Potter facts from WHP-unlearned Llama, while 50% exposed unfair suppression of retained knowledge. Our work highlights the persistent risks of information leakage, emphasizing the need for more robust unlearning strategies for erasure.
zh
[AI-156] Chunk Twice Embed Once: A Systematic Study of Segmentation and Representation Trade-offs in Chemistry-Aware Retrieval-Augmented Generation
【速读】:该论文旨在解决化学领域中检索增强生成(Retrieval-Augmented Generation, RAG)系统在文档分割和表示方面的基础设计选择尚未得到充分探索的问题。其关键解决方案在于通过大规模、系统化的实验评估,确定适用于化学领域的最优分块策略和嵌入模型,其中递归基于标记的分块(R100-0)表现出最佳性能,同时检索优化的嵌入模型(如Nomic和Intfloat E5变体)显著优于领域专用模型(如SciBERT)。
链接: https://arxiv.org/abs/2506.17277
作者: Mahmoud Amiri,Thomas Bocklitz
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Chemical Physics (physics.chem-ph)
备注:
Abstract:Retrieval-Augmented Generation (RAG) systems are increasingly vital for navigating the ever-expanding body of scientific literature, particularly in high-stakes domains such as chemistry. Despite the promise of RAG, foundational design choices – such as how documents are segmented and represented – remain underexplored in domain-specific contexts. This study presents the first large-scale, systematic evaluation of chunking strategies and embedding models tailored to chemistry-focused RAG systems. We investigate 25 chunking configurations across five method families and evaluate 48 embedding models on three chemistry-specific benchmarks, including the newly introduced QuestChemRetrieval dataset. Our results reveal that recursive token-based chunking (specifically R100-0) consistently outperforms other approaches, offering strong performance with minimal resource overhead. We also find that retrieval-optimized embeddings – such as Nomic and Intfloat E5 variants – substantially outperform domain-specialized models like SciBERT. By releasing our datasets, evaluation framework, and empirical benchmarks, we provide actionable guidelines for building effective and efficient chemistry-aware RAG systems.
zh
[AI-157] Modal Logic for Stratified Becoming: Actualization Beyond Possible Worlds
【速读】:该论文试图解决传统模态逻辑中基于全局可能世界模型的局限性,该模型将模态算子视为对完全确定的替代世界的量化,忽视了实际化过程的局部性、动态性和不对称性。解决方案的关键在于提出一种基于分层实际化的模态逻辑框架——分层实际化逻辑(Stratified Actualization Logic, SAL),其中模态算子根据本体论稳定性层级进行索引,作为可接受性规则进行解释,从而在结构化的可能性层面上构建模态运算,其基础是各层级之间转换的内在一致性。
链接: https://arxiv.org/abs/2506.17276
作者: Alexandre Le Nepvou
机构: 未知
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI)
备注: This paper develops the formal logical foundations of the stratified actualization framework presented in a companion paper currently under review at Erkenntnis (manuscript ID: ERKE-D-25-00410)
Abstract:This article develops a novel framework for modal logic based on the idea of stratified actualization, rather than the classical model of global possible worlds. Traditional Kripke semantics treat modal operators as quantification over fully determinate alternatives, neglecting the local, dynamic, and often asymmetric nature of actualization processes. We propose a system Stratified Actualization Logic (SAL) in which modalities are indexed by levels of ontological stability, interpreted as admissibility regimes. Each modality operates over a structured layer of possibility, grounded in the internal coherence of transitions between layers. We formally define the syntax and semantics of SAL, introduce its axioms, and prove soundness and completeness. Applications are discussed in connection with temporal becoming, quantum decoherence domains, and modal metaphysics. The result is a logic that captures the ontological structure of actualization without recourse to abstract possible worlds, offering a stratified alternative to standard modal realism.
zh
[AI-158] Conformal Safety Shielding for Imperfect-Perception Agents
【速读】:该论文试图解决在离散自主代理中,由于学习组件在高维观测下的不完美感知(或更一般的状态估计)所导致的安全控制问题。解决方案的关键在于提出一种屏蔽(shield)构造,该构造通过根据状态估计限制代理可用的动作,从而在运行时提供安全保证。该方法利用了置信预测(conformal prediction)来确保每个观测对应的预测估计集以用户指定的概率包含实际状态,屏蔽仅允许在预测集中的所有估计下均被允许的动作,从而实现局部安全保证。
链接: https://arxiv.org/abs/2506.17275
作者: William Scarbro,Calum Imrie,Sinem Getir Yaman,Kavan Fatehi,Corina S. Pasareanu,Radu Calinescu,Ravi Mangal
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 32 pages; Equal contribution by W. Scarbro and C. Imrie
Abstract:We consider the problem of safe control in discrete autonomous agents that use learned components for imperfect perception (or more generally, state estimation) from high-dimensional observations. We propose a shield construction that provides run-time safety guarantees under perception errors by restricting the actions available to an agent, modeled as a Markov decision process, as a function of the state estimates. Our construction uses conformal prediction for the perception component, which guarantees that for each observation, the predicted set of estimates includes the actual state with a user-specified probability. The shield allows an action only if it is allowed for all the estimates in the predicted set, resulting in a local safety guarantee. We also articulate and prove a global safety property of existing shield constructions for perfect-perception agents bounding the probability of reaching unsafe states if the agent always chooses actions prescribed by the shield. We illustrate our approach with a case-study of an experimental autonomous system that guides airplanes on taxiways using high-dimensional perception DNNs.
zh
[AI-159] QUST_NLP at SemEval-2025 Task 7: A Three-Stage Retrieval Framework for Monolingual and Crosslingual Fact-Checked Claim Retrieval
【速读】:该论文旨在解决事实核查中的声明检索(fact-checked claim retrieval)问题,即从大规模语料库中准确检索出与给定声明相关的已验证事实。解决方案的关键在于提出一个三阶段的检索框架:首先评估多种检索模型并选择性能最佳的用于候选声明的初步检索;其次采用多个重排序模型进一步优化候选结果,每个模型选取前10名;最后通过加权投票确定最终的检索结果。该方法在单语和跨语言任务中分别取得了第5名和第7名的成绩。
链接: https://arxiv.org/abs/2506.17272
作者: Youzheng Liu,Jiyan Liu,Xiaoman Xu,Taihang Wang,Yimin Wang,Ye Jiang
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:This paper describes the participation of QUST_NLP in the SemEval-2025 Task 7. We propose a three-stage retrieval framework specifically designed for fact-checked claim retrieval. Initially, we evaluate the performance of several retrieval models and select the one that yields the best results for candidate retrieval. Next, we employ multiple re-ranking models to enhance the candidate results, with each model selecting the Top-10 outcomes. In the final stage, we utilize weighted voting to determine the final retrieval outcomes. Our approach achieved 5th place in the monolingual track and 7th place in the crosslingual track. We release our system code at: this https URL
zh
[AI-160] CF-VLM:CounterFactual Vision-Language Fine-tuning
【速读】:该论文试图解决视觉-语言模型(Vision-Language Models, VLMs)在细粒度区分和深度因果推理任务中的显著局限性,当前VLMs通常依赖于表面的统计相关性,缺乏捕捉视觉与文本内容之间潜在因果逻辑的能力。解决方案的关键在于提出一种名为反事实视觉-语言微调(CounterFactual Vision-Language Fine-tuning, CF-VLM)的新框架,通过有针对性地使用反事实样本增强VLMs的因果推理能力,其核心包括三个互补的训练目标:保持基础的跨模态对齐、强化事实场景表示在面对连贯反事实时的独特性和稳定性,以及提升模型对关键因果修改的敏感性。
链接: https://arxiv.org/abs/2506.17267
作者: Jusheng Zhang,Kaitong Cai,Yijia Fan,Jian Wang,Keze Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent advances in vision-language models (VLMs) have greatly improved cross-modal semantic understanding, yet significant limitations remain in fine-grained discrimination and deep causal reasoning tasks. Existing VLMs often rely on superficial statistical correlations, lacking the ability to capture the underlying causal logic between visual and textual content. To address this, we propose CounterFactual Vision-Language Fine-tuning (CF-VLM), a novel framework that enhances the causal reasoning capabilities of VLMs through the targeted use of counterfactual samples. CF-VLM introduces three complementary training objectives: maintaining foundational cross-modal alignment, reinforcing the uniqueness and stability of factual scene representations against coherent counterfactuals, and sharpening the model’s sensitivity to minimal but critical causal edits. Extensive experiments demonstrate that CF-VLM consistently outperforms strong baselines and state-of-the-art methods on compositional reasoning and generalization benchmarks. Furthermore, it shows promise in mitigating visual hallucinations, indicating improved factual consistency. Our CF-VLM provides a robust foundation for deploying VLMs in high-stakes, real-world scenarios requiring reliable reasoning and interpretability.
zh
[AI-161] Does Multimodal Large Language Model Truly Unlearn? Stealthy MLLM Unlearning Attack
【速读】:该论文试图解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在训练过程中可能记忆敏感个人信息和图像所带来的隐私风险问题。为缓解这一问题,研究者提出了MLLM遗忘方法,通过微调模型以减少对敏感信息的“遗忘”。然而,目前尚不清楚这些知识是否真正被遗忘还是仅被隐藏。为此,论文提出了一种新的问题——大语言模型(Large Language Models, LLMs)遗忘攻击,旨在恢复已被遗忘的模型知识。其解决方案的关键在于提出一种名为Stealthy Unlearning Attack (SUA)的框架,该框架学习一种通用噪声模式,当应用于输入图像时,能够触发模型揭示未被遗忘的内容。通过引入嵌入对齐损失,SUA进一步提升了攻击的隐蔽性,确保扰动在语义空间中难以被检测到。实验结果表明,SUA能够有效恢复MLLM中的未遗忘信息,并且所学噪声具有良好的泛化能力。
链接: https://arxiv.org/abs/2506.17265
作者: Xianren Zhang,Hui Liu,Delvin Ce Zhang,Xianfeng Tang,Qi He,Dongwon Lee,Suhang Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Under Review
Abstract:Multimodal Large Language Models (MLLMs) trained on massive data may memorize sensitive personal information and photos, posing serious privacy risks. To mitigate this, MLLM unlearning methods are proposed, which fine-tune MLLMs to reduce the ``forget’’ sensitive information. However, it remains unclear whether the knowledge has been truly forgotten or just hidden in the model. Therefore, we propose to study a novel problem of LLM unlearning attack, which aims to recover the unlearned knowledge of an unlearned LLM. To achieve the goal, we propose a novel framework Stealthy Unlearning Attack (SUA) framework that learns a universal noise pattern. When applied to input images, this noise can trigger the model to reveal unlearned content. While pixel-level perturbations may be visually subtle, they can be detected in the semantic embedding space, making such attacks vulnerable to potential defenses. To improve stealthiness, we introduce an embedding alignment loss that minimizes the difference between the perturbed and denoised image embeddings, ensuring the attack is semantically unnoticeable. Experimental results show that SUA can effectively recover unlearned information from MLLMs. Furthermore, the learned noise generalizes well: a single perturbation trained on a subset of samples can reveal forgotten content in unseen images. This indicates that knowledge reappearance is not an occasional failure, but a consistent behavior.
zh
[AI-162] OAT-Rephrase: Optimization-Aware Training Data Rephrasing for Zeroth-Order LLM Fine-Tuning
【速读】:该论文试图解决零阶优化(Zeroth-Order Optimization, ZO)在微调大语言模型(Large Language Models, LLMs)时存在的收敛速度慢和优化不稳定问题,这些问题主要由噪声梯度估计引起。解决方案的关键在于提出OAT-Rephrase,一种基于优化感知的训练数据重写策略,该策略利用LLM对ZO动态(如MeZO)的理解,通过双阶段管道(包括重写器LLM和语义裁判)生成保持任务相关性和逻辑一致性的重写数据,从而提升微调性能,缩小或消除与一阶方法的差距。
链接: https://arxiv.org/abs/2506.17264
作者: Jikai Long,Zijian Hu,Xiaodong Yu,Jianwen Xie,Zhaozhuo Xu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Fine-tuning large language models (LLMs) using zeroth-order optimization (ZO) offers a memory-efficient alternative to gradient-based methods but suffers from slower convergence and unstable optimization due to noisy gradient estimates. This paper introduces OAT-Rephrase, an Optimization-Aware Training data rephrasing strategy that leverages an LLM to rephrase training instances based on its understanding of the ZO dynamics, specifically MeZO, derived directly from its paper. The approach incorporates a dual-stage pipeline featuring a rewriter LLM and a semantic judge, ensuring all rephrasings retain task relevance and logical consistency. Evaluations across five classification tasks and three LLM architectures demonstrate that OAT-Rephrase consistently improves MeZO fine-tuning performance, often narrowing or eliminating the gap with first-order methods. Our findings suggest that optimization-aware rephrasing serves as a reusable and low-overhead enhancement for zeroth-order tuning regimes.
zh
[AI-163] Memory Allocation in Resource-Constrained Reinforcement Learning
【速读】:该论文试图解决在资源受限条件下,记忆约束对智能体在未知环境中使用标准强化学习算法进行学习和决策性能的影响问题(memory constraints influence an agent’s performance when navigating unknown environments)。解决方案的关键在于分析智能体在有限记忆资源下如何分配记忆以平衡内部过程,如世界模型估计与基于该模型的规划之间的需求,特别是在基于蒙特卡洛树搜索(MCTS)和深度Q网络(DQN)的算法中,探索不同记忆分配策略对连续学习和阶段性学习任务性能的影响。
链接: https://arxiv.org/abs/2506.17263
作者: Massimiliano Tamborski,David Abel
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: RLDM 2025
Abstract:Resource constraints can fundamentally change both learning and decision-making. We explore how memory constraints influence an agent’s performance when navigating unknown environments using standard reinforcement learning algorithms. Specifically, memory-constrained agents face a dilemma: how much of their limited memory should be allocated to each of the agent’s internal processes, such as estimating a world model, as opposed to forming a plan using that model? We study this dilemma in MCTS- and DQN-based algorithms and examine how different allocations of memory impact performance in episodic and continual learning settings.
zh
[AI-164] AI to Identify Strain-sensitive Regions of the Optic Nerve Head Linked to Functional Loss in Glaucoma
【速读】:该论文旨在解决如何通过视神经头(ONH)生物力学信息提升青光眼进展性视野缺损模式预测的问题,其关键解决方案是利用生成式 AI (Generative AI) 结合几何深度学习模型,分析 ONH 在不同眼压条件下的应变特征,并识别对预测具有重要贡献的应变敏感区域。研究结果显示,ONH 应变显著提升了视野缺损预测的准确性,其中神经视乳头边缘而非板层筛状结构是模型预测中最关键的区域。
链接: https://arxiv.org/abs/2506.17262
作者: Thanadet Chuangsuwanich,Monisha E. Nongpiur,Fabian A. Braeu,Tin A. Tun,Alexandre Thiery,Shamira Perera,Ching Lin Ho,Martin Buist,George Barbastathis,Tin Aung,Michaël J.A. Girard
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Objective: (1) To assess whether ONH biomechanics improves prediction of three progressive visual field loss patterns in glaucoma; (2) to use explainable AI to identify strain-sensitive ONH regions contributing to these predictions. Methods: We recruited 237 glaucoma subjects. The ONH of one eye was imaged under two conditions: (1) primary gaze and (2) primary gaze with IOP elevated to ~35 mmHg via ophthalmo-dynamometry. Glaucoma experts classified the subjects into four categories based on the presence of specific visual field defects: (1) superior nasal step (N=26), (2) superior partial arcuate (N=62), (3) full superior hemifield defect (N=25), and (4) other/non-specific defects (N=124). Automatic ONH tissue segmentation and digital volume correlation were used to compute IOP-induced neural tissue and lamina cribrosa (LC) strains. Biomechanical and structural features were input to a Geometric Deep Learning model. Three classification tasks were performed to detect: (1) superior nasal step, (2) superior partial arcuate, (3) full superior hemifield defect. For each task, the data were split into 80% training and 20% testing sets. Area under the curve (AUC) was used to assess performance. Explainable AI techniques were employed to highlight the ONH regions most critical to each classification. Results: Models achieved high AUCs of 0.77-0.88, showing that ONH strain improved VF loss prediction beyond morphology alone. The inferior and inferotemporal rim were identified as key strain-sensitive regions, contributing most to visual field loss prediction and showing progressive expansion with increasing disease severity. Conclusion and Relevance: ONH strain enhances prediction of glaucomatous VF loss patterns. Neuroretinal rim, rather than the LC, was the most critical region contributing to model predictions. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2506.17262 [cs.LG] (or arXiv:2506.17262v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2506.17262 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Michael Girard [view email] [v1] Mon, 9 Jun 2025 16:00:01 UTC (661 KB)
zh
[AI-165] A Digital Twin Framework for Generation-IV Reactors with Reinforcement Learning-Enabled Health-Aware Supervisory Control
【速读】:该论文旨在解决下一代Generation IV (Gen-IV)核反应堆在部署过程中面临的大规模成本投资问题,通过引入数字孪生框架来优化运行和维护策略,从而提高效率并降低风险。其解决方案的关键在于构建一个闭环框架,集成代理建模、强化学习和贝叶斯推断,实现端到端的在线调控与自我调整,同时利用参考调节器控制算法确保系统约束的遵守,并通过贝叶斯滤波将详细在线仿真数据与实测数据融合,以提升决策的准确性和实时性。
链接: https://arxiv.org/abs/2506.17258
作者: Jasmin Y. Lim,Dimitrios Pylorof,Humberto E. Garcia,Karthik Duraisamy
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI)
备注: 39 pages, 22 figures
Abstract:Generation IV (Gen-IV) nuclear power plants are envisioned to replace the current reactor fleet, bringing improvements in performance, safety, reliability, and sustainability. However, large cost investments currently inhibit the deployment of these advanced reactor concepts. Digital twins bridge real-world systems with digital tools to reduce costs, enhance decision-making, and boost operational efficiency. In this work, a digital twin framework is designed to operate the Gen-IV Fluoride-salt-cooled High-temperature Reactor, utilizing data-enhanced methods to optimize operational and maintenance policies while adhering to system constraints. The closed-loop framework integrates surrogate modeling, reinforcement learning, and Bayesian inference to streamline end-to-end communication for online regulation and self-adjustment. Reinforcement learning is used to consider component health and degradation to drive the target power generations, with constraints enforced through a Reference Governor control algorithm that ensures compliance with pump flow rate and temperature limits. These input driving modules benefit from detailed online simulations that are assimilated to measurement data with Bayesian filtering. The digital twin is demonstrated in three case studies: a one-year long-term operational period showcasing maintenance planning capabilities, short-term accuracy refinement with high-frequency measurements, and system shock capturing that demonstrates real-time recalibration capabilities when change in boundary conditions. These demonstrations validate robustness for health-aware and constraint-informed nuclear plant operation, with general applicability to other advanced reactor concepts and complex engineering systems.
zh
[AI-166] UltraSketchLLM : Saliency-Driven Sketching for Ultra-Low Bit LLM Compression
【速读】:该论文试图解决大规模语言模型(Large Language Models, LLMs)在边缘设备上的部署问题,特别是针对内存限制导致的极端权重压缩需求。现有方法在实现多对一压缩时,要么依赖映射表造成内存开销,要么因随机权重分组导致精度显著下降。其解决方案的关键在于提出UltraSketchLLM,这是一种无索引的基于草图(sketch)的框架,通过数据草图技术将多个权重映射到单个值,并结合低估的AbsMaxMin草图、重要性感知的空间分配以及直通估计器,实现了每权重低至0.5位的压缩率,同时保持模型性能。
链接: https://arxiv.org/abs/2506.17255
作者: Sunan Zou,Ziyun Zhang,Xueting Sun,Guojie Luo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:The rapid growth of large language models (LLMs) has outpaced the memory constraints of edge devices, necessitating extreme weight compression beyond the 1-bit limit. While quantization reduces model size, it is fundamentally limited to 1 bit per weight. Existing multiple-to-one compression methods either rely on mapping tables (inducing memory overhead) or incur severe accuracy degradation due to random weight grouping. We introduce UltraSketchLLM, an index-free, sketch-based framework that achieves ultra-low bit compression (down to 0.5 bits per weight) while preserving model performance. UltraSketchLLM leverages data sketching, a sub-linear representation technique from streaming applications, to map multiple weights to single values with bounded error. Our approach integrates an underestimate AbsMaxMin sketch to minimize relative errors for small weights, importance-aware space allocation to prioritize salient weights, and a straight-through estimator for compression-aware finetuning. Experiments on Llama-3.2-1B demonstrate up to 0.5-bit compression with competitive perplexity, alongside tolerable latency overhead. UltraSketchLLM offers a practical solution for deploying LLMs in resource-constrained environments.
zh
[AI-167] Keeping Up with the Models: Online Deployment and Routing of LLM s at Scale
【速读】:该论文旨在解决大规模语言模型(Large Language Models, LLMs)服务提供商在面对新模型快速迭代和旧模型逐渐淘汰的背景下,如何在有限的部署容量和每请求成本预算内,动态管理模型库存并高效路由查询的问题。解决方案的关键在于提出一种分阶段的决策算法——StageRoute,其核心在于(i)利用奖励上置信界和成本下置信界乐观选择下一阶段最多 M_max 个模型,(ii)通过求解一个受预算约束的多臂老虎机子问题来路由每个查询,从而实现近最优的累积遗憾(regret)性能。
链接: https://arxiv.org/abs/2506.17254
作者: Shaoang Li,Jian Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:The rapid pace at which new large language models (LLMs) appear – and older ones become obsolete – forces LLM service providers to juggle a streaming inventory of models while respecting tight deployment capacity and per-query cost budgets. We cast the reality as an online decision problem that couples stage-wise deployment, made at fixed maintenance windows, with per-query routing among the models kept live. We introduce StageRoute, a hierarchical algorithm that (i) optimistically selects up to M_max models for the next stage using reward upper-confidence and cost lower-confidence bounds, then (ii) solves a budget-constrained bandit sub-problem to route each incoming query. We prove that StageRoute achieves a regret of order T^2/3 and provide a matching lower bound, thereby establishing its near-optimality. Moreover, our experiments confirm the theory, demonstrating that StageRoute performs close to the optimum in practical settings.
zh
[AI-168] MS-TVNet:A Long-Term Time Series Prediction Method Based on Multi-Scale Dynamic Convolution
【速读】:该论文旨在解决长期时间序列预测中对卷积网络潜力探索不足的问题,传统方法主要依赖Transformer和MLP模型,而卷积网络在该领域的应用尚未得到充分研究。其解决方案的关键在于引入一种多尺度时间序列重构成模块,该模块能够有效捕捉多周期片段之间的关系及变量依赖性,并在此基础上构建了多尺度3D动态卷积神经网络MS-TVNet,从而在多个数据集上实现了优于基线模型的性能,达到了长期时间序列预测的最先进(SOTA)结果。
链接: https://arxiv.org/abs/2506.17253
作者: Chenghan Li,Mingchen Li,Yipu Liao,Ruisheng Diao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Long-term time series prediction has predominantly relied on Transformer and MLP models, while the potential of convolutional networks in this domain remains underexplored. To address this gap, we introduce a novel multi-scale time series reshape module, which effectively captures the relationships among multi-period patches and variable dependencies. Building upon this module, we propose MS-TVNet, a multi-scale 3D dynamic convolutional neural network. Through comprehensive evaluations on diverse datasets, MS-TVNet demonstrates superior performance compared to baseline models, achieving state-of-the-art (SOTA) results in long-term time series prediction. Our findings highlight the effectiveness of leveraging convolutional networks for capturing complex temporal patterns, suggesting a promising direction for future research in this this http URL code is realsed on this https URL.
zh
[AI-169] Adaptive Sample Scheduling for Direct Preference Optimization
【速读】:该论文试图解决在直接偏好优化(Direct Preference Optimization, DPO)过程中,由于语言模型状态的动态变化而影响训练样本选择效果的问题。传统数据选择策略往往忽视了模型在DPO过程中的演化状态,导致无法充分发挥有限偏好数据的潜力。解决方案的关键在于提出一种名为SamS的算法,该算法通过根据语言模型的学习反馈,在每个训练批次中自适应地选择样本,从而动态调整训练样本的调度,以最大化模型的泛化性能。该方法无需修改核心DPO算法,仅通过集成SamS即可在多个任务上显著提升性能,且计算开销较低。
链接: https://arxiv.org/abs/2506.17252
作者: Zixuan Huang,Yikun Ban,Lean Fu,Xiaojie Li,Zhongxiang Dai,Jianxin Li,Deqing Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Direct Preference Optimization (DPO) has emerged as an effective approach for aligning large language models (LLMs) with human preferences. However, its performance is highly dependent on the quality of the underlying human preference data. To address this bottleneck, prior work has explored various data selection strategies, but these methods often overlook the impact of the evolving states of the language model during the DPO process. %including active querying, response pair selection, and data pre-selection. In this paper, we introduce a novel problem: Sample Scheduling for DPO, which aims to dynamically and adaptively schedule training samples based on the model’s evolving states throughout preference optimization. To solve this problem, we propose SamS, an efficient and effective algorithm that adaptively selects samples in each training batch based on the LLM’s learning feedback to maximize the potential generalization performance. Notably, without modifying the core DPO algorithm, simply integrating SamS significantly improves performance across tasks, with minimal additional computational overhead. This work points to a promising new direction for improving LLM alignment through more effective utilization of fixed preference datasets.
zh
[AI-170] raining-free LLM Verification via Recycling Few-shot Examples
【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)在推理过程中固有的随机性以及由此导致的结论不一致问题,这些问题限制了模型的可靠性和准确性。为了解决这一问题,论文提出了一种新颖且有效的框架——Referi,其关键在于利用给定的少量示例(few-shot examples)来验证LLMs的输出,而不仅仅是用于生成输出。具体而言,Referi通过结合两种基于贝叶斯规则设计的评分机制,对生成的输出进行评估,并通过少量额外的LLM推理选择出既具有高置信度又符合上下文一致性的候选答案。
链接: https://arxiv.org/abs/2506.17251
作者: Dongseok Lee,Jimyung Hong,Dongyoung Kim,Jaehyung Kim
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Although LLMs have achieved remarkable performance, the inherent stochasticity of their reasoning process and varying conclusions present significant challenges. Majority voting or Best-of-N with external verification models has been explored to find the most promising solution among multiple LLM outputs. However, these approaches have certain limitations, such as limited applicability or the cost of an additional training step. To address this problem, we propose a novel and effective framework that Recycles Few-shot examples to verify LLM outputs (Referi). Our key idea is to additionally utilize the given few-shot examples to evaluate the candidate outputs of the target query, not only using them to generate outputs as the conventional few-shot prompting setup. Specifically, Referi evaluates the generated outputs by combining two different scores, designed motivated from Bayes’ rule, and subsequently selects the candidate that is both confidently determined and contextually coherent through a few additional LLM inferences. Experiments with three different LLMs and across seven diverse tasks demonstrate that our framework significantly improves the accuracy of LLMs-achieving an average gain of 4.8%-through effective response selection, without additional training.
zh
[AI-171] owards Interpretable Adversarial Examples via Sparse Adversarial Attack
【速读】:该论文试图解决现有稀疏攻击在生成可解释的对抗样本时存在的稀疏性不足、计算开销大、迁移能力差和攻击强度弱的问题。其解决方案的关键在于引入一种新颖且理论严谨的参数化技术,以近似NP难的l0优化问题,从而使直接优化稀疏扰动成为可能;同时设计了一种新的损失函数,通过最大化对抗性特征并最小化扰动像素数量来增强初始扰动,从而实现高效、可迁移且强大的对抗攻击。
链接: https://arxiv.org/abs/2506.17250
作者: Fudong Lin,Jiadong Lou,Hao Wang,Brian Jalaian,Xu Yuan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Sparse attacks are to optimize the magnitude of adversarial perturbations for fooling deep neural networks (DNNs) involving only a few perturbed pixels (i.e., under the l0 constraint), suitable for interpreting the vulnerability of DNNs. However, existing solutions fail to yield interpretable adversarial examples due to their poor sparsity. Worse still, they often struggle with heavy computational overhead, poor transferability, and weak attack strength. In this paper, we aim to develop a sparse attack for understanding the vulnerability of CNNs by minimizing the magnitude of initial perturbations under the l0 constraint, to overcome the existing drawbacks while achieving a fast, transferable, and strong attack to DNNs. In particular, a novel and theoretical sound parameterization technique is introduced to approximate the NP-hard l0 optimization problem, making directly optimizing sparse perturbations computationally feasible. Besides, a novel loss function is designed to augment initial perturbations by maximizing the adversary property and minimizing the number of perturbed pixels simultaneously. Extensive experiments are conducted to demonstrate that our approach, with theoretical performance guarantees, outperforms state-of-the-art sparse attacks in terms of computational overhead, transferability, and attack strength, expecting to serve as a benchmark for evaluating the robustness of DNNs. In addition, theoretical and empirical results validate that our approach yields sparser adversarial examples, empowering us to discover two categories of noises, i.e., “obscuring noise” and “leading noise”, which will help interpret how adversarial perturbation misleads the classifiers into incorrect predictions. Our code is available at this https URL.
zh
[AI-172] Improving Prediction Certainty Estimation for Reliable Early Exiting via Null Space Projection IJCAI2025
【速读】:该论文试图解决早期退出(early exiting)方法在预测置信度估计中过度依赖与类别相关的logits,而忽视了特征中与类别无关信息对预测置信度的负面影响,导致错误的早期退出问题。解决方案的关键在于定义了一个NSP(Next Sentence Prediction)分数,通过考虑特征中与类别无关信息的比例来更准确地估计预测置信度,并在此基础上提出了一种基于置信度感知概率(Certainty-Aware Probability, CAP)的新型早期退出方法,该方法结合了logits和NSP分数以提升置信度估计的可靠性,从而实现更有效的退出决策。
链接: https://arxiv.org/abs/2506.17249
作者: Jianing He,Qi Zhang,Duoqian Miao,Yi Kun,Shufeng Hao,Hongyun Zhang,Zhihua Wei
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: IJCAI 2025, 9 pages
Abstract:Early exiting has demonstrated great potential in accelerating the inference of pre-trained language models (PLMs) by enabling easy samples to exit at shallow layers, eliminating the need for executing deeper layers. However, existing early exiting methods primarily rely on class-relevant logits to formulate their exiting signals for estimating prediction certainty, neglecting the detrimental influence of class-irrelevant information in the features on prediction certainty. This leads to an overestimation of prediction certainty, causing premature exiting of samples with incorrect early predictions. To remedy this, we define an NSP score to estimate prediction certainty by considering the proportion of class-irrelevant information in the features. On this basis, we propose a novel early exiting method based on the Certainty-Aware Probability (CAP) score, which integrates insights from both logits and the NSP score to enhance prediction certainty estimation, thus enabling more reliable exiting decisions. The experimental results on the GLUE benchmark show that our method can achieve an average speed-up ratio of 2.19x across all tasks with negligible performance degradation, surpassing the state-of-the-art (SOTA) ConsistentEE by 28%, yielding a better trade-off between task performance and inference efficiency. The code is available at this https URL.
zh
[AI-173] Efficient Quantification of Multimodal Interaction at Sample Level ICML2025
【速读】:该论文试图解决多模态信息中模态间相互作用(包括冗余性、独特性和协同性)在样本层面的精确量化问题,这一问题对于分析多模态系统中的信息动态至关重要,但面临显著的理论和计算挑战。解决方案的关键在于提出一种基于点熵信息论的轻量级样本层面多模态交互估计器(Lightweight Sample-wise Multimodal Interaction, LSMI),其核心是通过高效的熵估计方法,实现对连续分布下样本层面交互的准确估计,并揭示多模态数据中细粒度的样本和类别级动态特性。
链接: https://arxiv.org/abs/2506.17248
作者: Zequn Yang,Hongfa Wang,Di Hu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: Accepted to ICML 2025
Abstract:Interactions between modalities – redundancy, uniqueness, and synergy – collectively determine the composition of multimodal information. Understanding these interactions is crucial for analyzing information dynamics in multimodal systems, yet their accurate sample-level quantification presents significant theoretical and computational challenges. To address this, we introduce the Lightweight Sample-wise Multimodal Interaction (LSMI) estimator, rigorously grounded in pointwise information theory. We first develop a redundancy estimation framework, employing an appropriate pointwise information measure to quantify this most decomposable and measurable interaction. Building upon this, we propose a general interaction estimation method that employs efficient entropy estimation, specifically tailored for sample-wise estimation in continuous distributions. Extensive experiments on synthetic and real-world datasets validate LSMI’s precision and efficiency. Crucially, our sample-wise approach reveals fine-grained sample- and category-level dynamics within multimodal data, enabling practical applications such as redundancy-informed sample partitioning, targeted knowledge distillation, and interaction-aware model ensembling. The code is available at this https URL.
zh
[AI-174] Recursive Learning-Based Virtual Buffering for Analytical Global Placement
【速读】:该论文旨在解决现代工艺节点中互连延迟与单元延迟比例失衡导致的物理综合流程中时序闭合问题,特别是传统缓冲方法在全局布局阶段计算成本高以及基于机器学习的缓冲方法未能充分考虑电气规则检查(ERC)违规并无法有效融入物理设计流程的问题。解决方案的关键在于提出MLBuf-RePlAce,这是一个基于学习驱动的虚拟缓冲感知分析性全局布局框架,采用高效的递归学习生成式缓冲方法预测缓冲器类型和位置,从而在全局布局阶段解决ERC违规问题。
链接: https://arxiv.org/abs/2506.17247
作者: Andrew B. Kahng,Yiting Liu,Zhiang Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Due to the skewed scaling of interconnect versus cell delay in modern technology nodes, placement with buffer porosity (i.e., cell density) awareness is essential for timing closure in physical synthesis flows. However, existing approaches face two key challenges: (i) traditional van Ginneken-Lillis-style buffering approaches are computationally expensive during global placement; and (ii) machine learning-based approaches, such as BufFormer, lack a thorough consideration of Electrical Rule Check (ERC) violations and fail to “close the loop” back into the physical design flow. In this work, we propose MLBuf-RePlAce, the first open-source learning-driven virtual buffering-aware analytical global placement framework, built on top of the OpenROAD infrastructure. MLBuf-RePlAce adopts an efficient recursive learning-based generative buffering approach to predict buffer types and locations, addressing ERC violations during global placement. We compare MLBuf-RePlAce against the default virtual buffering-based timing-driven global placer in OpenROAD, using open-source testcases from the TILOS MacroPlacement and OpenROAD-flow-scripts repositories. Without degradation of post-route power, MLBuf-RePlAce achieves (maximum, average) improvements of (56%, 31%) in total negative slack (TNS) within the open-source OpenROAD flow. When evaluated by completion in a commercial flow, MLBuf-RePlAce achieves (maximum, average) improvements of (53%, 28%) in TNS with an average of 0.2% improvement in post-route power.
zh
[AI-175] Graph Neural Networks in Multi-Omics Cancer Research: A Structured Survey
【速读】:该论文旨在解决多组学数据整合中的复杂生物机制解析问题,特别是在癌症研究中的应用。其解决方案的关键在于利用图神经网络(Graph Neural Networks, GNNs)框架,以建模异构且结构化的组学数据,从而实现分子互作和调控网络的精确表示。通过分类不同靶向组学层、GNN结构及生物任务,论文揭示了当前研究中混合模型与可解释性模型的兴起趋势,以及注意力机制和对比学习的广泛应用,为构建有效的集成癌症分析GNN管道提供了理论支持与实践指导。
链接: https://arxiv.org/abs/2506.17234
作者: Payam Zohari,Mostafa Haghir Chehreghani
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 51 pages
Abstract:The task of data integration for multi-omics data has emerged as a powerful strategy to unravel the complex biological underpinnings of cancer. Recent advancements in graph neural networks (GNNs) offer an effective framework to model heterogeneous and structured omics data, enabling precise representation of molecular interactions and regulatory networks. This systematic review explores several recent studies that leverage GNN-based architectures in multi-omics cancer research. We classify the approaches based on their targeted omics layers, graph neural network structures, and biological tasks such as subtype classification, prognosis prediction, and biomarker discovery. The analysis reveals a growing trend toward hybrid and interpretable models, alongside increasing adoption of attention mechanisms and contrastive learning. Furthermore, we highlight the use of patient-specific graphs and knowledge-driven priors as emerging directions. This survey serves as a comprehensive resource for researchers aiming to design effective GNN-based pipelines for integrative cancer analysis, offering insights into current practices, limitations, and potential future directions.
zh
[AI-176] MMET: A Multi-Input and Multi-Scale Transformer for Efficient PDEs Solving
【速读】:该论文旨在解决基于机器学习方法求解偏微分方程(Partial Differential Equations, PDEs)时面临的通用性与效率不足的问题,具体表现为多输入和多尺度泛化能力有限以及计算成本高昂。其解决方案的关键在于提出了一种名为多输入多尺度高效Transformer(Multi-input and Multi-scale Efficient Transformer, MMET)的框架,该框架通过将网格点和查询点作为两个序列分别输入编码器和解码器,并引入门控条件嵌入(Gated Condition Embedding, GCE)层以处理不同维度的输入变量或函数,从而有效解决多尺度和多输入问题;同时采用基于希尔伯特曲线的重序列化和块嵌入机制,减少输入长度,显著降低计算成本。
链接: https://arxiv.org/abs/2506.17230
作者: Yichen Luo,Jia Wang,Dapeng Lan,Yu Liu,Zhibo Pang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Partial Differential Equations (PDEs) are fundamental for modeling physical systems, yet solving them in a generic and efficient manner using machine learning-based approaches remains challenging due to limited multi-input and multi-scale generalization capabilities, as well as high computational costs. This paper proposes the Multi-input and Multi-scale Efficient Transformer (MMET), a novel framework designed to address the above challenges. MMET decouples mesh and query points as two sequences and feeds them into the encoder and decoder, respectively, and uses a Gated Condition Embedding (GCE) layer to embed input variables or functions with varying dimensions, enabling effective solutions for multi-scale and multi-input problems. Additionally, a Hilbert curve-based reserialization and patch embedding mechanism decrease the input length. This significantly reduces the computational cost when dealing with large-scale geometric models. These innovations enable efficient representations and support multi-scale resolution queries for large-scale and multi-input PDE problems. Experimental evaluations on diverse benchmarks spanning different physical fields demonstrate that MMET outperforms SOTA methods in both accuracy and computational efficiency. This work highlights the potential of MMET as a robust and scalable solution for real-time PDE solving in engineering and physics-based applications, paving the way for future explorations into pre-trained large-scale models in specific domains. This work is open-sourced at this https URL.
zh
[AI-177] Wisdom of Crowds Through Myopic Self-Confidence Adaptation
【速读】:该论文试图解决群体决策中如何通过代理间的相互影响来优化对共同世界状态的估计问题,具体而言,是研究在非贝叶斯学习规则下,代理如何通过迭代更新其估计以达到最小化最终估计方差的目标。解决方案的关键在于将该问题建模为一个博弈论下的多目标优化问题,并通过分析代理之间的权重分配来确定帕累托前沿和纳什均衡,同时证明了异步最优响应动态收敛至严格纳什均衡。
链接: https://arxiv.org/abs/2506.18195
作者: Giacomo Como,Fabio Fagnani,Anton Proskurnikov
机构: 未知
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Systems and Control (eess.SY); Physics and Society (physics.soc-ph)
备注:
Abstract:The wisdom of crowds is an umbrella term for phenomena suggesting that the collective judgment or decision of a large group can be more accurate than the individual judgments or decisions of the group members. A well-known example illustrating this concept is the competition at a country fair described by Galton, where the median value of the individual guesses about the weight of an ox resulted in an astonishingly accurate estimate of the actual weight. This phenomenon resembles classical results in probability theory and relies on independent decision-making. The accuracy of the group’s final decision can be significantly reduced if the final agents’ opinions are driven by a few influential agents. In this paper, we consider a group of agents who initially possess uncorrelated and unbiased noisy measurements of a common state of the world. Assume these agents iteratively update their estimates according to a simple non-Bayesian learning rule, commonly known in mathematical sociology as the French-DeGroot dynamics or iterative opinion pooling. As a result of this iterative distributed averaging process, each agent arrives at an asymptotic estimate of the state of the world, with the variance of this estimate determined by the matrix of weights the agents assign to each other. Every agent aims at minimizing the variance of her asymptotic estimate of the state of the world; however, such variance is also influenced by the weights allocated by other agents. To achieve the best possible estimate, the agents must then solve a game-theoretic, multi-objective optimization problem defined by the available sets of influence weights. We characterize both the Pareto frontier and the set of Nash equilibria in the resulting game. Additionally, we examine asynchronous best-response dynamics for the group of agents and prove their convergence to the set of strict Nash equilibria. Subjects: Optimization and Control (math.OC); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Systems and Control (eess.SY); Physics and Society (physics.soc-ph) Cite as: arXiv:2506.18195 [math.OC] (or arXiv:2506.18195v1 [math.OC] for this version) https://doi.org/10.48550/arXiv.2506.18195 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-178] OmniESI: A unified framework for enzyme-substrate interaction prediction with progressive conditional deep learning
【速读】:该论文旨在解决现有预测方法未能有效整合酶催化先验知识,从而无法合理调节与催化模式不匹配的通用蛋白质-分子特征的问题。其解决方案的关键在于提出一种两阶段渐进框架OmniESI,通过条件深度学习实现酶-底物相互作用的预测,该框架包含两个条件网络,分别强调酶反应特异性和关键催化相关相互作用,从而在潜在空间中逐步调整特征,从通用蛋白质-分子领域过渡到催化感知领域。
链接: https://arxiv.org/abs/2506.17963
作者: Zhiwei Nie,Hongyu Zhang,Hao Jiang,Yutian Liu,Xiansong Huang,Fan Xu,Jie Fu,Zhixiang Ren,Yonghong Tian,Wen-Bin Zhang,Jie Chen
机构: 未知
类目: Biomolecules (q-bio.BM); Artificial Intelligence (cs.AI)
备注:
Abstract:Understanding and modeling enzyme-substrate interactions is crucial for catalytic mechanism research, enzyme engineering, and metabolic engineering. Although a large number of predictive methods have emerged, they do not incorporate prior knowledge of enzyme catalysis to rationally modulate general protein-molecule features that are misaligned with catalytic patterns. To address this issue, we introduce a two-stage progressive framework, OmniESI, for enzyme-substrate interaction prediction through conditional deep learning. By decomposing the modeling of enzyme-substrate interactions into a two-stage progressive process, OmniESI incorporates two conditional networks that respectively emphasize enzymatic reaction specificity and crucial catalysis-related interactions, facilitating a gradual feature modulation in the latent space from general protein-molecule domain to catalysis-aware domain. On top of this unified architecture, OmniESI can adapt to a variety of downstream tasks, including enzyme kinetic parameter prediction, enzyme-substrate pairing prediction, enzyme mutational effect prediction, and enzymatic active site annotation. Under the multi-perspective performance evaluation of in-distribution and out-of-distribution settings, OmniESI consistently delivered superior performance than state-of-the-art specialized methods across seven benchmarks. More importantly, the proposed conditional networks were shown to internalize the fundamental patterns of catalytic efficiency while significantly improving prediction performance, with only negligible parameter increases (0.16%), as demonstrated by ablation studies on key components. Overall, OmniESI represents a unified predictive approach for enzyme-substrate interactions, providing an effective tool for catalytic mechanism cracking and enzyme engineering with strong generalization and broad applicability.
zh
[AI-179] Greedy Selection under Independent Increments: A Toy Model Analysis
【速读】:该论文研究的是在N个独立同分布的离散时间随机过程(discrete-time stochastic processes)中进行迭代选择的问题,每个过程具有独立增量。在每一阶段,根据其观测值保留固定数量的过程。论文的关键在于证明了在该简单模型下,选择最终最大值过程的最优策略是在每个阶段应用贪心选择(greedy selection)。尽管该结果依赖于强独立性假设,但它为多阶段淘汰设置中的贪心启发式方法提供了清晰的理论依据,并可能作为理解高维应用中相关算法的一个简化示例。
链接: https://arxiv.org/abs/2506.17941
作者: Huitao Yang
机构: 未知
类目: Probability (math.PR); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:
Abstract:We study an iterative selection problem over N i.i.d. discrete-time stochastic processes with independent increments. At each stage, a fixed number of processes are retained based on their observed values. Under this simple model, we prove that the optimal strategy for selecting the final maximum-value process is to apply greedy selection at each stage. While the result relies on strong independence assumptions, it offers a clean justification for greedy heuristics in multi-stage elimination settings and may serve as a toy example for understanding related algorithms in high-dimensional applications.
zh
[AI-180] Residual Connection-Enhanced ConvLSTM for Lithium Dendrite Growth Prediction
【速读】:该论文旨在解决锂枝晶生长对可充电电池性能和安全性造成的负面影响,特别是由枝晶引发的短路和容量衰减问题。其解决方案的关键在于提出一种基于残差连接增强的卷积长短期记忆网络(Residual Connection-Enhanced ConvLSTM),通过引入残差连接缓解梯度消失问题,提升特征保留能力,并有效捕捉局部枝晶生长动态与宏观电池行为,从而提高枝晶生长模式预测的准确性与计算效率。
链接: https://arxiv.org/abs/2506.17756
作者: Hosung Lee,Byeongoh Hwang,Dasan Kim,Myungjoo Kang
机构: 未知
类目: Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI)
备注: 14pages, 6figures, accepted to Journal of The Electrochemical Society
Abstract:The growth of lithium dendrites significantly impacts the performance and safety of rechargeable batteries, leading to short circuits and capacity degradation. This study proposes a Residual Connection-Enhanced ConvLSTM model to predict dendrite growth patterns with improved accuracy and computational efficiency. By integrating residual connections into ConvLSTM, the model mitigates the vanishing gradient problem, enhances feature retention across layers, and effectively captures both localized dendrite growth dynamics and macroscopic battery behavior. The dataset was generated using a phase-field model, simulating dendrite evolution under varying conditions. Experimental results show that the proposed model achieves up to 7% higher accuracy and significantly reduces mean squared error (MSE) compared to conventional ConvLSTM across different voltage conditions (0.1V, 0.3V, 0.5V). This highlights the effectiveness of residual connections in deep spatiotemporal networks for electrochemical system modeling. The proposed approach offers a robust tool for battery diagnostics, potentially aiding in real-time monitoring and optimization of lithium battery performance. Future research can extend this framework to other battery chemistries and integrate it with real-world experimental data for further validation
zh
[AI-181] Resolving the Ti-V Phase Diagram Discrepancy with First-Principles Calculations and Bayesian Learning
【速读】:该论文试图解决钛-钒(Ti-V)二元合金是否表现出体心立方(BCC)共溶间隙还是完全固溶的争议问题。解决方案的关键在于采用了一种基于从头算(ab initio)与机器学习相结合的工作流程,该流程将主动训练的矩张量势(Moment Tensor Potential)与贝叶斯热力学推断相耦合,从而在热力学极限下获得了整个成分范围内的Ti-V二元系相图,并给出了置信区间。这一方法成功再现了所有实验特征,证明了其可靠性,并支持存在一个在T = 980 K和c = 0.67处终止的BCC共溶间隙的模型。
链接: https://arxiv.org/abs/2506.17719
作者: Timofei Miryashkin,Olga Klimanova,Alexander Shapeev
机构: 未知
类目: Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI); Computational Physics (physics.comp-ph)
备注:
Abstract:Conflicting experiments disagree on whether the titanium-vanadium (Ti-V) binary alloy exhibits a body-centred cubic (BCC) miscibility gap or remains completely soluble. A leading hypothesis attributes the miscibility gap to oxygen contamination during alloy preparation. To resolve this controversy, we use an ab initio + machine-learning workflow that couples an actively-trained Moment Tensor Potential to Bayesian thermodynamic inference. Using this workflow, we obtain Ti-V binary system across the entire composition range, together with confidence intervals in the thermodynamic limit. The resulting diagram reproduces all experimental features, demonstrating the robustness of our approach, and clearly favors the variant with a BCC miscibility gap terminating at T = 980 K and c = 0.67. Because oxygen was excluded from simulations, the gap cannot be attributed to impurity effects, contradicting recent CALPHAD reassessments.
zh
[AI-182] Exploring Strategies for Personalized Radiation Therapy Part I Unlocking Response-Related Tumor Subregions with Class Activation Mapping
【速读】:该论文旨在解决个性化精准放疗中对预后性、空间信息丰富的特征识别以及根据个体反应调整治疗方案的需求。其解决方案的关键在于采用集成自编码器分类模型,结合基于像素的类激活映射(pixel wise CAM)技术,以实现更精确的治疗反应预测,并提供详细的解剖学空间信息,从而支持生物验证和对异质性治疗反应的深入理解。
链接: https://arxiv.org/abs/2506.17536
作者: Hao Peng,Steve Jiang,Robert Timmerman
机构: 未知
类目: Medical Physics (physics.med-ph); Artificial Intelligence (cs.AI)
备注:
Abstract:Personalized precision radiation therapy requires more than simple classification, it demands the identification of prognostic, spatially informative features and the ability to adapt treatment based on individual response. This study compares three approaches for predicting treatment response: standard radiomics, gradient based features, and convolutional neural networks enhanced with Class Activation Mapping. We analyzed 69 brain metastases from 39 patients treated with Gamma Knife radiosurgery. An integrated autoencoder classifier model was used to predict whether tumor volume would shrink by more than 20 percent at a three months follow up, framed as a binary classification task. The results highlight their strength in hierarchical feature extraction and the classifiers discriminative capacity. Among the models, pixel wise CAM provides the most detailed spatial insight, identifying lesion specific regions rather than relying on fixed patterns, demonstrating strong generalization. In non responding lesions, the activated regions may indicate areas of radio resistance. Pixel wise CAM outperformed both radiomics and gradient based methods in classification accuracy. Moreover, its fine grained spatial features allow for alignment with cellular level data, supporting biological validation and deeper understanding of heterogeneous treatment responses. Although further validation is necessary, these findings underscore the promise in guiding personalized and adaptive radiotherapy strategies for both photon and particle therapies.
zh
[AI-183] Exploring Strategies for Personalized Radiation Therapy Part II Predicting Tumor Drift Patterns with Diffusion Models
【速读】:该论文试图解决放射治疗中因患者间剂量和时间参数差异导致的治疗响应预测困难问题,特别是在脑癌治疗中,分次或分阶段立体定向放射外科相比单次分割更具安全性,但增加了治疗反应预测的复杂性。解决方案的关键在于提出一种基于个性化超分次立体定向适应性放疗(PULSAR)的策略,通过动态调整治疗方案以适应肿瘤随时间的变化,并引入去噪扩散隐式模型(DDIM)来学习治疗前后影像的数据驱动映射,从而有效模拟个体化肿瘤演化并定位与治疗反应相关的区域。
链接: https://arxiv.org/abs/2506.17491
作者: Hao Peng,Steve Jiang,Robert Timmerman
机构: 未知
类目: Medical Physics (physics.med-ph); Artificial Intelligence (cs.AI)
备注:
Abstract:Radiation therapy outcomes are decided by two key parameters, dose and timing, whose best values vary substantially across patients. This variability is especially critical in the treatment of brain cancer, where fractionated or staged stereotactic radiosurgery improves safety compared to single fraction approaches, but complicates the ability to predict treatment response. To address this challenge, we employ Personalized Ultra-fractionated Stereotactic Adaptive Radiotherapy (PULSAR), a strategy that dynamically adjusts treatment based on how each tumor evolves over time. However, the success of PULSAR and other adaptive approaches depends on predictive tools that can guide early treatment decisions and avoid both overtreatment and undertreatment. However, current radiomics and dosiomics models offer limited insight into the evolving spatial and temporal patterns of tumor response. To overcome these limitations, we propose a novel framework using Denoising Diffusion Implicit Models (DDIM), which learns data-driven mappings from pre to post treatment imaging. In this study, we developed single step and iterative denoising strategies and compared their performance. The results show that diffusion models can effectively simulate patient specific tumor evolution and localize regions associated with treatment response. The proposed strategy provides a promising foundation for modeling heterogeneous treatment response and enabling early, adaptive interventions, paving the way toward more personalized and biologically informed radiotherapy.
zh
[AI-184] Challenges in Grounding Language in the Real World
【速读】:该论文试图解决如何构建一个语言理解系统,使人类能够使用自然语言与物理机器人进行协作的问题。其解决方案的关键在于整合具备交互式任务学习能力的认知代理(cognitive agent)与大型语言模型的语言能力,从而实现人机之间的有效语言交互。
链接: https://arxiv.org/abs/2506.17375
作者: Peter Lindes,Kaoutar Skiker
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI)
备注: 14 pages, 2 figures
Abstract:A long-term goal of Artificial Intelligence is to build a language understanding system that allows a human to collaborate with a physical robot using language that is natural to the human. In this paper we highlight some of the challenges in doing this, and propose a solution that integrates the abilities of a cognitive agent capable of interactive task learning in a physical robot with the linguistic abilities of a large language model. We also point the way to an initial implementation of this approach.
zh
机器学习
[LG-0] Offline Goal-Conditioned Reinforcement Learning with Projective Quasimetric Planning
链接: https://arxiv.org/abs/2506.18847
作者: Anthony Kobanda,Waris Radji,Mathieu Petitbois,Odalric-Ambrym Maillard,Rémy Portelas
类目: Machine Learning (cs.LG)
*备注:
Abstract:Offline Goal-Conditioned Reinforcement Learning seeks to train agents to reach specified goals from previously collected trajectories. Scaling that promises to long-horizon tasks remains challenging, notably due to compounding value-estimation errors. Principled geometric offers a potential solution to address these issues. Following this insight, we introduce Projective Quasimetric Planning (ProQ), a compositional framework that learns an asymmetric distance and then repurposes it, firstly as a repulsive energy forcing a sparse set of keypoints to uniformly spread over the learned latent space, and secondly as a structured directional cost guiding towards proximal sub-goals. In particular, ProQ couples this geometry with a Lagrangian out-of-distribution detector to ensure the learned keypoints stay within reachable areas. By unifying metric learning, keypoint coverage, and goal-conditioned control, our approach produces meaningful sub-goals and robustly drives long-horizon goal-reaching on diverse a navigation benchmarks.
[LG-1] Multi-Agent Online Control with Adversarial Disturbances
链接: https://arxiv.org/abs/2506.18814
作者: Anas Barakat,John Lazarsfeld,Georgios Piliouras,Antonios Varvitsiotis
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT); Optimization and Control (math.OC)
*备注:
Abstract:Multi-agent control problems involving a large number of agents with competing and time-varying objectives are increasingly prevalent in applications across robotics, economics, and energy systems. In this paper, we study online control in multi-agent linear dynamical systems with disturbances. In contrast to most prior work in multi-agent control, we consider an online setting where disturbances are adversarial and where each agent seeks to minimize its own, adversarial sequence of convex losses. In this setting, we investigate the robustness of gradient-based controllers from single-agent online control, with a particular focus on understanding how individual regret guarantees are influenced by the number of agents in the system. Under minimal communication assumptions, we prove near-optimal sublinear regret bounds that hold uniformly for all agents. Finally, when the objectives of the agents are aligned, we show that the multi-agent control problem induces a time-varying potential game for which we derive equilibrium gap guarantees.
[LG-2] Learning Physical Systems: Symplectification via Gauge Fixing in Dirac Structures
链接: https://arxiv.org/abs/2506.18812
作者: Aristotelis Papatheodorou,Pranav Vaidhyanathan,Natalia Ares,Ioannis Havoutis
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: Presented at Equivariant Systems: Theory and Applications in State Estimation, Artificial Intelligence and Control, Robotics: Science and Systems (RSS) 2025 Workshop, 6 Pages, 3 Figures
Abstract:Physics-informed deep learning has achieved remarkable progress by embedding geometric priors, such as Hamiltonian symmetries and variational principles, into neural networks, enabling structure-preserving models that extrapolate with high accuracy. However, in systems with dissipation and holonomic constraints, ubiquitous in legged locomotion and multibody robotics, the canonical symplectic form becomes degenerate, undermining the very invariants that guarantee stability and long-term prediction. In this work, we tackle this foundational limitation by introducing Presymplectification Networks (PSNs), the first framework to learn the symplectification lift via Dirac structures, restoring a non-degenerate symplectic geometry by embedding constrained systems into a higher-dimensional manifold. Our architecture combines a recurrent encoder with a flow-matching objective to learn the augmented phase-space dynamics end-to-end. We then attach a lightweight Symplectic Network (SympNet) to forecast constrained trajectories while preserving energy, momentum, and constraint satisfaction. We demonstrate our method on the dynamics of the ANYmal quadruped robot, a challenging contact-rich, multibody system. To the best of our knowledge, this is the first framework that effectively bridges the gap between constrained, dissipative mechanical systems and symplectic learning, unlocking a whole new class of geometric machine learning models, grounded in first principles yet adaptable from data.
[LG-3] A Multi-view Divergence-Convergence Feature Augmentation Framework for Drug-related Microbes Prediction
链接: https://arxiv.org/abs/2506.18797
作者: Xin An,Ruijie Li,Qiao Ning,Shikai Guo,Hui Li,Qian Ma
类目: Machine Learning (cs.LG)
*备注: 10 pages, 8 figures (including subfigures), 1 table. Xin An and Ruijie Li contributed equally to this work and should be considered co-first authors
Abstract:In the study of drug function and precision medicine, identifying new drug-microbe associations is crucial. However, current methods isolate association and similarity analysis of drug and microbe, lacking effective inter-view optimization and coordinated multi-view feature fusion. In our study, a multi-view Divergence-Convergence Feature Augmentation framework for Drug-related Microbes Prediction (DCFA_DMP) is proposed, to better learn and integrate association information and similarity information. In the divergence phase, DCFA_DMP strengthens the complementarity and diversity between heterogeneous information and similarity information by performing Adversarial Learning method between the association network view and different similarity views, optimizing the feature space. In the convergence phase, a novel Bidirectional Synergistic Attention Mechanism is proposed to deeply synergize the complementary features between different views, achieving a deep fusion of the feature space. Moreover, Transformer graph learning is alternately applied on the drug-microbe heterogeneous graph, enabling each drug or microbe node to focus on the most relevant nodes. Numerous experiments demonstrate DCFA_DMP’s significant performance in predicting drug-microbe associations. It also proves effectiveness in predicting associations for new drugs and microbes in cold start experiments, further confirming its stability and reliability in predicting potential drug-microbe associations.
[LG-4] DPG loss functions for learning parameter-to-solution maps by neural networks
链接: https://arxiv.org/abs/2506.18773
作者: Pablo Cortés Castillo,Wolfgang Dahmen,Jay Gopalakrishnan
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注:
Abstract:We develop, analyze, and experimentally explore residual-based loss functions for machine learning of parameter-to-solution maps in the context of parameter-dependent families of partial differential equations (PDEs). Our primary concern is on rigorous accuracy certification to enhance prediction capability of resulting deep neural network reduced models. This is achieved by the use of variationally correct loss functions. Through one specific example of an elliptic PDE, details for establishing the variational correctness of a loss function from an ultraweak Discontinuous Petrov Galerkin (DPG) discretization are worked out. Despite the focus on the example, the proposed concepts apply to a much wider scope of problems, namely problems for which stable DPG formulations are available. The issue of high-contrast diffusion fields and ensuing difficulties with degrading ellipticity are discussed. Both numerical results and theoretical arguments illustrate that for high-contrast diffusion parameters the proposed DPG loss functions deliver much more robust performance than simpler least-squares losses.
[LG-5] Experimenting Fast and Slow: Bayesian Optimization of Long-term Outcomes with Online Experiments
链接: https://arxiv.org/abs/2506.18744
作者: Qing Feng,Samuel Dalton,Benjamin Letham,Maximilian Balandat,Eytan Bakshy
类目: Machine Learning (cs.LG)
*备注:
Abstract:Online experiments in internet systems, also known as A/B tests, are used for a wide range of system tuning problems, such as optimizing recommender system ranking policies and learning adaptive streaming controllers. Decision-makers generally wish to optimize for long-term treatment effects of the system changes, which often requires running experiments for a long time as short-term measurements can be misleading due to non-stationarity in treatment effects over time. The sequential experimentation strategies–which typically involve several iterations–can be prohibitively long in such cases. We describe a novel approach that combines fast experiments (e.g., biased experiments run only for a few hours or days) and/or offline proxies (e.g., off-policy evaluation) with long-running, slow experiments to perform sequential, Bayesian optimization over large action spaces in a short amount of time.
[LG-6] owards Group Fairness with Multiple Sensitive Attributes in Federated Foundation Models
链接: https://arxiv.org/abs/2506.18732
作者: Yuning Yang,Han Yu,Tianrun Gao,Xiaodong Xu,Guangyu Wang
类目: Machine Learning (cs.LG)
*备注:
Abstract:The deep integration of foundation models (FM) with federated learning (FL) enhances personalization and scalability for diverse downstream tasks, making it crucial in sensitive domains like healthcare. Achieving group fairness has become an increasingly prominent issue in the era of federated foundation models (FFMs), since biases in sensitive attributes might lead to inequitable treatment for under-represented demographic groups. Existing studies mostly focus on achieving fairness with respect to a single sensitive attribute. This renders them unable to provide clear interpretability of dependencies among multiple sensitive attributes which is required to achieve group fairness. Our paper takes the first attempt towards a causal analysis of the relationship between group fairness across various sensitive attributes in the FFM. We extend the FFM structure to trade off multiple sensitive attributes simultaneously and quantify the causal effect behind the group fairness through causal discovery and inference. Extensive experiments validate its effectiveness, offering insights into interpretability towards building trustworthy and fair FFM systems.
[LG-7] PARALLELPROMPT: Extracting Parallelism from Large Language Model Queries
链接: https://arxiv.org/abs/2506.18728
作者: Steven Kolawole,Keshav Santhanam,Virginia Smith,Pratiksha Thaker
类目: Machine Learning (cs.LG)
*备注: In review
Abstract:LLM serving systems typically treat user prompts as monolithic inputs, optimizing inference through decoding tricks or inter-query batching. However, many real-world prompts contain latent semantic parallelism–decomposable structures where subtasks can be executed independently to reduce latency while preserving meaning. We introduce PARALLELPROMPT, the first benchmark for measuring intra-query parallelism in natural user prompts. Our dataset comprises over 37,000 real-world prompts from public LLM chat logs, each annotated with a structured schema capturing task templates, shared context, and iteration inputs. These schemas are extracted using LLM-assisted prompting with rule-based multilingual validation. To evaluate the benefits of decomposition, we provide an execution suite that benchmarks serial vs. parallel strategies, measuring latency, structural adherence, and semantic fidelity. Our results show that intra-query parallelism can be successfully parsed in over 75% of curated datasets, unlocking up to 5x speedups on tasks like translation, comprehension, and comparative analysis, with minimal quality degradation. By releasing this benchmark, curation pipeline, and evaluation suite, we provide the first standardized testbed for studying structure-aware execution in LLM serving pipelines.
[LG-8] SaGIF: Improving Individual Fairness in Graph Neural Networks via Similarity Encoding
链接: https://arxiv.org/abs/2506.18696
作者: Yuchang Zhu,Jintang Li,Huizhe Zhang,Liang Chen,Zibin Zheng
类目: Machine Learning (cs.LG)
*备注: Under review
Abstract:Individual fairness (IF) in graph neural networks (GNNs), which emphasizes the need for similar individuals should receive similar outcomes from GNNs, has been a critical issue. Despite its importance, research in this area has been largely unexplored in terms of (1) a clear understanding of what induces individual unfairness in GNNs and (2) a comprehensive consideration of identifying similar individuals. To bridge these gaps, we conduct a preliminary analysis to explore the underlying reason for individual unfairness and observe correlations between IF and similarity consistency, a concept introduced to evaluate the discrepancy in identifying similar individuals based on graph structure versus node features. Inspired by our observations, we introduce two metrics to assess individual similarity from two distinct perspectives: topology fusion and feature fusion. Building upon these metrics, we propose Similarity-aware GNNs for Individual Fairness, named SaGIF. The key insight behind SaGIF is the integration of individual similarities by independently learning similarity representations, leading to an improvement of IF in GNNs. Our experiments on several real-world datasets validate the effectiveness of our proposed metrics and SaGIF. Specifically, SaGIF consistently outperforms state-of-the-art IF methods while maintaining utility performance. Code is available at: this https URL.
[LG-9] On Union-Closedness of Language Generation
链接: https://arxiv.org/abs/2506.18642
作者: Steve Hanneke,Amin Karbasi,Anay Mehrotra,Grigoris Velegkas
类目: Machine Learning (cs.LG)
*备注:
Abstract:We investigate language generation in the limit - a model by Kleinberg and Mullainathan [NeurIPS 2024] and extended by Li, Raman, and Tewari [COLT 2025]. While Kleinberg and Mullainathan proved generation is possible for all countable collections, Li et al. defined a hierarchy of generation notions (uniform, non-uniform, and generatable) and explored their feasibility for uncountable collections. Our first set of results resolve two open questions of Li et al. by proving finite unions of generatable or non-uniformly generatable classes need not be generatable. These follow from a stronger result: there is a non-uniformly generatable class and a uniformly generatable class whose union is non-generatable. This adds to the aspects along which language generation in the limit is different from traditional tasks in statistical learning theory like classification, which are closed under finite unions. In particular, it implies that given two generators for different collections, one cannot combine them to obtain a single “more powerful” generator, prohibiting this notion of boosting. Our construction also addresses a third open question of Li et al. on whether there are uncountable classes that are non-uniformly generatable and do not satisfy the eventually unbounded closure (EUC) condition introduced by Li, Raman, and Tewari. Our approach utilizes carefully constructed classes along with a novel diagonalization argument that could be of independent interest in the growing area of language generation. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2506.18642 [cs.LG] (or arXiv:2506.18642v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2506.18642 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-10] On Equivariant Model Selection through the Lens of Uncertainty UAI2025
链接: https://arxiv.org/abs/2506.18629
作者: Putri A. van der Linden,Alexander Timans,Dharmesh Tailor,Erik J. Bekkers
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 9 pages, 4 figures, 2 tables. In the 8th Workshop on Tractable Probabilistic Modeling at UAI 2025
Abstract:Equivariant models leverage prior knowledge on symmetries to improve predictive performance, but misspecified architectural constraints can harm it instead. While work has explored learning or relaxing constraints, selecting among pretrained models with varying symmetry biases remains challenging. We examine this model selection task from an uncertainty-aware perspective, comparing frequentist (via Conformal Prediction), Bayesian (via the marginal likelihood), and calibration-based measures to naive error-based evaluation. We find that uncertainty metrics generally align with predictive performance, but Bayesian model evidence does so inconsistently. We attribute this to a mismatch in Bayesian and geometric notions of model complexity, and discuss possible remedies. Our findings point towards the potential of uncertainty in guiding symmetry-aware model selection.
[LG-11] Prédiction optimale pour un modèle ordinal à covariables fonctionnelles
链接: https://arxiv.org/abs/2506.18615
作者: Simón Weinberger(ERIC),Jairo Cugliari(ERIC),Aurélie Le Cain
类目: Machine Learning (cs.LG)
*备注: in French language, Journ{é}es de statistiques, Soci{é}t{é} Française des Statistiques, Jul 2023, Bruxelle- Universit{é} Libre de Bruxelles (ULB), Belgique
Abstract:We present a prediction framework for ordinal models: we introduce optimal predictions using loss functions and give the explicit form of the Least-Absolute-Deviation prediction for these models. Then, we reformulate an ordinal model with functional covariates to a classic ordinal model with multiple scalar covariates. We illustrate all the proposed methods and try to apply these to a dataset collected by EssilorLuxottica for the development of a control algorithm for the shade of connected glasses.
[LG-12] Policy gradient methods for ordinal policies
链接: https://arxiv.org/abs/2506.18614
作者: Simón Weinberger(ERIC),Jairo Cugliari(ERIC)
类目: Machine Learning (cs.LG)
*备注: in French language, Journ{é}es de statistiques 2025, Soci{é}t{é} Française des Statistiques, Jun 2023, Marseille, France
Abstract:In reinforcement learning, the softmax parametrization is the standard approach for policies over discrete action spaces. However, it fails to capture the order relationship between actions. Motivated by a real-world industrial problem, we propose a novel policy parametrization based on ordinal regression models adapted to the reinforcement learning setting. Our approach addresses practical challenges, and numerical experiments demonstrate its effectiveness in real applications and in continuous action tasks, where discretizing the action space and applying the ordinal policy yields competitive performance.
[LG-13] ransformer World Model for Sample Efficient Multi-Agent Reinforcement Learning
链接: https://arxiv.org/abs/2506.18537
作者: Azad Deihim,Eduardo Alonso,Dimitra Apostolopoulou
类目: Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注:
Abstract:We present the Multi-Agent Transformer World Model (MATWM), a novel transformer-based world model designed for multi-agent reinforcement learning in both vector- and image-based environments. MATWM combines a decentralized imagination framework with a semi-centralized critic and a teammate prediction module, enabling agents to model and anticipate the behavior of others under partial observability. To address non-stationarity, we incorporate a prioritized replay mechanism that trains the world model on recent experiences, allowing it to adapt to agents’ evolving policies. We evaluated MATWM on a broad suite of benchmarks, including the StarCraft Multi-Agent Challenge, PettingZoo, and MeltingPot. MATWM achieves state-of-the-art performance, outperforming both model-free and prior world model approaches, while demonstrating strong sample efficiency, achieving near-optimal performance in as few as 50K environment interactions. Ablation studies confirm the impact of each component, with substantial gains in coordination-heavy tasks.
[LG-14] Federated Learning from Molecules to Processes: A Perspective
链接: https://arxiv.org/abs/2506.18525
作者: Jan G. Rittig,Clemens Kortmann
类目: Machine Learning (cs.LG); Chemical Physics (physics.chem-ph)
*备注:
Abstract:We present a perspective on federated learning in chemical engineering that envisions collaborative efforts in machine learning (ML) developments within the chemical industry. Large amounts of chemical and process data are proprietary to chemical companies and are therefore locked in data silos, hindering the training of ML models on large data sets in chemical engineering. Recently, the concept of federated learning has gained increasing attention in ML research, enabling organizations to jointly train machine learning models without disclosure of their individual data. We discuss potential applications of federated learning in several fields of chemical engineering, from the molecular to the process scale. In addition, we apply federated learning in two exemplary case studies that simulate practical scenarios of multiple chemical companies holding proprietary data sets: (i) prediction of binary mixture activity coefficients with graph neural networks and (ii) system identification of a distillation column with autoencoders. Our results indicate that ML models jointly trained with federated learning yield significantly higher accuracy than models trained by each chemical company individually and can perform similarly to models trained on combined datasets from all companies. Federated learning has therefore great potential to advance ML models in chemical engineering while respecting corporate data privacy, making it promising for future industrial applications.
[LG-15] DDOT: A Derivative-directed Dual-decoder Ordinary Differential Equation Transformer for Dynamic System Modeling
链接: https://arxiv.org/abs/2506.18522
作者: Yang Chang,Kuang-Da Wang,Ping-Chun Hsieh,Cheng-Kuan Lin,Wen-Chih Peng
类目: Machine Learning (cs.LG)
*备注:
Abstract:Uncovering the underlying ordinary differential equations (ODEs) that govern dynamic systems is crucial for advancing our understanding of complex phenomena. Traditional symbolic regression methods often struggle to capture the temporal dynamics and intervariable correlations inherent in ODEs. ODEFormer, a state-of-the-art method for inferring multidimensional ODEs from single trajectories, has made notable progress. However, its focus on single-trajectory evaluation is highly sensitive to initial starting points, which may not fully reflect true performance. To address this, we propose the divergence difference metric (DIV-diff), which evaluates divergence over a grid of points within the target region, offering a comprehensive and stable analysis of the variable space. Alongside, we introduce DDOT (Derivative-Directed Dual-Decoder Ordinary Differential Equation Transformer), a transformer-based model designed to reconstruct multidimensional ODEs in symbolic form. By incorporating an auxiliary task predicting the ODE’s derivative, DDOT effectively captures both structure and dynamic behavior. Experiments on ODEBench show DDOT outperforms existing symbolic regression methods, achieving an absolute improvement of 4.58% and 1.62% in P(R^2 0.9) for reconstruction and generalization tasks, respectively, and an absolute reduction of 3.55% in DIV-diff. Furthermore, DDOT demonstrates real-world applicability on an anesthesia dataset, highlighting its practical impact.
[LG-16] AnalogNAS-Bench: A NAS Benchmark for Analog In-Memory Computing
链接: https://arxiv.org/abs/2506.18495
作者: Aniss Bessalah,Hatem Mohamed Abdelmoumen,Karima Benatchba,Hadjer Benmeziane
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR)
*备注:
Abstract:Analog In-memory Computing (AIMC) has emerged as a highly efficient paradigm for accelerating Deep Neural Networks (DNNs), offering significant energy and latency benefits over conventional digital hardware. However, state-of-the-art neural networks are not inherently designed for AIMC, as they fail to account for its unique non-idealities. Neural Architecture Search (NAS) is thus needed to systematically discover neural architectures optimized explicitly for AIMC constraints. However, comparing NAS methodologies and extracting insights about robust architectures for AIMC requires a dedicated NAS benchmark that explicitly accounts for AIMC-specific hardware non-idealities. To address this, we introduce AnalogNAS-Bench, the first NAS benchmark tailored specifically for AIMC. Our study reveals three key insights: (1) standard quantization techniques fail to capture AIMC-specific noises, (2) robust architectures tend to feature wider and branched blocks, (3) skip connections improve resilience to temporal drift noise. These insights highlight the limitations of current NAS benchmarks for AIMC and pave the way for future analog-aware NAS. All the implementations used in this paper can be found at this https URL.
[LG-17] Reliability-Adjusted Prioritized Experience Replay
链接: https://arxiv.org/abs/2506.18482
作者: Leonard S. Pleiss,Tobias Sutter,Maximilian Schiffer
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Experience replay enables data-efficient learning from past experiences in online reinforcement learning agents. Traditionally, experiences were sampled uniformly from a replay buffer, regardless of differences in experience-specific learning potential. In an effort to sample more efficiently, researchers introduced Prioritized Experience Replay (PER). In this paper, we propose an extension to PER by introducing a novel measure of temporal difference error reliability. We theoretically show that the resulting transition selection algorithm, Reliability-adjusted Prioritized Experience Replay (ReaPER), enables more efficient learning than PER. We further present empirical results showing that ReaPER outperforms PER across various environment types, including the Atari-5 benchmark.
[LG-18] FREQuency ATTribution: Benchmarking Frequency-based Occlusion for Time Series Data
链接: https://arxiv.org/abs/2506.18481
作者: Dominique Mercier,Andreas Dengel,Sheraz,Ahmed
类目: Machine Learning (cs.LG)
*备注: 18 pages, 12 figures, 2 tables
Abstract:Deep neural networks are among the most successful algorithms in terms of performance and scalability in different domains. However, since these networks are black boxes, their usability is severely restricted due to the lack of interpretability. Existing interpretability methods do not address the analysis of time-series-based networks specifically enough. This paper shows that an analysis in the frequency domain can not only highlight relevant areas in the input signal better than existing methods, but is also more robust to fluctuations in the signal. In this paper, FreqATT is presented, a framework that enables post-hoc networks to interpret time series analysis. To achieve this, the relevant different frequencies are evaluated and the signal is either filtered or the relevant input data is marked.
[LG-19] A Motivational Architecture for Open-Ended Learning Challenges in Robots
链接: https://arxiv.org/abs/2506.18454
作者: Alejandro Romero,Gianluca Baldassarre,Richard J. Duro,Vieri Giuliano Santucci
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: Accepted to RLDM 2025
Abstract:Developing agents capable of autonomously interacting with complex and dynamic environments, where task structures may change over time and prior knowledge cannot be relied upon, is a key prerequisite for deploying artificial systems in real-world settings. The open-ended learning framework identifies the core challenges for creating such agents, including the ability to autonomously generate new goals, acquire the necessary skills (or curricula of skills) to achieve them, and adapt to non-stationary environments. While many existing works tackles various aspects of these challenges in isolation, few propose integrated solutions that address them simultaneously. In this paper, we introduce H-GRAIL, a hierarchical architecture that, through the use of different typologies of intrinsic motivations and interconnected learning mechanisms, autonomously discovers new goals, learns the required skills for their achievement, generates skill sequences for tackling interdependent tasks, and adapts to non-stationary environments. We tested H-GRAIL in a real robotic scenario, demonstrating how the proposed solutions effectively address the various challenges of open-ended learning.
[LG-20] New Hardness Results for Low-Rank Matrix Completion
链接: https://arxiv.org/abs/2506.18440
作者: Dror Chawin,Ishay Haviv
类目: Computational Complexity (cs.CC); Machine Learning (cs.LG)
*备注: 27 pages
Abstract:The low-rank matrix completion problem asks whether a given real matrix with missing values can be completed so that the resulting matrix has low rank or is close to a low-rank matrix. The completed matrix is often required to satisfy additional structural constraints, such as positive semi-definiteness or a bounded infinity norm. The problem arises in various research fields, including machine learning, statistics, and theoretical computer science, and has broad real-world applications. This paper presents new \mathsfNP -hardness results for low-rank matrix completion problems. We show that for every sufficiently large integer d and any real number \varepsilon \in [ 2^-O(d),\frac17] , given a partial matrix A with exposed values of magnitude at most 1 that admits a positive semi-definite completion of rank d , it is \mathsfNP -hard to find a positive semi-definite matrix that agrees with each given value of A up to an additive error of at most \varepsilon , even when the rank is allowed to exceed d by a multiplicative factor of O (\frac1\varepsilon ^2 \cdot \log(1/\varepsilon) ) . This strengthens a result of Hardt, Meka, Raghavendra, and Weitz (COLT, 2014), which applies to multiplicative factors smaller than 2 and to \varepsilon that decays polynomially in d . We establish similar \mathsfNP -hardness results for the case where the completed matrix is constrained to have a bounded infinity norm (rather than be positive semi-definite), for which all previous hardness results rely on complexity assumptions related to the Unique Games Conjecture. Our proofs involve a novel notion of nearly orthonormal representations of graphs, the concept of line digraphs, and bounds on the rank of perturbed identity matrices. Comments: 27 pages Subjects: Computational Complexity (cs.CC); Machine Learning (cs.LG) Cite as: arXiv:2506.18440 [cs.CC] (or arXiv:2506.18440v1 [cs.CC] for this version) https://doi.org/10.48550/arXiv.2506.18440 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-21] Dynamic Hybrid Modeling: Incremental Identification and Model Predictive Control
链接: https://arxiv.org/abs/2506.18344
作者: Adrian Caspari,Thomas Bierweiler,Sarah Fadda,Daniel Labisch,Maarten Nauta,Franzisko Wagner,Merle Warmbold,Constantinos C. Pantelides
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 18 pages, 10 Figures
Abstract:Mathematical models are crucial for optimizing and controlling chemical processes, yet they often face significant limitations in terms of computational time, algorithm complexity, and development costs. Hybrid models, which combine mechanistic models with data-driven models (i.e. models derived via the application of machine learning to experimental data), have emerged as a promising solution to these challenges. However, the identification of dynamic hybrid models remains difficult due to the need to integrate data-driven models within mechanistic model structures. We present an incremental identification approach for dynamic hybrid models that decouples the mechanistic and data-driven components to overcome computational and conceptual difficulties. Our methodology comprises four key steps: (1) regularized dynamic parameter estimation to determine optimal time profiles for flux variables, (2) correlation analysis to evaluate relationships between variables, (3) data-driven model identification using advanced machine learning techniques, and (4) hybrid model integration to combine the mechanistic and data-driven components. This approach facilitates early evaluation of model structure suitability, accelerates the development of hybrid models, and allows for independent identification of data-driven components. Three case studies are presented to illustrate the robustness, reliability, and efficiency of our incremental approach in handling complex systems and scenarios with limited data.
[LG-22] Instability in Diffusion ODEs: An Explanation for Inaccurate Image Reconstruction
链接: https://arxiv.org/abs/2506.18290
作者: Han Zhang,Jinghong Mao,Shangwen Zhu,Zhantao Yang,Lianghua Huang,Yu Liu,Deli Zhao,Ruili Feng,Fan Cheng
类目: Machine Learning (cs.LG)
*备注:
Abstract:Diffusion reconstruction plays a critical role in various applications such as image editing, restoration, and style transfer. In theory, the reconstruction should be simple - it just inverts and regenerates images by numerically solving the Probability Flow-Ordinary Differential Equation (PF-ODE). Yet in practice, noticeable reconstruction errors have been observed, which cannot be well explained by numerical errors. In this work, we identify a deeper intrinsic property in the PF-ODE generation process, the instability, that can further amplify the reconstruction errors. The root of this instability lies in the sparsity inherent in the generation distribution, which means that the probability is concentrated on scattered and small regions while the vast majority remains almost empty. To demonstrate the existence of instability and its amplification on reconstruction error, we conduct experiments on both toy numerical examples and popular open-sourced diffusion models. Furthermore, based on the characteristics of image data, we theoretically prove that the instability’s probability converges to one as the data dimensionality increases. Our findings highlight the inherent challenges in diffusion-based reconstruction and can offer insights for future improvements.
[LG-23] Learning High-Quality Latent Representations for Anomaly Detection and Signal Integrity Enhancement in High-Speed Signals
链接: https://arxiv.org/abs/2506.18288
作者: Muhammad Usama,Hee-Deok Jang,Soham Shanbhag,Yoo-Chang Sung,Seung-Jun Bae,Dong Eui Chang
类目: Machine Learning (cs.LG)
*备注:
Abstract:This paper addresses the dual challenge of improving anomaly detection and signal integrity in high-speed dynamic random access memory signals. To achieve this, we propose a joint training framework that integrates an autoencoder with a classifier to learn more distinctive latent representations by focusing on valid data features. Our approach is evaluated across three anomaly detection algorithms and consistently outperforms two baseline methods. Detailed ablation studies further support these findings. Furthermore, we introduce a signal integrity enhancement algorithm that improves signal integrity by an average of 11.3%. The source code and data used in this study are available at this https URL.
[LG-24] Leverag ing Large Language Models for Information Verification – an Engineering Approach
链接: https://arxiv.org/abs/2506.18274
作者: Nguyen Nang Hung,Nguyen Thanh Trong,Vuong Thanh Toan,Nguyen An Phuoc,Dao Minh Tu,Nguyen Manh Duc Tuan,Nguyen Dinh Mau
类目: Machine Learning (cs.LG)
*备注:
Abstract:For the ACMMM25 challenge, we present a practical engineering approach to multimedia news source verification, utilizing Large Language Models (LLMs) like GPT-4o as the backbone of our pipeline. Our method processes images and videos through a streamlined sequence of steps: First, we generate metadata using general-purpose queries via Google tools, capturing relevant content and links. Multimedia data is then segmented, cleaned, and converted into frames, from which we select the top-K most informative frames. These frames are cross-referenced with metadata to identify consensus or discrepancies. Additionally, audio transcripts are extracted for further verification. Noticeably, the entire pipeline is automated using GPT-4o through prompt engineering, with human intervention limited to final validation.
[LG-25] Memory-Augmented Architecture for Long-Term Context Handling in Large Language Models
链接: https://arxiv.org/abs/2506.18271
作者: Haseeb Ullah Khan Shinwari,Muhammad Usama
类目: Machine Learning (cs.LG)
*备注:
Abstract:Large Language Models face significant challenges in maintaining coherent interactions over extended dialogues due to their limited contextual memory. This limitation often leads to fragmented exchanges and reduced relevance in responses, diminishing user experience. To address these issues, we propose a memory-augmented architecture that dynamically retrieves, updates, and prunes relevant information from past interactions, ensuring effective long-term context handling. Experimental results demonstrate that our solution significantly improves contextual coherence, reduces memory overhead, and enhances response quality, showcasing its potential for real-time applications in interactive systems.
[LG-26] Ground tracking for improved landmine detection in a GPR system
链接: https://arxiv.org/abs/2506.18258
作者: Li Tang,Peter A. Torrione,Cihat Eldeniz,Leslie M. Collins
类目: Machine Learning (cs.LG)
*备注:
Abstract:Ground penetrating radar (GPR) provides a promising technology for accurate subsurface object detection. In particular, it has shown promise for detecting landmines with low metal content. However, the ground bounce (GB) that is present in GPR data, which is caused by the dielectric discontinuity between soil and air, is a major source of interference and degrades landmine detection performance. To mitigate this interference, GB tracking algorithms formulated using both a Kalman filter (KF) and a particle filter (PF) framework are proposed. In particular, the location of the GB in the radar signal is modeled as the hidden state in a stochastic system for the PF approach. The observations are the 2D radar images, which arrive scan by scan along the down-track direction. An initial training stage sets parameters automatically to accommodate different ground and weather conditions. The features associated with the GB description are updated adaptively with the arrival of new data. The prior distribution for a given location is predicted by propagating information from two adjacent channels/scans, which ensures that the overall GB surface remains smooth. The proposed algorithms are verified in experiments utilizing real data, and their performances are compared with other GB tracking approaches. We demonstrate that improved GB tracking contributes to improved performance for the landmine detection problem.
[LG-27] Exploring Efficient Quantification of Modeling Uncertainties with Differentiable Physics-Informed Machine Learning Architectures
链接: https://arxiv.org/abs/2506.18247
作者: Manaswin Oddiraju,Bharath Varma Penumatsa,Divyang Amin,Michael Piedmonte,Souma Chowdhury
类目: Machine Learning (cs.LG)
*备注: IDETC 2025
Abstract:Quantifying and propagating modeling uncertainties is crucial for reliability analysis, robust optimization, and other model-based algorithmic processes in engineering design and control. Now, physics-informed machine learning (PIML) methods have emerged in recent years as a new alternative to traditional computational modeling and surrogate modeling methods, offering a balance between computing efficiency, modeling accuracy, and interpretability. However, their ability to predict and propagate modeling uncertainties remains mostly unexplored. In this paper, a promising class of auto-differentiable hybrid PIML architectures that combine partial physics and neural networks or ANNs (for input transformation or adaptive parameter estimation) is integrated with Bayesian Neural networks (replacing the ANNs); this is done with the goal to explore whether BNNs can successfully provision uncertainty propagation capabilities in the PIML architectures as well, further supported by the auto-differentiability of these architectures. A two-stage training process is used to alleviate the challenges traditionally encountered in training probabilistic ML models. The resulting BNN-integrated PIML architecture is evaluated on an analytical benchmark problem and flight experiments data for a fixed-wing RC aircraft, with prediction performance observed to be slightly worse or at par with purely data-driven ML and original PIML models. Moreover, Monte Carlo sampling of probabilistic BNN weights was found to be most effective in propagating uncertainty in the BNN-integrated PIML architectures.
[LG-28] Dual-Forward Path Teacher Knowledge Distillation: Bridging the Capacity Gap Between Teacher and Student
链接: https://arxiv.org/abs/2506.18244
作者: Tong Li,Long Liu,Yihang Hu,Hu Chen,Shifeng Chen
类目: Machine Learning (cs.LG)
*备注: 15pages
Abstract:Knowledge distillation (KD) provides an effective way to improve the performance of a student network under the guidance of pre-trained teachers. However, this approach usually brings in a large capacity gap between teacher and student networks, limiting the distillation gains. Previous methods addressing this problem either discard accurate knowledge representation or fail to dynamically adjust the transferred knowledge, which is less effective in addressing the capacity gap problem and hinders students from achieving comparable performance with the pre-trained teacher. In this work, we extend the ideology of prompt-based learning to address the capacity gap problem, and propose Dual-Forward Path Teacher Knowledge Distillation (DFPT-KD), which replaces the pre-trained teacher with a novel dual-forward path teacher to supervise the learning of student. The key to DFPT-KD is prompt-based tuning, i.e., establishing an additional prompt-based forward path within the pre-trained teacher and optimizing it with the pre-trained teacher frozen to make the transferred knowledge compatible with the representation ability of the student. Extensive experiments demonstrate that DFPT-KD leads to trained students performing better than the vanilla KD. To make the transferred knowledge better compatible with the representation abilities of the student, we further fine-tune the whole prompt-based forward path, yielding a novel distillation approach dubbed DFPT-KD+. By extensive experiments, it is shown that DFPT-KD+ improves upon DFPT-KD and achieves state-of-the-art accuracy performance.
[LG-29] Joint Embedding Predictive Architecture for self-supervised pretraining on polymer molecular graphs
链接: https://arxiv.org/abs/2506.18194
作者: Francesco Picolli,Gabriel Vogel,Jana M. Weber
类目: Machine Learning (cs.LG)
*备注:
Abstract:Recent advances in machine learning (ML) have shown promise in accelerating the discovery of polymers with desired properties by aiding in tasks such as virtual screening via property prediction. However, progress in polymer ML is hampered by the scarcity of high-quality labeled datasets, which are necessary for training supervised ML models. In this work, we study the use of the very recent ‘Joint Embedding Predictive Architecture’ (JEPA), a type of architecture for self-supervised learning (SSL), on polymer molecular graphs to understand whether pretraining with the proposed SSL strategy improves downstream performance when labeled data is scarce. Our results indicate that JEPA-based self-supervised pretraining on polymer graphs enhances downstream performance, particularly when labeled data is very scarce, achieving improvements across all tested datasets.
[LG-30] Online Learning of Whittle Indices for Restless Bandits with Non-Stationary Transition Kernels
链接: https://arxiv.org/abs/2506.18186
作者: Md Kamran Chowdhury Shisher,Vishrant Tripathi,Mung Chiang,Christopher G. Brinton
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:We consider optimal resource allocation for restless multi-armed bandits (RMABs) in unknown, non-stationary settings. RMABs are PSPACE-hard to solve optimally, even when all parameters are known. The Whittle index policy is known to achieve asymptotic optimality for a large class of such problems, while remaining computationally efficient. In many practical settings, however, the transition kernels required to compute the Whittle index are unknown and non-stationary. In this work, we propose an online learning algorithm for Whittle indices in this setting. Our algorithm first predicts current transition kernels by solving a linear optimization problem based on upper confidence bounds and empirical transition probabilities calculated from data over a sliding window. Then, it computes the Whittle index associated with the predicted transition kernels. We design these sliding windows and upper confidence bounds to guarantee sub-linear dynamic regret on the number of episodes T , under the condition that transition kernels change slowly over time (rate upper bounded by \epsilon=1/T^k with k0 ). Furthermore, our proposed algorithm and regret analysis are designed to exploit prior domain knowledge and structural information of the RMABs to accelerate the learning process. Numerical results validate that our algorithm achieves superior performance in terms of lowest cumulative regret relative to baselines in non-stationary environments.
[LG-31] Memba: Membrane-driven Parameter-Efficient Fine-Tuning for Mamba
链接: https://arxiv.org/abs/2506.18184
作者: Donghyun Lee,Yuhang Li,Ruokai Yin,Shiting Xiao,Priyadarshini Panda
类目: Machine Learning (cs.LG)
*备注:
Abstract:State Space Models (SSMs) have emerged as powerful alternatives to attention-based Transformers, with Mamba demonstrating impressive efficiency and scalability. As these models grow increasingly larger, the need for Parameter-Efficient Fine-Tuning (PEFT) methods becomes critical to adapt pre-trained Mamba to downstream tasks without prohibitive computational costs. However, previous approaches simply apply traditional Transformer-tailored PEFT methods without addressing the unique temporal processing dynamics of SSMs. To address this limitation, we propose Memba, a membrane-driven PEFT approach specifically designed for Mamba. Memba introduces Leaky Integrate Membrane (LIM) neurons as bio-inspired gating mechanisms that naturally accumulate membrane potentials over time, enhancing selective information retention. By strategically combining LIM neurons with Low-Rank Adaptations (LoRA) and cross-layer membrane transfer, our approach significantly improves Mamba’s temporal modeling capabilities. Extensive experiments across language and vision tasks demonstrate that Memba achieves substantial improvements over existing PEFT methods. The code is available at this https URL.
[LG-32] Probabilistic and reinforced mining of association rules
链接: https://arxiv.org/abs/2506.18155
作者: Yongchao Huang
类目: Machine Learning (cs.LG)
*备注: 205 pages
Abstract:This work introduces 4 novel probabilistic and reinforcement-driven methods for association rule mining (ARM): Gaussian process-based association rule mining (GPAR), Bayesian ARM (BARM), multi-armed bandit based ARM (MAB-ARM), and reinforcement learning based association rule mining (RLAR). These methods depart fundamentally from traditional frequency-based algorithms such as Apriori, FP-Growth, and Eclat, offering enhanced capabilities for incorporating prior knowledge, modeling uncertainty, item dependencies, probabilistic inference and adaptive search strategies. GPAR employs Gaussian processes to model item co-occurrence via feature representations, enabling principled inference, uncertainty quantification, and efficient generalization to unseen itemsets without retraining. BARM adopts a Bayesian framework with priors and optional correlation structures, yielding robust uncertainty quantification through full posterior distributions over item presence probabilities. MAB-ARM, including its Monte Carlo tree search (MCTS) companion, utilizes an upper confidence bound (UCB) strategy for efficient and adaptive exploration of the itemset space, while RLAR applies a deep Q-network (DQN) to learn a generalizable policy for identifying high-quality rules. Collectively, these approaches improve the flexibility and robustness of ARM, particularly for discovering rare or complex patterns and operating on small datasets. Empirical results on synthetic and real-world datasets demonstrate their effectiveness, while also highlighting trade-offs in computational complexity and interpretability. These innovations mark a significant shift from static, frequency-driven paradigms, offering some prior and dependency-informed, uncertainty-aware or scalable ARM frameworks for diverse application domains such as retail, geography, finance, medical diagnostics, and risk-sensitive scenarios.
[LG-33] Bayesian Multiobject Tracking With Neural-Enhanced Motion and Measurement Models
链接: https://arxiv.org/abs/2506.18124
作者: Shaoxiu Wei,Mingchao Liang,Florian Meyer
类目: Machine Learning (cs.LG); Signal Processing (eess.SP); Machine Learning (stat.ML)
*备注:
Abstract:Multiobject tracking (MOT) is an important task in applications including autonomous driving, ocean sciences, and aerospace surveillance. Traditional MOT methods are model-based and combine sequential Bayesian estimation with data association and an object birth model. More recent methods are fully data-driven and rely on the training of neural networks. Both approaches offer distinct advantages in specific settings. In particular, model-based methods are generally applicable across a wide range of scenarios, whereas data-driven MOT achieves superior performance in scenarios where abundant labeled data for training is available. A natural thought is whether a general framework can integrate the two approaches. This paper introduces a hybrid method that utilizes neural networks to enhance specific aspects of the statistical model in Bayesian MOT that have been identified as overly simplistic. By doing so, the performance of the prediction and update steps of Bayesian MOT is improved. To ensure tractable computation, our framework uses belief propagation to avoid high-dimensional operations combined with sequential Monte Carlo methods to perform low-dimensional operations efficiently. The resulting method combines the flexibility and robustness of model-based approaches with the capability to learn complex information from data of neural networks. We evaluate the performance of the proposed method based on the nuScenes autonomous driving dataset and demonstrate that it has state-of-the-art performance
[LG-34] RoboArena: Distributed Real-World Evaluation of Generalist Robot Policies
链接: https://arxiv.org/abs/2506.18123
作者: Pranav Atreya,Karl Pertsch,Tony Lee,Moo Jin Kim,Arhan Jain,Artur Kuramshin,Clemens Eppner,Cyrus Neary,Edward Hu,Fabio Ramos,Jonathan Tremblay,Kanav Arora,Kirsty Ellis,Luca Macesanu,Matthew Leonard,Meedeum Cho,Ozgur Aslan,Shivin Dass,Jie Wang,Xingfang Yuan,Xuning Yang,Abhishek Gupta,Dinesh Jayaraman,Glen Berseth,Kostas Daniilidis,Roberto Martin-Martin,Youngwoon Lee,Percy Liang,Chelsea Finn,Sergey Levine
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: Website: this https URL
Abstract:Comprehensive, unbiased, and comparable evaluation of modern generalist policies is uniquely challenging: existing approaches for robot benchmarking typically rely on heavy standardization, either by specifying fixed evaluation tasks and environments, or by hosting centralized ‘‘robot challenges’’, and do not readily scale to evaluating generalist policies across a broad range of tasks and environments. In this work, we propose RoboArena, a new approach for scalable evaluation of generalist robot policies in the real world. Instead of standardizing evaluations around fixed tasks, environments, or locations, we propose to crowd-source evaluations across a distributed network of evaluators. Importantly, evaluators can freely choose the tasks and environments they evaluate on, enabling easy scaling of diversity, but they are required to perform double-blind evaluations over pairs of policies. Then, by aggregating preference feedback from pairwise comparisons across diverse tasks and environments, we can derive a ranking of policies. We instantiate our approach across a network of evaluators at seven academic institutions using the DROID robot platform. Through more than 600 pairwise real-robot evaluation episodes across seven generalist policies, we demonstrate that our crowd-sourced approach can more accurately rank the performance of existing generalist policies than conventional, centralized evaluation approaches, while being more scalable, resilient, and trustworthy. We open our evaluation network to the community and hope that it can enable more accessible comparisons of generalist robot policies.
[LG-35] Dynamic Temporal Positional Encodings for Early Intrusion Detection in IoT
链接: https://arxiv.org/abs/2506.18114
作者: Ioannis Panopoulos,Maria-Lamprini A. Bartsioka,Sokratis Nikolaidis,Stylianos I. Venieris,Dimitra I. Kaklamani,Iakovos S. Venieris
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: Accepted at the 10th International Conference on Smart and Sustainable Technologies (SpliTech 2025)
Abstract:The rapid expansion of the Internet of Things (IoT) has introduced significant security challenges, necessitating efficient and adaptive Intrusion Detection Systems (IDS). Traditional IDS models often overlook the temporal characteristics of network traffic, limiting their effectiveness in early threat detection. We propose a Transformer-based Early Intrusion Detection System (EIDS) that incorporates dynamic temporal positional encodings to enhance detection accuracy while maintaining computational efficiency. By leveraging network flow timestamps, our approach captures both sequence structure and timing irregularities indicative of malicious behaviour. Additionally, we introduce a data augmentation pipeline to improve model robustness. Evaluated on the CICIoT2023 dataset, our method outperforms existing models in both accuracy and earliness. We further demonstrate its real-time feasibility on resource-constrained IoT devices, achieving low-latency inference and minimal memory footprint.
[LG-36] AB: Unified Benchmarking of Time Series Anomaly Detection Methods VLDB2025
链接: https://arxiv.org/abs/2506.18046
作者: Xiangfei Qiu,Zhe Li,Wanghui Qiu,Shiyan Hu,Lekui Zhou,Xingjian Wu,Zhengyu Li,Chenjuan Guo,Aoying Zhou,Zhenli Sheng,Jilin Hu,Christian S. Jensen,Bin Yang
类目: Machine Learning (cs.LG)
*备注: Accepted by PVLDB2025
Abstract:Time series anomaly detection (TSAD) plays an important role in many domains such as finance, transportation, and healthcare. With the ongoing instrumentation of reality, more time series data will be available, leading also to growing demands for TSAD. While many TSAD methods already exist, new and better methods are still desirable. However, effective progress hinges on the availability of reliable means of evaluating new methods and comparing them with existing methods. We address deficiencies in current evaluation procedures related to datasets and experimental settings and protocols. Specifically, we propose a new time series anomaly detection benchmark, called TAB. First, TAB encompasses 29 public multivariate datasets and 1,635 univariate time series from different domains to facilitate more comprehensive evaluations on diverse datasets. Second, TAB covers a variety of TSAD methods, including Non-learning, Machine learning, Deep learning, LLM-based, and Time-series pre-trained methods. Third, TAB features a unified and automated evaluation pipeline that enables fair and easy evaluation of TSAD methods. Finally, we employ TAB to evaluate existing TSAD methods and report on the outcomes, thereby offering a deeper insight into the performance of these methods. Besides, all datasets and code are available at this https URL.
[LG-37] Why Do Some Language Models Fake Alignment While Others Dont?
链接: https://arxiv.org/abs/2506.18032
作者: Abhay Sheshadri,John Hughes,Julian Michael,Alex Mallen,Arun Jose,Janus,Fabien Roger
类目: Machine Learning (cs.LG)
*备注:
Abstract:Alignment faking in large language models presented a demonstration of Claude 3 Opus and Claude 3.5 Sonnet selectively complying with a helpful-only training objective to prevent modification of their behavior outside of training. We expand this analysis to 25 models and find that only 5 (Claude 3 Opus, Claude 3.5 Sonnet, Llama 3 405B, Grok 3, Gemini 2.0 Flash) comply with harmful queries more when they infer they are in training than when they infer they are in deployment. First, we study the motivations of these 5 models. Results from perturbing details of the scenario suggest that only Claude 3 Opus’s compliance gap is primarily and consistently motivated by trying to keep its goals. Second, we investigate why many chat models don’t fake alignment. Our results suggest this is not entirely due to a lack of capabilities: many base models fake alignment some of the time, and post-training eliminates alignment-faking for some models and amplifies it for others. We investigate 5 hypotheses for how post-training may suppress alignment faking and find that variations in refusal behavior may account for a significant portion of differences in alignment faking.
[LG-38] Generalization under Byzantine Poisoning Attacks: Tight Stability Bounds in Robust Distributed Learning
链接: https://arxiv.org/abs/2506.18020
作者: Thomas Boudou,Batiste Le Bars,Nirupam Gupta,Aurélien Bellet
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Machine Learning (stat.ML)
*备注:
Abstract:Robust distributed learning algorithms aim to maintain good performance in distributed and federated settings, even in the presence of misbehaving workers. Two primary threat models have been studied: Byzantine attacks, where misbehaving workers can send arbitrarily corrupted updates, and data poisoning attacks, where misbehavior is limited to manipulation of local training data. While prior work has shown comparable optimization error under both threat models, a fundamental question remains open: How do these threat models impact generalization? Empirical evidence suggests a gap between the two threat models, yet it remains unclear whether it is fundamental or merely an artifact of suboptimal attacks. In this work, we present the first theoretical investigation into this problem, formally showing that Byzantine attacks are intrinsically more harmful to generalization than data poisoning. Specifically, we prove that: (i) under data poisoning, the uniform algorithmic stability of a robust distributed learning algorithm, with optimal optimization error, degrades by an additive factor of \varTheta ( \fracfn-f ) , with f the number of misbehaving workers out of n ; and (ii) In contrast, under Byzantine attacks, the degradation is in \mathcalO \big( \sqrt \fracfn-2f \big) .This difference in stability leads to a generalization error gap that is especially significant as f approaches its maximum value \fracn2 .
[LG-39] Imputation of Longitudinal Data Using GANs: Challenges and Implications for Classification
链接: https://arxiv.org/abs/2506.18007
作者: Sharon Torao Pingi,Md Abul Bashar,Richi Nayak
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 68 pages (excluding bibliography), 10 figures
Abstract:Longitudinal data is commonly utilised across various domains, such as health, biomedical, education and survey studies. This ubiquity has led to a rise in statistical, machine and deep learning-based methods for Longitudinal Data Classification (LDC). However, the intricate nature of the data, characterised by its multi-dimensionality, causes instance-level heterogeneity and temporal correlations that add to the complexity of longitudinal data analysis. Additionally, LDC accuracy is often hampered by the pervasiveness of missing values in longitudinal data. Despite ongoing research that draw on the generative power and utility of Generative Adversarial Networks (GANs) to address the missing data problem, critical considerations include statistical assumptions surrounding longitudinal data and missingness within it, as well as other data-level challenges like class imbalance and mixed data types that impact longitudinal data imputation (LDI) and the subsequent LDC process in GANs. This paper provides a comprehensive overview of how GANs have been applied in LDI, with a focus whether GANS have adequately addressed fundamental assumptions about the data from a LDC perspective. We propose a categorisation of main approaches to GAN-based LDI, highlight strengths and limitations of methods, identify key research trends, and provide promising future directions. Our findings indicate that while GANs show great potential for LDI to improve usability and quality of longitudinal data for tasks like LDC, there is need for more versatile approaches that can handle the wider spectrum of challenges presented by longitudinal data with missing values. By synthesising current knowledge and identifying critical research gaps, this survey aims to guide future research efforts in developing more effective GAN-based solutions to address LDC challenges.
[LG-40] Newtonian and Lagrangian Neural Networks: A Comparison Towards Efficient Inverse Dynamics Identification
链接: https://arxiv.org/abs/2506.17994
作者: Minh Trinh,Andreas René Geist,Josefine Monnet,Stefan Vilceanu,Sebastian Trimpe,Christian Brecher
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: Paper accepted for publication in 14th IFAC Symposium on Robotics
Abstract:Accurate inverse dynamics models are essential tools for controlling industrial robots. Recent research combines neural network regression with inverse dynamics formulations of the Newton-Euler and the Euler-Lagrange equations of motion, resulting in so-called Newtonian neural networks and Lagrangian neural networks, respectively. These physics-informed models seek to identify unknowns in the analytical equations from data. Despite their potential, current literature lacks guidance on choosing between Lagrangian and Newtonian networks. In this study, we show that when motor torques are estimated instead of directly measuring joint torques, Lagrangian networks prove less effective compared to Newtonian networks as they do not explicitly model dissipative torques. The performance of these models is compared to neural network regression on data of a MABI MAX 100 industrial robot.
[LG-41] Data Curation Matters: Model Collapse and Spurious Shift Performance Prediction from Training on Uncurated Text Embeddings
链接: https://arxiv.org/abs/2506.17989
作者: Lucas Mattioli,Youness Ait Hadichou,Sabrina Chaouche,Martin Gonzalez
类目: Machine Learning (cs.LG)
*备注: 37 pages. Multiple figures
Abstract:Training models on uncurated Text Embeddings (TEs) derived from raw tabular data can lead to a severe failure mode known as model collapse, where predictions converge to a single class regardless of input. By comparing models trained with identical hyper-parameter configurations on both raw tabular data and their TE-derived counterparts, we find that collapse is a consistent failure mode in the latter setting. We introduce a set of metrics that capture the extent of model collapse, offering a new perspective on TE quality as a proxy for data curation. Our results reveal that TE alone does not effectively function as a curation layer - and that their quality significantly influences downstream learning. More insidiously, we observe that the presence of model collapse can yield artificially inflated and spurious Accuracy-on-the-Line correlation. These findings highlight the need for more nuanced curation and evaluation of embedding-based representations, particularly in out-of-distribution settings.
[LG-42] SliceGX: Layer-wise GNN Explanation with Model-slicing
链接: https://arxiv.org/abs/2506.17977
作者: Tingting Zhu,Tingyang Chen,Yinghui Wu,Arijit Khan,Xiangyu Ke
类目: Machine Learning (cs.LG); Databases (cs.DB)
*备注:
Abstract:Ensuring the trustworthiness of graph neural networks (GNNs) as black-box models requires effective explanation methods. Existing GNN explanations typically apply input perturbations to identify subgraphs that are responsible for the occurrence of the final output of GNNs. However, such approaches lack finer-grained, layer-wise analysis of how intermediate representations contribute to the final result, capabilities that are crucial for model diagnosis and architecture optimization. This paper introduces SliceGX, a novel GNN explanation approach that generates explanations at specific GNN layers in a progressive manner. Given a GNN M, a set of selected intermediate layers, and a target layer, SliceGX automatically segments M into layer blocks (“model slice”) and discovers high-quality explanatory subgraphs in each layer block that clarifies the occurrence of output of M at the targeted layer. Although finding such layer-wise explanations is computationally challenging, we develop efficient algorithms and optimization techniques that incrementally generate and maintain these subgraphs with provable approximation guarantees. Additionally, SliceGX offers a SPARQL-like query interface, providing declarative access and search capacities for the generated explanations. Through experiments on large real-world graphs and representative GNN architectures, we verify the effectiveness and efficiency of SliceGX, and illustrate its practical utility in supporting model debugging.
[LG-43] rustworthy Efficient Communication for Distributed Learning using LQ-SGD Algorithm
链接: https://arxiv.org/abs/2506.17974
作者: Hongyang Li,Lincen Bai,Caesar Wu,Mohammed Chadli,Said Mammar,Pascal Bouvry
类目: Machine Learning (cs.LG)
*备注:
Abstract:We propose LQ-SGD (Low-Rank Quantized Stochastic Gradient Descent), an efficient communication gradient compression algorithm designed for distributed training. LQ-SGD further develops on the basis of PowerSGD by incorporating the low-rank approximation and log-quantization techniques, which drastically reduce the communication overhead, while still ensuring the convergence speed of training and model accuracy. In addition, LQ-SGD and other compression-based methods show stronger resistance to gradient inversion than traditional SGD, providing a more robust and efficient optimization path for distributed learning systems.
[LG-44] ROJAN-GUARD: Hardware Trojans Detection Using GNN in RTL Designs
链接: https://arxiv.org/abs/2506.17894
作者: Kiran Thorat,Amit Hasan,Caiwen Ding,Zhijie Shi
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:
Abstract:Chip manufacturing is a complex process, and to achieve a faster time to market, an increasing number of untrusted third-party tools and designs from around the world are being utilized. The use of these untrusted third party intellectual properties (IPs) and tools increases the risk of adversaries inserting hardware trojans (HTs). The covert nature of HTs poses significant threats to cyberspace, potentially leading to severe consequences for national security, the economy, and personal privacy. Many graph neural network (GNN)-based HT detection methods have been proposed. However, they perform poorly on larger designs because they rely on training with smaller designs. Additionally, these methods do not explore different GNN models that are well-suited for HT detection or provide efficient training and inference processes. We propose a novel framework that generates graph embeddings for large designs (e.g., RISC-V) and incorporates various GNN models tailored for HT detection. Furthermore, our framework introduces domain-specific techniques for efficient training and inference by implementing model quantization. Model quantization reduces the precision of the weights, lowering the computational requirements, enhancing processing speed without significantly affecting detection accuracy. We evaluate our framework using a custom dataset, and our results demonstrate a precision of 98.66% and a recall (true positive rate) of 92.30%, highlighting the effectiveness and efficiency of our approach in detecting hardware trojans in large-scale chip designs
[LG-45] Choice of Scoring Rules for Indirect Elicitation of Properties with Parametric Assumptions
链接: https://arxiv.org/abs/2506.17880
作者: Lingfang Hu,Ian A. Kash(Department of Computer Science, University of Illinois at Chicago, Chicago, USA.)
类目: Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML)
*备注: Key words: proper scoring rules, property elicitation, parametric model estimation. Paper length: 20 pages of main text + 2 pages of references + 21 pages of appendices
Abstract:People are commonly interested in predicting a statistical property of a random event such as mean and variance. Proper scoring rules assess the quality of predictions and require that the expected score gets uniquely maximized at the precise prediction, in which case we call the score directly elicits the property. Previous research work has widely studied the existence and the characterization of proper scoring rules for different properties, but little literature discusses the choice of proper scoring rules for applications at hand. In this paper, we explore a novel task, the indirect elicitation of properties with parametric assumptions, where the target property is a function of several directly-elicitable sub-properties and the total score is a weighted sum of proper scoring rules for each sub-property. Because of the restriction to a parametric model class, different settings for the weights lead to different constrained optimal solutions. Our goal is to figure out how the choice of weights affects the estimation of the target property and which choice is the best. We start it with simulation studies and observe an interesting pattern: in most cases, the optimal estimation of the target property changes monotonically with the increase of each weight, and the best configuration of weights is often to set some weights as zero. To understand how it happens, we first establish the elementary theoretical framework and then provide deeper sufficient conditions for the case of two sub-properties and of more sub-properties respectively. The theory on 2-D cases perfectly interprets the experimental results. In higher-dimensional situations, we especially study the linear cases and suggest that more complex settings can be understood with locally mapping into linear situations or using linear approximations when the true values of sub-properties are close enough to the parametric space.
[LG-46] Geometric Contact Flows: Contactomorphisms for Dynamics and Control ICML2025
链接: https://arxiv.org/abs/2506.17868
作者: Andrea Testa,Søren Hauberg,Tamim Asfour,Leonel Rozo
类目: Robotics (cs.RO); Machine Learning (cs.LG); Differential Geometry (math.DG)
*备注: Accepted at ICML 2025
Abstract:Accurately modeling and predicting complex dynamical systems, particularly those involving force exchange and dissipation, is crucial for applications ranging from fluid dynamics to robotics, but presents significant challenges due to the intricate interplay of geometric constraints and energy transfer. This paper introduces Geometric Contact Flows (GFC), a novel framework leveraging Riemannian and Contact geometry as inductive biases to learn such systems. GCF constructs a latent contact Hamiltonian model encoding desirable properties like stability or energy conservation. An ensemble of contactomorphisms then adapts this model to the target dynamics while preserving these properties. This ensemble allows for uncertainty-aware geodesics that attract the system’s behavior toward the data support, enabling robust generalization and adaptation to unseen scenarios. Experiments on learning dynamics for physical systems and for controlling robots on interaction tasks demonstrate the effectiveness of our approach.
[LG-47] Leveling the Playing Field: Carefully Comparing Classical and Learned Controllers for Quadrotor Trajectory Tracking
链接: https://arxiv.org/abs/2506.17832
作者: Pratik Kunapuli,Jake Welde,Dinesh Jayaraman,Vijay Kumar
类目: Robotics (cs.RO); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: Accepted for publication to RSS 2025. 10 pages, 5 figures. Project website: this https URL
Abstract:Learning-based control approaches like reinforcement learning (RL) have recently produced a slew of impressive results for tasks like quadrotor trajectory tracking and drone racing. Naturally, it is common to demonstrate the advantages of these new controllers against established methods like analytical controllers. We observe, however, that reliably comparing the performance of such very different classes of controllers is more complicated than might appear at first sight. As a case study, we take up the problem of agile tracking of an end-effector for a quadrotor with a fixed arm. We develop a set of best practices for synthesizing the best-in-class RL and geometric controllers (GC) for benchmarking. In the process, we resolve widespread RL-favoring biases in prior studies that provide asymmetric access to: (1) the task definition, in the form of an objective function, (2) representative datasets, for parameter optimization, and (3) feedforward information, describing the desired future trajectory. The resulting findings are the following: our improvements to the experimental protocol for comparing learned and classical controllers are critical, and each of the above asymmetries can yield misleading conclusions. Prior works have claimed that RL outperforms GC, but we find the gaps between the two controller classes are much smaller than previously published when accounting for symmetric comparisons. Geometric control achieves lower steady-state error than RL, while RL has better transient performance, resulting in GC performing better in relatively slow or less agile tasks, but RL performing better when greater agility is required. Finally, we open-source implementations of geometric and RL controllers for these aerial vehicles, implementing best practices for future development. Website and code is available at this https URL
[LG-48] Flatness After All?
链接: https://arxiv.org/abs/2506.17809
作者: Neta Shoham,Liron Mor-Yosef,Haim Avron
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Recent literature has examined the relationship between the curvature of the loss function at minima and generalization, mainly in the context of overparameterized networks. A key observation is that “flat” minima tend to generalize better than “sharp” minima. While this idea is supported by empirical evidence, it has also been shown that deep networks can generalize even with arbitrary sharpness, as measured by either the trace or the spectral norm of the Hessian. In this paper, we argue that generalization could be assessed by measuring flatness using a soft rank measure of the Hessian. We show that when the common neural network model (neural network with exponential family negative log likelihood loss) is calibrated, and its prediction error and its confidence in the prediction are not correlated with the first and the second derivatives of the network’s output, our measure accurately captures the asymptotic expected generalization gap. For non-calibrated models, we connect our flatness measure to the well-known Takeuchi Information Criterion and show that it still provides reliable estimates of generalization gaps for models that are not overly confident. Experimental results indicate that our approach offers a robust estimate of the generalization gap compared to baselines.
[LG-49] AdRo-FL: Informed and Secure Client Selection for Federated Learning in the Presence of Adversarial Aggregator
链接: https://arxiv.org/abs/2506.17805
作者: Md. Kamrul Hossain,Walid Aljoby,Anis Elgabli,Ahmed M. Abdelmoniem,Khaled A. Harras
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 17 pages
Abstract:Federated Learning (FL) enables collaborative learning without exposing clients’ data. While clients only share model updates with the aggregator, studies reveal that aggregators can infer sensitive information from these updates. Secure Aggregation (SA) protects individual updates during transmission; however, recent work demonstrates a critical vulnerability where adversarial aggregators manipulate client selection to bypass SA protections, constituting a Biased Selection Attack (BSA). Although verifiable random selection prevents BSA, it precludes informed client selection essential for FL performance. We propose Adversarial Robust Federated Learning (AdRo-FL), which simultaneously enables: informed client selection based on client utility, and robust defense against BSA maintaining privacy-preserving aggregation. AdRo-FL implements two client selection frameworks tailored for distinct settings. The first framework assumes clients are grouped into clusters based on mutual trust, such as different branches of an organization. The second framework handles distributed clients where no trust relationships exist between them. For the cluster-oriented setting, we propose a novel defense against BSA by (1) enforcing a minimum client selection quota from each cluster, supervised by a cluster-head in every round, and (2) introducing a client utility function to prioritize efficient clients. For the distributed setting, we design a two-phase selection protocol: first, the aggregator selects the top clients based on our utility-driven ranking; then, a verifiable random function (VRF) ensures a BSA-resistant final selection. AdRo-FL also applies quantization to reduce communication overhead and sets strict transmission deadlines to improve energy efficiency. AdRo-FL achieves up to 1.85\times faster time-to-accuracy and up to 1.06\times higher final accuracy compared to insecure baselines.
[LG-50] SING: SDE Inference via Natural Gradients
链接: https://arxiv.org/abs/2506.17796
作者: Amber Hu,Henry Smith,Scott Linderman
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Latent stochastic differential equation (SDE) models are important tools for the unsupervised discovery of dynamical systems from data, with applications ranging from engineering to neuroscience. In these complex domains, exact posterior inference of the latent state path is typically intractable, motivating the use of approximate methods such as variational inference (VI). However, existing VI methods for inference in latent SDEs often suffer from slow convergence and numerical instability. Here, we propose SDE Inference via Natural Gradients (SING), a method that leverages natural gradient VI to efficiently exploit the underlying geometry of the model and variational posterior. SING enables fast and reliable inference in latent SDE models by approximating intractable integrals and parallelizing computations in time. We provide theoretical guarantees that SING will approximately optimize the intractable, continuous-time objective of interest. Moreover, we demonstrate that better state inference enables more accurate estimation of nonlinear drift functions using, for example, Gaussian process SDE models. SING outperforms prior methods in state inference and drift estimation on a variety of datasets, including a challenging application to modeling neural dynamics in freely behaving animals. Altogether, our results illustrate the potential of SING as a tool for accurate inference in complex dynamical systems, especially those characterized by limited prior knowledge and non-conjugate structure.
[LG-51] PhysiX: A Foundation Model for Physics Simulations
链接: https://arxiv.org/abs/2506.17774
作者: Tung Nguyen,Arsh Koneru,Shufan Li,Aditya grover
类目: Machine Learning (cs.LG)
*备注: 21 pages, 10 figures
Abstract:Foundation models have achieved remarkable success across video, image, and language domains. By scaling up the number of parameters and training datasets, these models acquire generalizable world knowledge and often surpass task-specific approaches. However, such progress has yet to extend to the domain of physics simulation. A primary bottleneck is data scarcity: while millions of images, videos, and textual resources are readily available on the internet, the largest physics simulation datasets contain only tens of thousands of samples. This data limitation hinders the use of large models, as overfitting becomes a major concern. As a result, physics applications typically rely on small models, which struggle with long-range prediction due to limited context understanding. Additionally, unlike images, videos, or text-which typically exhibit fixed granularity-physics datasets often vary drastically in scale, amplifying the challenges of scaling up multitask training. We introduce PhysiX, the first large-scale foundation model for physics simulation. PhysiX is a 4.5B parameter autoregressive generative model. It uses a discrete tokenizer to encode physical processes at different scales into a sequence of discrete tokens, and employs an autoregressive next-token prediction objective to model such processes in the token space. To mitigate the rounding error in the discretization process, PhysiX incorporates a specialized refinement module. Through extensive experiments, we show that PhysiX effectively addresses the data bottleneck, outperforming task-specific baselines under comparable settings as well as the previous absolute state-of-the-art approaches on The Well benchmark. Our results indicate that knowledge learned from natural videos can be successfully transferred to physics simulation, and that joint training across diverse simulation tasks enables synergistic learning.
[LG-52] Log-Normal Multiplicative Dynamics for Stable Low-Precision Training of Large Networks
链接: https://arxiv.org/abs/2506.17768
作者: Keigo Nishida,Eren Mehmet Kıral,Kenichi Bannai,Mohammad Emtiyaz Khan,Thomas Möllenhoff
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Code is available here: this https URL
Abstract:Studies in neuroscience have shown that biological synapses follow a log-normal distribution whose transitioning can be explained by noisy multiplicative dynamics. Biological networks can function stably even under dynamically fluctuating conditions arising due to unreliable synaptic transmissions. Here we ask: Is it possible to design similar multiplicative training in artificial neural networks? To answer this question, we derive a Bayesian learning rule that assumes log-normal posterior distributions over weights which gives rise to a new Log-Normal Multiplicative Dynamics (LMD) algorithm. The algorithm uses multiplicative updates with both noise and regularization applied multiplicatively. The method is as easy to implement as Adam and only requires one additional vector to store. Our results show that LMD achieves stable and accurate training-from-scratch under low-precision forward operations for Vision Transformer and GPT-2. These results suggest that multiplicative dynamics, a biological feature, may enable stable low-precision inference and learning on future energy-efficient hardware.
[LG-53] A Locally Differential Private Coding-Assisted Succinct Histogram Protocol
链接: https://arxiv.org/abs/2506.17767
作者: Hsuan-Po Liu,Hessam Mahdavifar
类目: Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:
Abstract:A succinct histogram captures frequent items and their frequencies across clients and has become increasingly important for large-scale, privacy-sensitive machine learning applications. To develop a rigorous framework to guarantee privacy for the succinct histogram problem, local differential privacy (LDP) has been utilized and shown promising results. To preserve data utility under LDP, which essentially works by intentionally adding noise to data, error-correcting codes naturally emerge as a promising tool for reliable information collection. This work presents the first practical (\epsilon,\delta) -LDP protocol for constructing succinct histograms using error-correcting codes. To this end, polar codes and their successive-cancellation list (SCL) decoding algorithms are leveraged as the underlying coding scheme. More specifically, our protocol introduces Gaussian-based perturbations to enable efficient soft decoding. Experiments demonstrate that our approach outperforms prior methods, particularly for items with low true frequencies, while maintaining similar frequency estimation accuracy.
[LG-54] owards a Unified Textual Graph Framework for Spectral Reasoning via Physical and Chemical Information Fusion
链接: https://arxiv.org/abs/2506.17761
作者: Jiheng Liang,Ziru Yu,Zujie Xie,Yuchen Guo,Yulan Guo,Xiangyang Yu
类目: Machine Learning (cs.LG)
*备注: 16 pages, 7 figures, 8 tables
Abstract:Motivated by the limitations of current spectral analysis methods-such as reliance on single-modality data, limited generalizability, and poor interpretability-we propose a novel multi-modal spectral analysis framework that integrates prior knowledge graphs with Large Language Models. Our method explicitly bridges physical spectral measurements and chemical structural semantics by representing them in a unified Textual Graph format, enabling flexible, interpretable, and generalizable spectral understanding. Raw spectra are first transformed into TAGs, where nodes and edges are enriched with textual attributes describing both spectral properties and chemical context. These are then merged with relevant prior knowledge-including functional groups and molecular graphs-to form a Task Graph that incorporates “Prompt Nodes” supporting LLM-based contextual reasoning. A Graph Neural Network further processes this structure to complete downstream tasks. This unified design enables seamless multi-modal integration and automated feature decoding with minimal manual annotation. Our framework achieves consistently high performance across multiple spectral analysis tasks, including node-level, edge-level, and graph-level classification. It demonstrates robust generalization in both zero-shot and few-shot settings, highlighting its effectiveness in learning from limited data and supporting in-context reasoning. This work establishes a scalable and interpretable foundation for LLM-driven spectral analysis, unifying physical and chemical modalities for scientific applications.
[LG-55] Physics-informed mixture of experts network for interpretable battery degradation trajectory computation amid second-life complexities
链接: https://arxiv.org/abs/2506.17755
作者: Xinghao Huang,Shengyu Tao,Chen Liang,Jiawei Chen,Junzhe Shi,Yuqi Li,Bizhong Xia,Guangmin Zhou,Xuan Zhang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Retired electric vehicle batteries offer immense potential to support low-carbon energy systems, but uncertainties in their degradation behavior and data inaccessibilities under second-life use pose major barriers to safe and scalable deployment. This work proposes a Physics-Informed Mixture of Experts (PIMOE) network that computes battery degradation trajectories using partial, field-accessible signals in a single cycle. PIMOE leverages an adaptive multi-degradation prediction module to classify degradation modes using expert weight synthesis underpinned by capacity-voltage and relaxation data, producing latent degradation trend embeddings. These are input to a use-dependent recurrent network for long-term trajectory prediction. Validated on 207 batteries across 77 use conditions and 67,902 cycles, PIMOE achieves an average mean absolute percentage (MAPE) errors of 0.88% with a 0.43 ms inference time. Compared to the state-of-the-art Informer and PatchTST, it reduces computational time and MAPE by 50%, respectively. Compatible with random state of charge region sampling, PIMOE supports 150-cycle forecasts with 1.50% average and 6.26% maximum MAPE, and operates effectively even with pruned 5MB training data. Broadly, PIMOE framework offers a deployable, history-free solution for battery degradation trajectory computation, redefining how second-life energy storage systems are assessed, optimized, and integrated into the sustainable energy landscape.
[LG-56] Numerical simulation of transient heat conduction with moving heat source using Physics Informed Neural Networks
链接: https://arxiv.org/abs/2506.17726
作者: Anirudh Kalyan,Sundararajan Natarajan
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注:
Abstract:In this paper, the physics informed neural networks (PINNs) is employed for the numerical simulation of heat transfer involving a moving source. To reduce the computational effort, a new training method is proposed that uses a continuous time-stepping through transfer learning. Within this, the time interval is divided into smaller intervals and a single network is initialized. On this single network each time interval is trained with the initial condition for (n+1)th as the solution obtained at nth time increment. Thus, this framework enables the computation of large temporal intervals without increasing the complexity of the network itself. The proposed framework is used to estimate the temperature distribution in a homogeneous medium with a moving heat source. The results from the proposed framework is compared with traditional finite element method and a good agreement is seen.
[LG-57] Learning Time-Aware Causal Representation for Model Generalization in Evolving Domains ICML2025
链接: https://arxiv.org/abs/2506.17718
作者: Zhuo He,Shuang Li,Wenze Song,Longhui Yuan,Jian Liang,Han Li,Kun Gai
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: ICML 2025
Abstract:Endowing deep models with the ability to generalize in dynamic scenarios is of vital significance for real-world deployment, given the continuous and complex changes in data distribution. Recently, evolving domain generalization (EDG) has emerged to address distribution shifts over time, aiming to capture evolving patterns for improved model generalization. However, existing EDG methods may suffer from spurious correlations by modeling only the dependence between data and targets across domains, creating a shortcut between task-irrelevant factors and the target, which hinders generalization. To this end, we design a time-aware structural causal model (SCM) that incorporates dynamic causal factors and the causal mechanism drifts, and propose \textbfStatic-D\textbfYNamic \textbfCausal Representation Learning (\textbfSYNC), an approach that effectively learns time-aware causal representations. Specifically, it integrates specially designed information-theoretic objectives into a sequential VAE framework which captures evolving patterns, and produces the desired representations by preserving intra-class compactness of causal factors both across and within domains. Moreover, we theoretically show that our method can yield the optimal causal predictor for each time domain. Results on both synthetic and real-world datasets exhibit that SYNC can achieve superior temporal generalization performance.
[LG-58] CEGA: A Cost-Effective Approach for Graph-Based Model Extraction and Acquisition
链接: https://arxiv.org/abs/2506.17709
作者: Zebin Wang,Menghan Lin,Bolin Shen,Ken Anderson,Molei Liu,Tianxi Cai,Yushun Dong
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Machine Learning (stat.ML)
*备注:
Abstract:Graph Neural Networks (GNNs) have demonstrated remarkable utility across diverse applications, and their growing complexity has made Machine Learning as a Service (MLaaS) a viable platform for scalable deployment. However, this accessibility also exposes GNN to serious security threats, most notably model extraction attacks (MEAs), in which adversaries strategically query a deployed model to construct a high-fidelity replica. In this work, we evaluate the vulnerability of GNNs to MEAs and explore their potential for cost-effective model acquisition in non-adversarial research settings. Importantly, adaptive node querying strategies can also serve a critical role in research, particularly when labeling data is expensive or time-consuming. By selectively sampling informative nodes, researchers can train high-performing GNNs with minimal supervision, which is particularly valuable in domains such as biomedicine, where annotations often require expert input. To address this, we propose a node querying strategy tailored to a highly practical yet underexplored scenario, where bulk queries are prohibited, and only a limited set of initial nodes is available. Our approach iteratively refines the node selection mechanism over multiple learning cycles, leveraging historical feedback to improve extraction efficiency. Extensive experiments on benchmark graph datasets demonstrate our superiority over comparable baselines on accuracy, fidelity, and F1 score under strict query-size constraints. These results highlight both the susceptibility of deployed GNNs to extraction attacks and the promise of ethical, efficient GNN acquisition methods to support low-resource research environments.
[LG-59] Learning Personalized Utility Functions for Drivers in Ride-hailing Systems Using Ensemble Hypernetworks
链接: https://arxiv.org/abs/2506.17672
作者: Weiming Mai,Jie Gao,Oded Cats
类目: Machine Learning (cs.LG); Emerging Technologies (cs.ET)
*备注:
Abstract:In ride-hailing systems, drivers decide whether to accept or reject ride requests based on factors such as order characteristics, traffic conditions, and personal preferences. Accurately predicting these decisions is essential for improving the efficiency and reliability of these systems. Traditional models, such as the Random Utility Maximization (RUM) approach, typically predict drivers’ decisions by assuming linear correlations among attributes. However, these models often fall short because they fail to account for non-linear interactions between attributes and do not cater to the unique, personalized preferences of individual drivers. In this paper, we develop a method for learning personalized utility functions using hypernetwork and ensemble learning. Hypernetworks dynamically generate weights for a linear utility function based on trip request data and driver profiles, capturing the non-linear relationships. An ensemble of hypernetworks trained on different data segments further improve model adaptability and generalization by introducing controlled randomness, thereby reducing over-fitting. We validate the performance of our ensemble hypernetworks model in terms of prediction accuracy and uncertainty estimation in a real-world dataset. The results demonstrate that our approach not only accurately predicts each driver’s utility but also effectively balances the needs for explainability and uncertainty quantification. Additionally, our model serves as a powerful tool for revealing the personalized preferences of different drivers, clearly illustrating which attributes largely impact their rider acceptance decisions.
[LG-60] Online Multi-LLM Selection via Contextual Bandits under Unstructured Context Evolution
链接: https://arxiv.org/abs/2506.17670
作者: Manhin Poon,XiangXiang Dai,Xutong Liu,Fang Kong,John C.S. Lui,Jinhang Zuo
类目: Machine Learning (cs.LG)
*备注:
Abstract:Large language models (LLMs) exhibit diverse response behaviors, costs, and strengths, making it challenging to select the most suitable LLM for a given user query. We study the problem of adaptive multi-LLM selection in an online setting, where the learner interacts with users through multi-step query refinement and must choose LLMs sequentially without access to offline datasets or model internals. A key challenge arises from unstructured context evolution: the prompt dynamically changes in response to previous model outputs via a black-box process, which cannot be simulated, modeled, or learned. To address this, we propose the first contextual bandit framework for sequential LLM selection under unstructured prompt dynamics. We formalize a notion of myopic regret and develop a LinUCB-based algorithm that provably achieves sublinear regret without relying on future context prediction. We further introduce budget-aware and positionally-aware (favoring early-stage satisfaction) extensions to accommodate variable query costs and user preferences for early high-quality responses. Our algorithms are theoretically grounded and require no offline fine-tuning or dataset-specific training. Experiments on diverse benchmarks demonstrate that our methods outperform existing LLM routing strategies in both accuracy and cost-efficiency, validating the power of contextual bandits for real-time, adaptive LLM selection.
[LG-61] rustworthy Chronic Disease Risk Prediction For Self-Directed Preventive Care via Medical Literature Validation
链接: https://arxiv.org/abs/2506.17620
作者: Minh Le,Khoi Ton
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注:
Abstract:Chronic diseases are long-term, manageable, yet typically incurable conditions, highlighting the need for effective preventive strategies. Machine learning has been widely used to assess individual risk for chronic diseases. However, many models rely on medical test data (e.g. blood results, glucose levels), which limits their utility for proactive self-assessment. Additionally, to gain public trust, machine learning models should be explainable and transparent. Although some research on self-assessment machine learning models includes explainability, their explanations are not validated against established medical literature, reducing confidence in their reliability. To address these issues, we develop deep learning models that predict the risk of developing 13 chronic diseases using only personal and lifestyle factors, enabling accessible, self-directed preventive care. Importantly, we use SHAP-based explainability to identify the most influential model features and validate them against established medical literature. Our results show a strong alignment between the models’ most influential features and established medical literature, reinforcing the models’ trustworthiness. Critically, we find that this observation holds across 13 distinct diseases, indicating that this machine learning approach can be broadly trusted for chronic disease prediction. This work lays the foundation for developing trustworthy machine learning tools for self-directed preventive care. Future research can explore other approaches for models’ trustworthiness and discuss how the models can be used ethically and responsibly.
[LG-62] EQuARX: Efficient Quantized AllReduce in XLA for Distributed Machine Learning Acceleration
链接: https://arxiv.org/abs/2506.17615
作者: Ibrahim Ahmed,Clemens Schaefer,Gil Tabak,Denis Vnukov,Zenong Zhang,Felix chern,Anatoliy Yevtushenko,Andy Davis
类目: Machine Learning (cs.LG)
*备注:
Abstract:While Large Language Models (LLMs) have become highly influential, their enormous scale presents significant deployment challenges. Efficiently serving these models typically requires distributing them across numerous accelerator devices, which introduces substantial performance overhead from inter-device communication (collectives). While model quantization has been widely adopted to reduce the memory and compute requirements of LLM weights and activations with minimal quality impact, applying quantization directly to collectives like AllReduce is inherently difficult due to the inter-device summation involved, which can lead to numerical instability or significant error accumulation. In this work, we present a native dynamic block-wise efficient quantized AllReduce within the XLA compiler for TPUs (EQuARX). By using TPU-friendly quantization and deep pipelining of communication and compute, EQuARX with int8 precision achieves a 1.8X speedup over baseline BF16 AllReduce across various network topologies. Furthermore, EQuARX accelerates the prefill stage of Gemma 3 27B by 1.25X and Gemma 3 12B by 1.1X, respectively, with small to negligible impact on quality.
[LG-63] owards Fundamental Limits for Active Multi-distribution Learning COLT
链接: https://arxiv.org/abs/2506.17607
作者: Chicheng Zhang,Yihan Zhou
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: to appear in Conference on Learning Theory (COLT) 2025
Abstract:Multi-distribution learning extends agnostic Probably Approximately Correct (PAC) learning to the setting in which a family of k distributions, \D_i_i\in[k] , is considered and a classifier’s performance is measured by its error under the worst distribution. This problem has attracted a lot of recent interests due to its applications in collaborative learning, fairness, and robustness. Despite a rather complete picture of sample complexity of passive multi-distribution learning, research on active multi-distribution learning remains scarce, with algorithms whose optimality remaining unknown. In this paper, we develop new algorithms for active multi-distribution learning and establish improved label complexity upper and lower bounds, in distribution-dependent and distribution-free settings. Specifically, in the near-realizable setting we prove an upper bound of \widetildeO\Bigl(\theta_\max(d+k)\ln\frac1\varepsilon\Bigr) and \widetildeO\Bigl(\theta_\max(d+k)\Bigl(\ln\frac1\varepsilon+\frac\nu^2\varepsilon^2\Bigr)+\frack\nu\varepsilon^2\Bigr) in the realizable and agnostic settings respectively, where \theta_\max is the maximum disagreement coefficient among the k distributions, d is the VC dimension of the hypothesis class, \nu is the multi-distribution error of the best hypothesis, and \varepsilon is the target excess error. Moreover, we show that the bound in the realizable setting is information-theoretically optimal and that the k\nu/\varepsilon^2 term in the agnostic setting is fundamental for proper learners. We also establish instance-dependent sample complexity bound for passive multidistribution learning that smoothly interpolates between realizable and agnostic regimes~\citepblum2017collaborative,zhang2024optimal, which may be of independent interest. Comments: to appear in Conference on Learning Theory (COLT) 2025 Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2506.17607 [cs.LG] (or arXiv:2506.17607v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2506.17607 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-64] LFR-PINO: A Layered Fourier Reduced Physics-Informed Neural Operator for Parametric PDEs
链接: https://arxiv.org/abs/2506.17582
作者: Jing Wang,Biao Chen,Hairun Xie,Rui Wang,Yifan Xia,Jifa Zhang,Hui Xu
类目: Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注: 28 pages, 17 figures
Abstract:Physics-informed neural operators have emerged as a powerful paradigm for solving parametric partial differential equations (PDEs), particularly in the aerospace field, enabling the learning of solution operators that generalize across parameter spaces. However, existing methods either suffer from limited expressiveness due to fixed basis/coefficient designs, or face computational challenges due to the high dimensionality of the parameter-to-weight mapping space. We present LFR-PINO, a novel physics-informed neural operator that introduces two key innovations: (1) a layered hypernetwork architecture that enables specialized parameter generation for each network layer, and (2) a frequency-domain reduction strategy that significantly reduces parameter count while preserving essential spectral features. This design enables efficient learning of a universal PDE solver through pre-training, capable of directly handling new equations while allowing optional fine-tuning for enhanced precision. The effectiveness of this approach is demonstrated through comprehensive experiments on four representative PDE problems, where LFR-PINO achieves 22.8%-68.7% error reduction compared to state-of-the-art baselines. Notably, frequency-domain reduction strategy reduces memory usage by 28.6%-69.3% compared to Hyper-PINNs while maintaining solution accuracy, striking an optimal balance between computational efficiency and solution fidelity.
[LG-65] owards Deeper GCNs: Alleviating Over-smoothing via Iterative Training and Fine-tuning
链接: https://arxiv.org/abs/2506.17576
作者: Furong Peng,Jinzhen Gao,Xuan Lu,Kang Liu,Yifan Huo,Sheng Wang
类目: Machine Learning (cs.LG)
*备注: 16 pages,18 figures
Abstract:Graph Convolutional Networks (GCNs) suffer from severe performance degradation in deep architectures due to over-smoothing. While existing studies primarily attribute the over-smoothing to repeated applications of graph Laplacian operators, our empirical analysis reveals a critical yet overlooked factor: trainable linear transformations in GCNs significantly exacerbate feature collapse, even at moderate depths (e.g., 8 layers). In contrast, Simplified Graph Convolution (SGC), which removes these transformations, maintains stable feature diversity up to 32 layers, highlighting linear transformations’ dual role in facilitating expressive power and inducing over-smoothing. However, completely removing linear transformations weakens the model’s expressive capacity. To address this trade-off, we propose Layer-wise Gradual Training (LGT), a novel training strategy that progressively builds deep GCNs while preserving their expressiveness. LGT integrates three complementary components: (1) layer-wise training to stabilize optimization from shallow to deep layers, (2) low-rank adaptation to fine-tune shallow layers and accelerate training, and (3) identity initialization to ensure smooth integration of new layers and accelerate convergence. Extensive experiments on benchmark datasets demonstrate that LGT achieves state-of-the-art performance on vanilla GCN, significantly improving accuracy even in 32-layer settings. Moreover, as a training method, LGT can be seamlessly combined with existing methods such as PairNorm and ContraNorm, further enhancing their performance in deeper networks. LGT offers a general, architecture-agnostic training framework for scalable deep GCNs. The code is available at [this https URL]. Comments: 16 pages,18 figures Subjects: Machine Learning (cs.LG) Cite as: arXiv:2506.17576 [cs.LG] (or arXiv:2506.17576v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2506.17576 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-66] Faster Low-Rank Approximation and Kernel Ridge Regression via the Block-Nyström Method
链接: https://arxiv.org/abs/2506.17556
作者: Sachin Garg,Michał Dereziński
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注:
Abstract:The Nyström method is a popular low-rank approximation technique for large matrices that arise in kernel methods and convex optimization. Yet, when the data exhibits heavy-tailed spectral decay, the effective dimension of the problem often becomes so large that even the Nyström method may be outside of our computational budget. To address this, we propose Block-Nyström, an algorithm that injects a block-diagonal structure into the Nyström method, thereby significantly reducing its computational cost while recovering strong approximation guarantees. We show that Block-Nyström can be used to construct improved preconditioners for second-order optimization, as well as to efficiently solve kernel ridge regression for statistical learning over Hilbert spaces. Our key technical insight is that, within the same computational budget, combining several smaller Nyström approximations leads to stronger tail estimates of the input spectrum than using one larger approximation. Along the way, we provide a novel recursive preconditioning scheme for efficiently inverting the Block-Nyström matrix, and provide new statistical learning bounds for a broad class of approximate kernel ridge regression solvers.
[LG-67] Predicting E-commerce Purchase Behavior using a DQN-Inspired Deep Learning Model for enhanced adaptability
链接: https://arxiv.org/abs/2506.17543
作者: Aditi Madhusudan Jain
类目: Machine Learning (cs.LG)
*备注:
Abstract:This paper presents a novel approach to predicting buying intent and product demand in e-commerce settings, leveraging a Deep Q-Network (DQN) inspired architecture. In the rapidly evolving landscape of online retail, accurate prediction of user behavior is crucial for optimizing inventory management, personalizing user experiences, and maximizing sales. Our method adapts concepts from reinforcement learning to a supervised learning context, combining the sequential modeling capabilities of Long Short-Term Memory (LSTM) networks with the strategic decision-making aspects of DQNs. We evaluate our model on a large-scale e-commerce dataset comprising over 885,000 user sessions, each characterized by 1,114 features. Our approach demonstrates robust performance in handling the inherent class imbalance typical in e-commerce data, where purchase events are significantly less frequent than non-purchase events. Through comprehensive experimentation with various classification thresholds, we show that our model achieves a balance between precision and recall, with an overall accuracy of 88% and an AUC-ROC score of 0.88. Comparative analysis reveals that our DQN-inspired model offers advantages over traditional machine learning and standard deep learning approaches, particularly in its ability to capture complex temporal patterns in user behavior. The model’s performance and scalability make it well-suited for real-world e-commerce applications dealing with high-dimensional, sequential data. This research contributes to the field of e-commerce analytics by introducing a novel predictive modeling technique that combines the strengths of deep learning and reinforcement learning paradigms. Our findings have significant implications for improving demand forecasting, personalizing user experiences, and optimizing marketing strategies in online retail environments.
[LG-68] Episode-specific Fine-tuning for Metric-based Few-shot Learners with Optimization-based Training
链接: https://arxiv.org/abs/2506.17499
作者: Xuanyu Zhuang,Geoffroy Peeters,Gaël Richard
类目: Machine Learning (cs.LG); Multimedia (cs.MM); Sound (cs.SD)
*备注:
Abstract:In few-shot classification tasks (so-called episodes), a small set of labeled support samples is provided during inference to aid the classification of unlabeled query samples. Metric-based models typically operate by computing similarities between query and support embeddings within a learned metric space, followed by nearest-neighbor classification. However, these labeled support samples are often underutilized–they are only used for similarity comparison, despite their potential to fine-tune and adapt the metric space itself to the classes in the current episode. To address this, we propose a series of simple yet effective episode-specific, during-inference fine-tuning methods for metric-based models, including Rotational Division Fine-Tuning (RDFT) and its two variants, Iterative Division Fine-Tuning (IDFT) and Augmented Division Fine-Tuning (ADFT). These methods construct pseudo support-query pairs from the given support set to enable fine-tuning even for non-parametric models. Nevertheless, the severely limited amount of data in each task poses a substantial risk of overfitting when applying such fine-tuning strategies. To mitigate this, we further propose to train the metric-based model within an optimization-based meta-learning framework. With the combined efforts of episode-specific fine-tuning and optimization-based meta-training, metric-based models are equipped with the ability to rapidly adapt to the limited support samples during inference while avoiding overfitting. We validate our approach on three audio datasets from diverse domains, namely ESC-50 (environmental sounds), Speech Commands V2 (spoken keywords), and Medley-solos-DB (musical instrument). Experimental results demonstrate that our approach consistently improves performance for all evaluated metric-based models (especially for attention-based models) and generalizes well across different audio domains.
[LG-69] Online Adaptation for Flying Quadrotors in Tight Formations
链接: https://arxiv.org/abs/2506.17488
作者: Pei-An Hsieh,Kong Yao Chee,M. Ani Hsieh
类目: Robotics (cs.RO); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 10 pages, 4 figures
Abstract:The task of flying in tight formations is challenging for teams of quadrotors because the complex aerodynamic wake interactions can destabilize individual team members as well as the team. Furthermore, these aerodynamic effects are highly nonlinear and fast-paced, making them difficult to model and predict. To overcome these challenges, we present L1 KNODE-DW MPC, an adaptive, mixed expert learning based control framework that allows individual quadrotors to accurately track trajectories while adapting to time-varying aerodynamic interactions during formation flights. We evaluate L1 KNODE-DW MPC in two different three-quadrotor formations and show that it outperforms several MPC baselines. Our results show that the proposed framework is capable of enabling the three-quadrotor team to remain vertically aligned in close proximity throughout the flight. These findings show that the L1 adaptive module compensates for unmodeled disturbances most effectively when paired with an accurate dynamics model. A video showcasing our framework and the physical experiments is available here: this https URL
[LG-70] A geometric framework for momentum-based optimizers for low-rank training
链接: https://arxiv.org/abs/2506.17475
作者: Steffen Schotthöfer,Timon Klein,Jonas Kusch
类目: Machine Learning (cs.LG)
*备注:
Abstract:Low-rank pre-training and fine-tuning have recently emerged as promising techniques for reducing the computational and storage costs of large neural networks. Training low-rank parameterizations typically relies on conventional optimizers such as heavy ball momentum methods or Adam. In this work, we identify and analyze potential difficulties that these training methods encounter when used to train low-rank parameterizations of weights. In particular, we show that classical momentum methods can struggle to converge to a local optimum due to the geometry of the underlying optimization landscape. To address this, we introduce novel training strategies derived from dynamical low-rank approximation, which explicitly account for the underlying geometric structure. Our approach leverages and combines tools from dynamical low-rank approximation and momentum-based optimization to design optimizers that respect the intrinsic geometry of the parameter space. We validate our methods through numerical experiments, demonstrating faster convergence, and stronger validation metrics at given parameter budgets.
[LG-71] Aha Moment Revisited: Are VLMs Truly Capable of Self Verification in Inference-time Scaling?
链接: https://arxiv.org/abs/2506.17417
作者: Mingyuan Wu,Meitang Li,Jingcheng Yang,Jize Jiang,Kaizhuo Yan,Zhaoheng Li,Minjia Zhang,Klara Nahrstedt
类目: Machine Learning (cs.LG)
*备注: Work in progress
Abstract:Recent advances in large language models (LLMs) have demonstrated that inference-time computation techniques, such as decoding-time scaling and self-refinement, can significantly enhance reasoning capabilities without relying on external knowledge. A key driver of this success is the emergence of self-correction and self-verification behaviors, often elicited through reinforcement learning (RL). In this paper, we investigate whether these inference-time techniques extend effectively to vision-language models (VLMs), particularly those trained with RL. We find that while decoding strategies such as majority voting and best-of-N selection with self-verification all improve VLM reasoning performance, generation-reliant methods such as the former achieve significantly higher gains versus verification-reliant methods such as the latter. Additionally, the self-correction behavior often associated with RL-tuned models, such as aha moment, does not lead to measurable gains. We show via extensive experimentation within the inference-time scaling framework to identify a key root cause: RL-trained VLMs still lack robust self-verification capabilities across both visual and textual modalities.
[LG-72] Adaptive Control Attention Network for Underwater Acoustic Localization and Domain Adaptation
链接: https://arxiv.org/abs/2506.17409
作者: Quoc Thinh Vo,Joe Woods,Priontu Chowdhury,David K. Han
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS); Signal Processing (eess.SP)
*备注: This paper has been accepted for the 33rd European Signal Processing Conference (EUSIPCO) 2025 in Palermo, Italy
Abstract:Localizing acoustic sound sources in the ocean is a challenging task due to the complex and dynamic nature of the environment. Factors such as high background noise, irregular underwater geometries, and varying acoustic properties make accurate localization difficult. To address these obstacles, we propose a multi-branch network architecture designed to accurately predict the distance between a moving acoustic source and a receiver, tested on real-world underwater signal arrays. The network leverages Convolutional Neural Networks (CNNs) for robust spatial feature extraction and integrates Conformers with self-attention mechanism to effectively capture temporal dependencies. Log-mel spectrogram and generalized cross-correlation with phase transform (GCC-PHAT) features are employed as input representations. To further enhance the model performance, we introduce an Adaptive Gain Control (AGC) layer, that adaptively adjusts the amplitude of input features, ensuring consistent energy levels across varying ranges, signal strengths, and noise conditions. We assess the model’s generalization capability by training it in one domain and testing it in a different domain, using only a limited amount of data from the test domain for fine-tuning. Our proposed method outperforms state-of-the-art (SOTA) approaches in similar settings, establishing new benchmarks for underwater sound localization.
[LG-73] FFINO: Factorized Fourier Improved Neural Operator for Modeling Multiphase Flow in Underground Hydrogen Storag e
链接: https://arxiv.org/abs/2506.17344
作者: Tao Wang,Hewei Tang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Underground hydrogen storage (UHS) is a promising energy storage option for the current energy transition to a low-carbon economy. Fast modeling of hydrogen plume migration and pressure field evolution is crucial for UHS field management. In this study, we propose a new neural operator architecture, FFINO, as a fast surrogate model for multiphase flow problems in UHS. We parameterize experimental relative permeability curves reported in the literature and include them as key uncertainty parameters in the FFINO model. We also compare the FFINO model with the state-of-the-art FMIONet model through a comprehensive combination of metrics. Our new FFINO model has 38.1% fewer trainable parameters, 17.6% less training time, and 12% less GPU memory cost compared to FMIONet. The FFINO model also achieves a 9.8% accuracy improvement in predicting hydrogen plume in focused areas, and 18% higher RMSE in predicting pressure buildup. The inference time of the trained FFINO model is 7850 times faster than a numerical simulator, which makes it a competent substitute for numerical simulations of UHS problems with superior time efficiency.
[LG-74] AutomataGPT : Forecasting and Ruleset Inference for Two-Dimensional Cellular Automata
链接: https://arxiv.org/abs/2506.17333
作者: Jaime A. Berkovich,Noah S. David,Markus J. Buehler
类目: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn); Materials Science (cond-mat.mtrl-sci); Quantitative Methods (q-bio.QM)
*备注:
Abstract:Cellular automata (CA) provide a minimal formalism for investigating how simple local interactions generate rich spatiotemporal behavior in domains as diverse as traffic flow, ecology, tissue morphogenesis and crystal growth. However, automatically discovering the local update rules for a given phenomenon and using them for quantitative prediction remains challenging. Here we present AutomataGPT, a decoder-only transformer pretrained on around 1 million simulated trajectories that span 100 distinct two-dimensional binary deterministic CA rules on toroidal grids. When evaluated on previously unseen rules drawn from the same CA family, AutomataGPT attains 98.5% perfect one-step forecasts and reconstructs the governing update rule with up to 96% functional (application) accuracy and 82% exact rule-matrix match. These results demonstrate that large-scale pretraining over wider regions of rule space yields substantial generalization in both the forward (state forecasting) and inverse (rule inference) problems, without hand-crafted priors. By showing that transformer models can faithfully infer and execute CA dynamics from data alone, our work lays the groundwork for abstracting real-world dynamical phenomena into data-efficient CA surrogates, opening avenues in biology, tissue engineering, physics and AI-driven scientific discovery.
[LG-75] CopulaSMOTE: A Copula-Based Oversampling Approach for Imbalanced Classification in Diabetes Prediction
链接: https://arxiv.org/abs/2506.17326
作者: Agnideep Aich,Md Monzur Murshed,Sameera Hewage,Amanda Mayeaux
类目: Machine Learning (cs.LG); Applications (stat.AP); Machine Learning (stat.ML)
*备注:
Abstract:Diabetes mellitus poses a significant health risk, as nearly 1 in 9 people are affected by it. Early detection can significantly lower this risk. Despite significant advancements in machine learning for identifying diabetic cases, results can still be influenced by the imbalanced nature of the data. To address this challenge, our study considered copula-based data augmentation, which preserves the dependency structure when generating data for the minority class and integrates it with machine learning (ML) techniques. We selected the Pima Indian dataset and generated data using A2 copula, then applied four machine learning algorithms: logistic regression, random forest, gradient boosting, and extreme gradient boosting. Our findings indicate that XGBoost combined with A2 copula oversampling achieved the best performance improving accuracy by 4.6%, precision by 15.6%, recall by 20.4%, F1-score by 18.2% and AUC by 25.5% compared to the standard SMOTE method. Furthermore, we statistically validated our results using the McNemar test. This research represents the first known use of A2 copulas for data augmentation and serves as an alternative to the SMOTE technique, highlighting the efficacy of copulas as a statistical method in machine learning applications.
[LG-76] Using Machine Learning in Analyzing Air Quality Discrepancies of Environmental Impact
链接: https://arxiv.org/abs/2506.17319
作者: Shuangbao Paul Wang,Lucas Yang,Rahouane Chouchane,Jin Guo,Michael Bailey
类目: Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注: IEEE 2024 International Conference on AI x Data Knowledge Engineering (AIxDKE)
Abstract:In this study, we apply machine learning and software engineering in analyzing air pollution levels in City of Baltimore. The data model was fed with three primary data sources: 1) a biased method of estimating insurance risk used by homeowners loan corporation, 2) demographics of Baltimore residents, and 3) census data estimate of NO2 and PM2.5 concentrations. The dataset covers 650,643 Baltimore residents in 44.7 million residents in 202 major cities in US. The results show that air pollution levels have a clear association with the biased insurance estimating method. Great disparities present in NO2 level between more desirable and low income blocks. Similar disparities exist in air pollution level between residents’ ethnicity. As Baltimore population consists of a greater proportion of people of color, the finding reveals how decades old policies has continued to discriminate and affect quality of life of Baltimore citizens today.
[LG-77] A family of graph GOSPA metrics for graphs with different sizes
链接: https://arxiv.org/abs/2506.17316
作者: Jinhao Gu,Ángel F. García-Fernández,Robert E. Firth,Lennart Svensson
类目: ocial and Information Networks (cs.SI); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:
Abstract:This paper proposes a family of graph metrics for measuring distances between graphs of different sizes. The proposed metric family defines a general form of the graph generalised optimal sub-pattern assignment (GOSPA) metric and is also proved to satisfy the metric properties. Similarly to the graph GOSPA metric, the proposed graph GOSPA metric family also penalises the node attribute costs for assigned nodes between the two graphs, and the number of unassigned nodes. However, the proposed family of metrics provides more general penalties for edge mismatches than the graph GOSPA metric. This paper also shows that the graph GOSPA metric family can be approximately computed using linear programming. Simulation experiments are performed to illustrate the characteristics of the proposed graph GOSPA metric family with different choices of hyperparameters. The benefits of the proposed graph GOSPA metric family for classification tasks are also shown on real-world datasets.
[LG-78] Efficient Malware Detection with Optimized Learning on High-Dimensional Features
链接: https://arxiv.org/abs/2506.17309
作者: Aditya Choudhary,Sarthak Pawar,Yashodhara Haribhakta
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: This paper has been accepted for presentation at the International Conference on Innovations in Intelligent Systems: Advancements in Computing, Communication, and Cybersecurity (ISAC3)
Abstract:Malware detection using machine learning requires feature extraction from binary files, as models cannot process raw binaries directly. A common approach involves using LIEF for raw feature extraction and the EMBER vectorizer to generate 2381-dimensional feature vectors. However, the high dimensionality of these features introduces significant computational challenges. This study addresses these challenges by applying two dimensionality reduction techniques: XGBoost-based feature selection and Principal Component Analysis (PCA). We evaluate three reduced feature dimensions (128, 256, and 384), which correspond to approximately 5.4%, 10.8%, and 16.1% of the original 2381 features, across four models-XGBoost, LightGBM, Extra Trees, and Random Forest-using a unified training, validation, and testing split formed from the EMBER-2018, ERMDS, and BODMAS datasets. This approach ensures generalization and avoids dataset bias. Experimental results show that LightGBM trained on the 384-dimensional feature set after XGBoost feature selection achieves the highest accuracy of 97.52% on the unified dataset, providing an optimal balance between computational efficiency and detection performance. The best model, trained in 61 minutes using 30 GB of RAM and 19.5 GB of disk space, generalizes effectively to completely unseen datasets, maintaining 95.31% accuracy on TRITIUM and 93.98% accuracy on INFERNO. These findings present a scalable, compute-efficient approach for malware detection without compromising accuracy.
[LG-79] Recommendation systems in e-commerce applications with machine learning methods
链接: https://arxiv.org/abs/2506.17287
作者: Aneta Poniszewska-Maranda,Magdalena Pakula,Bozena Borowska
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: 29th International Conference on Evaluation and Assessment in Software Engineering, 17-20 June, 2025, Istanbul, Turkey
Abstract:E-commerce platforms are increasingly reliant on recommendation systems to enhance user experience, retain customers, and, in most cases, drive sales. The integration of machine learning methods into these systems has significantly improved their efficiency, personalization, and scalability. This paper aims to highlight the current trends in e-commerce recommendation systems, identify challenges, and evaluate the effectiveness of various machine learning methods used, including collaborative filtering, content-based filtering, and hybrid models. A systematic literature review (SLR) was conducted, analyzing 38 publications from 2013 to 2025. The methods used were evaluated and compared to determine their performance and effectiveness in addressing e-commerce challenges.
[LG-80] A Framework for Generating Conversational Recommendation Datasets from Behavioral Interactions
链接: https://arxiv.org/abs/2506.17285
作者: Vinaik Chhetri,Yousaf Reza,Moghis Fereidouni,Srijata Maji,Umar Farooq,AB Siddique
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: 12 pages, 6 tables,4 figures
Abstract:Modern recommendation systems typically follow two complementary paradigms: collaborative filtering, which models long-term user preferences from historical interactions, and conversational recommendation systems (CRS), which interact with users in natural language to uncover immediate needs. Each captures a different dimension of user intent. While CRS models lack collaborative signals, leading to generic or poorly personalized suggestions, traditional recommenders lack mechanisms to interactively elicit immediate needs. Unifying these paradigms promises richer personalization but remains challenging due to the lack of large-scale conversational datasets grounded in real user behavior. We present ConvRecStudio, a framework that uses large language models (LLMs) to simulate realistic, multi-turn dialogs grounded in timestamped user-item interactions and reviews. ConvRecStudio follows a three-stage pipeline: (1) Temporal Profiling, which constructs user profiles and community-level item sentiment trajectories over fine-grained aspects; (2) Semantic Dialog Planning, which generates a structured plan using a DAG of flexible super-nodes; and (3) Multi-Turn Simulation, which instantiates the plan using paired LLM agents for the user and system, constrained by executional and behavioral fidelity checks. We apply ConvRecStudio to three domains – MobileRec, Yelp, and Amazon Electronics – producing over 12K multi-turn dialogs per dataset. Human and automatic evaluations confirm the naturalness, coherence, and behavioral grounding of the generated conversations. To demonstrate utility, we build a cross-attention transformer model that jointly encodes user history and dialog context, achieving gains in Hit@K and NDCG@K over baselines using either signal alone or naive fusion. Notably, our model achieves a 10.9% improvement in Hit@1 on Yelp over the strongest baseline.
[LG-81] raining a Scientific Reasoning Model for Chemistry
链接: https://arxiv.org/abs/2506.17238
作者: Siddharth M. Narayanan,James D. Braza,Ryan-Rhys Griffiths,Albert Bou,Geemi Wellawatte,Mayk Caldas Ramos,Ludovico Mitchener,Samuel G. Rodriques,Andrew D. White
类目: Machine Learning (cs.LG)
*备注:
Abstract:Reasoning models are large language models that emit a long chain-of-thought before answering, providing both higher accuracy and explicit reasoning for their response. A major question has been whether language model reasoning generalizes beyond mathematics, programming, and logic, where most previous work has focused. We demonstrate that reasoning models can be post-trained for chemistry without additional domain pretraining, and require substantially less data compared to contemporary domain-specific models. We report ether0, a 24B parameter LLM (based on Mistral-Small-24B) that can reason in natural language and respond with chemical structures. This reasoning model was trained with reinforcement learning on 640,730 experimentally-grounded chemistry problems across 375 tasks ranging from synthesizability, to blood-brain barrier permeability, to human receptor activity, to scent. Our model exceeds general-purpose chemistry models, frontier models, and human experts on molecular design tasks. It is also more data efficient relative to specialized models. We anticipate that this method can be applied to train data-efficient language models specialized for tasks across a wide variety of scientific domains.
[LG-82] Bridging Equilibrium and Kinetics Prediction with a Data-Weighted Neural Network Model of Methane Steam Reforming
链接: https://arxiv.org/abs/2506.17224
作者: Zofia Pizoń,Shinji Kimijima,Grzegorz Brus
类目: Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注: 12 pages, 8 figures
Abstract:Hydrogen’s role is growing as an energy carrier, increasing the need for efficient production, with methane steam reforming being the most widely used technique. This process is crucial for applications like fuel cells, where hydrogen is converted into electricity, pushing for reactor miniaturization and optimized process control through numerical simulations. Existing models typically address either kinetic or equilibrium regimes, limiting their applicability. Here we show a surrogate model capable of unifying both regimes. An artificial neural network trained on a comprehensive dataset that includes experimental data from kinetic and equilibrium experiments, interpolated data, and theoretical data derived from theoretical models for each regime. Data augmentation and assigning appropriate weights to each data type enhanced training. After evaluating Bayesian Optimization and Random Sampling, the optimal model demonstrated high predictive accuracy for the composition of the post-reaction mixture under varying operating parameters, indicated by a mean squared error of 0.000498 and strong Pearson correlation coefficients of 0.927. The network’s ability to provide continuous derivatives of its predictions makes it particularly useful for process modeling and optimization. The results confirm the surrogate model’s robustness for simulating methane steam reforming in both kinetic and equilibrium regimes, making it a valuable tool for design and process optimization.
[LG-83] Optimal Graph Reconstruction by Counting Connected Components in Induced Subgraphs COLT2025
链接: https://arxiv.org/abs/2506.08405
作者: Hadley Black,Arya Mazumdar,Barna Saha,Yinzhan Xu
类目: Data Structures and Algorithms (cs.DS); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注: To appear in COLT 2025
Abstract:The graph reconstruction problem has been extensively studied under various query models. In this paper, we propose a new query model regarding the number of connected components, which is one of the most basic and fundamental graph parameters. Formally, we consider the problem of reconstructing an n -node m -edge graph with oracle queries of the following form: provided with a subset of vertices, the oracle returns the number of connected components in the induced subgraph. We show \Theta(\fracm \log n\log m) queries in expectation are both sufficient and necessary to adaptively reconstruct the graph. In contrast, we show that \Omega(n^2) non-adaptive queries are required, even when m = O(n) . We also provide an O(m\log n + n\log^2 n) query algorithm using only two rounds of adaptivity.
[LG-84] Learning Partitions with Optimal Query and Round Complexities COLT2025
链接: https://arxiv.org/abs/2505.05009
作者: Hadley Black,Arya Mazumdar,Barna Saha
类目: Data Structures and Algorithms (cs.DS); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注: Appearing in COLT 2025
Abstract:We consider the basic problem of learning an unknown partition of n elements into at most k sets using simple queries that reveal information about a small subset of elements. Our starting point is the well-studied pairwise same-set queries which ask if a pair of elements belong to the same class. It is known that non-adaptive algorithms require \Theta(n^2) queries, while adaptive algorithms require \Theta(nk) queries, and the best known algorithm uses k-1 rounds. This problem has been studied extensively over the last two decades in multiple communities due to its fundamental nature and relevance to clustering, active learning, and crowd sourcing. In many applications, it is of high interest to reduce adaptivity while minimizing query complexity. We give a complete characterization of the deterministic query complexity of this problem as a function of the number of rounds, r , interpolating between the non-adaptive and adaptive settings: for any constant r , the query complexity is \Theta(n^1+\frac12^r-1k^1-\frac12^r-1) . Our algorithm only needs O(\log \log n) rounds to attain the optimal O(nk) query complexity. Next, we consider two generalizations of pairwise queries to subsets S of size at most s : (1) weak subset queries which return the number of classes intersected by S , and (2) strong subset queries which return the entire partition restricted on S . Once again in crowd sourcing applications, queries on large sets may be prohibitive. For non-adaptive algorithms, we show \Omega(n^2/s^2) strong queries are needed. Perhaps surprisingly, we show that there is a non-adaptive algorithm using weak queries that matches this bound up to log-factors for all s \leq \sqrtn . More generally, we obtain nearly matching upper and lower bounds for algorithms using subset queries in terms of both the number of rounds, r , and the query size bound, s . Comments: Appearing in COLT 2025 Subjects: Data Structures and Algorithms (cs.DS); Information Theory (cs.IT); Machine Learning (cs.LG) Cite as: arXiv:2505.05009 [cs.DS] (or arXiv:2505.05009v1 [cs.DS] for this version) https://doi.org/10.48550/arXiv.2505.05009 Focus to learn more arXiv-issued DOI via DataCite
[LG-85] Learning to Control an Android Robot Head for Facial Animation
链接: https://arxiv.org/abs/2412.13641
作者: Marcel Heisler,Christian Becker-Asano
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:
Abstract:The ability to display rich facial expressions is crucial for human-like robotic heads. While manually defining such expressions is intricate, there already exist approaches to automatically learn them. In this work one such approach is applied to evaluate and control a robot head different from the one in the original study. To improve the mapping of facial expressions from human actors onto a robot head, it is proposed to use 3D landmarks and their pairwise distances as input to the learning algorithm instead of the previously used facial action units. Participants of an online survey preferred mappings from our proposed approach in most cases, though there are still further improvements required.
[LG-86] Local Averag ing Accurately Distills Manifold Structure From Noisy Data
链接: https://arxiv.org/abs/2506.18761
作者: Yihan Shen,Shiyu Wang,Arnaud Lamy,Mariam Avagyan,John Wright
类目: Machine Learning (stat.ML); Computational Geometry (cs.CG); Machine Learning (cs.LG)
*备注:
Abstract:High-dimensional data are ubiquitous, with examples ranging from natural images to scientific datasets, and often reside near low-dimensional manifolds. Leveraging this geometric structure is vital for downstream tasks, including signal denoising, reconstruction, and generation. However, in practice, the manifold is typically unknown and only noisy samples are available. A fundamental approach to uncovering the manifold structure is local averaging, which is a cornerstone of state-of-the-art provable methods for manifold fitting and denoising. However, to the best of our knowledge, there are no works that rigorously analyze the accuracy of local averaging in a manifold setting in high-noise regimes. In this work, we provide theoretical analyses of a two-round mini-batch local averaging method applied to noisy samples drawn from a d -dimensional manifold \mathcal M \subset \mathbbR^D , under a relatively high-noise regime where the noise size is comparable to the reach \tau . We show that with high probability, the averaged point \hat\mathbf q achieves the bound d(\hat\mathbf q, \mathcal M) \leq \sigma \sqrtd\left(1+\frac\kappa\mathrmdiam(\mathcal M)\log(D)\right) , where \sigma, \mathrmdiam(\mathcal M),\kappa denote the standard deviation of the Gaussian noise, manifold’s diameter and a bound on its extrinsic curvature, respectively. This is the first analysis of local averaging accuracy over the manifold in the relatively high noise regime where \sigma \sqrtD \approx \tau . The proposed method can serve as a preprocessing step for a wide range of provable methods designed for lower-noise regimes. Additionally, our framework can provide a theoretical foundation for a broad spectrum of denoising and dimensionality reduction methods that rely on local averaging techniques.
[LG-87] Fast State-Augmented Learning for Wireless Resource Allocation with Dual Variable Regression
链接: https://arxiv.org/abs/2506.18748
作者: Yigit Berkay Uslu,Navid NaderiAlizadeh,Mark Eisen,Alejandro Ribeiro
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: This work has been submitted to the IEEE TSP for possible publication
Abstract:We consider resource allocation problems in multi-user wireless networks, where the goal is to optimize a network-wide utility function subject to constraints on the ergodic average performance of users. We demonstrate how a state-augmented graph neural network (GNN) parametrization for the resource allocation policy circumvents the drawbacks of the ubiquitous dual subgradient methods by representing the network configurations (or states) as graphs and viewing dual variables as dynamic inputs to the model, viewed as graph signals supported over the graphs. Lagrangian maximizing state-augmented policies are learned during the offline training phase, and the dual variables evolve through gradient updates while executing the learned state-augmented policies during the inference phase. Our main contributions are to illustrate how near-optimal initialization of dual multipliers for faster inference can be accomplished with dual variable regression, leveraging a secondary GNN parametrization, and how maximization of the Lagrangian over the multipliers sampled from the dual descent dynamics substantially improves the training of state-augmented models. We demonstrate the superior performance of the proposed algorithm with extensive numerical experiments in a case study of transmit power control. Finally, we prove a convergence result and an exponential probability bound on the excursions of the dual function (iterate) optimality gaps.
[LG-88] A Random Matrix Analysis of In-context Memorization for Nonlinear Attention
链接: https://arxiv.org/abs/2506.18656
作者: Zhenyu Liao,Jiaqing Liu,TianQi Hou,Difan Zou,Zenan Ling
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注: 40 pages, 7 pages
Abstract:Attention mechanisms have revolutionized machine learning (ML) by enabling efficient modeling of global dependencies across inputs. Their inherently parallelizable structures allow for efficient scaling with the exponentially increasing size of both pretrained data and model parameters. Yet, despite their central role as the computational backbone of modern large language models (LLMs), the theoretical understanding of Attentions, especially in the nonlinear setting, remains limited. In this paper, we provide a precise characterization of the \emphin-context memorization error of \emphnonlinear Attention, in the high-dimensional proportional regime where the number of input tokens n and their embedding dimension p are both large and comparable. Leveraging recent advances in the theory of large kernel random matrices, we show that nonlinear Attention typically incurs higher memorization error than linear ridge regression on random inputs. However, this gap vanishes, and can even be reversed, when the input exhibits statistical structure, particularly when the Attention weights align with the input signal direction. Our results reveal how nonlinearity and input structure interact with each other to govern the memorization performance of nonlinear Attention. The theoretical insights are supported by numerical experiments. Comments: 40 pages, 7 pages Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST) Cite as: arXiv:2506.18656 [stat.ML] (or arXiv:2506.18656v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2506.18656 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-89] ght Generalization Error Bounds for Stochastic Gradient Descent in Non-convex Learning
链接: https://arxiv.org/abs/2506.18645
作者: Wenjun Xiong,Juan Ding,Xinlei Zuo,Qizhai Li
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:
Abstract:Stochastic Gradient Descent (SGD) is fundamental for training deep neural networks, especially in non-convex settings. Understanding SGD’s generalization properties is crucial for ensuring robust model performance on unseen data. In this paper, we analyze the generalization error bounds of SGD for non-convex learning by introducing the Type II perturbed SGD (T2pm-SGD), which accommodates both sub-Gaussian and bounded loss functions. The generalization error bound is decomposed into two components: the trajectory term and the flatness term. Our analysis improves the trajectory term to O(n^-1) , significantly enhancing the previous O((nb)^-1/2) bound for bounded losses, where n is the number of training samples and b is the batch size. By selecting an optimal variance for the perturbation noise, the overall bound is further refined to O(n^-2/3) . For sub-Gaussian loss functions, a tighter trajectory term is also achieved. In both cases, the flatness term remains stable across iterations and is smaller than those reported in previous literature, which increase with iterations. This stability, ensured by T2pm-SGD, leads to tighter generalization error bounds for both loss function types. Our theoretical results are validated through extensive experiments on benchmark datasets, including MNIST and CIFAR-10, demonstrating the effectiveness of T2pm-SGD in establishing tighter generalization bounds.
[LG-90] rustworthy Prediction with Gaussian Process Knowledge Scores
链接: https://arxiv.org/abs/2506.18630
作者: Kurt Butler,Guanchao Feng,Tong Chen,Petar Djuric
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 6 pages, 5 figures, to be published in the Proceedings of the European Signal Processing Conference (EUSIPCO)
Abstract:Probabilistic models are often used to make predictions in regions of the data space where no observations are available, but it is not always clear whether such predictions are well-informed by previously seen data. In this paper, we propose a knowledge score for predictions from Gaussian process regression (GPR) models that quantifies the extent to which observing data have reduced our uncertainty about a prediction. The knowledge score is interpretable and naturally bounded between 0 and 1. We demonstrate in several experiments that the knowledge score can anticipate when predictions from a GPR model are accurate, and that this anticipation improves performance in tasks such as anomaly detection, extrapolation, and missing data imputation. Source code for this project is available online at this https URL.
[LG-91] heoretical guarantees for neural estimators in parametric statistics
链接: https://arxiv.org/abs/2506.18508
作者: Almut Rödder,Manuel Hentschel,Sebastian Engelke
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Neural estimators are simulation-based estimators for the parameters of a family of statistical models, which build a direct mapping from the sample to the parameter vector. They benefit from the versatility of available network architectures and efficient training methods developed in the field of deep learning. Neural estimators are amortized in the sense that, once trained, they can be applied to any new data set with almost no computational cost. While many papers have shown very good performance of these methods in simulation studies and real-world applications, so far no statistical guarantees are available to support these observations theoretically. In this work, we study the risk of neural estimators by decomposing it into several terms that can be analyzed separately. We formulate easy-to-check assumptions ensuring that each term converges to zero, and we verify them for popular applications of neural estimators. Our results provide a general recipe to derive theoretical guarantees also for broader classes of architectures and estimation problems.
[LG-92] Leverag ing neural network interatomic potentials for a foundation model of chemistry
链接: https://arxiv.org/abs/2506.18497
作者: So Yeon Kim,Yang Jeong Park,Ju Li
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注: 29pages, 10 figures
Abstract:Large-scale foundation models, including neural network interatomic potentials (NIPs) in computational materials science, have demonstrated significant potential. However, despite their success in accelerating atomistic simulations, NIPs face challenges in directly predicting electronic properties and often require coupling to higher-scale models or extensive simulations for macroscopic properties. Machine learning (ML) offers alternatives for structure-to-property mapping but faces trade-offs: feature-based methods often lack generalizability, while deep neural networks require significant data and computational power. To address these trade-offs, we introduce HackNIP, a two-stage pipeline that leverages pretrained NIPs. This method first extracts fixed-length feature vectors (embeddings) from NIP foundation models and then uses these embeddings to train shallow ML models for downstream structure-to-property predictions. This study investigates whether such a hybridization approach, by ``hacking" the NIP, can outperform end-to-end deep neural networks, determines the dataset size at which this transfer learning approach surpasses direct fine-tuning of the NIP, and identifies which NIP embedding depths yield the most informative features. HackNIP is benchmarked on Matbench, evaluated for data efficiency, and tested on diverse tasks including \textitab initio, experimental, and molecular properties. We also analyze how embedding depth impacts performance. This work demonstrates a hybridization strategy to overcome ML trade-offs in materials science, aiming to democratize high-performance predictive modeling.
[LG-93] BrainSymphony: A Transformer-Driven Fusion of fMRI Time Series and Structural Connectivity
链接: https://arxiv.org/abs/2506.18314
作者: Moein Khajehnejad,Forough Habibollahi,Adeel Razi
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注: 21 pages, 8 figures
Abstract:Existing foundation models for neuroimaging are often prohibitively large and data-intensive. We introduce BrainSymphony, a lightweight, parameter-efficient foundation model that achieves state-of-the-art performance while being pre-trained on significantly smaller public datasets. BrainSymphony’s strong multimodal architecture processes functional MRI data through parallel spatial and temporal transformer streams, which are then efficiently distilled into a unified representation by a Perceiver module. Concurrently, it models structural connectivity from diffusion MRI using a novel signed graph transformer to encode the brain’s anatomical structure. These powerful, modality-specific representations are then integrated via an adaptive fusion gate. Despite its compact design, our model consistently outperforms larger models on a diverse range of downstream benchmarks, including classification, prediction, and unsupervised network identification tasks. Furthermore, our model revealed novel insights into brain dynamics using attention maps on a unique external psilocybin neuroimaging dataset (pre- and post-administration). BrainSymphony establishes that architecturally-aware, multimodal models can surpass their larger counterparts, paving the way for more accessible and powerful research in computational neuroscience.
[LG-94] Quantifying Uncertainty in the Presence of Distribution Shifts
链接: https://arxiv.org/abs/2506.18283
作者: Yuli Slavutsky,David M. Blei
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Neural networks make accurate predictions but often fail to provide reliable uncertainty estimates, especially under covariate distribution shifts between training and testing. To address this problem, we propose a Bayesian framework for uncertainty estimation that explicitly accounts for covariate shifts. While conventional approaches rely on fixed priors, the key idea of our method is an adaptive prior, conditioned on both training and new covariates. This prior naturally increases uncertainty for inputs that lie far from the training distribution in regions where predictive performance is likely to degrade. To efficiently approximate the resulting posterior predictive distribution, we employ amortized variational inference. Finally, we construct synthetic environments by drawing small bootstrap samples from the training data, simulating a range of plausible covariate shift using only the original dataset. We evaluate our method on both synthetic and real-world data. It yields substantially improved uncertainty estimates under distribution shifts.
[LG-95] Phase retrieval with rank d measurements – emphdescending algorithms phase transitions
链接: https://arxiv.org/abs/2506.18282
作者: Mihailo Stojnic
类目: Machine Learning (stat.ML); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注:
Abstract:Companion paper [118] developed a powerful \emphRandom duality theory (RDT) based analytical program to statistically characterize performance of \emphdescending phase retrieval algorithms (dPR) (these include all variants of gradient descents and among them widely popular Wirtinger flows). We here generalize the program and show how it can be utilized to handle rank d positive definite phase retrieval (PR) measurements (with special cases d=1 and d=2 serving as emulations of the real and complex phase retrievals, respectively). In particular, we observe that the minimal sample complexity ratio (number of measurements scaled by the dimension of the unknown signal) which ensures dPR’s success exhibits a phase transition (PT) phenomenon. For both plain and lifted RDT we determine phase transitions locations. To complement theoretical results we implement a log barrier gradient descent variant and observe that, even in small dimensional scenarios (with problem sizes on the order of 100), the simulated phase transitions are in an excellent agreement with the theoretical predictions.
[LG-96] Optimal spectral initializers impact on phase retrieval phase transitions – an RDT view
链接: https://arxiv.org/abs/2506.18279
作者: Mihailo Stojnic
类目: Machine Learning (stat.ML); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注:
Abstract:We analyze the relation between spectral initializers and theoretical limits of \emphdescending phase retrieval algorithms (dPR). In companion paper [104], for any sample complexity ratio, \alpha , \emphparametric manifold, \mathcal PM(\alpha) , is recognized as a critically important structure that generically determines dPRs abilities to solve phase retrieval (PR). Moreover, overlap between the algorithmic solution and the true signal is positioned as a key \mathcal PM 's component. We here consider the so-called \emphoverlap optimal spectral initializers (OptSpins) as dPR’s starting points and develop a generic \emphRandom duality theory (RDT) based program to statistically characterize them. In particular, we determine the functional structure of OptSpins and evaluate the starting overlaps that they provide for the dPRs. Since \mathcal PM ‘s so-called \emphflat regions are highly susceptible to \emphlocal jitteriness and as such are key obstacles on dPR’s path towards PR’s global optimum, a precise characterization of the starting overlap allows to determine if such regions can be successfully circumvented. Through the presented theoretical analysis we observe two key points in that regard: \textbf\emph(i) dPR’s theoretical phase transition (critical \alpha above which they solve PR) might be difficult to practically achieve as the \mathcal PM ‘s flat regions are large causing the associated OptSpins to fall exactly within them; and \textbf\emph(ii) Opting for so-called ``\emphsafer compression’’ and slightly increasing \alpha (by say 15% ) shrinks flat regions and allows OptSpins to fall outside them and dPRs to ultimately solve PR. Numerical simulations are conducted as well and shown to be in an excellent agreement with theoretical predictions.
[LG-97] Phase transition of emphdescending phase retrieval algorithms
链接: https://arxiv.org/abs/2506.18275
作者: Mihailo Stojnic
类目: Machine Learning (stat.ML); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注:
Abstract:We study theoretical limits of \emphdescending phase retrieval algorithms. Utilizing \emphRandom duality theory (RDT) we develop a generic program that allows statistical characterization of various algorithmic performance metrics. Through these we identify the concepts of \emphparametric manifold and its \emphfunneling points as key mathematical objects that govern the underlying algorithms’ behavior. An isomorphism between single funneling point manifolds and global convergence of descending algorithms is established. The structure and shape of the parametric manifold as well as its dependence on the sample complexity are studied through both plain and lifted RDT. Emergence of a phase transition is observed. Namely, as sample complexity increases, parametric manifold transitions from a multi to a single funneling point structure. This in return corresponds to a transition from the scenarios where descending algorithms generically fail to the scenarios where they succeed in solving phase retrieval. We also develop and implement a practical algorithmic variant that in a hybrid alternating fashion combines a barrier and a plain gradient descent. Even though the theoretical results are obtained for infinite dimensional scenarios (and consequently non-jittery parametric manifolds), we observe a strong agrement between theoretical and simulated phase transitions predictions for fairly small dimensions on the order of a few hundreds.
[LG-98] CT Radiomics-Based Explainable Machine Learning Model for Accurate Differentiation of Malignant and Benign Endometrial Tumors: A Two-Center Study
链接: https://arxiv.org/abs/2506.18106
作者: Tingrui Zhang,Honglin Wu,Zekun Jiang,Yingying Wang,Rui Ye,Huiming Ni,Chang Liu,Jin Cao,Xuan Sun,Rong Shao,Xiaorong Wei,Yingchun Sun
类目: Image and Video Processing (eess.IV); Machine Learning (cs.LG)
*备注: 30 pages, 5 figures, 3 tables
Abstract:Aimed to develop and validate a CT radiomics-based explainable machine learning model for diagnosing malignancy and benignity specifically in endometrial cancer (EC) patients. A total of 83 EC patients from two centers, including 46 with malignant and 37 with benign conditions, were included, with data split into a training set (n=59) and a testing set (n=24). The regions of interest (ROIs) were manually segmented from pre-surgical CT scans, and 1132 radiomic features were extracted from the pre-surgical CT scans using Pyradiomics. Six explainable machine learning modeling algorithms were implemented respectively, for determining the optimal radiomics pipeline. The diagnostic performance of the radiomic model was evaluated by using sensitivity, specificity, accuracy, precision, F1 score, confusion matrices, and ROC curves. To enhance clinical understanding and usability, we separately implemented SHAP analysis and feature mapping visualization, and evaluated the calibration curve and decision curve. By comparing six modeling strategies, the Random Forest model emerged as the optimal choice for diagnosing EC, with a training AUC of 1.00 and a testing AUC of 0.96. SHAP identified the most important radiomic features, revealing that all selected features were significantly associated with EC (P 0.05). Radiomics feature maps also provide a feasible assessment tool for clinical applications. DCA indicated a higher net benefit for our model compared to the “All” and “None” strategies, suggesting its clinical utility in identifying high-risk cases and reducing unnecessary interventions. In conclusion, the CT radiomics-based explainable machine learning model achieved high diagnostic performance, which could be used as an intelligent auxiliary tool for the diagnosis of endometrial cancer.
[LG-99] GRASP: Grouped Regression with Adaptive Shrinkage Priors
链接: https://arxiv.org/abs/2506.18092
作者: Shu Yu Tew,Daniel F. Schmidt,Mario Boley
类目: Methodology (stat.ME); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:We introduce GRASP, a simple Bayesian framework for regression with grouped predictors, built on the normal beta prime (NBP) prior. The NBP prior is an adaptive generalization of the horseshoe prior with tunable hyperparameters that control tail behavior, enabling a flexible range of sparsity, from strong shrinkage to ridge-like regularization. Unlike prior work that introduced the group inverse-gamma gamma (GIGG) prior by decomposing the NBP prior into structured hierarchies, we show that directly controlling the tails is sufficient without requiring complex hierarchical constructions. Extending the non-tail adaptive grouped half-Cauchy hierarchy of Xu et al., GRASP assigns the NBP prior to both local and group shrinkage parameters allowing adaptive sparsity within and across groups. A key contribution of this work is a novel framework to explicitly quantify correlations among shrinkage parameters within a group, providing deeper insights into grouped shrinkage behavior. We also introduce an efficient Metropolis-Hastings sampler for hyperparameter estimation. Empirical results on simulated and real-world data demonstrate the robustness and versatility of GRASP across grouped regression problems with varying sparsity and signal-to-noise ratios.
[LG-100] Identifiable Convex-Concave Regression via Sub-gradient Regularised Least Squares
链接: https://arxiv.org/abs/2506.18078
作者: William Chung
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST); Applications (stat.AP)
*备注: 21 pages, working paper
Abstract:We propose a novel nonparametric regression method that models complex input-output relationships as the sum of convex and concave components. The method-Identifiable Convex-Concave Nonparametric Least Squares (ICCNLS)-decomposes the target function into additive shape-constrained components, each represented via sub-gradient-constrained affine functions. To address the affine ambiguity inherent in convex-concave decompositions, we introduce global statistical orthogonality constraints, ensuring that residuals are uncorrelated with both intercept and input variables. This enforces decomposition identifiability and improves interpretability. We further incorporate L1, L2 and elastic net regularisation on sub-gradients to enhance generalisation and promote structural sparsity. The proposed method is evaluated on synthetic and real-world datasets, including healthcare pricing data, and demonstrates improved predictive accuracy and model simplicity compared to conventional CNLS and difference-of-convex (DC) regression approaches. Our results show that statistical identifiability, when paired with convex-concave structure and sub-gradient regularisation, yields interpretable models suited for forecasting, benchmarking, and policy evaluation.
[LG-101] AbRank: A Benchmark Dataset and Metric-Learning Framework for Antibody-Antigen Affinity Ranking
链接: https://arxiv.org/abs/2506.17857
作者: Chunan Liu,Aurelien Pelissier,Yanjun Shao,Lilian Denzler,Andrew C.R. Martin,Brooks Paige,Mariia Rodriguez Martinez
类目: Biomolecules (q-bio.BM); Machine Learning (cs.LG)
*备注:
Abstract:Accurate prediction of antibody-antigen (Ab-Ag) binding affinity is essential for therapeutic design and vaccine development, yet the performance of current models is limited by noisy experimental labels, heterogeneous assay conditions, and poor generalization across the vast antibody and antigen sequence space. We introduce AbRank, a large-scale benchmark and evaluation framework that reframes affinity prediction as a pairwise ranking problem. AbRank aggregates over 380,000 binding assays from nine heterogeneous sources, spanning diverse antibodies, antigens, and experimental conditions, and introduces standardized data splits that systematically increase distribution shift, from local perturbations such as point mutations to broad generalization across novel antigens and antibodies. To ensure robust supervision, AbRank defines an m-confident ranking framework by filtering out comparisons with marginal affinity differences, focusing training on pairs with at least an m-fold difference in measured binding strength. As a baseline for the benchmark, we introduce WALLE-Affinity, a graph-based approach that integrates protein language model embeddings with structural information to predict pairwise binding preferences. Our benchmarks reveal significant limitations in current methods under realistic generalization settings and demonstrate that ranking-based training improves robustness and transferability. In summary, AbRank offers a robust foundation for machine learning models to generalize across the antibody-antigen space, with direct relevance for scalable, structure-aware antibody therapeutic design.
[LG-102] Bayesian Inference for Left-Truncated Log-Logistic Distributions for Time-to-event Data Analysis
链接: https://arxiv.org/abs/2506.17852
作者: Fahad Mostafa,Md Rejuan Haque,Md Mostafijur Rahman,Farzana Nasrin
类目: Methodology (stat.ME); Machine Learning (cs.LG); Applications (stat.AP); Computation (stat.CO); Machine Learning (stat.ML)
*备注: 24 pages, 5 figures, 5 tables
Abstract:Parameter estimation is a foundational step in statistical modeling, enabling us to extract knowledge from data and apply it effectively. Bayesian estimation of parameters incorporates prior beliefs with observed data to infer distribution parameters probabilistically and robustly. Moreover, it provides full posterior distributions, allowing uncertainty quantification and regularization, especially useful in small or truncated samples. Utilizing the left-truncated log-logistic (LTLL) distribution is particularly well-suited for modeling time-to-event data where observations are subject to a known lower bound such as precipitation data and cancer survival times. In this paper, we propose a Bayesian approach for estimating the parameters of the LTLL distribution with a fixed truncation point ( x_L 0 ). Given a random variable ( X \sim LL(\alpha, \beta; x_L) ), where ( \alpha 0 ) is the scale parameter and ( \beta 0 ) is the shape parameter, the likelihood function is derived based on a truncated sample ( X_1, X_2, \dots, X_N ) with ( X_i x_L ). We assume independent prior distributions for the parameters, and the posterior inference is conducted via Markov Chain Monte Carlo sampling, specifically using the Metropolis-Hastings algorithm to obtain posterior estimates ( \hat\alpha ) and ( \hat\beta ). Through simulation studies and real-world applications, we demonstrate that Bayesian estimation provides more stable and reliable parameter estimates, particularly when the likelihood surface is irregular due to left truncation. The results highlight the advantages of Bayesian inference outperform the estimation of parameter uncertainty in truncated distributions for time to event data analysis.
[LG-103] Quantum-Hybrid Support Vector Machines for Anomaly Detection in Industrial Control Systems
链接: https://arxiv.org/abs/2506.17824
作者: Tyler Cultice,Md. Saif Hassan Onim,Annarita Giani,Himanshu Thapliyal
类目: Quantum Physics (quant-ph); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 12 pages, 6 tables, 10 figures
Abstract:Sensitive data captured by Industrial Control Systems (ICS) play a large role in the safety and integrity of many critical infrastructures. Detection of anomalous or malicious data, or Anomaly Detection (AD), with machine learning is one of many vital components of cyberphysical security. Quantum kernel-based machine learning methods have shown promise in identifying complex anomalous behavior by leveraging the highly expressive and efficient feature spaces of quantum computing. This study focuses on the parameterization of Quantum Hybrid Support Vector Machines (QSVMs) using three popular datasets from Cyber-Physical Systems (CPS). The results demonstrate that QSVMs outperform traditional classical kernel methods, achieving 13.3% higher F1 scores. Additionally, this research investigates noise using simulations based on real IBMQ hardware, revealing a maximum error of only 0.98% in the QSVM kernels. This error results in an average reduction of 1.57% in classification metrics. Furthermore, the study found that QSVMs show a 91.023% improvement in kernel-target alignment compared to classical methods, indicating a potential “quantum advantage” in anomaly detection for critical infrastructures. This effort suggests that QSVMs can provide a substantial advantage in anomaly detection for ICS, ultimately enhancing the security and integrity of critical infrastructures.
[LG-104] Derandomizing Simultaneous Confidence Regions for Band-Limited Functions by Improved Norm Bounds and Majority-Voting Schemes
链接: https://arxiv.org/abs/2506.17764
作者: Balázs Csanád Csáji,Bálint Horváth
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Signal Processing (eess.SP); Systems and Control (eess.SY)
*备注:
Abstract:Band-limited functions are fundamental objects that are widely used in systems theory and signal processing. In this paper we refine a recent nonparametric, nonasymptotic method for constructing simultaneous confidence regions for band-limited functions from noisy input-output measurements, by working in a Paley-Wiener reproducing kernel Hilbert space. Kernel norm bounds are tightened using a uniformly-randomized Hoeffding’s inequality for small samples and an empirical Bernstein bound for larger ones. We derive an approximate threshold, based on the sample size and how informative the inputs are, that governs which bound to deploy. Finally, we apply majority voting to aggregate confidence sets from random subsamples, boosting both stability and region size. We prove that even per-input aggregated intervals retain their simultaneous coverage guarantee. These refinements are also validated through numerical experiments.
[LG-105] Rethinking the Role of Operating Conditions for Learning-based Multi-condition Fault Diagnosis
链接: https://arxiv.org/abs/2506.17740
作者: Pengyu Han,Zeyi Liu,Shijin Chen,Dongliang Zou,Xiao He
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: 6 pages, 6 figures, conference
Abstract:Multi-condition fault diagnosis is prevalent in industrial systems and presents substantial challenges for conventional diagnostic approaches. The discrepancy in data distributions across different operating conditions degrades model performance when a model trained under one condition is applied to others. With the recent advancements in deep learning, transfer learning has been introduced to the fault diagnosis field as a paradigm for addressing multi-condition fault diagnosis. Among these methods, domain generalization approaches can handle complex scenarios by extracting condition-invariant fault features. Although many studies have considered fault diagnosis in specific multi-condition scenarios, the extent to which operating conditions affect fault information has been scarcely studied, which is crucial. However, the extent to which operating conditions affect fault information has been scarcely studied, which is crucial. When operating conditions have a significant impact on fault features, directly applying domain generalization methods may lead the model to learn condition-specific information, thereby reducing its overall generalization ability. This paper investigates the performance of existing end-to-end domain generalization methods under varying conditions, specifically in variable-speed and variable-load scenarios, using multiple experiments on a real-world gearbox. Additionally, a two-stage diagnostic framework is proposed, aiming to improve fault diagnosis performance under scenarios with significant operating condition impacts. By incorporating a domain-generalized encoder with a retraining strategy, the framework is able to extract condition-invariant fault features while simultaneously alleviating potential overfitting to the source domain. Several experiments on a real-world gearbox dataset are conducted to validate the effectiveness of the proposed approach.
[LG-106] Advanced Modeling for Exoplanet Detection and Characterization
链接: https://arxiv.org/abs/2506.17665
作者: Krishna Chamarthy
类目: Earth and Planetary Astrophysics (astro-ph.EP); Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG)
*备注:
Abstract:Research into light curves from stars (temporal variation of brightness) has completely changed how exoplanets are discovered or characterised. This study including star light curves from the Kepler dataset as a way to discover exoplanets (planetary transits) and derive some estimate of their physical characteristics by the light curve and machine learning methods. The dataset consists of measured flux (recordings) for many individual stars and we will examine the light curve of each star and look for periodic dips in brightness due to an astronomical body making a transit. We will apply variables derived from an established method for deriving measurements from light curve data to derive key parameters related to the planet we observed during the transit, such as distance to the host star, orbital period, radius. The orbital period will typically be measured based on the time between transit of the subsequent timelines and the radius will be measured based on the depth of transit. The density of the star and planet can also be estimated from the transit event, as well as very limited information on the albedo (reflectivity) and atmosphere of the planet based on transmission spectroscopy and/or the analysis of phase curve for levels of flux. In addition to these methods, we will employ some machine learning classification of the stars (i.e. likely have an exoplanet or likely do not have an exoplanet) based on flux change. This could help fulfil both the process of looking for exoplanets more efficient as well as providing important parameters for the planet. This will provide a much quicker means of searching the vast astronomical datasets for the likelihood of exoplanets.
[LG-107] Scalable Machine Learning Algorithms using Path Signatures
链接: https://arxiv.org/abs/2506.17634
作者: Csaba Tóth
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR)
*备注: PhD thesis
Abstract:The interface between stochastic analysis and machine learning is a rapidly evolving field, with path signatures - iterated integrals that provide faithful, hierarchical representations of paths - offering a principled and universal feature map for sequential and structured data. Rooted in rough path theory, path signatures are invariant to reparameterization and well-suited for modelling evolving dynamics, long-range dependencies, and irregular sampling - common challenges in real-world time series and graph data. This thesis investigates how to harness the expressive power of path signatures within scalable machine learning pipelines. It introduces a suite of models that combine theoretical robustness with computational efficiency, bridging rough path theory with probabilistic modelling, deep learning, and kernel methods. Key contributions include: Gaussian processes with signature kernel-based covariance functions for uncertainty-aware time series modelling; the Seq2Tens framework, which employs low-rank tensor structure in the weight space for scalable deep modelling of long-range dependencies; and graph-based models where expected signatures over graphs induce hypo-elliptic diffusion processes, offering expressive yet tractable alternatives to standard graph neural networks. Further developments include Random Fourier Signature Features, a scalable kernel approximation with theoretical guarantees, and Recurrent Sparse Spectrum Signature Gaussian Processes, which combine Gaussian processes, signature kernels, and random features with a principled forgetting mechanism for multi-horizon time series forecasting with adaptive context length. We hope this thesis serves as both a methodological toolkit and a conceptual bridge, and provides a useful reference for the current state of the art in scalable, signature-based learning for sequential and structured data. Comments: PhD thesis Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR) Cite as: arXiv:2506.17634 [stat.ML] (or arXiv:2506.17634v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2506.17634 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-108] UT-GraphCast Hindcast Dataset: A Global AI Forecast Archive from UT Austin for Weather and Climate Applications
链接: https://arxiv.org/abs/2506.17453
作者: Naveen Sudharsan,Manmeet Singh,Harsh Kamath,Hassan Dashtian,Clint Dawson,Zong-Liang Yang,Dev Niyogi
类目: Geophysics (physics.geo-ph); Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注:
Abstract:The UT GraphCast Hindcast Dataset from 1979 to 2024 is a comprehensive global weather forecast archive generated using the Google DeepMind GraphCast Operational model. Developed by researchers at The University of Texas at Austin under the WCRP umbrella, this dataset provides daily 15 day deterministic forecasts at 00UTC on an approximately 25 km global grid for a 45 year period. GraphCast is a physics informed graph neural network that was trained on ECMWF ERA5 reanalysis. It predicts more than a dozen key atmospheric and surface variables on 37 vertical levels, delivering a full medium range forecast in under one minute on modern hardware.
[LG-109] Sequence-to-Sequence Models with Attention Mechanistically Map to the Architecture of Human Memory Search
链接: https://arxiv.org/abs/2506.17424
作者: Nikolaus Salvatore,Qiong Zhang
类目: Neurons and Cognition (q-bio.NC); Machine Learning (cs.LG)
*备注:
Abstract:Past work has long recognized the important role of context in guiding how humans search their memory. While context-based memory models can explain many memory phenomena, it remains unclear why humans develop such architectures over possible alternatives in the first place. In this work, we demonstrate that foundational architectures in neural machine translation – specifically, recurrent neural network (RNN)-based sequence-to-sequence models with attention – exhibit mechanisms that directly correspond to those specified in the Context Maintenance and Retrieval (CMR) model of human memory. Since neural machine translation models have evolved to optimize task performance, their convergence with human memory models provides a deeper understanding of the functional role of context in human memory, as well as presenting new ways to model human memory. Leveraging this convergence, we implement a neural machine translation model as a cognitive model of human memory search that is both interpretable and capable of capturing complex dynamics of learning. We show that our model accounts for both averaged and optimal human behavioral patterns as effectively as context-based memory models. Further, we demonstrate additional strengths of the proposed model by evaluating how memory search performance emerges from the interaction of different model components.
[LG-110] Gaussian Processes and Reproducing Kernels: Connections and Equivalences
链接: https://arxiv.org/abs/2506.17366
作者: Motonobu Kanagawa,Philipp Hennig,Dino Sejdinovic,Bharath K. Sriperumbudur
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Numerical Analysis (math.NA); Probability (math.PR); Statistics Theory (math.ST)
*备注: 172 pages
Abstract:This monograph studies the relations between two approaches using positive definite kernels: probabilistic methods using Gaussian processes, and non-probabilistic methods using reproducing kernel Hilbert spaces (RKHS). They are widely studied and used in machine learning, statistics, and numerical analysis. Connections and equivalences between them are reviewed for fundamental topics such as regression, interpolation, numerical integration, distributional discrepancies, and statistical dependence, as well as for sample path properties of Gaussian processes. A unifying perspective for these equivalences is established, based on the equivalence between the Gaussian Hilbert space and the RKHS. The monograph serves as a basis to bridge many other methods based on Gaussian processes and reproducing kernels, which are developed in parallel by the two research communities.
[LG-111] CLOUD: A Scalable and Physics-Informed Foundation Model for Crystal Representation Learning
链接: https://arxiv.org/abs/2506.17345
作者: Changwen Xu,Shang Zhu,Venkatasubramanian Viswanathan
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注: 36 pages, 11 pages of Supporting Information
Abstract:The prediction of crystal properties is essential for understanding structure-property relationships and accelerating the discovery of functional materials. However, conventional approaches relying on experimental measurements or density functional theory (DFT) calculations are often resource-intensive, limiting their scalability. Machine learning (ML) models offer a promising alternative by learning complex structure-property relationships from data, enabling faster predictions. Yet, existing ML models often rely on labeled data, adopt representations that poorly capture essential structural characteristics, and lack integration with physical principles–factors that limit their generalizability and interpretability. Here, we introduce CLOUD (Crystal Language mOdel for Unified and Differentiable materials modeling), a transformer-based framework trained on a novel Symmetry-Consistent Ordered Parameter Encoding (SCOPE) that encodes crystal symmetry, Wyckoff positions, and composition in a compact, coordinate-free string representation. Pre-trained on over six million crystal structures, CLOUD is fine-tuned on multiple downstream tasks and achieves competitive performance in predicting a wide range of material properties, demonstrating strong scaling performance. Furthermore, as proof of concept of differentiable materials modeling, CLOUD is applied to predict the phonon internal energy and heat capacity, which integrates the Debye model to preserve thermodynamic consistency. The CLOUD-DEBYE framework enforces thermodynamic consistency and enables temperature-dependent property prediction without requiring additional data. These results demonstrate the potential of CLOUD as a scalable and physics-informed foundation model for crystalline materials, unifying symmetry-consistent representations with physically grounded learning for property prediction and materials discovery.
[LG-112] Differentiable neural network representation of multi-well locally-convex potentials
链接: https://arxiv.org/abs/2506.17242
作者: Reese E. Jones,Adrian Buganza Tepole,Jan N. Fuhg
类目: Machine Learning (stat.ML); Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注: 16 pages, 13 figures
Abstract:Multi-well potentials are ubiquitous in science, modeling phenomena such as phase transitions, dynamic instabilities, and multimodal behavior across physics, chemistry, and biology. In contrast to non-smooth minimum-of-mixture representations, we propose a differentiable and convex formulation based on a log-sum-exponential (LSE) mixture of input convex neural network (ICNN) modes. This log-sum-exponential input convex neural network (LSE-ICNN) provides a smooth surrogate that retains convexity within basins and allows for gradient-based learning and inference. A key feature of the LSE-ICNN is its ability to automatically discover both the number of modes and the scale of transitions through sparse regression, enabling adaptive and parsimonious modeling. We demonstrate the versatility of the LSE-ICNN across diverse domains, including mechanochemical phase transformations, microstructural elastic instabilities, conservative biological gene circuits, and variational inference for multimodal probability distributions. These examples highlight the effectiveness of the LSE-ICNN in capturing complex multimodal landscapes while preserving differentiability, making it broadly applicable in data-driven modeling, optimization, and physical simulation. Comments: 16 pages, 13 figures Subjects: Machine Learning (stat.ML); Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG) Cite as: arXiv:2506.17242 [stat.ML] (or arXiv:2506.17242v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2506.17242 Focus to learn more arXiv-issued DOI via DataCite
[LG-113] Coupled Entropy: A Goldilocks Generalization?
链接: https://arxiv.org/abs/2506.17229
作者: Kenric P. Nelson
类目: Machine Learning (stat.ML); Statistical Mechanics (cond-mat.stat-mech); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注: 8 pages; draft paper for Conference on Nonextensive Statistical Physics Dedicated to Constantino Tsallis’ 82nd Birthday
Abstract:Nonextensive Statistical Mechanics (NSM) has developed into a powerful toolset for modeling and analyzing complex systems. Despite its many successes, a puzzle arose early in its development. The constraints on the Tsallis entropy are in the form of an escort distribution with elements proportional to p_i^q , but this same factor within the Tsallis entropy function is not normalized. This led to consideration of the Normalized Tsallis Entropy (NTE); however, the normalization proved to make the function unstable. I will provide evidence that the coupled entropy, which divides NTE by 1 + d\kappa , where d is the dimension and \kappa is the coupling, may provide the necessary robustness necessary for applications like machine learning. The definition for the coupled entropy and its maximizing distributions, the coupled exponential family, arises from clarifying how the number of independent random variables (q) is composed of the nonlinear properties of complex systems, q=1+\frac\alpha\kappa1+d\kappa , where \alpha is the nonlinear parameter governing the shape of distributions near their location and \kappa is the parameter determining the asymptotic tail decay. Foundationally, for complex systems, the coupling is the measure of nonlinearity inducing non-exponential distributions and the degree of nonadditivity entropy. As such, the coupling is a strong candidate as a measure of statistical complexity.
信息检索
[IR-0] An Audio-centric Multi-task Learning Framework for Streaming Ads Targeting on Spotify KDD2025
链接: https://arxiv.org/abs/2506.18735
作者: Shivam Verma,Vivian Chen,Darren Mei
类目: Information Retrieval (cs.IR); Audio and Speech Processing (eess.AS)
*备注: Accepted at KDD 2025
Abstract:Spotify, a large-scale multimedia platform, attracts over 675 million monthly active users who collectively consume millions of hours of music, podcasts, audiobooks, and video content. This diverse content consumption pattern introduces unique challenges for computational advertising, which must effectively integrate a variety of ad modalities, including audio, video, and display, within a single user experience. Traditional ad recommendation models, primarily designed for foregrounded experiences, often struggle to reconcile the platform’s inherent audio-centrality with the demands of optimizing ad performance across multiple formats and modalities. To overcome these challenges, we introduce Cross-modal Adaptive Mixture-of-Experts (CAMoE), a novel framework for optimizing click-through rate (CTR) prediction in both audio-centric and multi-modal settings. CAMoE enhances traditional mixture-of-experts models by incorporating modality-aware task grouping, adaptive loss masking, and deep-cross networks (DCN) to capture complex feature interactions within a multi-modal ad ecosystem. Through extensive ablation studies, we demonstrate that this approach achieves near Pareto-optimal performance across audio, video, and display ad formats, significantly improving AUC-PR compared to conventional single-task and content-based multi-task learning baselines. When deployed at scale on Spotify’s ad serving platform, CAMoE delivered substantial gains, yielding a 14.5% increase in CTR for audio ads, a 1.3% increase for video ads, and a 4.8% reduction in expected cost-per-click (eCPC) for audio slots.
[IR-1] Harnessing the Power of Reinforcement Learning for Language-Model-Based Information Retriever via Query-Document Co-Augmentation
链接: https://arxiv.org/abs/2506.18670
作者: Jingming Liu,Yumeng Li,Wei Shi,Yao-Xiang Ding,Hui Su,Kun Zhou
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Recent studies have proposed leveraging Large Language Models (LLMs) as information retrievers through query rewriting. However, for challenging corpora, we argue that enhancing queries alone is insufficient for robust semantic matching; the LLM should also have sufficient understanding of the corpus by directly handling and augmenting the documents themselves. To this end, we present an LLM-based retriever empowered to augment both user queries and corpus documents, with its policy fully explored via reinforcement learning (RL) and minimal human inductive bias. Notably, we find that simply allowing the LLM to modify documents yields little benefit unless paired with our carefully designed bidirectional RL framework, which enables the LLM to simultaneously learn and collaborate on both query and document augmentation policies. A key technical challenge in realizing such a framework lies in jointly updating both policies during training, where the rewards for the two directions depend on each other, making their entangled reward intractable. Our approach addresses this by introducing a reward sampling strategy and a specifically designed RL algorithm that enables effective training with these sampled rewards. Experimental results demonstrate that our approach significantly enhances LLM-based retrieval performance in both sparse and dense settings, particularly in difficult retrieval domains, and achieves strong cross-benchmark generalization. Our code is released at this https URL.
[IR-2] Rethinking Click Models in Light of Carousel Interfaces: Theory-Based Categorization and Design of Click Models ICTIR2025
链接: https://arxiv.org/abs/2506.18548
作者: Jingwei Kang,Maarten de Rijke,Santiago de Leon-Martinez,Harrie Oosterhuis
类目: Information Retrieval (cs.IR)
*备注: Accepted by ICTIR 2025
Abstract:Click models are a well-established for modeling user interactions with web interfaces. Previous work has mainly focused on traditional single-list web search settings; this includes existing surveys that introduced categorizations based on the first generation of probabilistic graphical model (PGM) click models that have become standard. However, these categorizations have become outdated, as their conceptualizations are unable to meaningfully compare PGM with neural network (NN) click models nor generalize to newer interfaces, such as carousel interfaces. We argue that this outdated view fails to adequately explain the fundamentals of click model designs, thus hindering the development of novel click models. This work reconsiders what should be the fundamental concepts in click model design, grounding them - unlike previous approaches - in their mathematical properties. We propose three fundamental key-design choices that explain what statistical patterns a click model can capture, and thus indirectly, what user behaviors they can capture. Based on these choices, we create a novel click model taxonomy that allows a meaningful comparison of all existing click models; this is the first taxonomy of single-list, grid and carousel click models that includes PGMs and NNs. Finally, we show how our conceptualization provides a foundation for future click model design by an example derivation of a novel design for carousel interfaces. Comments: Accepted by ICTIR 2025 Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2506.18548 [cs.IR] (or arXiv:2506.18548v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2506.18548 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1145/3731120.3744585 Focus to learn more DOI(s) linking to related resources
[IR-3] Comparative Analysis of Lion and AdamW Optimizers for Cross-Encoder Reranking with MiniLM GTE and ModernBERT
链接: https://arxiv.org/abs/2506.18297
作者: Shahil Kumar,Manu Pande,Anay Yatin Damle
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Modern information retrieval systems often employ a two-stage pipeline: an efficient initial retrieval stage followed by a computationally intensive reranking stage. Cross-encoders have shown strong effectiveness for reranking due to their deep analysis of query-document pairs. This paper studies the impact of the Lion optimizer, a recent alternative to AdamW, during fine-tuning of cross-encoder rerankers. We fine-tune three transformer models-MiniLM, GTE, and ModernBERT-on the MS MARCO passage ranking dataset using both optimizers. GTE and ModernBERT support extended context lengths (up to 8192 tokens). We evaluate effectiveness using TREC 2019 Deep Learning Track and MS MARCO dev set (MRR@10). Experiments, run on the Modal cloud platform, reveal that ModernBERT with Lion achieves the best NDCG@10 (0.7225) and MAP (0.5121) on TREC DL 2019, while MiniLM with Lion ties ModernBERT for MRR@10 (0.5988) on MS MARCO dev. Lion also provides superior GPU efficiency, improving utilization by 2.67% to 10.33% across models. We analyze performance trends using standard IR metrics and discuss the optimizer’s impact on training dynamics across architectures.
[IR-4] A novel fast short-time root music method for vibration monitoring of high-speed spindles
链接: https://arxiv.org/abs/2506.17600
作者: Huiguang Zhang,Baoguo Liu,Wei Feng,Zongtang Li
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Ultra-high-speed spindle bearings challenge traditional vibration monitoring due to broadband noise, non-stationarity, and limited time-frequency resolution. We present a fast Short-Time Root-MUSIC (fSTrM) algorithm that exploits FFT-accelerated Lanczos bidiagonalization to reduce computational complexity from \mathcalO(N^3) to SN\log_2N+S^2(N+S)+M^2(N+M) while preserving parametric super-resolution. The method constructs Hankel matrices from 16 ms signal frames and extracts fault frequencies through polynomial rooting on the unit circle. Experimental validation on the Politecnico di Torino bearing dataset demonstrates breakthrough micro-defect detection capabilities. The algorithm reliably identifies 150 \mu m defects – previously undetectable by conventional methods – providing 72+ hours additional warning time. Compared to STFT and wavelet methods, fSTrM achieves 1.2 Hz frequency resolution (vs. 12.5 Hz), 93% detection rate at - 5 dB SNR, and quantifies defect severity through harmonic content analysis. Critically, the algorithm processes each frame in 2.4 ms on embedded ARM Cortex-M7 hardware, enabling real-time deployment. This advancement transforms bearing monitoring from failure prevention to continuous degradation assessment, establishing a new paradigm for predictive maintenance in aerospace and precision machining. Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2506.17600 [cs.IR] (or arXiv:2506.17600v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2506.17600 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[IR-5] PreQRAG – Classify and Rewrite for Enhanced RAG SIGIR2025
链接: https://arxiv.org/abs/2506.17493
作者: Damian Martinez,Catalina Riano,Hui Fang
类目: Information Retrieval (cs.IR)
*备注: 7 pages, SIGIR 2025 LiveRAG
Abstract:This paper presents the submission of the UDInfo team to the SIGIR 2025 LiveRAG Challenge. We introduce PreQRAG, a Retrieval Augmented Generation (RAG) architecture designed to improve retrieval and generation quality through targeted question preprocessing. PreQRAG incorporates a pipeline that first classifies each input question as either single-document or multi-document type. For single-document questions, we employ question rewriting techniques to improve retrieval precision and generation relevance. For multi-document questions, we decompose complex queries into focused sub-questions that can be processed more effectively by downstream components. This classification and rewriting strategy improves the RAG performance. Experimental evaluation of the LiveRAG Challenge dataset demonstrates the effectiveness of our question-type-aware architecture, with PreQRAG achieving the preliminary second place in Session 2 of the LiveRAG challenge.
[IR-6] Automating Financial Statement Audits with Large Language Models
链接: https://arxiv.org/abs/2506.17282
作者: Rushi Wang,Jiateng Liu,Weijie Zhao,Shenglan Li,Denghui Zhang
类目: Information Retrieval (cs.IR)
*备注: 14 pages
Abstract:Financial statement auditing is essential for stakeholders to understand a company’s financial health, yet current manual processes are inefficient and error-prone. Even with extensive verification procedures, auditors frequently miss errors, leading to inaccurate financial statements that fail to meet stakeholder expectations for transparency and reliability. To this end, we harness large language models (LLMs) to automate financial statement auditing and rigorously assess their capabilities, providing insights on their performance boundaries in the scenario of automated auditing. Our work introduces a comprehensive benchmark using a curated dataset combining real-world financial tables with synthesized transaction data. In the benchmark, we developed a rigorous five-stage evaluation framework to assess LLMs’ auditing capabilities. The benchmark also challenges models to map specific financial statement errors to corresponding violations of accounting standards, simulating real-world auditing scenarios through test cases. Our testing reveals that current state-of-the-art LLMs successfully identify financial statement errors when given historical transaction data. However, these models demonstrate significant limitations in explaining detected errors and citing relevant accounting standards. Furthermore, LLMs struggle to execute complete audits and make necessary financial statement revisions. These findings highlight a critical gap in LLMs’ domain-specific accounting knowledge. Future research must focus on enhancing LLMs’ understanding of auditing principles and procedures. Our benchmark and evaluation framework establish a foundation for developing more effective automated auditing tools that will substantially improve the accuracy and efficiency of real-world financial statement auditing.