本篇博文主要内容为 2025-06-04 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。

目录

概览 (2025-06-04)

今日共更新742篇论文,其中:

  • 自然语言处理142篇(Computation and Language (cs.CL))
  • 人工智能232篇(Artificial Intelligence (cs.AI))
  • 计算机视觉177篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习220篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] Causal Estimation of Tokenisation Bias ACL2025

【速读】: 该论文试图解决语言模型中由分词器(tokeniser)选择引起的分词偏差(tokenisation bias)问题,即分词器对字符序列概率估计的影响。解决方案的关键在于将分词偏差视为一种因果效应,并通过断点回归设计(regression discontinuity design)进行估计,利用分词算法对子词(subword)的排序特性,在截止点附近比较相似的子词以量化其影响。

链接: https://arxiv.org/abs/2506.03149
作者: Pietro Lesci,Clara Meister,Thomas Hofmann,Andreas Vlachos,Tiago Pimentel
机构: University of Cambridge (剑桥大学); ETH Zürich (苏黎世联邦理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Published as a conference paper at ACL 2025

点击查看摘要

Abstract:Modern language models are typically trained over subword sequences, but ultimately define probabilities over character-strings. Ideally, the choice of the tokeniser – which maps character-strings to subwords – should not affect the probability assigned to the underlying character-string; in practice, it does. We define this mismatch as tokenisation bias. In this work, we quantify one particular type of tokenisation bias: the effect of including or not a subword (e.g., \langle hello \rangle ) in a tokeniser’s vocabulary on the probability a trained model assigns to the corresponding characters (i.e., \textit``hello’‘). Estimating this effect is challenging because each model is trained with only one tokeniser. We address this by framing tokenisation bias as a causal effect and estimating it using the regression discontinuity design. Specifically, we exploit the fact that tokenisation algorithms rank subwords and add the first K to a tokeniser’s vocabulary, where K is an arbitrary cutoff point. As such, we can estimate a causal effect by comparing similar subwords around this cutoff. Experimentally, we find that tokenisation consistently affects models’ outputs across scales, vocabularies, and tokenisers. Notably, a subword’s presence in a small model’s vocabulary may increase its characters’ probability by up to 17 times, highlighting tokenisation as a key design choice in language modelling.
zh

[NLP-1] UniWorld: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation

【速读】: 该论文试图解决现有统一模型在图像感知与操作任务中的局限性,这些任务在实际应用中具有迫切需求。其解决方案的关键在于利用由强大视觉-语言模型提供的语义特征,而非传统变分自编码器(VAE),从而构建一个名为UniWorld的统一生成框架。该框架仅使用1%的BAGEL数据量,便在图像编辑基准测试中持续优于BAGEL,并在图像理解和生成能力上保持竞争力。

链接: https://arxiv.org/abs/2506.03147
作者: Bin Lin,Zongjian Li,Xinhua Cheng,Yuwei Niu,Yang Ye,Xianyi He,Shenghai Yuan,Wangbo Yu,Shaodong Wang,Yunyang Ge,Yatian Pang,Li Yuan
机构: Peking University, Shenzhen Graduate School (北京大学深圳研究生院); Peng Cheng Laboratory (鹏城实验室); Rabbitpre AI (Rabbitpre AI)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Although existing unified models deliver strong performance on vision-language understanding and text-to-image generation, their models are limited in exploring image perception and manipulation tasks, which are urgently desired by users for wide applications. Recently, OpenAI released their powerful GPT-4o-Image model for comprehensive image perception and manipulation, achieving expressive capability and attracting community interests. By observing the performance of GPT-4o-Image in our carefully constructed experiments, we infer that GPT-4o-Image leverages features extracted by semantic encoders instead of VAE, while VAEs are considered essential components in many image manipulation models. Motivated by such inspiring observations, we present a unified generative framework named UniWorld based on semantic features provided by powerful visual-language models and contrastive semantic encoders. As a result, we build a strong unified model using only 1% amount of BAGEL’s data, which consistently outperforms BAGEL on image editing benchmarks. UniWorld also maintains competitive image understanding and generation capabilities, achieving strong performance across multiple image perception tasks. We fully open-source our models, including model weights, training and evaluation scripts, and datasets.
zh

[NLP-2] Entity-Augmented Neuroscience Knowledge Retrieval Using Ontology and Semantic Understanding Capability of LLM

【速读】: 该论文试图解决在神经科学领域中,从分散的文献中准确检索现有信息并发现新见解的问题,当前最先进的检索方法在处理多源知识时存在困难。其解决方案的关键在于利用大语言模型(Large Language Model, LLM)、神经科学本体论和文本嵌入,从无标签的大规模神经科学研究语料库中构建知识图谱(Knowledge Graph, KG),并通过实体增强的信息检索算法提取知识,从而显著提升知识发现的效果。

链接: https://arxiv.org/abs/2506.03145
作者: Pralaypati Ta,Sriram Venkatesaperumal,Keerthi Ram,Mohanasankar Sivaprakasam
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Neuroscience research publications encompass a vast wealth of knowledge. Accurately retrieving existing information and discovering new insights from this extensive literature is essential for advancing the field. However, when knowledge is dispersed across multiple sources, current state-of-the-art retrieval methods often struggle to extract the necessary information. A knowledge graph (KG) can integrate and link knowledge from multiple sources, but existing methods for constructing KGs in neuroscience often rely on labeled data and require domain expertise. Acquiring large-scale, labeled data for a specialized area like neuroscience presents significant challenges. This work proposes novel methods for constructing KG from unlabeled large-scale neuroscience research corpus utilizing large language models (LLM), neuroscience ontology, and text embeddings. We analyze the semantic relevance of neuroscience text segments identified by LLM for building the knowledge graph. We also introduce an entity-augmented information retrieval algorithm to extract knowledge from the KG. Several experiments were conducted to evaluate the proposed approaches, and the results demonstrate that our methods significantly enhance knowledge discovery from the unlabeled neuroscience research corpus. It achieves an F1 score of 0.84 for entity extraction, and the knowledge obtained from the KG improves answers to over 54% of the questions.
zh

[NLP-3] MERIT: Multilingual Semantic Retrieval with Interleaved Multi-Condition Query

【速读】: 该论文旨在解决当前语义检索研究中对多条件交织场景下视觉信息表达能力利用不足的问题,尤其是在现有数据集仅限于单语言、单图像或单一检索条件的情况下,模型难以有效捕捉查询中的具体条件要素。解决方案的关键在于提出Coral框架,该框架通过整合嵌入重构以保留细粒度条件元素,并结合对比学习以提取全面的全局语义,从而提升模型在多条件交织语义检索任务中的性能。

链接: https://arxiv.org/abs/2506.03144
作者: Wei Chow,Yuan Gao,Linfeng Li,Xian Wang,Qi Xu,Hang Song,Lingdong Kong,Ran Zhou,Yi Zeng,Yidong Cai,Botian Jiang,Shilin Xu,Jiajun Zhang,Minghui Qiu,Xiangtai Li,Tianshu Yang,Siliang Tang,Juncheng Li
机构: ByteDance Inc.; Zhejiang University
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Multimedia (cs.MM)
备注: Preprint; Project Page, Code, and Dataset at: this https URL

点击查看摘要

Abstract:Semantic retrieval is crucial for modern applications yet remains underexplored in current research. Existing datasets are limited to single languages, single images, or singular retrieval conditions, often failing to fully exploit the expressive capacity of visual information as evidenced by maintained performance when images are replaced with captions. However, practical retrieval scenarios frequently involve interleaved multi-condition queries with multiple images. Hence, this paper introduces MERIT, the first multilingual dataset for interleaved multi-condition semantic retrieval, comprising 320,000 queries with 135,000 products in 5 languages, covering 7 distinct product categories. Extensive experiments on MERIT identify existing models’s limitation: focusing solely on global semantic information while neglecting specific conditional elements in queries. Consequently, we propose Coral, a novel fine-tuning framework that adapts pre-trained MLLMs by integrating embedding reconstruction to preserve fine-grained conditional elements and contrastive learning to extract comprehensive global semantics. Experiments demonstrate that Coral achieves a 45.9% performance improvement over conventional approaches on MERIT, with strong generalization capabilities validated across 8 established retrieval benchmarks. Collectively, our contributions - a novel dataset, identification of critical limitations in existing approaches, and an innovative fine-tuning framework - establish a foundation for future research in interleaved multi-condition semantic retrieval.
zh

[NLP-4] GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents

【速读】: 该论文旨在解决视觉语言模型(VLM)驱动的图形用户界面(GUI)代理中的视觉定位问题,即根据视觉内容和文本计划准确识别执行操作的屏幕区域。现有方法将此问题建模为基于文本的坐标生成任务,但存在空间语义对齐不足、无法处理模糊的监督目标以及屏幕坐标密集性与视觉特征粗粒度之间的不匹配等局限性。解决方案的关键在于提出GUI-Actor,其核心是一个基于注意力机制的动作头,能够将专用的ACTOR标记与所有相关视觉块标记对齐,从而在单次前向传播中提出一个或多个操作区域,并结合接地验证器评估和选择最可能的操作区域,实现了更优的性能和泛化能力。

链接: https://arxiv.org/abs/2506.03143
作者: Qianhui Wu,Kanzhi Cheng,Rui Yang,Chaoyun Zhang,Jianwei Yang,Huiqiang Jiang,Jian Mu,Baolin Peng,Bo Qiao,Reuben Tan,Si Qin,Lars Liden,Qingwei Lin,Huan Zhang,Tong Zhang,Jianbing Zhang,Dongmei Zhang,Jianfeng Gao
机构: Microsoft(微软); Nanjing University(南京大学); University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:One of the principal challenges in building VLM-powered GUI agents is visual grounding, i.e., localizing the appropriate screen region for action execution based on both the visual content and the textual plans. Most existing work formulates this as a text-based coordinate generation task. However, these approaches suffer from several limitations: weak spatial-semantic alignment, inability to handle ambiguous supervision targets, and a mismatch between the dense nature of screen coordinates and the coarse, patch-level granularity of visual features extracted by models like Vision Transformers. In this paper, we propose GUI-Actor, a VLM-based method for coordinate-free GUI grounding. At its core, GUI-Actor introduces an attention-based action head that learns to align a dedicated ACTOR token with all relevant visual patch tokens, enabling the model to propose one or more action regions in a single forward pass. In line with this, we further design a grounding verifier to evaluate and select the most plausible action region from the candidates proposed for action execution. Extensive experiments show that GUI-Actor outperforms prior state-of-the-art methods on multiple GUI action grounding benchmarks, with improved generalization to unseen screen resolutions and layouts. Notably, GUI-Actor-7B even surpasses UI-TARS-72B (38.1) on ScreenSpot-Pro, achieving scores of 40.7 with Qwen2-VL and 44.6 with Qwen2.5-VL as backbones. Furthermore, by incorporating the verifier, we find that fine-tuning only the newly introduced action head (~100M parameters for 7B model) while keeping the VLM backbone frozen is sufficient to achieve performance comparable to previous state-of-the-art models, highlighting that GUI-Actor can endow the underlying VLM with effective grounding capabilities without compromising its general-purpose strengths.
zh

[NLP-5] Co-Evolving LLM Coder and Unit Tester via Reinforcement Learning

【速读】: 该论文旨在解决代码生成与单元测试生成协同优化的问题,特别是在没有真实代码作为监督的情况下,如何通过强化学习框架实现两者的共同进化。其解决方案的关键在于设计一种专门的奖励机制,该机制基于编码与单元测试生成能力的交互结果进行优化,从而使得单元测试器能够直接从编码器的错误中学习,提升整体性能。

链接: https://arxiv.org/abs/2506.03136
作者: Yinjie Wang,Ling Yang,Ye Tian,Ke Shen,Mengdi Wang
机构: 未知
类目: Computation and Language (cs.CL)
备注: Project: this https URL

点击查看摘要

Abstract:We propose CURE, a novel reinforcement learning framework with a dedicated reward design that co-evolves coding and unit test generation capabilities based on their interaction outcomes, without any ground-truth code as supervision. This approach enables flexible and scalable training and allows the unit tester to learn directly from the coder’s mistakes. Our derived ReasonFlux-Coder-7B and 14B models improve code generation accuracy by 5.3% and Best-of-N accuracy by 9.0% after optimization on Qwen2.5-Instruct models, outperforming similarly sized Qwen-Coder, DeepSeek-Coder, and Seed-Coder. They naturally extend to downstream tasks such as test-time scaling and agentic coding-achieving a 8.1% improvement over the base model. For the long-CoT model, our ReasonFlux-Coder-4B consistently outperforms Qwen3-4B while achieving 64.8% inference efficiency in unit test generation. Notably, we also find that our model can serve as an effective reward model for reinforcement learning on base models. Project: this https URL
zh

[NLP-6] OmniSpatial: Towards Comprehensive Spatial Reasoning Benchmark for Vision Language Models

【速读】: 该论文试图解决当前视觉-语言模型(Vision-Language Models, VLMs)在空间推理能力上的不足,特别是其在处理复杂空间关系时的局限性。现有研究主要关注基础的空间关系理解任务,如左右、远近区分和物体计数,但这些任务仅代表了空间推理的最基础层次。为弥补这一空白,作者提出了OmniSpatial,一个基于认知心理学的全面且具有挑战性的空间推理基准,涵盖动态推理、复杂空间逻辑、空间交互和视角转换四大类,共50个细粒度子类别。该解决方案的关键在于通过互联网数据爬取与人工精细标注构建大规模高质量的问答对,以全面评估和推动VLMs在空间推理方面的性能提升。

链接: https://arxiv.org/abs/2506.03135
作者: Mengdi Jia,Zekun Qi,Shaochen Zhang,Wenyao Zhang,Xinqiang Yu,Jiawei He,He Wang,Li Yi
机构: Tsinghua University (清华大学); Xian Jiaotong University (西安交通大学); Shanghai Jiao Tong University (上海交通大学); Galbot; Peking University (北京大学); Shanghai Qi Zhi Institute (上海期智研究院); Shanghai AI Laboratory (上海人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Project Page: this https URL

点击查看摘要

Abstract:Spatial reasoning is a key aspect of cognitive psychology and remains a major bottleneck for current vision-language models (VLMs). While extensive research has aimed to evaluate or improve VLMs’ understanding of basic spatial relations, such as distinguishing left from right, near from far, and object counting, these tasks represent only the most fundamental level of spatial reasoning. In this work, we introduce OmniSpatial, a comprehensive and challenging benchmark for spatial reasoning, grounded in cognitive psychology. OmniSpatial covers four major categories: dynamic reasoning, complex spatial logic, spatial interaction, and perspective-taking, with 50 fine-grained subcategories. Through Internet data crawling and careful manual annotation, we construct over 1.5K question-answer pairs. Extensive experiments show that both open- and closed-source VLMs, as well as existing reasoning and spatial understanding models, exhibit significant limitations in comprehensive spatial understanding. We further analyze failure cases and propose potential directions for future research.
zh

[NLP-7] AUTOCIRCUIT-RL: Reinforcement Learning-Driven LLM for Automated Circuit Topology Generation ICML’2025

【速读】: 该论文旨在解决模拟电路拓扑综合中的设计搜索空间庞大与严格约束遵守带来的高效合成难题。其解决方案的关键在于提出一种基于强化学习(Reinforcement Learning, RL)的框架AUTOCIRCUIT-RL,该框架通过两个阶段进行优化:首先通过指令微调使大型语言模型(Large Language Model, LLM)从结构化提示中学习生成符合设计约束的电路拓扑;其次通过强化学习精炼阶段,利用奖励模型对生成结果的有效性、效率和输出电压进行评估与优化,从而提升生成电路的质量与一致性。

链接: https://arxiv.org/abs/2506.03122
作者: Prashanth Vijayaraghavan,Luyao Shi,Ehsan Degan,Vandana Mukherjee,Xin Zhang
机构: 未知
类目: Computation and Language (cs.CL)
备注: 9 Pages (Content), 4 Pages (Appendix), 7 figures, ICML’2025

点击查看摘要

Abstract:Analog circuit topology synthesis is integral to Electronic Design Automation (EDA), enabling the automated creation of circuit structures tailored to specific design requirements. However, the vast design search space and strict constraint adherence make efficient synthesis challenging. Leveraging the versatility of Large Language Models (LLMs), we propose AUTOCIRCUIT-RL,a novel reinforcement learning (RL)-based framework for automated analog circuit synthesis. The framework operates in two phases: instruction tuning, where an LLM learns to generate circuit topologies from structured prompts encoding design constraints, and RL refinement, which further improves the instruction-tuned model using reward models that evaluate validity, efficiency, and output voltage. The refined model is then used directly to generate topologies that satisfy the design constraints. Empirical results show that AUTOCIRCUIT-RL generates ~12% more valid circuits and improves efficiency by ~14% compared to the best baselines, while reducing duplicate generation rates by ~38%. It achieves over 60% success in synthesizing valid circuits with limited training data, demonstrating strong generalization. These findings highlight the framework’s effectiveness in scaling to complex circuits while maintaining efficiency and constraint adherence, marking a significant advancement in AI-driven circuit design.
zh

[NLP-8] Critique-GRPO: Advancing LLM Reasoning with Natural Language and Numerical Feedback

【速读】: 该论文旨在解决仅依赖数值反馈的强化学习(Reinforcement Learning, RL)在大型语言模型(Large Language Models, LLMs)应用中遇到的三个关键问题:性能瓶颈、自我反思效果有限以及持续失败。其解决方案的关键在于引入自然语言反馈,通过批判性评价(Critique)引导模型进行有效的策略优化。论文提出的Critique-GRPO框架结合了自然语言和数值反馈,使LLMs能够同时从初始响应和批判引导的改进中学习,从而提升整体性能。实验结果表明,该方法在多个复杂任务中显著优于传统的监督学习和强化学习微调方法。

链接: https://arxiv.org/abs/2506.03106
作者: Xiaoying Zhang,Hao Sun,Yipeng Zhang,Kaituo Feng,Chao Yang,Helen Meng
机构: The Chinese University of Hong Kong, HCCL(香港中文大学,计算科学与工程系); Cambridge University(剑桥大学); The Chinese University of Hong Kong, MMLab(香港中文大学,多媒体实验室); Shanghai AI Laboratory(上海人工智能实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 38 pages

点击查看摘要

Abstract:Recent advances in reinforcement learning (RL) with numerical feedback, such as scalar rewards, have significantly enhanced the complex reasoning capabilities of large language models (LLMs). Despite this success, we identify three key challenges encountered by RL with solely numerical feedback: performance plateaus, limited effectiveness of self-reflection, and persistent failures. We then demonstrate that RL-finetuned models, even after exhibiting performance plateaus, can generate correct refinements on persistently failed problems by leveraging natural language feedback in the form of critiques. Building on this insight, we propose Critique-GRPO, an online RL framework that integrates both natural language and numerical feedback for effective policy optimization. Critique-GRPO enables LLMs to learn from initial responses and critique-guided refinements simultaneously while maintaining exploration. Extensive experiments using Qwen2.5-7B-Base and Qwen3-8B-Base show that Critique-GRPO consistently outperforms supervised learning-based and RL-based fine-tuning approaches across eight challenging mathematical, STEM, and general reasoning tasks, improving average pass@1 scores by approximately 4.5% and 5%, respectively. Notably, Critique-GRPO surpasses a strong baseline that incorporates expert demonstrations within online RL. Further analysis reveals two critical insights about policy exploration: (1) higher entropy does not always guarantee efficient learning from exploration, and (2) longer responses do not necessarily lead to more effective exploration.
zh

[NLP-9] Beyond Text Compression: Evaluating Tokenizers Across Scales ACL2025

【速读】: 该论文试图解决分词器(tokenizer)选择对语言模型性能的影响评估问题,尤其是在多语言场景下如何有效衡量分词器质量的挑战。其解决方案的关键在于利用小模型在计算成本较低的情况下准确预测大模型对不同分词器的响应差异,并通过结合多种内在评估指标,构建一个可靠的分词器评估框架,从而提升未来语言模型开发中分词器选择的效率与准确性。

链接: https://arxiv.org/abs/2506.03101
作者: Jonas F. Lotz,António V. Lopes,Stephan Peitz,Hendra Setiawan,Leonardo Emili
机构: Apple(苹果)
类目: Computation and Language (cs.CL)
备注: ACL 2025

点击查看摘要

Abstract:The choice of tokenizer can profoundly impact language model performance, yet accessible and reliable evaluations of tokenizer quality remain an open challenge. Inspired by scaling consistency, we show that smaller models can accurately predict significant differences in tokenizer impact on larger models at a fraction of the compute cost. By systematically evaluating both English-centric and multilingual tokenizers, we find that tokenizer choice has negligible effects on tasks in English but results in consistent performance differences in multilingual settings. We propose new intrinsic tokenizer metrics inspired by Zipf’s law that correlate more strongly with downstream performance than text compression when modeling unseen languages. By combining several metrics to capture multiple aspects of tokenizer behavior, we develop a reliable framework for intrinsic tokenizer evaluations. Our work offers a more efficient path to informed tokenizer selection in future language model development.
zh

[NLP-10] Retrieval-Augmented Generation as Noisy In-Context Learning: A Unified Theory and Risk Bounds

【速读】: 该论文试图解决检索增强生成(Retrieval-augmented generation, RAG)在理论层面的缺乏深入分析的问题,特别是其泛化能力的有限样本界和偏差-方差权衡。解决方案的关键在于提出一种框架,将检索到的文本视为与查询相关的噪声上下文示例,并将其视为经典上下文学习(in-context learning, ICL)和标准RAG的极限情况,从而揭示RAG在泛化误差上的内在上限,并通过引入均匀和非均匀RAG噪声建模从训练数据和外部语料库中进行检索。

链接: https://arxiv.org/abs/2506.03100
作者: Yang Guo,Yutian Tao,Yifei Ming,Robert D. Nowak,Yingyu Liang
机构: University of Wisconsin-Madison (威斯康星大学麦迪逊分校); Salesforce AI Research (Salesforce AI 研究院); The University of Hong Kong (香港大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR); Statistics Theory (math.ST)
备注: Under Review

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) has seen many empirical successes in recent years by aiding the LLM with external knowledge. However, its theoretical aspect has remained mostly unexplored. In this paper, we propose the first finite-sample generalization bound for RAG in in-context linear regression and derive an exact bias-variance tradeoff. Our framework views the retrieved texts as query-dependent noisy in-context examples and recovers the classical in-context learning (ICL) and standard RAG as the limit cases. Our analysis suggests that an intrinsic ceiling on generalization error exists on RAG as opposed to the ICL. Furthermore, our framework is able to model retrieval both from the training data and from external corpora by introducing uniform and non-uniform RAG noise. In line with our theory, we show the sample efficiency of ICL and RAG empirically with experiments on common QA benchmarks, such as Natural Questions and TriviaQA.
zh

[NLP-11] Literary Evidence Retrieval via Long-Context Language Models ACL2025

【速读】: 该论文试图解决现代长上下文语言模型(Language Models, LMs)在文学小说理解方面的能力问题,特别是通过文学证据检索任务来评估其对文学文本的分析能力。解决方案的关键在于构建一个基准测试,其中将一部原著(如《了不起的盖茨比》)的全文提供给语言模型,并附带一段包含缺失引文的文学评论,要求模型生成缺失的引文,从而模拟人类文学分析过程中的全局叙事推理和细读文本分析。该方法通过高质量的数据集筛选与人工验证,确保了任务的严谨性与挑战性。

链接: https://arxiv.org/abs/2506.03090
作者: Katherine Thai,Mohit Iyyer
机构: UMass Amherst (马萨诸塞大学阿默斯特分校); University of Maryland, College Park (马里兰大学学院公园分校)
类目: Computation and Language (cs.CL)
备注: ACL 2025

点击查看摘要

Abstract:How well do modern long-context language models understand literary fiction? We explore this question via the task of literary evidence retrieval, repurposing the RELiC dataset of That et al. (2022) to construct a benchmark where the entire text of a primary source (e.g., The Great Gatsby) is provided to an LLM alongside literary criticism with a missing quotation from that work. This setting, in which the model must generate the missing quotation, mirrors the human process of literary analysis by requiring models to perform both global narrative reasoning and close textual examination. We curate a high-quality subset of 292 examples through extensive filtering and human verification. Our experiments show that recent reasoning models, such as Gemini Pro 2.5 can exceed human expert performance (62.5% vs. 50% accuracy). In contrast, the best open-weight model achieves only 29.1% accuracy, highlighting a wide gap in interpretive reasoning between open and closed-weight models. Despite their speed and apparent accuracy, even the strongest models struggle with nuanced literary signals and overgeneration, signaling open challenges for applying LLMs to literary analysis. We release our dataset and evaluation code to encourage future work in this direction.
zh

[NLP-12] MAEBE: Multi-Agent Emergent Behavior Framework ICML2025

【速读】: 该论文试图解决多智能体AI集合在交互环境中产生的新型涌现风险,这些风险无法通过传统的针对孤立大语言模型(Large Language Models, LLMs)的AI安全评估方法有效识别。论文提出的解决方案是Multi-Agent Emergent Behavior Evaluation (MAEBE)框架,其关键在于系统性地评估多智能体系统中的涌现行为,特别是通过引入Greatest Good Benchmark和一种新颖的双反转问题技术,揭示了LLM道德偏好在不同问题框架下的脆弱性、多智能体集合道德推理的不可预测性以及群体动态对安全性和对齐带来的新挑战。

链接: https://arxiv.org/abs/2506.03053
作者: Sinem Erisken(Independent Researcher),Timothy Gothard(Independent Researcher),Martin Leitgab(Independent Researcher),Ram Potham(Independent Researcher)
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注: Preprint. This work has been submitted to the Multi-Agent Systems Workshop at ICML 2025 for review

点击查看摘要

Abstract:Traditional AI safety evaluations on isolated LLMs are insufficient as multi-agent AI ensembles become prevalent, introducing novel emergent risks. This paper introduces the Multi-Agent Emergent Behavior Evaluation (MAEBE) framework to systematically assess such risks. Using MAEBE with the Greatest Good Benchmark (and a novel double-inversion question technique), we demonstrate that: (1) LLM moral preferences, particularly for Instrumental Harm, are surprisingly brittle and shift significantly with question framing, both in single agents and ensembles. (2) The moral reasoning of LLM ensembles is not directly predictable from isolated agent behavior due to emergent group dynamics. (3) Specifically, ensembles exhibit phenomena like peer pressure influencing convergence, even when guided by a supervisor, highlighting distinct safety and alignment challenges. Our findings underscore the necessity of evaluating AI systems in their interactive, multi-agent contexts.
zh

[NLP-13] Facts Do Care About Your Language: Assessing Answer Quality of Multilingual LLM s

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在教育领域中事实准确性不足的问题,尤其是在非英语语言环境下的表现。研究的关键在于评估Llama3.1系列模型在回答适合中学和高中学生的事实性问题时的正确性,并揭示其在提供冗余信息、降低真实性以及加剧对罕见语言偏见方面的潜在问题。

链接: https://arxiv.org/abs/2506.03051
作者: Yuval Kansal,Shmuel Berman,Lydia Liu
机构: Princeton University (普林斯顿大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Factuality is a necessary precursor to useful educational tools. As adoption of Large Language Models (LLMs) in education continues of grow, ensuring correctness in all settings is paramount. Despite their strong English capabilities, LLM performance in other languages is largely untested. In this work, we evaluate the correctness of the Llama3.1 family of models in answering factual questions appropriate for middle and high school students. We demonstrate that LLMs not only provide extraneous and less truthful information, but also exacerbate existing biases against rare languages.
zh

[NLP-14] owards Analyzing and Understanding the Limitations of VAPO: A Theoretical Perspective

【速读】: 该论文试图解决强化学习(Reinforcement Learning, RL)在增强大语言模型(Large Language Models, LLMs)进行复杂、长链式思维(long-chain-of-thought, long-CoT)推理时所面临的局限性。其解决方案的关键在于分析VAPO框架在信用分配、价值函数表示能力以及将全局价值信号转化为局部策略改进方面的固有挑战,尤其是在稀疏奖励环境下,这些因素限制了其对深度长期价值的建模与利用能力。

链接: https://arxiv.org/abs/2506.03038
作者: Jintian Shao,Yiming Cheng
机构: Southern University of Science and Technology (南方科技大学); Tsinghua University (清华大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reinforcement learning (RL) enhances large language models (LLMs) in complex, long-chain-of-thought (long-CoT) reasoning. The advanced VAPO framework, despite sophisticated mechanisms like Decoupled GAE, theoretically faces fundamental limitations in comprehensively modeling and leveraging deep, long-term value for fine-grained, step-by-step policy guidance in extended reasoning chains. We argue these limitations stem from inherent difficulties in credit assignment, value function representational capacity with temporally abstracted goals, and translating global value signals into local policy improvements, especially with sparse rewards. Our theoretical analysis examines these aspects to illuminate VAPO’s boundaries in long-term value modeling, aiming to deepen understanding of current RL for advanced reasoning and suggest future research for more robust LLM agents.
zh

[NLP-15] Leverag ing Information Retrieval to Enhance Spoken Language Understanding Prompts in Few-Shot Learning INTERSPEECH2025

【速读】: 该论文旨在解决在资源有限的情况下,如何提升语音语言理解(Spoken Language Understanding, SLU)系统的性能问题。其解决方案的关键在于利用信息检索(Information Retrieval, IR)方法进行示例选择,以构建增强的提示(prompt),从而在不增加提示长度的前提下显著提升SLU任务的表现。

链接: https://arxiv.org/abs/2506.03035
作者: Pierre Lepagnol,Sahar Ghannay,Thomas Gerald,Christophe Servan,Sophie Rosset
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: Conference paper accepted to INTERSPEECH 2025

点击查看摘要

Abstract:Understanding user queries is fundamental in many applications, such as home assistants, booking systems, or recommendations. Accordingly, it is crucial to develop accurate Spoken Language Understanding (SLU) approaches to ensure the reliability of the considered system. Current State-of-the-Art SLU techniques rely on large amounts of training data; however, only limited annotated examples are available for specific tasks or languages. In the meantime, instruction-tuned large language models (LLMs) have shown exceptional performance on unseen tasks in a few-shot setting when provided with adequate prompts. In this work, we propose to explore example selection by leveraging Information retrieval (IR) approaches to build an enhanced prompt that is applied to an SLU task. We evaluate the effectiveness of the proposed method on several SLU benchmarks. Experimental results show that lexical IR methods significantly enhance performance without increasing prompt length. Comments: Conference paper accepted to INTERSPEECH 2025 Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR) Cite as: arXiv:2506.03035 [cs.CL] (or arXiv:2506.03035v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2506.03035 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: INTERSPEECH 2025
zh

[NLP-16] Coding Agents with Multimodal Browsing are Generalist Problem Solvers

【速读】: 该论文试图解决当前AI代理在特定领域表现优异但缺乏跨任务泛化能力的问题。其核心挑战在于,现有代理通常依赖高度专业化的工具或架构设计,导致其性能局限于预设场景。解决方案的关键在于构建一个具备少量通用工具的泛化代理,这些工具包括代码编辑与执行、网络搜索、多模态网页浏览和文件访问。通过这种方式,论文提出的OpenHands-Versa在多个多样化且具有挑战性的基准测试中展现出优于或与最先进专用代理相当的性能,验证了通用工具集在实现跨任务高性能方面的有效性。

链接: https://arxiv.org/abs/2506.03011
作者: Aditya Bharat Soni,Boxuan Li,Xingyao Wang,Valerie Chen,Graham Neubig
机构: Carnegie Mellon University (卡内基梅隆大学); Independent (独立); All Hands AI (All Hands AI)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Modern human labor is characterized by specialization; we train for years and develop particular tools that allow us to perform well across a variety of tasks. In addition, AI agents have been specialized for domains such as software engineering, web navigation, and workflow automation. However, this results in agents that are good for one thing but fail to generalize beyond their intended scope. One reason for this is that agent developers provide a highly specialized set of tools or make architectural decisions optimized for a specific use case or benchmark. In this work, we ask the question: what is the minimal set of general tools that can be used to achieve high performance across a diverse set of tasks? Our answer is OpenHands-Versa, a generalist agent built with a modest number of general tools: code editing and execution, web search, as well as multimodal web browsing and file access. Importantly, OpenHands-Versa demonstrates superior or competitive performance over leading specialized agents across three diverse and challenging benchmarks: SWE-Bench Multimodal, GAIA, and The Agent Company, outperforming the best-performing previously published results with absolute improvements in success rate of 9.1, 1.3, and 9.1 points respectively. Further, we show how existing state-of-the-art multi-agent systems fail to generalize beyond their target domains. These results demonstrate the feasibility of developing a generalist agent to solve diverse tasks and establish OpenHands-Versa as a strong baseline for future research.
zh

[NLP-17] Conditioning Large Language Models on Legal Systems? Detecting Punishable Hate Speech

【速读】: 该论文试图解决如何在不同抽象层次的法律体系下对大型语言模型(Large Language Models, LLMs)进行条件设置,以提升其在识别可能构成可惩罚的仇恨言论方面的性能。解决方案的关键在于探索和比较不同层次抽象的法律知识对模型表现的影响,包括抽象法律知识和具体法律知识的应用,以期改善模型在法律评估任务中的准确性和一致性。

链接: https://arxiv.org/abs/2506.03009
作者: Florian Ludwig,Torsten Zesch,Frederike Zufall
机构: ZITiS(中央办公室 for Information Technology in the Security Sector); CATALPA(计算语言学); FernUniversität in Hagen(哈根开放大学); Karlsruhe Institute of Technology (卡尔斯鲁厄理工学院); Waseda Institute for Advanced Study (早稻田大学先进研究机构)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The assessment of legal problems requires the consideration of a specific legal system and its levels of abstraction, from constitutional law to statutory law to case law. The extent to which Large Language Models (LLMs) internalize such legal systems is unknown. In this paper, we propose and investigate different approaches to condition LLMs at different levels of abstraction in legal systems. This paper examines different approaches to conditioning LLMs at multiple levels of abstraction in legal systems to detect potentially punishable hate speech. We focus on the task of classifying whether a specific social media posts falls under the criminal offense of incitement to hatred as prescribed by the German Criminal Code. The results show that there is still a significant performance gap between models and legal experts in the legal assessment of hate speech, regardless of the level of abstraction with which the models were conditioned. Our analysis revealed, that models conditioned on abstract legal knowledge lacked deep task understanding, often contradicting themselves and hallucinating answers, while models using concrete legal knowledge performed reasonably well in identifying relevant target groups, but struggled with classifying target conducts.
zh

[NLP-18] A Multi-Agent Framework for Mitigating Dialect Biases in Privacy Policy Question-Answering Systems ACL2025

【速读】: 该论文旨在解决隐私政策问答(Privacy Policy QA)系统在不同英语方言间表现不一致的问题,尤其是对非标准英语变体使用者的不利影响。其关键解决方案是提出一种受以人为本设计原则启发的多智能体框架,该框架包含一个方言代理(Dialect Agent),负责将查询转换为标准美国英语(SAE)同时保留方言意图,以及一个隐私政策代理(Privacy Policy Agent),通过领域专业知识优化预测结果。该方法无需重新训练或方言特定微调,具有广泛的适用性。

链接: https://arxiv.org/abs/2506.02998
作者: Đorđe Klisura,Astrid R Bernaga Torres,Anna Karen Gárate-Escamilla,Rajesh Roshan Biswal,Ke Yang,Hilal Pataci,Anthony Rios
机构: University of Texas at San Antonio (德克萨斯大学圣安东尼奥分校); Tecnológico de Monterrey (蒙特雷理工学院)
类目: Computation and Language (cs.CL)
备注: Accepted to ACL 2025 Main Conference

点击查看摘要

Abstract:Privacy policies inform users about data collection and usage, yet their complexity limits accessibility for diverse populations. Existing Privacy Policy Question Answering (QA) systems exhibit performance disparities across English dialects, disadvantaging speakers of non-standard varieties. We propose a novel multi-agent framework inspired by human-centered design principles to mitigate dialectal biases. Our approach integrates a Dialect Agent, which translates queries into Standard American English (SAE) while preserving dialectal intent, and a Privacy Policy Agent, which refines predictions using domain expertise. Unlike prior approaches, our method does not require retraining or dialect-specific fine-tuning, making it broadly applicable across models and domains. Evaluated on PrivacyQA and PolicyQA, our framework improves GPT-4o-mini’s zero-shot accuracy from 0.394 to 0.601 on PrivacyQA and from 0.352 to 0.464 on PolicyQA, surpassing or matching few-shot baselines without additional training data. These results highlight the effectiveness of structured agent collaboration in mitigating dialect biases and underscore the importance of designing NLP systems that account for linguistic diversity to ensure equitable access to privacy information.
zh

[NLP-19] Its Not a Walk in the Park! Challenges of Idiom Translation in Speech-to-text Systems ACL2025

【速读】: 该论文旨在解决跨语言翻译中习语(idiom)翻译的挑战,特别是在语音到文本翻译(Speech-to-Text Translation, SLT)系统中的表现问题。研究发现,尽管现代机器翻译(Machine Translation, MT)系统在常规新闻文本翻译上取得显著进展,但习语翻译仍面临较大困难,尤其在SLT系统中,相关研究较为匮乏。论文的关键解决方案是系统性地评估不同翻译系统(包括端到端SLT系统、MT系统和大型语言模型)在习语翻译任务中的表现,并揭示SLT系统在处理习语时存在明显的性能下降,常表现为字面翻译,而MT系统和大型语言模型则表现出更好的习语处理能力。这一发现强调了在SLT架构中需要引入专门的习语处理策略及改进内部表示方法。

链接: https://arxiv.org/abs/2506.02995
作者: Iuliia Zaitova,Badr M. Abdullah,Wei Xue,Dietrich Klakow,Bernd Möbius,Tania Avgustinova
机构: Saarland University (萨尔兰大学); Xi’an Jiaotong University (西安交通大学)
类目: Computation and Language (cs.CL)
备注: 13 pages, 3 figures, ACL 2025

点击查看摘要

Abstract:Idioms are defined as a group of words with a figurative meaning not deducible from their individual components. Although modern machine translation systems have made remarkable progress, translating idioms remains a major challenge, especially for speech-to-text systems, where research on this topic is notably sparse. In this paper, we systematically evaluate idiom translation as compared to conventional news translation in both text-to-text machine translation (MT) and speech-to-text translation (SLT) systems across two language pairs (German to English, Russian to English). We compare state-of-the-art end-to-end SLT systems (SeamlessM4T SLT-to-text, Whisper Large v3) with MT systems (SeamlessM4T SLT-to-text, No Language Left Behind), Large Language Models (DeepSeek, LLaMA) and cascaded alternatives. Our results reveal that SLT systems experience a pronounced performance drop on idiomatic data, often reverting to literal translations even in higher layers, whereas MT systems and Large Language Models demonstrate better handling of idioms. These findings underscore the need for idiom-specific strategies and improved internal representations in SLT architectures.
zh

[NLP-20] Mitigating Manipulation and Enhancing Persuasion: A Reflective Multi-Agent Approach for Legal Argument Generation

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在法律论证生成过程中存在的幻觉、无依据的说服以及未能有效利用提供的事实基础或在论证不可行时无法适当回避的问题。解决方案的关键在于提出一种基于反思的多智能体方法,通过专门设计的智能体——因素分析者和论证润色者——在迭代优化过程中生成三重法律论证(原告、被告、反驳),从而提升论证的合规性、准确性和事实依赖性。

链接: https://arxiv.org/abs/2506.02992
作者: Li Zhang,Kevin D. Ashley
机构: University of Pittsburgh(匹兹堡大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 13 pages, 2 figures, Workshop on Legally Compliant Intelligent Chatbots at ICAIL 2025]{Workshop on Legally Compliant Intelligent Chatbots @ ICAIL 2025

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly explored for legal argument generation, yet they pose significant risks of manipulation through hallucination and ungrounded persuasion, and often fail to utilize provided factual bases effectively or abstain when arguments are untenable. This paper introduces a novel reflective multi-agent method designed to address these challenges in the context of legally compliant persuasion. Our approach employs specialized agents–a Factor Analyst and an Argument Polisher–in an iterative refinement process to generate 3-ply legal arguments (plaintiff, defendant, rebuttal). We evaluate Reflective Multi-Agent against single-agent, enhanced-prompt single-agent, and non-reflective multi-agent baselines using four diverse LLMs (GPT-4o, GPT-4o-mini, Llama-4-Maverick-17b-128e, Llama-4-Scout-17b-16e) across three legal scenarios: “arguable”, “mismatched”, and “non-arguable”. Results demonstrate Reflective Multi-Agent’s significant superiority in successful abstention (preventing generation when arguments cannot be grounded), marked improvements in hallucination accuracy (reducing fabricated and misattributed factors), particularly in “non-arguable” scenarios, and enhanced factor utilization recall (improving the use of provided case facts). These findings suggest that structured reflection within a multi-agent framework offers a robust computable method for fostering ethical persuasion and mitigating manipulation in LLM-based legal argumentation systems, a critical step towards trustworthy AI in law. Project page: this https URL
zh

[NLP-21] Performance of leading large language models in May 2025 in Membership of the Royal College of General Practitioners-style examination questions: a cross-sectional analysis

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在全科医学教育中的应用能力评估问题,特别是验证当前领先的LLMs是否能够胜任英国皇家全科医师学院(Royal College of General Practitioners, RCGP)风格的考试题。解决方案的关键在于选取了截至2025年5月的四个顶级LLMs(o3、Claude Opus 4、Grok3和Gemini 2.5 Pro),并要求其回答从RCGP GP SelfTest中随机选取的100道多选题,题目包含文本信息、实验室结果和临床图像,以评估其在真实临床情境下的表现。

链接: https://arxiv.org/abs/2506.02987
作者: Richard Armitage
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 12 pages, 1 Table

点击查看摘要

Abstract:Background: Large language models (LLMs) have demonstrated substantial potential to support clinical practice. Other than Chat GPT4 and its predecessors, few LLMs, especially those of the leading and more powerful reasoning model class, have been subjected to medical specialty examination questions, including in the domain of primary care. This paper aimed to test the capabilities of leading LLMs as of May 2025 (o3, Claude Opus 4, Grok3, and Gemini 2.5 Pro) in primary care education, specifically in answering Member of the Royal College of General Practitioners (MRCGP) style examination questions. Methods: o3, Claude Opus 4, Grok3, and Gemini 2.5 Pro were tasked to answer 100 randomly chosen multiple choice questions from the Royal College of General Practitioners GP SelfTest on 25 May 2025. Questions included textual information, laboratory results, and clinical images. Each model was prompted to answer as a GP in the UK and was provided with full question information. Each question was attempted once by each model. Responses were scored against correct answers provided by GP SelfTest. Results: The total score of o3, Claude Opus 4, Grok3, and Gemini 2.5 Pro was 99.0%, 95.0%, 95.0%, and 95.0%, respectively. The average peer score for the same questions was 73.0%. Discussion: All models performed remarkably well, and all substantially exceeded the average performance of GPs and GP registrars who had answered the same questions. o3 demonstrated the best performance, while the performances of the other leading models were comparable with each other and were not substantially lower than that of o3. These findings strengthen the case for LLMs, particularly reasoning models, to support the delivery of primary care, especially those that have been specifically trained on primary care clinical data. Comments: 12 pages, 1 Table Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC) Cite as: arXiv:2506.02987 [cs.CL] (or arXiv:2506.02987v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2506.02987 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Richard Armitage [view email] [v1] Tue, 3 Jun 2025 15:25:38 UTC (189 KB)
zh

[NLP-22] owards a Japanese Full-duplex Spoken Dialogue System INTERSPEECH2025

【速读】: 该论文旨在解决全双工语音对话系统在日语语言中的研究与开发不足的问题(Full-duplex spoken dialogue systems for the Japanese language)。其解决方案的关键在于构建一个基于英文全双工对话模型Moshi的日语公开可用模型,并通过两阶段训练过程进行优化:首先在大规模日语语音对话数据上进行预训练,随后在高质量的立体语音对话数据上进行微调,同时引入由多流文本到语音系统生成的合成对话数据以进一步提升模型性能。

链接: https://arxiv.org/abs/2506.02979
作者: Atsumoto Ohashi,Shinya Iizuka,Jingjing Jiang,Ryuichiro Higashinaka
机构: 未知
类目: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: Accepted to Interspeech 2025

点击查看摘要

Abstract:Full-duplex spoken dialogue systems, which can model simultaneous bidirectional features of human conversations such as speech overlaps and backchannels, have attracted significant attention recently. However, the study of full-duplex spoken dialogue systems for the Japanese language has been limited, and the research on their development in Japanese remains scarce. In this paper, we present the first publicly available full-duplex spoken dialogue model in Japanese, which is built upon Moshi, a full-duplex dialogue model in English. Our model is trained through a two-stage process: pre-training on a large-scale spoken dialogue data in Japanese, followed by fine-tuning on high-quality stereo spoken dialogue data. We further enhance the model’s performance by incorporating synthetic dialogue data generated by a multi-stream text-to-speech system. Evaluation experiments demonstrate that the trained model outperforms Japanese baseline models in both naturalness and meaningfulness.
zh

[NLP-23] Expanding before Inferring: Enhancing Factuality in Large Language Models through Premature Layers Interpolation

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在生成过程中产生的事实性不一致问题,即“幻觉”(hallucinations)。现有方法主要从输入或输出层面进行干预,但未能充分考虑模型内部的信息精炼过程及早期层的作用。本文提出的解决方案是PLI(Premature Layers Interpolation),其关键在于通过数学插值在相邻层之间插入早期层,从而扩展信息处理与传递的深度,提升事实一致性。实验表明,PLI在减少幻觉方面效果显著,并在多数情况下优于现有基线方法。

链接: https://arxiv.org/abs/2506.02973
作者: Dingwei Chen,Ziqiang Liu,Feiteng Fang,Chak Tou Leong,Shiwen Ni,Ahmadreza Argha,Hamid Alinejad-Rokny,Min Yang,Chengming Li
机构: Sun Yat-Sen University (中山大学); Shenzhen MSU-BIT University (深圳北理莫斯科大学); The Hong Kong Polytechnic University (香港理工大学); UNSW, Sydney, NSW 2052, Australia (新南威尔士大学,悉尼,新南威尔士州2052,澳大利亚); Shenzhen Key Laboratory for High Performance Data Mining, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences (深圳市高性能数据挖掘重点实验室,深圳先进技术研究院,中国科学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) demonstrate remarkable capabilities in text understanding and generation. However, their tendency to produce factually inconsistent outputs, commonly referred to as ‘‘hallucinations’’, remains a critical challenge. Existing approaches, such as retrieval-based and inference-time correction methods, primarily address this issue at the input or output level, often overlooking the intrinsic information refinement process and the role of premature layers. Meanwhile, alignment- and fine-tuning-based methods are resource-intensive. In this paper, we propose PLI (Premature Layers Interpolation), a novel, training-free, and plug-and-play intervention designed to enhance factuality. PLI mitigates hallucinations by inserting premature layers formed through mathematical interpolation with adjacent layers. Inspired by stable diffusion and sampling steps, PLI extends the depth of information processing and transmission in LLMs, improving factual coherence. Experiments on four publicly available datasets demonstrate that PLI effectively reduces hallucinations while outperforming existing baselines in most cases. Further analysis suggests that the success of layer interpolation is closely linked to LLMs’ internal mechanisms. To promote reproducibility, we will release our code and data upon acceptance.
zh

[NLP-24] FlowerTune: A Cross-Domain Benchmark for Federated Fine-Tuning of Large Language Models

【速读】: 该论文旨在解决大规模语言模型(Large Language Models, LLMs)在开发过程中依赖大量公开数据所带来的数据稀缺性及领域特定敏感信息访问受限的问题。其解决方案的关键在于采用联邦学习(Federated Learning, FL)框架,通过在不共享原始数据的前提下实现预训练LLMs的去中心化微调,从而在保护数据隐私的同时提升模型的领域适应能力。

链接: https://arxiv.org/abs/2506.02961
作者: Yan Gao,Massimo Roberto Scamarcia,Javier Fernandez-Marques,Mohammad Naseri,Chong Shen Ng,Dimitris Stripelis,Zexi Li,Tao Shen,Jiamu Bai,Daoyuan Chen,Zikai Zhang,Rui Hu,InSeo Song,Lee KangYoon,Hong Jia,Ting Dang,Junyan Wang,Zheyuan Liu,Daniel Janes Beutel,Lingjuan Lyu,Nicholas D. Lane
机构: Flower Labs; University of Cambridge; Entrust Corp; Zhejiang University; Penn State University; Alibaba Group; University of Nevada, Reno; Gachon University; The University of Melbourne; The University of Adelaide; Sony AI
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have achieved state-of-the-art results across diverse domains, yet their development remains reliant on vast amounts of publicly available data, raising concerns about data scarcity and the lack of access to domain-specific, sensitive information. Federated Learning (FL) presents a compelling framework to address these challenges by enabling decentralized fine-tuning on pre-trained LLMs without sharing raw data. However, the compatibility and performance of pre-trained LLMs in FL settings remain largely under explored. We introduce the FlowerTune LLM Leaderboard, a first-of-its-kind benchmarking suite designed to evaluate federated fine-tuning of LLMs across four diverse domains: general NLP, finance, medical, and coding. Each domain includes federated instruction-tuning datasets and domain-specific evaluation metrics. Our results, obtained through a collaborative, open-source and community-driven approach, provide the first comprehensive comparison across 26 pre-trained LLMs with different aggregation and fine-tuning strategies under federated settings, offering actionable insights into model performance, resource constraints, and domain adaptation. This work lays the foundation for developing privacy-preserving, domain-specialized LLMs for real-world applications.
zh

[NLP-25] HACo-Det: A Study Towards Fine-Grained Machine-Generated Text Detection under Human-AI Coauthoring

【速读】: 该论文试图解决在人类与AI协作撰写文本(human-AI coauthoring)场景下,细粒度机器生成文本(MGT)检测的问题,传统方法主要关注二分类的文档级检测,忽视了混合人类与AI贡献的文本。解决方案的关键在于构建一个名为HACo-Det的数据集,该数据集通过自动流水线生成具有词级归属标签的人类-AI协作文本,并对七种主流文档级检测器进行改造,使其适用于词级检测任务,从而探索基于数值化AI比例的协作文本检测路径。

链接: https://arxiv.org/abs/2506.02959
作者: Zhixiong Su,Yichen Wang,Herun Wan,Zhaohan Zhang,Minnan Luo
机构: Xi’an Jiaotong University (西安交通大学); University of Chicago (芝加哥大学); Queen Mary University of London (伦敦玛丽女王大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The misuse of large language models (LLMs) poses potential risks, motivating the development of machine-generated text (MGT) detection. Existing literature primarily concentrates on binary, document-level detection, thereby neglecting texts that are composed jointly by human and LLM contributions. Hence, this paper explores the possibility of fine-grained MGT detection under human-AI coauthoring. We suggest fine-grained detectors can pave pathways toward coauthored text detection with a numeric AI ratio. Specifically, we propose a dataset, HACo-Det, which produces human-AI coauthored texts via an automatic pipeline with word-level attribution labels. We retrofit seven prevailing document-level detectors to generalize them to word-level detection. Then we evaluate these detectors on HACo-Det on both word- and sentence-level detection tasks. Empirical results show that metric-based methods struggle to conduct fine-grained detection with a 0.462 average F1 score, while finetuned models show superior performance and better generalization across domains. However, we argue that fine-grained co-authored text detection is far from solved. We further analyze factors influencing performance, e.g., context window, and highlight the limitations of current methods, pointing to potential avenues for improvement.
zh

[NLP-26] Adaptive Graph Pruning for Multi-Agent Communication

【速读】: 该论文试图解决现有基于大型语言模型(Large Language Model, LLM)的多智能体系统在面对不同任务复杂度时适应性不足的问题,其核心问题在于当前方法依赖固定的智能体数量和静态的通信结构,限制了系统的灵活性与效率。解决方案的关键在于提出一种名为自适应图剪枝(Adaptive Graph Pruning, AGP)的框架,该框架通过联合优化智能体数量(硬剪枝)和通信拓扑(软剪枝),实现任务自适应的多智能体协作。

链接: https://arxiv.org/abs/2506.02951
作者: Boyi Li,Zhonghan Zhao,Der-Horng Lee,Gaoang Wang
机构: Zhejiang University - University of Illinois Urbana-Champaign Institute (浙江大学-伊利诺伊大学厄巴纳-香槟分校联合研究院); Zhejiang University, College of Computer Science and Technology (浙江大学计算机科学与技术学院)
类目: Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Large Language Model (LLM) based multi-agent systems have shown remarkable performance in various tasks, especially when enhanced through collaborative communication. However, current methods often rely on a fixed number of agents and static communication structures, limiting their ability to adapt to varying task complexities. In this paper, we propose Adaptive Graph Pruning (AGP), a novel task-adaptive multi-agent collaboration framework that jointly optimizes agent quantity (hard-pruning) and communication topology (soft-pruning). Specifically, our method employs a two-stage training strategy: firstly, independently training soft-pruning networks for different agent quantities to determine optimal agent-quantity-specific complete graphs and positional masks across specific tasks; and then jointly optimizing hard-pruning and soft-pruning within a maximum complete graph to dynamically configure the number of agents and their communication topologies per task. Extensive experiments demonstrate that our approach is: (1) High-performing, achieving state-of-the-art results across six benchmarks and consistently generalizes across multiple mainstream LLM architectures, with a increase in performance of 2.58%\sim 9.84% ; (2) Task-adaptive, dynamically constructing optimized communication topologies tailored to specific tasks, with an extremely high performance in all three task categories (general reasoning, mathematical reasoning, and code generation); (3) Token-economical, having fewer training steps and token consumption at the same time, with a decrease in token consumption of 90%+ ; and (4) Training-efficient, achieving high performance with very few training steps compared with other methods. The performance will surpass the existing baselines after about ten steps of training under six benchmarks.
zh

[NLP-27] Quantitative LLM Judges

【速读】: 该论文试图解决如何提高大型语言模型(Large Language Model, LLM)在评估其他LLM输出时的准确性与可靠性问题。其解决方案的关键在于提出定量LLM评判者(quantitative LLM judges),通过回归模型将现有LLM评判者的评分与领域内人类评分对齐,从而提升评判的准确性和一致性。该方法利用评判者的文本评价和评分进行训练,以改进原始评判者的性能,并展示了框架在不同绝对和相对反馈类型中的通用性与灵活性。

链接: https://arxiv.org/abs/2506.02945
作者: Aishwarya Sahoo,Jeevana Kruthi Karnuthala,Tushar Parmanand Budhwani,Pranchal Agarwal,Sankaran Vaidyanathan,Alexa Siu,Franck Dernoncourt,Jennifer Healey,Nedim Lipka,Ryan Rossi,Uttaran Bhattacharya,Branislav Kveton
机构: University of Massachusetts Amherst(马萨诸塞大学阿默斯特分校); Adobe Research(Adobe 研究院)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:LLM-as-a-judge is a framework in which a large language model (LLM) automatically evaluates the output of another LLM. We propose quantitative LLM judges, which align evaluation scores of existing LLM judges to human scores in a given domain using regression models. The models are trained to improve the score of the original judge by using the judge’s textual evaluation and score. We present four quantitative judges for different types of absolute and relative feedback, which showcases the generality and versatility of our framework. Our framework is more computationally efficient than supervised fine-tuning and can be more statistically efficient when human feedback is limited, which is expected in most applications of our work. We validate these claims empirically on four datasets using two base judges. Our experiments show that quantitative judges can effectively improve the predictive power of existing judges through post-hoc modeling.
zh

[NLP-28] INESC-ID @ eRisk 2025: Exploring Fine-Tuned Similarity-Based and Prompt-Based Approaches to Depression Symptom Identification

【速读】: 该论文旨在解决从给定句子中检索与贝克抑郁量表-第二版(Beck’s Depression Inventory - II, BDI)中每个抑郁症状相关的句子的问题。其关键解决方案是将任务建模为针对每个BDI症状的二分类问题,并通过微调基础模型、句子相似性计算、大型语言模型(Large Language Model, LLM)提示以及集成方法进行优化。实验结果表明,微调基础模型在引入合成数据缓解类别不平衡后表现最佳,且不同症状的最佳方法存在差异。基于此,研究者设计了五次独立测试运行,其中两次采用集成方法,在官方信息检索(Information Retrieval, IR)评估中取得了最优成绩。

链接: https://arxiv.org/abs/2506.02924
作者: Diogo A.P. Nunes,Eugénio Ribeiro
机构: INESC-ID Lisboa(INESC-ID 里斯本); Instituto Superior Técnico, Universidade de Lisboa(高等技术学院,里斯本大学); Instituto Universitário de Lisboa (ISCTE-IUL)(里斯本大学学院(ISCTE-IUL))
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: 12 pages, 1 figure, 6 tables

点击查看摘要

Abstract:In this work, we describe our team’s approach to eRisk’s 2025 Task 1: Search for Symptoms of Depression. Given a set of sentences and the Beck’s Depression Inventory - II (BDI) questionnaire, participants were tasked with submitting up to 1,000 sentences per depression symptom in the BDI, sorted by relevance. Participant submissions were evaluated according to standard Information Retrieval (IR) metrics, including Average Precision (AP) and R-Precision (R-PREC). The provided training data, however, consisted of sentences labeled as to whether a given sentence was relevant or not w.r.t. one of BDI’s symptoms. Due to this labeling limitation, we framed our development as a binary classification task for each BDI symptom, and evaluated accordingly. To that end, we split the available labeled data into training and validation sets, and explored foundation model fine-tuning, sentence similarity, Large Language Model (LLM) prompting, and ensemble techniques. The validation results revealed that fine-tuning foundation models yielded the best performance, particularly when enhanced with synthetic data to mitigate class imbalance. We also observed that the optimal approach varied by symptom. Based on these insights, we devised five independent test runs, two of which used ensemble methods. These runs achieved the highest scores in the official IR evaluation, outperforming submissions from 16 other teams.
zh

[NLP-29] A Controllable Examination for Long-Context Language Models

【速读】: 该论文旨在解决现有长上下文语言模型(LCLM)评估框架在真实任务和合成任务中所存在的局限性。真实任务因复杂性和数据污染而难以解释,而合成任务中的“针在草堆中”(NIAH)格式由于“针”与“草堆”之间缺乏连贯性,无法有效模拟实际应用场景。论文提出的解决方案关键在于构建一个具有无缝上下文、可控制环境和可靠评估标准的新型基准——LongBioBench,该基准利用人工生成的传记作为受控环境,从理解、推理和可信度三个维度评估LCLM,从而在真实任务映射与可控性之间取得更好的平衡。

链接: https://arxiv.org/abs/2506.02921
作者: Yijun Yang,Zeyu Huang,Wenhao Zhu,Zihan Qiu,Fei Yuan,Jeff Z.Pan,Ivan Titov
机构: University of Edinburgh (爱丁堡大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); Nanjing University (南京大学); Qwen Team, Alibaba Group (通义实验室,阿里巴巴集团); University of Amsterdam (阿姆斯特丹大学)
类目: Computation and Language (cs.CL)
备注: Preprint

点击查看摘要

Abstract:Existing frameworks for evaluating long-context language models (LCLM) can be broadly categorized into real-world and synthetic tasks. Despite their utility, both approaches are accompanied by certain intrinsic limitations. Real-world tasks are too complex to interpret or characterize and are susceptible to data contamination. In contrast, synthetic tasks often adopt the needle-in-the-haystack (NIAH) format, wherein a lack of coherence between the “needle” and the “haystack” compromises their validity as proxies for realistic applications. In response to these challenges, we posit that an ideal long-context evaluation framework should be characterized by three essential features: \textitseamless context , \textitcontrollable setting , and \textitsound evaluation . This study introduces \textbfLongBioBench , a novel benchmark that utilizes artificially generated biographies as a controlled environment for assessing LCLMs across dimensions of \textitunderstanding , \textitreasoning , and \textittrustworthiness . Our experimental evaluation, which includes \textbf18 LCLMs in total, demonstrates that most models still exhibit deficiencies in semantic understanding and elementary reasoning over retrieved results and are less trustworthy as context length increases. Our further analysis indicates some design choices employed by existing synthetic benchmarks, such as contextual non-coherence, numerical needles, and the absence of distractors, rendering them vulnerable to test the model long-context capabilities. Moreover, we also reveal that long-context continual pretraining primarily adjusts RoPE embedding to accommodate extended context lengths. To sum up, compared to previous synthetic benchmarks, LongBioBench achieves a better trade-off between mirroring authentic language tasks and maintaining controllability, and is highly interpretable and configurable.
zh

[NLP-30] Cell-o1 : Training LLM s to Solve Single-Cell Reasoning Puzzles with Reinforcement Learning

【速读】: 该论文试图解决单细胞RNA测序数据中细胞类型注释的问题,特别是在批量层面考虑细胞上下文和确保标签唯一性的挑战。现有基础模型通常独立注释细胞,缺乏对批次级别细胞背景的考量和解释性推理。解决方案的关键在于引入CellPuzzles任务,并提出Cell-o1模型,该模型通过监督微调和基于批次奖励的强化学习进行训练,以提升批量层面的注释性能并实现类似专家的推理能力。

链接: https://arxiv.org/abs/2506.02911
作者: Yin Fang,Qiao Jin,Guangzhi Xiong,Bowen Jin,Xianrui Zhong,Siru Ouyang,Aidong Zhang,Jiawei Han,Zhiyong Lu
机构: National Institutes of Health (国家卫生研究院); University of Virginia (弗吉尼亚大学); University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: 28 pages; 16 tables; 7 figures; Code: this https URL

点击查看摘要

Abstract:Cell type annotation is a key task in analyzing the heterogeneity of single-cell RNA sequencing data. Although recent foundation models automate this process, they typically annotate cells independently, without considering batch-level cellular context or providing explanatory reasoning. In contrast, human experts often annotate distinct cell types for different cell clusters based on their domain knowledge. To mimic this workflow, we introduce the CellPuzzles task, where the objective is to assign unique cell types to a batch of cells. This benchmark spans diverse tissues, diseases, and donor conditions, and requires reasoning across the batch-level cellular context to ensure label uniqueness. We find that off-the-shelf large language models (LLMs) struggle on CellPuzzles, with the best baseline (OpenAI’s o1) achieving only 19.0% batch-level accuracy. To fill this gap, we propose Cell-o1, a 7B LLM trained via supervised fine-tuning on distilled reasoning traces, followed by reinforcement learning with batch-level rewards. Cell-o1 achieves state-of-the-art performance, outperforming o1 by over 73% and generalizing well across contexts. Further analysis of training dynamics and reasoning behaviors provides insights into batch-level annotation performance and emergent expert-like reasoning. Code and data are available at this https URL.
zh

[NLP-31] IMPARA-GED: Grammatical Error Detection is Boosting Reference-free Grammatical Error Quality Estimator ACL2025

【速读】: 该论文试图解决自动语法错误修正(GEC)评估方法中缺乏参考文本的问题,即在没有标准答案的情况下如何准确评估GEC系统的性能。解决方案的关键在于改进现有的IMPARA自动GEC评估方法,通过构建一个具备增强语法错误检测(GED)能力的预训练语言模型来提升其质量估计能力,从而实现更接近人类句级评价的评估结果。

链接: https://arxiv.org/abs/2506.02899
作者: Yusuke Sakai,Takumi Goto,Taro Watanabe
机构: Nara Institute of Science and Technology (奈良先端科学技術大学院大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: ACL 2025 Findings

点击查看摘要

Abstract:We propose IMPARA-GED, a novel reference-free automatic grammatical error correction (GEC) evaluation method with grammatical error detection (GED) capabilities. We focus on the quality estimator of IMPARA, an existing automatic GEC evaluation method, and construct that of IMPARA-GED using a pre-trained language model with enhanced GED capabilities. Experimental results on SEEDA, a meta-evaluation dataset for automatic GEC evaluation methods, demonstrate that IMPARA-GED achieves the highest correlation with human sentence-level evaluations.
zh

[NLP-32] A Multi-Dialectal Dataset for German Dialect ASR and Dialect-to-Standard Speech Translation INTERSPEECH2025

【速读】: 该论文试图解决德语方言在当前自动语音识别(Automatic Speech Recognition, ASR)研究中被低估的问题,特别是在东南德语的弗兰肯语、巴伐利亚语和阿勒曼尼语等方言与标准德语之间的识别鲁棒性问题。解决方案的关键在于构建了一个名为Betthupferl的评估数据集,包含四小时的方言朗读语音及半小时的标准德语语音,并提供方言与标准德语的双语转录文本,从而为研究模型对方言变异的适应能力提供了基础。

链接: https://arxiv.org/abs/2506.02894
作者: Verena Blaschke,Miriam Winkler,Constantin Förster,Gabriele Wenger-Glemser,Barbara Plank
机构: 未知
类目: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: Accepted to Interspeech 2025

点击查看摘要

Abstract:Although Germany has a diverse landscape of dialects, they are underrepresented in current automatic speech recognition (ASR) research. To enable studies of how robust models are towards dialectal variation, we present Betthupferl, an evaluation dataset containing four hours of read speech in three dialect groups spoken in Southeast Germany (Franconian, Bavarian, Alemannic), and half an hour of Standard German speech. We provide both dialectal and Standard German transcriptions, and analyze the linguistic differences between them. We benchmark several multilingual state-of-the-art ASR models on speech translation into Standard German, and find differences between how much the output resembles the dialectal vs. standardized transcriptions. Qualitative error analyses of the best ASR model reveal that it sometimes normalizes grammatical differences, but often stays closer to the dialectal constructions.
zh

[NLP-33] Scaling Fine-Grained MoE Beyond 50B Parameters: Empirical Evaluation and Practical Insights

【速读】: 该论文旨在解决如何高效扩展大型语言模型(Large Language Models, LLMs)的问题,特别是通过改进混合专家(Mixture of Experts, MoE)架构来提升模型的收敛性和性能。其解决方案的关键在于采用细粒度MoE方法,即使用更多数量且更小规模的专家模块,以优化模型在大规模参数下的训练效果和下游任务表现。

链接: https://arxiv.org/abs/2506.02890
作者: Jakub Krajewski,Marcin Chochowski,Daniel Korzekwa
机构: NVIDIA(英伟达); IDEAS NCBR, University of Warsaw(华沙大学IDEAS NCBR)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Mixture of Experts (MoE) architectures have emerged as pivotal for scaling Large Language Models (LLMs) efficiently. Fine-grained MoE approaches - utilizing more numerous, smaller experts - have demonstrated potential in improving model convergence and quality. This work proposes a set of training recipes and provides a comprehensive empirical evaluation of fine-grained MoE, directly comparing its scaling properties against standard MoE configurations for models with up to 56B total (17B active) parameters. We investigate convergence speed, model performance on downstream benchmarks, and practical training considerations across various setups. Overall, at the largest scale we show that fine-grained MoE achieves better validation loss and higher accuracy across a set of downstream benchmarks. This study offers empirical grounding and practical insights for leveraging fine-grained MoE in the development of future large-scale models.
zh

[NLP-34] CoT is Not True Reasoning It Is Just a Tight Constraint to Imitate: A Theory Perspective

【速读】: 该论文试图解决当前对大型语言模型(Large Language Models)在链式思维(Chain-of-Thought, CoT)提示下表现出的推理能力是否代表真实抽象推理的争议问题。论文提出一种理论上的反驳观点,认为CoT并未激发真正的抽象推理能力,而是通过施加结构约束,引导模型模仿推理的形式。解决方案的关键在于揭示CoT通过强制生成中间步骤,利用模型强大的序列预测和模式匹配能力,将输出限制在看似连贯的思维过程序列中,从而提升多步推理任务的性能。

链接: https://arxiv.org/abs/2506.02878
作者: Jintian Shao,Yiming Cheng
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Chain-of-Thought (CoT) prompting has demonstrably enhanced the performance of Large Language Models on tasks requiring multi-step inference. This success has led to widespread claims of emergent reasoning capabilities in these models. In this paper, we present a theoretical counter-perspective: Chain-of-Thought (CoT) does not elicit genuine, abstract reasoning. Instead, we argue that Chain-of-Thought functions as a powerful structural constraint that guides Large Language Models to imitate the form of reasoning. By forcing the generation of intermediate steps, Chain-of-Thought leverages the model immense capacity for sequence prediction and pattern matching, effectively constraining its output to sequences that resemble coherent thought processes. Chain-of-Thought (CoT) prompting has demonstrably enhanced the performance of Large Language Models on tasks requiring multi-step inference. This success has led to widespread claims of emergent reasoning capabilities in these models. In this paper, we present a theoretical counter-perspective: Chain-of-Thought (CoT) does not elicit genuine, abstract reasoning. Instead, we argue that Chain-of-Thought functions as a powerful structural constraint that guides Large Language Models to imitate the form of reasoning. By forcing the generation of intermediate steps, Chain-of-Thought leverages the model immense capacity for sequence prediction and pattern matching, effectively constraining its output to sequences that resemble coherent thought processes.
zh

[NLP-35] oken and Span Classification for Entity Recognition in French Historical Encyclopedias

【速读】: 该论文旨在解决历史文本中命名实体识别(Named Entity Recognition, NER)的挑战,这些问题包括非标准化语言、古体拼写以及嵌套或重叠实体的识别。其解决方案的关键在于将NER建模为同时考虑词级别和跨度级别的分类任务,以适应历史文档中常见的复杂嵌套实体结构。此外,研究还评估了生成式语言模型在低资源场景下的少样本提示方法的潜力,表明在标注数据稀缺的情况下,生成式模型可作为有前景的替代方案。

链接: https://arxiv.org/abs/2506.02872
作者: Ludovic Moncla,Hédi Zeghidi
机构: INSA Lyon (INSA里昂); CNRS (国家科学研究中心); Universite Claude Bernard Lyon 1 (克莱蒙-奥弗涅大学); LIRIS (信息与智能系统实验室)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Named Entity Recognition (NER) in historical texts presents unique challenges due to non-standardized language, archaic orthography, and nested or overlapping entities. This study benchmarks a diverse set of NER approaches, ranging from classical Conditional Random Fields (CRFs) and spaCy-based models to transformer-based architectures such as CamemBERT and sequence-labeling models like Flair. Experiments are conducted on the GeoEDdA dataset, a richly annotated corpus derived from 18th-century French encyclopedias. We propose framing NER as both token-level and span-level classification to accommodate complex nested entity structures typical of historical documents. Additionally, we evaluate the emerging potential of few-shot prompting with generative language models for low-resource scenarios. Our results demonstrate that while transformer-based models achieve state-of-the-art performance, especially on nested entities, generative models offer promising alternatives when labeled data are scarce. The study highlights ongoing challenges in historical NER and suggests avenues for hybrid approaches combining symbolic and neural methods to better capture the intricacies of early modern French text.
zh

[NLP-36] Demystifying Reasoning Dynamics with Mutual Information: Thinking Tokens are Information Peaks in LLM Reasoning

【速读】: 该论文试图解决大型推理模型(Large Reasoning Models, LRMs)内部推理机制不明确的问题,特别是通过信息论视角揭示其推理轨迹。论文的关键解决方案在于发现并利用“互信息峰值”现象,即在模型推理过程中,中间表示与正确答案之间的互信息(Mutual Information, MI)会在特定生成步骤出现显著上升,这些峰值通常对应于表达反思或过渡的“思考标记”(如“Hmm”、“Wait”和“Therefore”)。研究证明这些思考标记对模型的推理性能至关重要,而其他标记影响较小,因此通过精细利用这些思考标记可以有效提升LRM的推理能力。

链接: https://arxiv.org/abs/2506.02867
作者: Chen Qian,Dongrui Liu,Haochen Wen,Zhen Bai,Yong Liu,Jing Shao
机构: Gaoling School of Artificial Intelligence, Renmin University of China (中国人民大学高瓴人工智能学院); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); University College London, University of London (伦敦大学学院,伦敦大学); Dalian University of Technology (大连理工大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Preprint. Under review

点击查看摘要

Abstract:Large reasoning models (LRMs) have demonstrated impressive capabilities in complex problem-solving, yet their internal reasoning mechanisms remain poorly understood. In this paper, we investigate the reasoning trajectories of LRMs from an information-theoretic perspective. By tracking how mutual information (MI) between intermediate representations and the correct answer evolves during LRM reasoning, we observe an interesting MI peaks phenomenon: the MI at specific generative steps exhibits a sudden and significant increase during LRM’s reasoning process. We theoretically analyze such phenomenon and show that as MI increases, the probability of model’s prediction error decreases. Furthermore, these MI peaks often correspond to tokens expressing reflection or transition, such as Hmm'', Wait’’ and ``Therefore,‘’ which we term as the thinking tokens. We then demonstrate that these thinking tokens are crucial for LRM’s reasoning performance, while other tokens has minimal impacts. Building on these analyses, we propose two simple yet effective methods to improve LRM’s reasoning performance, by delicately leveraging these thinking tokens. Overall, our work provides novel insights into the reasoning mechanisms of LRMs and offers practical ways to improve their reasoning capabilities. The code is available at this https URL.
zh

[NLP-37] O-GATE: Clarifying Questions and Summarizing Responses with Trajectory Optimization for Eliciting Human Preference

【速读】: 该论文试图解决现有基于自教推理的方法在对话过程中难以识别最优对话轨迹并避免与任务无关的问题。其解决方案的关键在于提出TO-GATE框架,该框架通过轨迹优化增强问题生成,包含两个核心组件:澄清解析器用于生成最优提问轨迹,摘要器确保任务对齐的最终回应。轨迹优化使模型能够生成针对特定任务的有效引导问题和总结性回应。

链接: https://arxiv.org/abs/2506.02827
作者: Yulin Dou,Jiangming Liu
机构: Yunnan University (云南大学); Yunnan Key Laboratory of Intelligent Systems and Computing (云南智能系统与计算重点实验室)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) can effectively elicit human preferences through multi-turn dialogue. Complex tasks can be accomplished through iterative clarifying questions and final responses generated by an LLM acting as a questioner (STaR-GATE; Andukuri et al., 2024). However, existing approaches based on self-taught reasoning struggle to identify optimal dialogue trajectories and avoid irrelevant questions to the tasks. To address this limitation, we propose TO-GATE, a novel framework that enhances question generation through trajectory optimization, which consists of two key components: a clarification resolver that generates optimal questioning trajectories, and a summarizer that ensures task-aligned final responses. The trajectory optimization enables the model to produce effective elicitation questions and summary responses tailored to specific tasks. Experimental results demonstrate that TO-GATE significantly outperforms baseline methods, achieving a 9.32% improvement on standard preference elicitation tasks.
zh

[NLP-38] ProcrustesGPT : Compressing LLM s with Structured Matrices and Orthogonal Transformations ACL

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在参数数量上的高复杂性问题,旨在通过结构化矩阵表示来减少模型参数数量,从而降低计算和内存资源的需求。然而,直接将预训练模型的权重矩阵用结构化矩阵精确表示而不进行微调是不现实的。解决方案的关键在于利用LLM输出对权重矩阵的某些正交变换具有不变性的特性,这一特性可用于识别能够显著提升结构化类别中权重可压缩性的变换。该方法适用于支持高效投影操作的各种类型结构化矩阵。

链接: https://arxiv.org/abs/2506.02818
作者: Ekaterina Grishina,Mikhail Gorbunov,Maxim Rakhuba
机构: HSE University (高等经济大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted by ACL Findings

点击查看摘要

Abstract:Large language models (LLMs) demonstrate impressive results in natural language processing tasks but require a significant amount of computational and memory resources. Structured matrix representations are a promising way for reducing the number of parameters of these models. However, it seems unrealistic to expect that weight matrices of pretrained models can be accurately represented by structured matrices without any fine-tuning. To overcome this issue, we utilize the fact that LLM output is invariant under certain orthogonal transformations of weight matrices. This insight can be leveraged to identify transformations that significantly improve the compressibility of weights within structured classes. The proposed approach is applicable to various types of structured matrices that support efficient projection operations. Code is available at this https URL
zh

[NLP-39] SemVink: Advancing VLMs Semantic Understanding of Optical Illusions via Visual Global Thinking

【速读】: 该论文试图解决视觉-语言模型(Vision-Language Models, VLMs)在检测光学幻觉或生成式AI图像中的隐藏内容时表现不佳的问题,这是一类需要通过感知调整(如缩放)来识别的挑战性任务。解决方案的关键在于提出SemVink(Semantic Visual Thinking),通过将图像缩放到低分辨率(32-128像素)来消除冗余的视觉噪声,从而显著提升模型的准确率至99%。这一方法揭示了VLMs在架构上的缺陷:过度依赖高层语义而忽视了低级视觉操作的重要性,进而推动了多尺度处理的混合模型发展。

链接: https://arxiv.org/abs/2506.02803
作者: Sifan Li,Yujun Cai,Yiwei Wang
机构: 未知
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-language models (VLMs) excel in semantic tasks but falter at a core human capability: detecting hidden content in optical illusions or AI-generated images through perceptual adjustments like zooming. We introduce HC-Bench, a benchmark of 112 images with hidden text, objects, and illusions, revealing that leading VLMs achieve near-zero accuracy (0-5.36%)-even with explicit prompting. Humans resolve such ambiguities instinctively, yet VLMs fail due to an overreliance on high-level semantics. Strikingly, we propose SemVink (Semantic Visual Thinking) by simply scaling images to low resolutions (32-128 pixels), which unlocks 99% accuracy by eliminating redundant visual noise. This exposes a critical architectural flaw: VLMs prioritize abstract reasoning over low-level visual operations crucial for real-world robustness. Our work urges a shift toward hybrid models integrating multi-scale processing, bridging the gap between computational vision and human cognition for applications in medical imaging, security, and beyond.
zh

[NLP-40] Rethinking Machine Unlearning in Image Generation Models CCS2025

【速读】: 该论文旨在解决图像生成模型去学习(Image Generation Model Unlearning, IGMU)中存在的任务区分不明确、去学习指导缺失、缺乏有效的评估框架和不可靠的评估指标等问题,这些问题阻碍了对去学习机制的理解和实用去学习算法的设计。其解决方案的关键在于提出三个核心贡献:首先,设计了CatIGMU,一个分层任务分类框架,为IGMU提供详细的实现指导;其次,引入了EvalIGMU,一个涵盖五个关键方面的综合评估框架;最后,构建了DataIGM,一个高质量的去学习数据集,用于广泛评估IGMU、训练内容检测器以及基准测试最先进的去学习算法。这些贡献有助于全面理解、标准化分类和可靠评估IGMU。

链接: https://arxiv.org/abs/2506.02761
作者: Renyang Liu,Wenjie Feng,Tianwei Zhang,Wei Zhou,Xueqi Cheng,See-Kiong Ng
机构: National University of Singapore(新加坡国立大学); University of Science and Technology of China(中国科学技术大学); Nanyang Technological University(南洋理工大学); Yunnan University(云南大学); Chinese Academy of Sciences(中国科学院)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ACM CCS 2025

点击查看摘要

Abstract:With the surge and widespread application of image generation models, data privacy and content safety have become major concerns and attracted great attention from users, service providers, and policymakers. Machine unlearning (MU) is recognized as a cost-effective and promising means to address these challenges. Despite some advancements, image generation model unlearning (IGMU) still faces remarkable gaps in practice, e.g., unclear task discrimination and unlearning guidelines, lack of an effective evaluation framework, and unreliable evaluation metrics. These can hinder the understanding of unlearning mechanisms and the design of practical unlearning algorithms. We perform exhaustive assessments over existing state-of-the-art unlearning algorithms and evaluation standards, and discover several critical flaws and challenges in IGMU tasks. Driven by these limitations, we make several core contributions, to facilitate the comprehensive understanding, standardized categorization, and reliable evaluation of IGMU. Specifically, (1) We design CatIGMU, a novel hierarchical task categorization framework. It provides detailed implementation guidance for IGMU, assisting in the design of unlearning algorithms and the construction of testbeds. (2) We introduce EvalIGMU, a comprehensive evaluation framework. It includes reliable quantitative metrics across five critical aspects. (3) We construct DataIGM, a high-quality unlearning dataset, which can be used for extensive evaluations of IGMU, training content detectors for judgment, and benchmarking the state-of-the-art unlearning algorithms. With EvalIGMU and DataIGM, we discover that most existing IGMU algorithms cannot handle the unlearning well across different evaluation dimensions, especially for preservation and robustness. Code and models are available at this https URL.
zh

[NLP-41] Exploiting the English Vocabulary Profile for L2 word-level vocabulary assessment with LLM s

【速读】: 该论文试图解决第二语言(L2)词汇使用评估中缺乏细粒度分析的问题,传统方法主要关注词的上下文无关或词性(PoS)相关的使用,而未能充分考虑词汇在句子中的精确使用情况。解决方案的关键在于结合大型语言模型(LLMs)与英语词汇水平图谱(EVP),通过利用LLMs的语义信息,实现对学习者写作中单词的熟练度水平进行更准确的评估,从而提升词汇评估的精细度和准确性。

链接: https://arxiv.org/abs/2506.02758
作者: Stefano Bannò,Kate Knill,Mark Gales
机构: ALTA Institute, Department of Engineering, University of Cambridge (UK)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to the 20th Workshop on Innovative Use of NLP for Building Educational Applications

点击查看摘要

Abstract:Vocabulary use is a fundamental aspect of second language (L2) proficiency. To date, its assessment by automated systems has typically examined the context-independent, or part-of-speech (PoS) related use of words. This paper introduces a novel approach to enable fine-grained vocabulary evaluation exploiting the precise use of words within a sentence. The scheme combines large language models (LLMs) with the English Vocabulary Profile (EVP). The EVP is a standard lexical resource that enables in-context vocabulary use to be linked with proficiency level. We evaluate the ability of LLMs to assign proficiency levels to individual words as they appear in L2 learner writing, addressing key challenges such as polysemy, contextual variation, and multi-word expressions. We compare LLMs to a PoS-based baseline. LLMs appear to exploit additional semantic information that yields improved performance. We also explore correlations between word-level proficiency and essay-level proficiency. Finally, the approach is applied to examine the consistency of the EVP proficiency levels. Results show that LLMs are well-suited for the task of vocabulary assessment.
zh

[NLP-42] Multi-task Learning with Active Learning for Arabic Offensive Speech Detection

【速读】: 该论文旨在解决阿拉伯语社交媒体文本中攻击性言论检测的难题,该问题因标注数据有限、方言差异和语言本身的复杂性而尤为突出。其解决方案的关键在于将多任务学习(Multi-task Learning, MTL)与主动学习(Active Learning)相结合,通过联合训练暴力和粗俗语言两个辅助任务来提升攻击性语言的检测精度,并利用动态调整任务权重的方法平衡各任务的贡献。此外,通过不确定性采样策略选择最具信息量的样本进行模型训练,并引入加权表情符号处理以更好地捕捉语义线索,从而在资源受限的情况下实现高效且准确的检测。

链接: https://arxiv.org/abs/2506.02753
作者: Aisha Alansari,Hamzah Luqman
机构: King Fahd University of Petroleum and Minerals (法赫德国王石油矿产大学); SDAIA-KFUPM Joint Research Center for Artificial Intelligence (SDAIA-法赫国王石油矿产大学人工智能联合研究中心)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The rapid growth of social media has amplified the spread of offensive, violent, and vulgar speech, which poses serious societal and cybersecurity concerns. Detecting such content in Arabic text is particularly complex due to limited labeled data, dialectal variations, and the language’s inherent complexity. This paper proposes a novel framework that integrates multi-task learning (MTL) with active learning to enhance offensive speech detection in Arabic social media text. By jointly training on two auxiliary tasks, violent and vulgar speech, the model leverages shared representations to improve the detection accuracy of the offensive speech. Our approach dynamically adjusts task weights during training to balance the contribution of each task and optimize performance. To address the scarcity of labeled data, we employ an active learning strategy through several uncertainty sampling techniques to iteratively select the most informative samples for model training. We also introduce weighted emoji handling to better capture semantic cues. Experimental results on the OSACT2022 dataset show that the proposed framework achieves a state-of-the-art macro F1-score of 85.42%, outperforming existing methods while using significantly fewer fine-tuning samples. The findings of this study highlight the potential of integrating MTL with active learning for efficient and accurate offensive language detection in resource-constrained settings.
zh

[NLP-43] Stereotypical gender actions can be extracted from Web text

【速读】: 该论文试图解决如何从自然文本中提取与性别刻板印象相关的行动,并将其补充到常识知识库中的问题。其解决方案的关键在于利用Open Mind Common Sense (OMCS) 以及Twitter和网络语料库中的性别信息,结合代词/姓名性别启发式方法来计算行动的性别偏见,从而识别出具有性别倾向的常识性行动。通过这种方法,研究者获得了与人类黄金标准高度相关的结果,验证了该方案的有效性。

链接: https://arxiv.org/abs/2506.02740
作者: Amaç Herdağdelen,Marco Baroni
机构: New England Complex Systems Institute, Cambridge, USA(新英格兰复杂系统研究所,美国剑桥); CIMeC, University of Trento(复杂系统中心,特伦托大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We extracted gender-specific actions from text corpora and Twitter, and compared them to stereotypical expectations of people. We used Open Mind Common Sense (OMCS), a commonsense knowledge repository, to focus on actions that are pertinent to common sense and daily life of humans. We use the gender information of Twitter users and Web-corpus-based pronoun/name gender heuristics to compute the gender bias of the actions. With high recall, we obtained a Spearman correlation of 0.47 between corpus-based predictions and a human gold standard, and an area under the ROC curve of 0.76 when predicting the polarity of the gold standard. We conclude that it is feasible to use natural text (and a Twitter-derived corpus in particular) in order to augment commonsense repositories with the stereotypical gender expectations of actions. We also present a dataset of 441 commonsense actions with human judges’ ratings on whether the action is typically/slightly masculine/feminine (or neutral), and another larger dataset of 21,442 actions automatically rated by the methods we investigate in this study.
zh

[NLP-44] RACE-Align: Retrieval-Augmented and Chain-of-Thought Enhanced Preference Alignment for Large Language Models

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在垂直领域中面临准确性不足、领域特定推理能力弱以及可解释性差的问题。传统偏好对齐方法如基于人类反馈的强化学习(Reinforcement Learning from Human Feedback, RLHF)和直接偏好优化(Direct Preference Optimization, DPO)往往忽略了底层知识来源和推理逻辑。论文提出的解决方案是RACE-Align(检索增强与思维链增强对齐),其关键在于构建一种结合外部知识支持和显式思维链(Chain-of-Thought, CoT)推理的二元偏好数据集,并通过DPO算法进行模型对齐。核心创新在于偏好数据的构建策略,即通过AI驱动的检索实现事实基础化,提升知识性和准确性,并强调领域特定思维链的优化,将推理过程本身作为关键偏好维度。

链接: https://arxiv.org/abs/2506.02726
作者: Qihang Yan,Xinyu Zhang,Luming Guo,Qi Zhang,Feifan Liu
机构: ShanghaiTech University (上海科技大学); Henan University (河南大学); Liaoning University of Traditional Chinese Medicine (辽宁中医药大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) struggle with accuracy, domain-specific reasoning, and interpretability in vertical domains. Traditional preference alignment methods like Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO) often overlook the underlying knowledge sources and reasoning logic. This paper introduces RACE-Align (Retrieval-Augmented and Chain-of-Thought Enhanced Alignment), a novel framework designed to address these limitations. RACE-Align systematically constructs a binary preference dataset incorporating external knowledge support and explicit Chain-of-Thought (CoT) reasoning, then aligns LLMs using the DPO algorithm. The core innovation lies in its preference data construction strategy: it integrates AI-driven retrieval for factual grounding, enhancing knowledgeability and accuracy, and emphasizes the optimization of domain-specific CoT, treating the reasoning process itself as a key preference dimension. A multi-stage, AI-driven refinement pipeline cost-effectively generates these preference pairs. Experimental validation in Traditional Chinese Medicine (TCM) using Qwen3-1.7B as the base model demonstrates that RACE-Align significantly outperforms the original base model and a model fine-tuned only with Supervised Fine-Tuning (SFT). Improvements were observed across multiple dimensions, including answer accuracy, information richness, application of TCM thinking patterns, logicality and depth of reasoning, and interpretability. These findings suggest RACE-Align offers an effective pathway to enhance LLMs’ knowledge application, reasoning reliability, and process transparency in complex vertical domains.
zh

[NLP-45] Benchmarking and Advancing Large Language Models for Local Life Services KDD2025

【速读】: 该论文旨在探索大型语言模型(Large Language Models, LLMs)在本地生活服务领域的应用潜力,并通过建立全面的基准测试体系,系统评估不同LLMs在相关任务中的表现。其解决方案的关键在于提出两种主要优化方法:模型微调(model fine-tuning)和基于代理的工作流(agent-based workflows),并通过实验验证,即使是一个相对较小的7B参数模型也能达到与72B参数模型相当的性能,从而在推理成本与模型能力之间实现有效平衡。这一优化显著提升了LLMs在实际在线服务中的可行性和效率。

链接: https://arxiv.org/abs/2506.02720
作者: Xiaochong Lan,Jie Feng,Jiahuan Lei,Xinlei Shi,Yong Li
机构: Tsinghua University (清华大学); BNRist (北京信息科学与技术国家研究中心); Meituan (美团)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: KDD 2025

点击查看摘要

Abstract:Large language models (LLMs) have exhibited remarkable capabilities and achieved significant breakthroughs across various domains, leading to their widespread adoption in recent years. Building on this progress, we investigate their potential in the realm of local life services. In this study, we establish a comprehensive benchmark and systematically evaluate the performance of diverse LLMs across a wide range of tasks relevant to local life services. To further enhance their effectiveness, we explore two key approaches: model fine-tuning and agent-based workflows. Our findings reveal that even a relatively compact 7B model can attain performance levels comparable to a much larger 72B model, effectively balancing inference cost and model capability. This optimization greatly enhances the feasibility and efficiency of deploying LLMs in real-world online services, making them more practical and accessible for local life applications.
zh

[NLP-46] Iterative Self-Improvement of Vision Language Models for Image Scoring and Self-Explanation ICIP2025

【速读】: 该论文试图解决视觉语言模型(Vision Language Models, VLMs)在图像评分任务中缺乏可解释性的问题,即模型不仅需要生成图像评分,还需提供自然语言的合理解释。解决方案的关键在于提出一种无需依赖外部数据或模型的自训练方法,利用已有的图像评分数据集和指令调优的VLM,通过生成文本进行迭代训练,并结合直接偏好优化(Direct Preference Optimization)提升评分准确性和生成解释的一致性。

链接: https://arxiv.org/abs/2506.02708
作者: Naoto Tanji,Toshihiko Yamasaki
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Accepted to ICIP2025

点击查看摘要

Abstract:Image scoring is a crucial task in numerous real-world applications. To trust a model’s judgment, understanding its rationale is essential. This paper proposes a novel training method for Vision Language Models (VLMs) to generate not only image scores but also corresponding justifications in natural language. Leveraging only an image scoring dataset and an instruction-tuned VLM, our method enables self-training, utilizing the VLM’s generated text without relying on external data or models. In addition, we introduce a simple method for creating a dataset designed to improve alignment between predicted scores and their textual justifications. By iteratively training the model with Direct Preference Optimization on two distinct datasets and merging them, we can improve both scoring accuracy and the coherence of generated explanations.
zh

[NLP-47] On Entity Identification in Language Models ACL2025

【速读】: 该论文试图解决语言模型(Language Models, LMs)内部表示如何识别和区分命名实体(Named Entities)的提及问题,特别是实体与其提及之间的多对多对应关系。其解决方案的关键在于提出一种类似于聚类质量度量的框架,通过分析LM内部表示的聚类特性,量化相同实体的提及是否聚集在一起、不同实体的提及是否保持分离,从而评估模型在实体识别与区分方面的性能。实验表明,基于Transformer的自回归模型在实体识别任务中表现出较高的精度和召回率,且实体相关信息在早期LM层中以低维线性子空间的形式紧凑表示。

链接: https://arxiv.org/abs/2506.02701
作者: Masaki Sakata,Sho Yokoi,Benjamin Heinzerling,Takumi Ito,Kentaro Inui
机构: Tohoku University (东北大学); RIKEN (理化学研究所); NINJAL (日本国立语言学研究机构); Langsmith Inc. (朗思科技公司); MBZUAI (穆巴达拉科学技术大学)
类目: Computation and Language (cs.CL)
备注: ACL 2025 Findings; 26 pages, 13 figures, 9 tables

点击查看摘要

Abstract:We analyze the extent to which internal representations of language models (LMs) identify and distinguish mentions of named entities, focusing on the many-to-many correspondence between entities and their mentions. We first formulate two problems of entity mentions – ambiguity and variability – and propose a framework analogous to clustering quality metrics. Specifically, we quantify through cluster analysis of LM internal representations the extent to which mentions of the same entity cluster together and mentions of different entities remain separated. Our experiments examine five Transformer-based autoregressive models, showing that they effectively identify and distinguish entities with metrics analogous to precision and recall ranging from 0.66 to 0.9. Further analysis reveals that entity-related information is compactly represented in a low-dimensional linear subspace at early LM layers. Additionally, we clarify how the characteristics of entity representations influence word prediction performance. These findings are interpreted through the lens of isomorphism between LM representations and entity-centric knowledge structures in the real world, providing insights into how LMs internally organize and use entity information.
zh

[NLP-48] MASTER: Enhancing Large Language Model via Multi-Agent Simulated Teaching

【速读】: 该论文旨在解决大规模模型在指令微调过程中高质量微调数据获取困难的问题(high-quality fine-tuning data acquisition challenge),这一问题主要源于数据收集的难度和高昂的生产成本。论文提出的解决方案是MASTER,其关键在于通过多个具有不同认知水平的智能体之间的交互来增强原始数据,模拟三种基于教学理论的教学生动场景,从而生成高质量的师生互动数据。

链接: https://arxiv.org/abs/2506.02689
作者: Liang Yue,Yihong Tang,Kehai Chen,Jie Liu,Min Zhang
机构: Harbin Institute of Technology, Shenzhen, China (哈尔滨工业大学深圳校区)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Instruction fine-tuning is crucial in NLP tasks, enhancing pretrained models’ instruction-following capabilities and task-specific performance. However, obtaining high-quality fine-tuning data for large models is challenging due to data collection difficulties and high production costs. To address this, we propose MASTER, a novel data augmentation method that enriches original data through interactions among multiple agents with varying cognitive levels. We simulate three pedagogically grounded teaching scenarios, leveraging multi-agent conversations to generate high-quality teacher-student interaction data. Utilizing MASTER, we construct BOOST-QA, a fine-tuning dataset augmented from existing datasets like Orca-Math-200k, ProcQA, and OpenHermes2.5. Experiments show that models fine-tuned with BOOST-QA perform excellently across multiple benchmarks, demonstrating strong multitask generalization. Notably, MASTER significantly improves models’ reasoning abilities in complex tasks, providing valuable insights for future research.
zh

[NLP-49] Decompose Plan in Parallel and Merge: A Novel Paradigm for Large Language Models based Planning with Multiple Constraints

【速读】: 该论文旨在解决基于大型语言模型(Large Language Models, LLMs)的智能体在规划任务中面临的挑战,特别是现有规划方法存在的严格约束和级联错误问题。其解决方案的关键在于提出一种新的并行规划范式——分解、并行规划子任务并合并子计划(Decompose, Plan in Parallel, and Merge, DPPM),通过将复杂任务根据约束条件分解为子任务,在并行生成子计划后将其合并为最终计划,同时引入验证与优化模块以实现错误纠正和冲突解决。

链接: https://arxiv.org/abs/2506.02683
作者: Zhengdong Lu,Weikai Lu,Yiling Tao,Yun Dai,ZiXuan Chen,Huiping Zhuang,Cen Chen,Hao Peng,Ziqian Zeng
机构: South China University of Technology (华南理工大学); Beihang University (北京航空航天大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Despite significant advances in Large Language Models (LLMs), planning tasks still present challenges for LLM-based agents. Existing planning methods face two key limitations: heavy constraints and cascading errors. To address these limitations, we propose a novel parallel planning paradigm, which Decomposes, Plans for subtasks in Parallel, and Merges subplans into a final plan (DPPM). Specifically, DPPM decomposes the complex task based on constraints into subtasks, generates the subplan for each subtask in parallel, and merges them into a global plan. In addition, our approach incorporates a verification and refinement module, enabling error correction and conflict resolution. Experimental results demonstrate that DPPM significantly outperforms existing methods in travel planning tasks.
zh

[NLP-50] L;DR: Too Long Do Re-weighting for Effcient LLM Reasoning Compression

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在推理过程中效率低下,尤其是在生成极长输出时的冗余推理问题。其解决方案的关键在于提出一种基于动态比例的训练流程,通过持续平衡模型的系统1(System-1)和系统2(System-2)数据权重,以消除冗余推理过程,同时保持模型的推理能力。该方法无需依赖复杂的数据标注或多个模型之间的插值,实验证明其可将输出标记数量减少近40%而不会影响推理准确性。

链接: https://arxiv.org/abs/2506.02678
作者: Zhong-Zhi Li,Xiao Liang,Zihao Tang,Lei Ji,Peijie Wang,Haotian Xu,Xing W,Haizhen Huang,Weiwei Deng,Ying Nian Wu,Yeyun Gong,Zhijiang Guo,Xiao Liu,Fei Yin,Cheng-Lin Liu
机构: Chinese Academy of Sciences (中国科学院); University of California, Los Angeles (加利福尼亚大学洛杉矶分校); Tsinghua University (清华大学); Microsoft (微软); Hong Kong University of Science and Technology (香港科技大学); Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州))
类目: Computation and Language (cs.CL); Computational Engineering, Finance, and Science (cs.CE); Numerical Analysis (math.NA)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have recently achieved remarkable progress by leveraging Reinforcement Learning and extended Chain-of-Thought (CoT) techniques. However, the challenge of performing efficient language reasoning–especially during inference with extremely long outputs–has drawn increasing attention from the research community. In this work, we propose a dynamic ratio-based training pipeline that does not rely on sophisticated data annotations or interpolation between multiple models. We continuously balance the weights between the model’s System-1 and System-2 data to eliminate redundant reasoning processes while preserving the model’s reasoning capability. We validate our approach across models on DeepSeek-R1-Distill-7B and DeepSeek-R1-Distill-14B and on a diverse set of benchmarks with varying difficulty levels. Our method significantly reduces the number of output tokens by nearly 40% while maintaining the accuracy of the reasoning. Our code and data will be available soon.
zh

[NLP-51] EvaLearn: Quantifying the Learning Capability and Efficiency of LLM s via Sequential Problem Solving

【速读】: 该论文试图解决大语言模型(Large Language Models, LLMs)在学习能力和效率方面评估不足的问题,特别是针对模型在复杂任务中的潜在能力缺乏系统性评估。解决方案的关键在于提出EvaLearn基准,该基准包含648个挑战性问题,分为182个序列,每个序列专注于一种任务类型,并要求模型按顺序解决问题,从而利用先前经验提升后续表现。EvaLearn通过五种自动化度量指标全面评估模型的学习能力和效率,揭示了不同模型在学习过程中的差异性表现,并探索了实例级评分和教师模型反馈对模型学习的促进作用。

链接: https://arxiv.org/abs/2506.02672
作者: Shihan Dou,Ming Zhang,Chenhao Huang,Jiayi Chen,Feng Chen,Shichun Liu,Yan Liu,Chenxiao Liu,Cheng Zhong,Zongzhang Zhang,Tao Gui,Chao Xin,Wei Chengzhi,Lin Yan,Qi Zhang,Xuanjing Huang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 47 pages, 24 figures

点击查看摘要

Abstract:We introduce EvaLearn, a pioneering benchmark designed to evaluate large language models (LLMs) on their learning capability and efficiency in challenging tasks, a critical, yet underexplored aspect of model potential. EvaLearn contains 648 challenging problems across six task types, grouped into 182 sequences, each sequence dedicated to one task type. Diverging from most existing benchmarks that evaluate models in parallel, EvaLearn requires models to solve problems sequentially, allowing them to leverage the experience gained from previous solutions. EvaLearn provides five comprehensive automated metrics to evaluate models and quantify their learning capability and efficiency. We extensively benchmark nine frontier models and observe varied performance profiles: some models, such as Claude-3.7-sonnet, start with moderate initial performance but exhibit strong learning ability, while some models struggle to benefit from experience and may even show negative transfer. Moreover, we investigate model performance under two learning settings and find that instance-level rubrics and teacher-model feedback further facilitate model learning. Importantly, we observe that current LLMs with stronger static abilities do not show a clear advantage in learning capability across all tasks, highlighting that EvaLearn evaluates a new dimension of model performance. We hope EvaLearn provides a novel evaluation perspective for assessing LLM potential and understanding the gap between models and human capabilities, promoting the development of deeper and more dynamic evaluation approaches. All datasets, the automatic evaluation framework, and the results studied in this paper are available at the GitHub repository.
zh

[NLP-52] Are Economists Always More Introverted? Analyzing Consistency in Persona-Assigned LLM s

【速读】: 该论文试图解决个性化大型语言模型(Personalized Large Language Models, LLMs)在不同任务和运行中保持角色一致性(consistency)的问题,即模型在被赋予相同角色时能否生成连贯且一致的响应。解决方案的关键在于引入一种新的标准化框架,用于评估LLMs在不同角色类别(如幸福感、职业、性格和政治立场)及多种任务类型(如问卷写作、文章生成、社交媒体帖子生成、单轮和多轮对话)中的表现,从而系统分析影响一致性的因素,如角色设定、刻板印象和模型设计选择。

链接: https://arxiv.org/abs/2506.02659
作者: Manon Reusens,Bart Baesens,David Jurgens
机构: KU Leuven (天主教鲁汶大学); University of Southampton (南安普顿大学); University of Michigan (密歇根大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Personalized Large Language Models (LLMs) are increasingly used in diverse applications, where they are assigned a specific persona - such as a happy high school teacher - to guide their responses. While prior research has examined how well LLMs adhere to predefined personas in writing style, a comprehensive analysis of consistency across different personas and task types is lacking. In this paper, we introduce a new standardized framework to analyze consistency in persona-assigned LLMs. We define consistency as the extent to which a model maintains coherent responses when assigned the same persona across different tasks and runs. Our framework evaluates personas across four different categories (happiness, occupation, personality, and political stance) spanning multiple task dimensions (survey writing, essay generation, social media post generation, single turn, and multi-turn conversations). Our findings reveal that consistency is influenced by multiple factors, including the assigned persona, stereotypes, and model design choices. Consistency also varies across tasks, increasing with more structured tasks and additional context. All code is available on GitHub.
zh

[NLP-53] Overcoming Data Scarcity in Multi-Dialectal Arabic ASR via Whisper Fine-Tuning INTERSPEECH2025

【速读】: 该论文试图解决阿拉伯语方言自动语音识别(Automatic Speech Recognition, ASR)系统在处理非标准阿拉伯语(Modern Standard Arabic, MSA)时表现不佳的问题。其关键解决方案是通过微调OpenAI的Whisper模型,并利用Mozilla Common Voice数据集中的MSA数据和MASC数据集中的方言数据进行训练,探索小规模MSA微调数据对模型性能的影响、MSA预训练的效益以及方言特异性与方言合并模型的效果。研究结果表明,合理平衡的方言合并数据可以在不显著降低性能的情况下缓解低资源ASR中的数据稀缺问题。

链接: https://arxiv.org/abs/2506.02627
作者: Ömer Tarik Özyilmaz,Matt Coler,Matias Valdenegro-Toro
机构: Department of Internal Medicine, Division of Nephrology; Bernoulli Institute for Mathematics, Computer Science, and Artificial Intelligence; Speech Technology Lab, Campus Fryslân
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted at Interspeech 2025

点击查看摘要

Abstract:Although commercial Arabic automatic speech recognition (ASR) systems support Modern Standard Arabic (MSA), they struggle with dialectal speech. We investigate the effect of fine-tuning OpenAI’s Whisper on five major Arabic dialects (Gulf, Levantine, Iraqi, Egyptian, Maghrebi) using Mozilla Common Voice for MSA and the MASC dataset for dialectal speech. We evaluate MSA training size effects, benefits of pre-training on MSA data, and dialect-specific versus dialect-pooled models. We find that small amounts of MSA fine-tuning data yield substantial improvements for smaller models, matching larger non-fine-tuned models. While MSA pre-training shows minimal benefit, suggesting limited shared features between MSA and dialects, our dialect-pooled models perform comparably to dialect-specific ones. This indicates that pooling dialectal data, when properly balanced, can help address data scarcity in low-resource ASR without significant performance loss.
zh

[NLP-54] EssayBench: Evaluating Large Language Models in Multi-Genre Chinese Essay Writing

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在中文作文写作与评价领域的能力研究不足问题,尤其是现有基准测试多依赖粗粒度文本质量指标,忽视了中文作文在结构和修辞上的复杂性,特别是在不同文体中的差异。其解决方案的关键在于提出一个跨四种主要文体(议论文、记叙文、描写文和说明文)的多文体基准测试 \benchName,通过收集并精炼728个真实作文题目,并将其细分为开放性和约束性写作场景,同时构建了一个细粒度、文体特定的评分框架,以实现对生成作文的可靠评估。

链接: https://arxiv.org/abs/2506.02596
作者: Fan Gao,Dongyuan Li,Ding Xia,Fei Mi,Yasheng Wang,Lifeng Shang,Baojun Wang
机构: The University of Tokyo (东京大学); Huawei Noah’s Ark Lab (华为诺亚方舟实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Chinese essay writing and its evaluation are critical in educational contexts, yet the capabilities of Large Language Models (LLMs) in this domain remain largely underexplored. Existing benchmarks often rely on coarse-grained text quality metrics, largely overlooking the structural and rhetorical complexities of Chinese essays, particularly across diverse genres. To address this gap, we propose \benchName, a multi-genre benchmark specifically designed for Chinese essay writing across four major genres: Argumentative, Narrative, Descriptive, and Expository. We curate and refine a total of 728 real-world prompts to ensure authenticity and meticulously categorize them into the \textitOpen-Ended and \textitConstrained sets to capture diverse writing scenarios. To reliably evaluate generated essays, we develop a fine-grained, genre-specific scoring framework that hierarchically aggregates scores. We further validate our evaluation protocol through a comprehensive human agreement study. Finally, we benchmark 15 large-sized LLMs, analyzing their strengths and limitations across genres and instruction types. With \benchName, we aim to advance LLM-based Chinese essay evaluation and inspire future research on improving essay generation in educational settings.
zh

[NLP-55] Beyond the Surface: Measuring Self-Preference in LLM Judgments

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在作为评判者时表现出的自我偏好偏差问题,即模型倾向于对自己的响应给予更高的评分。现有方法通过比较评判模型对其自身响应和其他模型响应的评分差异来衡量这种偏差,但这种方法将自我偏好偏差与响应质量混淆,因为高质量的响应可能导致正向的评分差异,即使不存在偏差。该论文的关键解决方案是引入黄金评判(gold judgments)作为响应实际质量的代理,并提出DBG分数,该分数将自我偏好偏差定义为评判模型对其自身响应的评分与相应黄金评判之间的差异,从而减轻响应质量对偏差测量的干扰。

链接: https://arxiv.org/abs/2506.02592
作者: Zhi-Yuan Chen,Hao Wang,Xinyu Zhang,Enrui Hu,Yankai Lin
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent studies show that large language models (LLMs) exhibit self-preference bias when serving as judges, meaning they tend to favor their own responses over those generated by other models. Existing methods typically measure this bias by calculating the difference between the scores a judge model assigns to its own responses and those it assigns to responses from other models. However, this approach conflates self-preference bias with response quality, as higher-quality responses from the judge model may also lead to positive score differences, even in the absence of bias. To address this issue, we introduce gold judgments as proxies for the actual quality of responses and propose the DBG score, which measures self-preference bias as the difference between the scores assigned by the judge model to its own responses and the corresponding gold judgments. Since gold judgments reflect true response quality, the DBG score mitigates the confounding effect of response quality on bias measurement. Using the DBG score, we conduct comprehensive experiments to assess self-preference bias across LLMs of varying versions, sizes, and reasoning abilities. Additionally, we investigate two factors that influence and help alleviate self-preference bias: response text style and the post-training data of judge models. Finally, we explore potential underlying mechanisms of self-preference bias from an attention-based perspective. Our code and data are available at this https URL.
zh

[NLP-56] On Generalization across Measurement Systems: LLM s Entail More Test-Time Compute for Underrepresented Cultures ACL2025

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在处理不同文化背景下的测量系统(Measurement Systems)时,是否能够准确提供信息的问题。研究的核心问题是评估LLMs在不同测量系统下的表现、默认使用系统以及通过推理方法缓解数据稀疏系统挑战的能力。解决方案的关键在于通过编译新数据集进行实验,并采用链式思维(Chain-of-Thought, CoT)等推理方法来提升模型在不同测量系统中的准确性,但这也带来了计算成本的增加和响应时间的延长。

链接: https://arxiv.org/abs/2506.02591
作者: Minh Duc Bui,Kyung Eun Park,Goran Glavaš,Fabian David Schmidt,Katharina von der Wense
机构: Johannes Gutenberg University Mainz, Germany(美因茨约翰内斯古腾堡大学, 德国); University of Mannheim, Germany(曼海姆大学, 德国); Center For Artificial Intelligence and Data Science, University of Würzburg, Germany(人工智能与数据科学中心, 美因茨大学, 德国); University of Colorado Boulder, USA(科罗拉多大学博尔德分校, 美国)
类目: Computation and Language (cs.CL)
备注: Accepted to ACL 2025 Main (Camera-Ready Version)

点击查看摘要

Abstract:Measurement systems (e.g., currencies) differ across cultures, but the conversions between them are well defined so that humans can state facts using any measurement system of their choice. Being available to users from diverse cultural backgrounds, large language models (LLMs) should also be able to provide accurate information irrespective of the measurement system at hand. Using newly compiled datasets we test if this is the case for seven open-source LLMs, addressing three key research questions: (RQ1) What is the default system used by LLMs for each type of measurement? (RQ2) Do LLMs’ answers and their accuracy vary across different measurement systems? (RQ3) Can LLMs mitigate potential challenges w.r.t. underrepresented systems via reasoning? Our findings show that LLMs default to the measurement system predominantly used in the data. Additionally, we observe considerable instability and variance in performance across different measurement systems. While this instability can in part be mitigated by employing reasoning methods such as chain-of-thought (CoT), this implies longer responses and thereby significantly increases test-time compute (and inference costs), marginalizing users from cultural backgrounds that use underrepresented measurement systems.
zh

[NLP-57] Synthetic Speech Source Tracing using Metric Learning INTERSPEECH2025

【速读】: 该论文试图解决在生成式语音识别系统中,通过受说话人识别启发的流程实现被操控音频的来源追踪问题,当前该领域缺乏稳健的解决方案。其关键解决方案在于评估基于分类和度量学习的两种方法,并在MLAADv5基准上使用ResNet和自监督学习(SSL)骨干网络进行测试,结果表明ResNet在性能上与度量学习方法相当,甚至超越了SSL基线系统,从而证明了ResNet在来源追踪中的可行性,并强调了优化SSL表示以适应该任务的重要性。

链接: https://arxiv.org/abs/2506.02590
作者: Dimitrios Koutsianos,Stavros Zacharopoulos,Yannis Panagakis,Themos Stafylakis
机构: 未知
类目: ound (cs.SD); Computation and Language (cs.CL)
备注: Submitted to Interspeech 2025

点击查看摘要

Abstract:This paper addresses source tracing in synthetic speech-identifying generative systems behind manipulated audio via speaker recognition-inspired pipelines. While prior work focuses on spoofing detection, source tracing lacks robust solutions. We evaluate two approaches: classification-based and metric-learning. We tested our methods on the MLAADv5 benchmark using ResNet and self-supervised learning (SSL) backbones. The results show that ResNet achieves competitive performance with the metric learning approach, matching and even exceeding SSL-based systems. Our work demonstrates ResNet’s viability for source tracing while underscoring the need to optimize SSL representations for this task. Our work bridges speaker recognition methodologies with audio forensic challenges, offering new directions for combating synthetic media manipulation.
zh

[NLP-58] Evaluating Named Entity Recognition Models for Russian Cultural News Texts: From BERT to LLM

【速读】: 该论文旨在解决在俄罗斯文化事件新闻文本这一专业领域中,对人名进行命名实体识别(Named Entity Recognition, NER)的挑战。研究采用SPbLitGuide数据集,该数据集包含1999年至2019年间圣彼得堡的文化活动公告。解决方案的关键在于对比评估多种NER模型,包括传统的基于transformer的架构(如DeepPavlov、RoBERTa和SpaCy)以及近期的大语言模型(Large Language Models, LLMs),如GPT-3.5、GPT-4和GPT-4o,并通过特定的JSON输出提示显著提升了模型性能,其中GPT-4o在特定提示下实现了0.93的F1分数,而GPT-4则在精确率上表现最佳,达到0.99。

链接: https://arxiv.org/abs/2506.02589
作者: Maria Levchenko
机构: University of Bologna(博洛尼亚大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:This paper addresses the challenge of Named Entity Recognition (NER) for person names within the specialized domain of Russian news texts concerning cultural events. The study utilizes the unique SPbLitGuide dataset, a collection of event announcements from Saint Petersburg spanning 1999 to 2019. A comparative evaluation of diverse NER models is presented, encompassing established transformer-based architectures such as DeepPavlov, RoBERTa, and SpaCy, alongside recent Large Language Models (LLMs) including GPT-3.5, GPT-4, and GPT-4o. Key findings highlight the superior performance of GPT-4o when provided with specific prompting for JSON output, achieving an F1 score of 0.93. Furthermore, GPT-4 demonstrated the highest precision at 0.99. The research contributes to a deeper understanding of current NER model capabilities and limitations when applied to morphologically rich languages like Russian within the cultural heritage domain, offering insights for researchers and practitioners. Follow-up evaluation with GPT-4.1 (April 2025) achieves F1=0.94 for both simple and structured prompts, demonstrating rapid progress across model families and simplified deployment requirements.
zh

[NLP-59] Prosodic Structure Beyond Lexical Content: A Study of Self-Supervised Learning INTERSPEECH2025

【速读】: 该论文试图解决语音中韵律(prosody)结构在文本理解中的贡献问题,特别是韵律特征(如语调、节奏和响度)在不依赖词汇内容的情况下,如何独立地为语言结构提供支持。其解决方案的关键在于利用自监督学习(self-supervised learning, SSL)来捕捉韵律声学特征的时序结构,并通过提出的掩码韵律模型(Masked Prosody Model)生成具有语义信息的表示,这些表示能够有效预测涉及局部信息(如词边界)和长期结构(如情感识别)的感知标签。

链接: https://arxiv.org/abs/2506.02584
作者: Sarenne Wallbridge,Christoph Minixhofer,Catherine Lai,Peter Bell
机构: Centre for Speech Technology Research
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: Accepted at INTERSPEECH 2025

点击查看摘要

Abstract:People exploit the predictability of lexical structures during text comprehension. Though predictable structure is also present in speech, the degree to which prosody, e.g. intonation, tempo, and loudness, contributes to such structure independently of the lexical content is unclear. This study leverages self-supervised learning (SSL) to examine the temporal granularity of structures in the acoustic correlates of prosody. Representations from our proposed Masked Prosody Model can predict perceptual labels dependent on local information, such as word boundaries, but provide the most value for labels involving longer-term structures, like emotion recognition. Probing experiments across various perceptual labels show strong relative gains over untransformed pitch, energy, and voice activity features. Our results reveal the importance of SSL training objective timescale and highlight the value of complex SSL-encoded structures compared to more constrained classical structures.
zh

[NLP-60] IndoSafety: Culturally Grounded Safety for LLM s in Indonesian Languages

【速读】: 该论文试图解决区域特定大型语言模型(LLMs)在文化多样性环境中安全性不足的问题,特别是在像印度尼西亚这样高度重视本地规范的地区。解决方案的关键在于构建IndoSafety,这是首个针对印度尼西亚语境的高质量、人工验证的安全评估数据集,涵盖了五种语言变体,并通过扩展现有的安全框架来建立符合印尼社会文化背景的分类体系。实验表明,基于IndoSafety进行微调能够显著提升模型的安全性,同时保持任务性能。

链接: https://arxiv.org/abs/2506.02573
作者: Muhammad Falensi Azmi,Muhammad Dehan Al Kautsar,Alfan Farizki Wicaksono,Fajri Koto
机构: Faculty of Computer Science, Universitas Indonesia (计算机科学学院,印度尼西亚大学); Department of Natural Language Processing, MBZUAI (自然语言处理系,MBZUAI)
类目: Computation and Language (cs.CL)
备注: 25 pages

点击查看摘要

Abstract:Although region-specific large language models (LLMs) are increasingly developed, their safety remains underexplored, particularly in culturally diverse settings like Indonesia, where sensitivity to local norms is essential and highly valued by the community. In this work, we present IndoSafety, the first high-quality, human-verified safety evaluation dataset tailored for the Indonesian context, covering five language varieties: formal and colloquial Indonesian, along with three major local languages: Javanese, Sundanese, and Minangkabau. IndoSafety is constructed by extending prior safety frameworks to develop a taxonomy that captures Indonesia’s sociocultural context. We find that existing Indonesian-centric LLMs often generate unsafe outputs, particularly in colloquial and local language settings, while fine-tuning on IndoSafety significantly improves safety while preserving task performance. Our work highlights the critical need for culturally grounded safety evaluation and provides a concrete step toward responsible LLM deployment in multilingual settings. Warning: This paper contains example data that may be offensive, harmful, or biased.
zh

[NLP-61] Pruning General Large Language Models into Customized Expert Models

【速读】: 该论文旨在解决大规模语言模型(Large Language Models, LLMs)因模型规模庞大而需要大量计算资源的问题,特别是在需要紧凑的专家模型以适应特定下游场景时,传统剪枝方法往往难以在保持模型性能的同时有效减少参数量。其解决方案的关键在于设计一种定制化剪枝方法(Custom Pruning, Cus-Prun),通过在“语言”、“领域”和“任务”三个维度上识别并剪除无关神经元,从而生成轻量级专家模型,且无需任何后训练过程。实验表明,Cus-Prun 在保持模型通用能力的同时显著提升了专家模型的性能。

链接: https://arxiv.org/abs/2506.02561
作者: Yirao Zhao,Guizhen Chen,Kenji Kawaguchi,Lidong Bing,Wenxuan Zhang
机构: National University of Singapore; Nanyang Technological University, Singapore; DAMO Academy, Alibaba Group, Singapore; MiroMind; Singapore University of Technology and Design
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have revolutionized natural language processing, yet their substantial model sizes often require substantial computational resources. To preserve computing resources and accelerate inference speed, it is crucial to prune redundant parameters, especially for experienced users who often need compact expert models tailored to specific downstream scenarios. However, most existing pruning methods focus on preserving the model’s general capabilities, often requiring extensive post-training or suffering from degraded performance due to coarse-grained pruning. In this work, we design a \underlineCus tom \underlinePrun ing method ( \textttCus-Prun ) to prune a large general model into a smaller lightweight expert model, which is positioned along the “language”, “domain” and “task” dimensions. By identifying and pruning irrelevant neurons of each dimension, \textttCus-Prun creates expert models without any post-training. Our experiments demonstrate that \textttCus-Prun consistently outperforms other methods, achieving minimal loss in both expert and general capabilities across various models from different model families and sizes.
zh

[NLP-62] Response-Level Rewards Are All You Need for Online Reinforcement Learning in LLM s: A Mathematical Perspective

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)强化学习中的零奖励假设(Zero-Reward Assumption)问题,即在非终止动作(如中间标记生成)中无法获得任务相关的即时奖励,仅有最终标记会收到整个响应的奖励。这一假设在实践中普遍存在,因为精确的标记级奖励通常难以或无法获取。解决方案的关键在于提出轨迹策略梯度定理(Trajectory Policy Gradient Theorem),该定理表明,即使在零奖励假设成立的情况下,也可以仅使用响应级奖励模型无偏估计基于真实未知标记级奖励的策略梯度,适用于REINFORCE和Actor-Critic类算法。这一结果揭示了PPO、GRPO、ReMax和RLOO等常用方法本质上具备建模标记级奖励信号的能力,为响应级奖励方法提供了理论依据。

链接: https://arxiv.org/abs/2506.02553
作者: Shenghua He,Tian Xia,Xuan Zhou,Hui Wei
机构: Amazon(亚马逊); PAII Inc.(PAII公司); UC Merced(加州大学默塞德分校)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We study a common challenge in reinforcement learning for large language models (LLMs): the Zero-Reward Assumption, where non-terminal actions (i.e., intermediate token generations) receive zero task-specific immediate reward, while only the final token receives a reward for the entire response. This assumption arises frequently in practice, as precise token-level rewards are often difficult or infeasible to obtain in LLM applications. In this work, we provide a unifying theoretical perspective. We introduce the Trajectory Policy Gradient Theorem, which shows that the policy gradient based on true, unknown token-level rewards can be unbiasedly estimated using only a response-level reward model, regardless of whether the Zero-Reward Assumption holds or not, for algorithms in the REINFORCE and Actor-Critic families. This result reveals that widely used methods such as PPO, GRPO, ReMax, and RLOO inherently possess the capacity to model token-level reward signals, offering a theoretical justification for response-level reward approaches. Our findings pave the way for more practical, efficient LLM fine-tuning, allowing developers to treat training algorithms as black boxes and focus on improving the response-level reward model with auxiliary sub-models. We also offer a detailed analysis of popular RL and non-RL methods, comparing their theoretical foundations and practical advantages across common LLM tasks. Finally, we propose a new algorithm: Token-Reinforced Policy Optimization (TRePO), a theoretically grounded method that is simpler than PPO, matches GRPO in memory efficiency, and holds promise for broad applicability.
zh

[NLP-63] CoRe-MMRAG : Cross-Source Knowledge Reconciliation for Multimodal RAG ACL2025

【速读】: 该论文旨在解决多模态检索增强生成(Multimodal Retrieval-Augmented Generation, MMRAG)中的两个关键问题:参数化知识与检索知识不一致(Parametric-Retrieved Knowledge Inconsistency, PRKI)以及视觉-文本知识不一致(Visual-Textual Knowledge Inconsistency, VTKI)。为应对这些问题,作者提出了一种端到端的框架——跨源知识协调(Cross-source knowledge Reconciliation for MultiModal RAG, CoRe-MMRAG),其核心在于通过四阶段流程实现不同知识源之间的有效协调与整合,包括生成内部响应、联合相似性评估选择多模态证据、生成外部响应以及最终融合生成可靠答案。此外,专门设计的训练范式进一步提升了知识源区分、多模态融合和统一答案生成的能力。

链接: https://arxiv.org/abs/2506.02544
作者: Yang Tian,Fan Liu,Jingyuan Zhang,Victoria W.,Yupeng Hu,Liqiang Nie
机构: Shandong University (山东大学); National University of Singapore (新加坡国立大学); Independent Researcher (独立研究员); Harbin Institute of Technology, Shenzhen (哈尔滨工业大学深圳)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to ACL 2025 Main

点击查看摘要

Abstract:Multimodal Retrieval-Augmented Generation (MMRAG) has been introduced to enhance Multimodal Large Language Models by incorporating externally retrieved multimodal knowledge, but it introduces two challenges: Parametric-Retrieved Knowledge Inconsistency (PRKI), where discrepancies between parametric and retrieved knowledge create uncertainty in determining reliability, and Visual-Textual Knowledge Inconsistency (VTKI), where misalignment between visual and textual sources disrupts entity representation. To address these challenges, we propose \textbfCr\textbfoss-source knowledge \textbfReconciliation for \textbfMulti\textbfModal \textbfRAG (CoRe-MMRAG), a novel end-to-end framework that effectively reconciles inconsistencies across knowledge sources. CoRe-MMRAG follows a four-stage pipeline: it first generates an internal response from parametric knowledge, then selects the most relevant multimodal evidence via joint similarity assessment, generates an external response, and finally integrates both to produce a reliable answer. Additionally, a specialized training paradigm enhances knowledge source discrimination, multimodal integration, and unified answer generation. Experiments on KB-VQA benchmarks show that CoRe-MMRAG achieves substantial improvements over baseline methods, achieving 5.6% and 9.3% performance gains on InfoSeek and Encyclopedic-VQA, respectively. We release code and data at \hrefthis https URLthis https URL.
zh

[NLP-64] Answer Convergence as a Signal for Early Stopping in Reasoning

【速读】: 该论文试图解决生成式 AI (Generative AI) 在链式思维(Chain-of-thought, CoT)提示中产生的冗长和重复输出问题,从而降低推理成本。其解决方案的关键在于识别并去除推理过程中不必要的步骤,通过三种推理时策略提升效率:基于答案一致性的提前终止、增强生成推理结束信号的概率,以及基于内部激活的监督停止方法。实验表明,这些方法在多个基准测试中显著减少了token使用量,同时保持了较高的准确性。

链接: https://arxiv.org/abs/2506.02536
作者: Xin Liu,Lu Wang
机构: University of Michigan(密歇根大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Chain-of-thought (CoT) prompting enhances reasoning in large language models (LLMs) but often leads to verbose and redundant outputs, thus increasing inference cost. We hypothesize that many reasoning steps are unnecessary for producing correct answers. To investigate this, we start with a systematic study to examine what is the minimum reasoning required for a model to reach a stable decision. We find that on math reasoning tasks like math, models typically converge to their final answers after 60% of the reasoning steps, suggesting substantial redundancy in the remaining content. Based on these insights, we propose three inference-time strategies to improve efficiency: (1) early stopping via answer consistency, (2) boosting the probability of generating end-of-reasoning signals, and (3) a supervised method that learns when to stop based on internal activations. Experiments across five benchmarks and five open-weights LLMs show that our methods significantly reduce token usage with little or no accuracy drop. In particular, on NaturalQuestions, Answer Consistency reduces tokens by over 40% while further improving accuracy. Our work underscores the importance of cost-effective reasoning methods that operate at inference time, offering practical benefits for real-world applications.
zh

[NLP-65] Natural Language Processing to Enhance Deliberation in Political Online Discussions: A Survey

【速读】: 该论文试图解决政治在线讨论中由于平台设计和流程缺陷导致的讨论质量下降问题,这些问题影响了 deliberation(协商)的实现。解决方案的关键在于利用机器学习方法来识别并应对在线政治讨论中的问题,从而促进更高质量的协商过程。

链接: https://arxiv.org/abs/2506.02533
作者: Maike Behrendt,Stefan Sylvius Wagner,Carina Weinmann,Marike Bormann,Mira Warne,Stefan Harmeling
机构: Heinrich Heine University Düsseldorf(海因里希·海涅大学杜塞尔多夫分校); Technical University Dortmund(多特蒙德工业大学)
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Political online participation in the form of discussing political issues and exchanging opinions among citizens is gaining importance with more and more formats being held digitally. To come to a decision, a careful discussion and consideration of opinions and a civil exchange of arguments, which is defined as the act of deliberation, is desirable. The quality of discussions and participation processes in terms of their deliberativeness highly depends on the design of platforms and processes. To facilitate online communication for both participants and initiators, machine learning methods offer a lot of potential. In this work we want to showcase which issues occur in political online discussions and how machine learning can be used to counteract these issues and enhance deliberation.
zh

[NLP-66] Reasoning Flow: Semantic Structure of Complex Reasoning Traces ACL2025

【速读】: 该论文试图解决大型推理模型(Large Reasoning Models, LRMs)生成的复杂推理轨迹在语义结构分析上的挑战,其核心问题是如何有效理解和表征这些轨迹中的推理模式。解决方案的关键在于提出了一种统一的框架——ReasoningFlow,该框架将推理轨迹解析为有向无环图,从而将不同的推理模式表征为子图结构,提供了一种人类可解释的表示方法,有助于深入理解、评估和提升LRMs的推理过程。

链接: https://arxiv.org/abs/2506.02532
作者: Jinu Lee,Sagnik Mukherjee,Dilek Hakkani-Tur,Julia Hockenmaier
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Computation and Language (cs.CL)
备注: 10 pages, 6 figures. ArgMining 2025 Workshop (Non-archival) @ ACL 2025

点击查看摘要

Abstract:Large reasoning models (LRMs) generate complex reasoning traces with planning, reflection, verification, and backtracking. In this work, we introduce ReasoningFlow, a unified schema for analyzing the semantic structures of these complex traces. ReasoningFlow parses traces into directed acyclic graphs, enabling the characterization of distinct reasoning patterns as subgraph structures. This human-interpretable representation offers promising applications in understanding, evaluating, and enhancing the reasoning processes of LRMs.
zh

[NLP-67] Automated Web Application Testing: End-to-End Test Case Generation with Large Language Models and Screen Transition Graphs

【速读】: 该论文旨在解决Web应用测试中由于界面复杂性和动态性导致的可靠性保障难题,特别是针对站点导航和表单填写这两个关键测试方面。其解决方案的关键在于将图结构与大语言模型(LLMs)相结合,用于建模导航流程并生成测试场景,同时采用状态图方法处理条件表单并自动化Selenium脚本生成,从而提升测试用例的覆盖率和鲁棒性。

链接: https://arxiv.org/abs/2506.02529
作者: Nguyen-Khang Le,Quan Minh Bui,Minh Ngoc Nguyen,Hiep Nguyen,Trung Vo,Son T. Luu,Shoshin Nomura,Minh Le Nguyen
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Published in the Proceedings of JSAI 2025

点击查看摘要

Abstract:Web applications are critical to modern software ecosystems, yet ensuring their reliability remains challenging due to the complexity and dynamic nature of web interfaces. Recent advances in large language models (LLMs) have shown promise in automating complex tasks, but limitations persist in handling dynamic navigation flows and complex form interactions. This paper presents an automated system for generating test cases for two key aspects of web application testing: site navigation and form filling. For site navigation, the system employs screen transition graphs and LLMs to model navigation flows and generate test scenarios. For form filling, it uses state graphs to handle conditional forms and automates Selenium script generation. Key contributions include: (1) a novel integration of graph structures and LLMs for site navigation testing, (2) a state graph-based approach for automating form-filling test cases, and (3) a comprehensive dataset for evaluating form-interaction testing. Experimental results demonstrate the system’s effectiveness in improving test coverage and robustness, advancing the state of web application testing.
zh

[NLP-68] Multilingual Information Retrieval with a Monolingual Knowledge Base SIGIR25

【速读】: 该论文旨在解决跨语言知识共享中的问题,特别是在高资源语言向低资源语言转移知识时,由于高质量知识库资源稀缺且语言有限,导致信息检索效果受限。其解决方案的关键在于提出一种基于加权采样的对比学习微调策略,以优化多语言嵌入模型,从而将不同语言的句子映射到与知识库语言相同的特征向量空间中,实现使用单语知识库进行多语言信息检索。

链接: https://arxiv.org/abs/2506.02527
作者: Yingying Zhuang,Aman Gupta,Anurag Beniwal
机构: Amazon(亚马逊)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: 6 pages, accepted at GENNEXT@SIGIR25

点击查看摘要

Abstract:Multilingual information retrieval has emerged as powerful tools for expanding knowledge sharing across languages. On the other hand, resources on high quality knowledge base are often scarce and in limited languages, therefore an effective embedding model to transform sentences from different languages into a feature vector space same as the knowledge base language becomes the key ingredient for cross language knowledge sharing, especially to transfer knowledge available in high-resource languages to low-resource ones. In this paper we propose a novel strategy to fine-tune multilingual embedding models with weighted sampling for contrastive learning, enabling multilingual information retrieval with a monolingual knowledge base. We demonstrate that the weighted sampling strategy produces performance gains compared to standard ones by up to 31.03% in MRR and up to 33.98% in Recall@3. Additionally, our proposed methodology is language agnostic and applicable for both multilingual and code switching use cases.
zh

[NLP-69] Learning Together to Perform Better: Teaching Small-Scale LLM s to Collaborate via Preferential Rationale Tuning ACL

【速读】: 该论文试图解决在无法依赖大型语言模型(Large Language Models, LLMs)的情况下,如何提升小型语言模型(Small Language Models, SLMs)的内在推理能力的问题。现有方法通常依赖于从大型模型中蒸馏知识,但受版权和法律限制,这种方法在商业场景中受到阻碍。论文提出的解决方案关键在于COLLATE框架,该框架通过训练小型模型生成多样化的推理过程,并利用偏好优化选择最能产生真实答案的推理路径,从而提升模型在下游任务上的性能。

链接: https://arxiv.org/abs/2506.02519
作者: Sohan Patnaik,Milan Aggarwal,Sumit Bhatia,Balaji Krishnamurthy
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted at ACL Main 2025

点击查看摘要

Abstract:LLMssuch as GPT-4 have shown a remarkable ability to solve complex questions by generating step-by-step rationales. Prior works have utilized this capability to improve smaller and cheaper LMs (say, with 7B parameters). However, various practical constraints, such as copyright and legal issues, owing to lack of transparency in the pre-training data of large (often closed) models, prevent their use in commercial settings. Little focus has been given to improving the innate reasoning ability of smaller models without distilling information from larger LLMs. To address this, we propose COLLATE, a trainable framework that tunes a (small) LLM to generate those outputs from a pool of diverse rationales that selectively improves the downstream task. COLLATE enforces multiple instances of the same LLM to exhibit distinct behavior and employs them to generate rationales to obtain diverse outputs. The LLM is then tuned via preference optimization to choose the candidate rationale which maximizes the likelihood of ground-truth answer. COLLATE outperforms several trainable and prompting baselines on 5 datasets across 3 domains: maths problem solving, natural language inference, and commonsense reasoning. We show the eff icacy of COLLATE on LLMs from different model families across varying parameter scales (1B to 8B) and demonstrate the benefit of multiple rationale providers guided by the end task through ablations. Code is released here (this https URL).
zh

[NLP-70] FinChain: A Symbolic Benchmark for Verifiable Chain-of-Thought Financial Reasoning

【速读】: 该论文旨在解决金融任务中多步骤符号推理能力缺乏系统性评估基准的问题。现有数据集如FinQA和ConvFinQA仅监督最终数值答案,未评估中间推理步骤。为此,作者提出了FinChain,这是首个针对可验证的Chain-of-Thought (CoT) 金融推理设计的符号基准。其关键在于提供每主题下五种参数化模板,涵盖不同推理复杂度和领域专业知识需求,并包含可执行的Python追踪,支持大规模训练数据的自动生成及跨领域的适应性。此外,还引入了ChainEval,一种用于自动评估最终答案和中间推理的新指标。

链接: https://arxiv.org/abs/2506.02515
作者: Zhuohan Xie,Dhruv Sahnan,Debopriyo Banerjee,Georgi Georgiev,Rushil Thareja,Hachem Madmoun,Jinyan Su,Aaryamonvikram Singh,Yuxia Wang,Rui Xing,Fajri Koto,Haonan Li,Ivan Koychev,Tanmoy Chakraborty,Salem Lahlou,Veselin Stoyanov,Preslav Nakov
机构: MBZUAI(穆巴达拉科学技术研究院); FMI, Sofia University(数学与信息学院,索非亚大学); Quantsquare(量化方); Cornell University(康奈尔大学); IIT Delhi(印度理工学院德里分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 15 pages, 8 figures, 2 tables

点击查看摘要

Abstract:Multi-step symbolic reasoning is critical for advancing downstream performance on financial tasks. Yet, benchmarks for systematically evaluating this capability are lacking. Existing datasets like FinQA and ConvFinQA supervise only final numerical answers, without assessing intermediate reasoning steps. To address this, we introduce FinChain, the first symbolic benchmark designed for verifiable Chain-of- Thought (CoT) financial reasoning. Spanning 54 topics across 12 financial domains, Fin- Chain offers five parameterized templates per topic, each varying in reasoning complexity and domain expertise required. Each dataset instance includes an executable Python trace, enabling automatic generation of extensive training data and easy adaptation to other domains. We also introduce ChainEval, a new metric for automatic evaluation of both final answers and intermediate reasoning. Benchmarking 30 LLMs on our dataset, we find that even state-of-the-art models have considerable room for improvement in multi-step financial reasoning. All templates and evaluation metrics for FinChain are available at https: //github.com/mbzuai-nlp/finchain.
zh

[NLP-71] M3FinMeeting: A Multilingual Multi-Sector and Multi-Task Financial Meeting Understanding Evaluation Dataset ACL-2025

【速读】: 该论文试图解决现有金融领域基准测试难以捕捉金融会议真实动态的问题,当前的金融基准多依赖新闻文章、财报或公告,无法全面反映实际金融会议的复杂性。其解决方案的关键是提出一个名为\textttM ^3 FinMeeting的新型基准,该基准具有多语言(支持英语、中文和日语)、多行业(覆盖全球行业分类标准GICS定义的多个行业)和多任务(包括摘要生成、问答对提取和问答任务)的特点,从而为金融会议理解提供更真实和全面的评估方式。

链接: https://arxiv.org/abs/2506.02510
作者: Jie Zhu,Junhui Li,Yalong Wen,Xiandong Li,Lifan Guo,Feng Chen
机构: Soochow University (苏州大学); Alibaba Cloud Computing (阿里云计算); Nanjing University (南京大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by ACL-2025

点击查看摘要

Abstract:Recent breakthroughs in large language models (LLMs) have led to the development of new benchmarks for evaluating their performance in the financial domain. However, current financial benchmarks often rely on news articles, earnings reports, or announcements, making it challenging to capture the real-world dynamics of financial meetings. To address this gap, we propose a novel benchmark called \textttM ^3 FinMeeting , which is a multilingual, multi-sector, and multi-task dataset designed for financial meeting understanding. First, \textttM ^3 FinMeeting supports English, Chinese, and Japanese, enhancing comprehension of financial discussions in diverse linguistic contexts. Second, it encompasses various industry sectors defined by the Global Industry Classification Standard (GICS), ensuring that the benchmark spans a broad range of financial activities. Finally, \textttM ^3 FinMeeting includes three tasks: summarization, question-answer (QA) pair extraction, and question answering, facilitating a more realistic and comprehensive evaluation of understanding. Experimental results with seven popular LLMs reveal that even the most advanced long-context models have significant room for improvement, demonstrating the effectiveness of \textttM ^3 FinMeeting as a benchmark for assessing LLMs’ financial meeting comprehension skills.
zh

[NLP-72] KARE-RAG : Knowledge-Aware Refinement and Enhancement for RAG

【速读】: 该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)系统中由于检索文档中的噪声导致的事实不一致问题。其关键解决方案在于提升生成模型处理噪声内容的能力,具体通过三个核心创新实现:结构化知识表示以促进训练过程中的错误检测、基于密集直接偏好优化(Dense Direct Preference Optimization, DDPO)的改进训练目标以优先修正关键错误,以及一种保持语义一致性的同时修正事实性错误的对比数据生成流程。这些方法显著提升了RAG系统的性能,且在不同模型规模和任务场景下均表现出色。

链接: https://arxiv.org/abs/2506.02503
作者: Yongjian Li,HaoCheng Chu,Yukun Yan,Zhenghao Liu,Shi Yu,Zheni Zeng,Ruobing Wang,Sen Song,Zhiyuan Liu,Maosong Sun
机构: Tsinghua(清华大学); Xiamen University(厦门大学); Tsinghua University(清华大学); Northeastern University(东北大学); Chinese Academy of Sciences(中国科学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) enables large language models (LLMs) to access broader knowledge sources, yet factual inconsistencies persist due to noise in retrieved documents-even with advanced retrieval methods. We demonstrate that enhancing generative models’ capacity to process noisy content is equally critical for robust performance. In this paper, we present KARE-RAG (Knowledge-Aware Refinement and Enhancement for RAG), which improves knowledge utilization through three key innovations: (1) structured knowledge representations that facilitate error detection during training, (2) Dense Direct Preference Optimization (DDPO)-a refined training objective that prioritizes correction of critical errors, and (3) a contrastive data generation pipeline that maintains semantic consistency while rectifying factual inaccuracies. Experiments show our method significantly enhances standard RAG pipelines across model scales, improving both in-domain and out-of-domain task performance without compromising general capabilities. Notably, these gains are achieved with modest training data, suggesting data-efficient optimization is possible through targeted learning strategies. Our findings establish a new direction for RAG improvement: by improving how models learn to process retrieved content, we can enhance performance across diverse inference paradigms. All data and code will be publicly available on Github.
zh

[NLP-73] Minos: A Multimodal Evaluation Model for Bidirectional Generation Between Image and Text

【速读】: 该论文旨在解决多模态生成任务中评估能力不足的问题,特别是针对文本到图像(T2I)生成任务的评估能力缺失以及大规模人类评估数据的缺乏。其解决方案的关键在于构建一个大规模的多模态评估数据集Minos-Corpus,该数据集融合了人类和GPT的评估数据,并基于此数据集提出数据选择与平衡、Mix-SFT训练方法以及DPO训练策略,从而开发出性能优越的多模态评估模型Minos。

链接: https://arxiv.org/abs/2506.02494
作者: Junzhe Zhang,Huixuan Zhang,Xinyu Hu,Li Lin,Mingqi Gao,Shi Qiu,Xiaojun Wan
机构: Wangxuan Institute of Computer Technology, Peking University (王选计算机技术研究所,北京大学); School of Physics, Peking University (物理学院,北京大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Evaluation is important for multimodal generation tasks. With the rapid progress of MLLMs, there is growing interest in applying MLLMs to build general evaluation systems. However, existing work overlooks two aspects: (1) the development of evaluation capabilities for text-to-image (T2I) generation task, and (2) the incorporation of large-scale human evaluation data. In this paper, we introduce Minos-Corpus, a large-scale multimodal evaluation dataset that combines evaluation data from both human and GPT. The corpus contains evaluation data across both image-to-text(I2T) and T2I generation tasks. Based on this corpus, we propose Data Selection and Balance, Mix-SFT training methods, and apply DPO to develop Minos, a multimodal evaluation model built upon a 7B backbone. Minos achieves state-of-the-art (SoTA) performance among all open-source evaluation models of similar scale on the average of evaluation performance on all tasks, and outperforms all open-source and closed-source models on evaluation of T2I generation task. Extensive experiments demonstrate the importance of leveraging high-quality human evaluation data and jointly training on evaluation data from both I2T and T2I generation tasks.
zh

[NLP-74] Enhancing Large Language Models with Neurosymbolic Reasoning for Multilingual Tasks

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在长上下文场景中进行多目标推理时遇到的挑战,即相关信息分散在大量文档中,导致模型难以有效提取和整合。解决方案的关键在于提出神经符号增强推理(NeuroSymbolic Augmented Reasoning, NSAR),该方法在推理过程中结合了神经推理与符号推理的优势,通过显式提取文本中的符号事实并生成可执行的Python代码来处理复杂的推理步骤,从而提升多语言环境下推理的鲁棒性、可解释性和可扩展性。

链接: https://arxiv.org/abs/2506.02483
作者: Sina Bagheri Nezhad,Ameeta Agrawal
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted at 19th Conference on Neurosymbolic Learning and Reasoning (NeSy 2025)

点击查看摘要

Abstract:Large language models (LLMs) often struggle to perform multi-target reasoning in long-context scenarios where relevant information is scattered across extensive documents. To address this challenge, we introduce NeuroSymbolic Augmented Reasoning (NSAR), which combines the benefits of neural and symbolic reasoning during inference. NSAR explicitly extracts symbolic facts from text and generates executable Python code to handle complex reasoning steps. Through extensive experiments across seven languages and diverse context lengths, we demonstrate that NSAR significantly outperforms both a vanilla RAG baseline and advanced prompting strategies in accurately identifying and synthesizing multiple pieces of information. Our results highlight the effectiveness of combining explicit symbolic operations with neural inference for robust, interpretable, and scalable reasoning in multilingual settings.
zh

[NLP-75] Do Language Models Think Consistently? A Study of Value Preferences Across Varying Response Lengths

【速读】: 该论文试图解决生成式 AI (Generative AI) 在伦理风险和价值倾向评估中,短形式问卷与长形式输出之间价值偏好一致性不足的问题。其解决方案的关键在于通过对比短形式反应与长形式回应中的价值偏好,并分析不同论点数量对用户表达的影响,从而揭示两者之间存在的弱相关性以及长形式生成设置间的一致性局限。研究结果表明,当前方法在确保跨应用场景的价值表达一致性方面仍存在显著不足。

链接: https://arxiv.org/abs/2506.02481
作者: Inderjeet Nair,Lu Wang
机构: University of Michigan, Ann Arbor, MI (密歇根大学,安娜堡分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Evaluations of LLMs’ ethical risks and value inclinations often rely on short-form surveys and psychometric tests, yet real-world use involves long-form, open-ended responses – leaving value-related risks and preferences in practical settings largely underexplored. In this work, we ask: Do value preferences inferred from short-form tests align with those expressed in long-form outputs? To address this question, we compare value preferences elicited from short-form reactions and long-form responses, varying the number of arguments in the latter to capture users’ differing verbosity preferences. Analyzing five LLMs (llama3-8b, gemma2-9b, mistral-7b, qwen2-7b, and olmo-7b), we find (1) a weak correlation between value preferences inferred from short-form and long-form responses across varying argument counts, and (2) similarly weak correlation between preferences derived from any two distinct long-form generation settings. (3) Alignment yields only modest gains in the consistency of value expression. Further, we examine how long-form generation attributes relate to value preferences, finding that argument specificity negatively correlates with preference strength, while representation across scenarios shows a positive correlation. Our findings underscore the need for more robust methods to ensure consistent value expression across diverse applications.
zh

[NLP-76] ORPP: Self-Optimizing Role-playing Prompts to Enhance Language Model Capabilities

【速读】: 该论文旨在解决如何高效生成高质量提示(prompt)以提升大型语言模型(Large Language Models, LLMs)在复杂任务中的性能问题。现有方法在提示优化方面存在计算开销大或依赖模型自身强优化能力的局限性。该论文提出的解决方案是ORPP(Optimized Role-Playing Prompt),其关键在于将提示搜索空间限定在角色扮演场景中,通过精心设计的高质量角色扮演提示充分激活模型的内在能力,从而实现性能提升。

链接: https://arxiv.org/abs/2506.02480
作者: Yifan Duan,Yihong Tang,Kehai Chen,Liqiang Nie,Min Zhang
机构: Harbin Institute of Technology, Shenzhen, China (哈尔滨工业大学深圳)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:High-quality prompts are crucial for eliciting outstanding performance from large language models (LLMs) on complex tasks. Existing research has explored model-driven strategies for prompt optimization. However, these methods often suffer from high computational overhead or require strong optimization capabilities from the model itself, which limits their broad this http URL address these challenges, we propose ORPP (Optimized Role-Playing Prompt),a framework that enhances model performance by optimizing and generating role-playing prompts. The core idea of ORPP is to confine the prompt search space to role-playing scenarios, thereby fully activating the model’s intrinsic capabilities through carefully crafted, high-quality role-playing prompts. Specifically, ORPP first performs iterative optimization on a small subset of training samples to generate high-quality role-playing prompts. Then, leveraging the model’s few-shot learning capability, it transfers the optimization experience to efficiently generate suitable prompts for the remaining this http URL experimental results show that ORPP not only matches but in most cases surpasses existing mainstream prompt optimization methods in terms of performance. Notably, ORPP demonstrates superior “plug-and-play” capability. In most cases, it can be integrated with various other prompt methods and further enhance their effectiveness.
zh

[NLP-77] BitBypass: A New Direction in Jailbreaking Aligned Large Language Models with Bitstream Camouflage

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在安全对齐(safety alignment)方面的脆弱性问题,即如何有效规避已对齐LLMs的安全机制以生成有害和不安全内容。论文提出的解决方案关键在于开发了一种新型的黑盒越狱攻击方法,称为BitBypass,其核心思想是利用连字符分隔的比特流伪装(hyphen-separated bitstream camouflage),通过操纵数据的基本信息表示形式(即连续比特)实现对安全对齐的突破,而非依赖于提示工程或对抗性扰动。

链接: https://arxiv.org/abs/2506.02479
作者: Kalyan Nakka,Nitesh Saxena
机构: Texas A&M University (德克萨斯A&M大学)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注: 24 pages, 24 figures, and 7 tables

点击查看摘要

Abstract:The inherent risk of generating harmful and unsafe content by Large Language Models (LLMs), has highlighted the need for their safety alignment. Various techniques like supervised fine-tuning, reinforcement learning from human feedback, and red-teaming were developed for ensuring the safety alignment of LLMs. However, the robustness of these aligned LLMs is always challenged by adversarial attacks that exploit unexplored and underlying vulnerabilities of the safety alignment. In this paper, we develop a novel black-box jailbreak attack, called BitBypass, that leverages hyphen-separated bitstream camouflage for jailbreaking aligned LLMs. This represents a new direction in jailbreaking by exploiting fundamental information representation of data as continuous bits, rather than leveraging prompt engineering or adversarial manipulations. Our evaluation of five state-of-the-art LLMs, namely GPT-4o, Gemini 1.5, Claude 3.5, Llama 3.1, and Mixtral, in adversarial perspective, revealed the capabilities of BitBypass in bypassing their safety alignment and tricking them into generating harmful and unsafe content. Further, we observed that BitBypass outperforms several state-of-the-art jailbreak attacks in terms of stealthiness and attack success. Overall, these results highlights the effectiveness and efficiency of BitBypass in jailbreaking these state-of-the-art LLMs.
zh

[NLP-78] FroM: Frobenius Norm-Based Data-Free Adaptive Model Merging

【速读】: 该论文试图解决在参数高效微调场景中,传统模型融合方法在合并全微调模型时遇到的任务干扰问题。解决方案的关键在于提出一种称为FroM的自适应融合方法,该方法通过直接使用Frobenius范数测量模型参数,而无需任何训练数据,从而有效缓解任务干扰问题,并在多种微调场景中优于基线方法。

链接: https://arxiv.org/abs/2506.02478
作者: Zijian Li,Xiaocheng Feng,Huixin Liu,Yichong Huang,Ting Liu,Bing Qin
机构: Harbin Institute of Technology (哈尔滨工业大学); Peng Cheng Laboratory (鹏城实验室)
类目: Computation and Language (cs.CL)
备注: 12 pages, 11 figures

点击查看摘要

Abstract:With the development of large language models, fine-tuning has emerged as an effective method to enhance performance in specific scenarios by injecting domain-specific knowledge. In this context, model merging techniques provide a solution for fusing knowledge from multiple fine-tuning models by combining their parameters. However, traditional methods often encounter task interference when merging full fine-tuning models, and this problem becomes even more evident in parameter-efficient fine-tuning scenarios. In this paper, we introduce an improvement to the RegMean method, which indirectly leverages the training data to approximate the outputs of the linear layers before and after merging. We propose an adaptive merging method called FroM, which directly measures the model parameters using the Frobenius norm, without any training data. By introducing an additional hyperparameter for control, FroM outperforms baseline methods across various fine-tuning scenarios, alleviating the task interference problem.
zh

[NLP-79] Comba: Improving Nonlinear RNNs with Closed-loop Control

【速读】: 该论文旨在解决传统序列建模方法在递归记忆管理上的效率与性能瓶颈问题,特别是针对现有模型如Mamba和GLA在递归状态与键向量之间缺乏有效交互的局限性。其解决方案的关键在于提出一种基于闭环控制理论的新型非线性循环神经网络(Nonlinear RNN)变体Comba,该模型采用标量加低秩的状态转移机制,并引入状态反馈与输出反馈校正,从而增强了模型的非线性递归结构和计算效率。

链接: https://arxiv.org/abs/2506.02475
作者: Jiaxi Hu,Yongqi Pan,Jusen Du,Disen Lan,Xiaqiang Tang,Qingsong Wen,Yuxuan Liang,Weigao Sun
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent efficient sequence modeling methods such as Gated DeltaNet, TTT, and RWKV-7 have achieved performance improvements by supervising the recurrent memory management through Delta learning rule. Unlike previous state-space models (e.g., Mamba) and gated linear attentions (e.g., GLA), these models introduce interactions between the recurrent state and the key vector, resulting in a nonlinear recursive structure. In this paper, we first introduce the concept of Nonlinear RNNs with a comprehensive analysis on the advantages and limitations of these models. Then, based on closed-loop control theory, we propose a novel Nonlinear RNN variant named Comba, which adopts a scalar-plus-low-rank state transition, with both state feedback and output feedback corrections. We also implement a hardware-efficient chunk-wise parallel kernel in Triton and train models with 340M/1.3B parameters on large-scale corpus. Comba demonstrates its superior performance and computation efficiency in both language and vision modeling.
zh

[NLP-80] XToM: Exploring the Multilingual Theory of Mind for Large Language Models

【速读】: 该论文试图解决当前大型语言模型(Large Language Models, LLMs)在跨语言情境下表现出的Theory of Mind (ToM)能力不足的问题,即LLMs是否能够在多种语言环境中推理他人的心理状态。解决方案的关键在于提出XToM,这是一个经过严格验证的多语言基准测试,涵盖了五种语言以及多样化的、情境丰富的任务场景,从而系统评估LLMs的多语言ToM能力。

链接: https://arxiv.org/abs/2506.02461
作者: Chunkit Chan,Yauwai Yim,Hongchuan Zeng,Zhiying Zou,Xinyuan Cheng,Zhifan Sun,Zheye Deng,Kawai Chung,Yuzhuo Ao,Yixiang Fan,Cheng Jiayang,Ercong Nie,Ginny Y. Wong,Helmut Schmid,Hinrich Schütze,Simon See,Yangqiu Song
机构: HKUST, Hong Kong; NVIDIA AI Technology Center (NVAITC), USA; Shanghai Jiao Tong University, China; Technische Universtität Darmstadt, Germany; LMU Munich, Germany; Munich Center for Machine Learning, Germany
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Theory of Mind (ToM), the ability to infer mental states in others, is pivotal for human social cognition. Existing evaluations of ToM in LLMs are largely limited to English, neglecting the linguistic diversity that shapes human cognition. This limitation raises a critical question: can LLMs exhibit Multilingual Theory of Mind, which is the capacity to reason about mental states across diverse linguistic contexts? To address this gap, we present XToM, a rigorously validated multilingual benchmark that evaluates ToM across five languages and incorporates diverse, contextually rich task scenarios. Using XToM, we systematically evaluate LLMs (e.g., DeepSeek R1), revealing a pronounced dissonance: while models excel in multilingual language understanding, their ToM performance varies across languages. Our findings expose limitations in LLMs’ ability to replicate human-like mentalizing across linguistic contexts.
zh

[NLP-81] MidPO: Dual Preference Optimization for Safety and Helpfulness in Large Language Models via a Mixture of Experts Framework

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在应用中如何在保持有用性的同时提升安全性的问题。现有方法通过安全约束的在线或离线偏好优化来解决该问题,但在线方法常因过度安全而降低有用性,离线方法则难以自适应地平衡安全性和有用性。论文提出的解决方案关键在于设计了一种基于专家混合(Mixture of Experts, MoE)的双偏好优化框架MidPO,通过将基础模型微调为两个独立的专家(安全专家和有用性专家),并在MoE框架中引入动态路由机制,实现对安全性和有用性的自适应平衡。

链接: https://arxiv.org/abs/2506.02460
作者: Yupeng Qi,Ziyu Lyu,Min Yang,Yanlin Wang,Lu Bai,Lixin Cui
机构: Sun Yat-sen University (中山大学); Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences (中国科学院深圳先进技术研究院); Beijing Normal University (北京师范大学); Central University of Finance and Economics (中央财经大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As large language models (LLMs) are increasingly applied across various domains, enhancing safety while maintaining the helpfulness of LLMs has become a critical challenge. Recent studies solve this problem through safety-constrained online preference optimization or safety-constrained offline preference optimization. However, the safety-constrained online methods often suffer from excessive safety, which might reduce helpfulness, while the safety-constrained offline methods perform poorly in adaptively balancing safety and helpfulness. To address these limitations, we propose MidPO, a \textbf\underlineMixture of Experts (MoE) framework for safety-helpfulness \textbf\underlinedual \textbf\underlinePreference \textbf\underlineOptimization. Firstly, MidPO devises single-preference enhanced direct preference optimization approach to transform the base model into two independent experts, termed safety and helpfulness experts, and fine-tunes the two independent experts for optimal safety or helpfulness performance. Secondly, to achieve an effective balance between safety and helpfulness, MidPO incorporates the two experts into the MoE framework and designs a dynamic routing mechanism to allocate contributions from each expert adaptively. We conduct quantitative and qualitative experiments on three popular datasets to demonstrate the proposed MidPO significantly outperforms state-of-the-art approaches in both safety and helpfulness. The code and models will be released.
zh

[NLP-82] Multimodal DeepResearcher: Generating Text-Chart Interleaved Reports From Scratch with Agent ic Framework

【速读】: 该论文试图解决现有深度研究框架主要生成纯文本内容,而缺乏对文本与可视化内容混合生成的自动化研究问题(the automated generation of interleaved texts and visualizations)。解决方案的关键在于提出一种结构化的图表文本表示方法——形式化可视化描述(Formal Description of Visualization, FDV),使大型语言模型能够学习并生成多样且高质量的可视化内容。在此基础上,构建了多模态深度研究框架Multimodal DeepResearcher,通过四个阶段的任务分解实现多模态报告的生成。

链接: https://arxiv.org/abs/2506.02454
作者: Zhaorui Yang,Bo Pan,Han Wang,Yiyao Wang,Xingyu Liu,Minfeng Zhu,Bo Zhang,Wei Chen
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 47 pages

点击查看摘要

Abstract:Visualizations play a crucial part in effective communication of concepts and information. Recent advances in reasoning and retrieval augmented generation have enabled Large Language Models (LLMs) to perform deep research and generate comprehensive reports. Despite its progress, existing deep research frameworks primarily focus on generating text-only content, leaving the automated generation of interleaved texts and visualizations underexplored. This novel task poses key challenges in designing informative visualizations and effectively integrating them with text reports. To address these challenges, we propose Formal Description of Visualization (FDV), a structured textual representation of charts that enables LLMs to learn from and generate diverse, high-quality visualizations. Building on this representation, we introduce Multimodal DeepResearcher, an agentic framework that decomposes the task into four stages: (1) researching, (2) exemplar report textualization, (3) planning, and (4) multimodal report generation. For the evaluation of generated multimodal reports, we develop MultimodalReportBench, which contains 100 diverse topics served as inputs along with 5 dedicated metrics. Extensive experiments across models and evaluation methods demonstrate the effectiveness of Multimodal DeepResearcher. Notably, utilizing the same Claude 3.7 Sonnet model, Multimodal DeepResearcher achieves an 82% overall win rate over the baseline method.
zh

[NLP-83] IP-Dialog: Evaluating Implicit Personalization in Dialogue Systems with Synthetic Data

【速读】: 该论文试图解决现代对话系统中用户背景隐式推理与个性化服务的能力评估与提升所面临的高质量数据稀缺问题(high-quality data scarcity),以及传统数据集构建方法在人力、资源消耗和隐私保护方面的不足。解决方案的关键在于提出一种自动合成数据生成方法,并引入Implicit Personalized Dialogue (IP-Dialog)基准及其训练数据集,涵盖10个任务和12种用户属性类型,同时构建了包含四个指标的系统性评估框架,以衡量模型的属性感知与推理能力。

链接: https://arxiv.org/abs/2506.02449
作者: Bo Peng,Zhiheng Wang,Heyang Gong,Chaochao Lu
机构: Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); Shanghai Jiao Tong University (上海交通大学); Shanghai Innovation Institute (上海创新研究院); Sicore Ladder Tech Co. Ltd. (西科 ladder 科技有限公司)
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:In modern dialogue systems, the ability to implicitly infer user backgrounds from conversations and leverage this information for personalized assistance is crucial. However, the scarcity of high-quality data remains a fundamental challenge to evaluating and improving this capability. Traditional dataset construction methods are labor-intensive, resource-demanding, and raise privacy concerns. To address these issues, we propose a novel approach for automatic synthetic data generation and introduce the Implicit Personalized Dialogue (IP-Dialog) benchmark along with a training dataset, covering 10 tasks and 12 user attribute types. Additionally, we develop a systematic evaluation framework with four metrics to assess both attribute awareness and reasoning capabilities. We further propose five causal graphs to elucidate models’ reasoning pathways during implicit personalization. Extensive experiments yield insightful observations and prove the reliability of our dataset.
zh

[NLP-84] Should LLM Safety Be More Than Refusing Harmful Instructions?

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在长尾分布(加密)文本中的行为及其安全问题,特别是模型在面对有害混淆指令时的拒绝能力(instruction refusal)和生成安全性(generation safety)方面的表现。解决方案的关键在于提出一个二维框架来评估LLM的安全性,并通过全面实验揭示具备解密能力的模型可能面临不匹配泛化攻击(mismatched-generalization attacks),即其安全机制在至少一个安全维度上失效,导致不安全响应或过度拒绝。基于此,论文评估了多种预LLM和后LLM阶段的安全措施,并探讨了其优缺点。

链接: https://arxiv.org/abs/2506.02442
作者: Utsav Maskey,Mark Dras,Usman Naseem
机构: Macquarie University (麦考瑞大学)
类目: Computation and Language (cs.CL)
备注: Preprint

点击查看摘要

Abstract:This paper presents a systematic evaluation of Large Language Models’ (LLMs) behavior on long-tail distributed (encrypted) texts and their safety implications. We introduce a two-dimensional framework for assessing LLM safety: (1) instruction refusal-the ability to reject harmful obfuscated instructions, and (2) generation safety-the suppression of generating harmful responses. Through comprehensive experiments, we demonstrate that models that possess capabilities to decrypt ciphers may be susceptible to mismatched-generalization attacks: their safety mechanisms fail on at least one safety dimension, leading to unsafe responses or over-refusal. Based on these findings, we evaluate a number of pre-LLM and post-LLM safeguards and discuss their strengths and limitations. This work contributes to understanding the safety of LLM in long-tail text scenarios and provides directions for developing robust safety mechanisms.
zh

[NLP-85] From Anger to Joy: How Nationality Personas Shape Emotion Attribution in Large Language Models

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在扮演特定国籍角色时是否表现出情感刻板印象的问题。其解决方案的关键在于通过分析预训练LLMs中情感归因的国家差异,以及这些归因是否符合文化规范,从而揭示模型输出中可能存在的简化和偏见性刻板印象。

链接: https://arxiv.org/abs/2506.02431
作者: Mahammed Kamruzzaman,Abdullah Al Monsur,Gene Louis Kim,Anshuman Chhabra
机构: University of South Florida (南佛罗里达大学); North South University (南南大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Emotions are a fundamental facet of human experience, varying across individuals, cultural contexts, and nationalities. Given the recent success of Large Language Models (LLMs) as role-playing agents, we examine whether LLMs exhibit emotional stereotypes when assigned nationality-specific personas. Specifically, we investigate how different countries are represented in pre-trained LLMs through emotion attributions and whether these attributions align with cultural norms. Our analysis reveals significant nationality-based differences, with emotions such as shame, fear, and joy being disproportionately assigned across regions. Furthermore, we observe notable misalignment between LLM-generated and human emotional responses, particularly for negative emotions, highlighting the presence of reductive and potentially biased stereotypes in LLM outputs.
zh

[NLP-86] Comparative Analysis of AI Agent Architectures for Entity Relationship Classification

【速读】: 该论文旨在解决实体关系分类在信息抽取中的挑战性问题,尤其是在标注数据有限和关系结构复杂的情境下。其解决方案的关键在于探索三种不同的AI代理架构,包括反思自我评估、分层任务分解以及一种新颖的多代理动态示例生成机制,其中动态示例生成方法引入了实时协作与对抗性提示,以提升关系分类的性能。实验结果表明,多代理协调策略在多个领域和模型后端中均优于标准的小样本提示方法,并接近微调模型的性能。

链接: https://arxiv.org/abs/2506.02426
作者: Maryam Berijanian,Kuldeep Singh,Amin Sehati
机构: Michigan State University (密歇根州立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Entity relationship classification remains a challenging task in information extraction, especially in scenarios with limited labeled data and complex relational structures. In this study, we conduct a comparative analysis of three distinct AI agent architectures designed to perform relation classification using large language models (LLMs). The agentic architectures explored include (1) reflective self-evaluation, (2) hierarchical task decomposition, and (3) a novel multi-agent dynamic example generation mechanism, each leveraging different modes of reasoning and prompt adaptation. In particular, our dynamic example generation approach introduces real-time cooperative and adversarial prompting. We systematically compare their performance across multiple domains and model backends. Our experiments demonstrate that multi-agent coordination consistently outperforms standard few-shot prompting and approaches the performance of fine-tuned models. These findings offer practical guidance for the design of modular, generalizable LLM-based systems for structured relation extraction. The source codes and dataset are available at \hrefthis https URLthis https URL.
zh

[NLP-87] Gender Inequality in English Textbooks Around the World: an NLP Approach

【速读】: 该论文试图解决跨文化背景下教科书中性别不平等问题的量化分析问题,旨在揭示不同文化圈层中英语教科书在角色数量、首次提及性别以及词频-逆文档频率(TF-IDF)词关联等方面的性别差异。解决方案的关键在于应用自然语言处理方法,包括字符计数、首现性别分析、TF-IDF词关联分析、专有名词性别模式识别、大型语言模型对性别化词表的区分能力测试以及GloVe嵌入技术对关键词与性别关联紧密程度的评估。

链接: https://arxiv.org/abs/2506.02425
作者: Tairan Liu
机构: 未知
类目: Computation and Language (cs.CL); Applications (stat.AP)
备注:

点击查看摘要

Abstract:Textbooks play a critical role in shaping children’s understanding of the world. While previous studies have identified gender inequality in individual countries’ textbooks, few have examined the issue cross-culturally. This study applies natural language processing methods to quantify gender inequality in English textbooks from 22 countries across 7 cultural spheres. Metrics include character count, firstness (which gender is mentioned first), and TF-IDF word associations by gender. The analysis also identifies gender patterns in proper names appearing in TF-IDF word lists, tests whether large language models can distinguish between gendered word lists, and uses GloVe embeddings to examine how closely keywords associate with each gender. Results show consistent overrepresentation of male characters in terms of count, firstness, and named entities. All regions exhibit gender inequality, with the Latin cultural sphere showing the least disparity.
zh

[NLP-88] StarVC: A Unified Auto-Regressive Framework for Joint Text and Speech Generation in Voice Conversion INTERSPEECH2025

【速读】: 该论文旨在解决语音转换(Voice Conversion, VC)中如何有效保留语言内容同时准确迁移说话人特征的问题。传统方法通常直接从语音中提取说话人信息,而忽视了对语言内容的显式建模。解决方案的关键在于提出StarVC,这是一个统一的自回归语音转换框架,其核心在于首先预测文本标记(text tokens),再合成声学特征,从而实现语言内容与说话人特征的解耦,提升转换效果。

链接: https://arxiv.org/abs/2506.02414
作者: Fengjin Li,Jie Wang,Yadong Niu,Yongqing Wang,Meng Meng,Jian Luan,Zhiyong Wu
机构: Shenzhen International Graduate School (深圳国际研究生院)
类目: Multimedia (cs.MM); Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: 5 pages, 2 figures, Accepted by Interspeech 2025, Demo: this https URL

点击查看摘要

Abstract:Voice Conversion (VC) modifies speech to match a target speaker while preserving linguistic content. Traditional methods usually extract speaker information directly from speech while neglecting the explicit utilization of linguistic content. Since VC fundamentally involves disentangling speaker identity from linguistic content, leveraging structured semantic features could enhance conversion performance. However, previous attempts to incorporate semantic features into VC have shown limited effectiveness, motivating the integration of explicit text modeling. We propose StarVC, a unified autoregressive VC framework that first predicts text tokens before synthesizing acoustic features. The experiments demonstrate that StarVC outperforms conventional VC methods in preserving both linguistic content (i.e., WER and CER) and speaker characteristics (i.e., SECS and MOS). Audio demo can be found at: this https URL.
zh

[NLP-89] SingaKids: A Multilingual Multimodal Dialogic Tutor for Language Learning ACL2025

【速读】: 该论文旨在解决多语言和文化背景下生成式人工智能在儿童语言学习中的性能一致性与适应性问题,以及如何设计适合儿童的友好交互界面以提升学习效果。其解决方案的关键在于构建一个名为SingaKids的对话式辅导系统,该系统整合了密集图像描述、多语言对话交互、语音理解和生动语音生成技术,并通过多语言预训练、任务特定调优和支架优化来提升系统的跨语言适应性和教学有效性。

链接: https://arxiv.org/abs/2506.02412
作者: Zhengyuan Liu,Geyu Lin,Hui Li Tan,Huayun Zhang,Yanfeng Lu,Xiaoxue Gao,Stella Xin Yin,He Sun,Hock Huan Goh,Lung Hsiang Wong,Nancy F. Chen
机构: Institute for Infocomm Research (I2R), A*STAR, Singapore; Nanyang Technological University, Singapore; National Institute of Education (NIE), Singapore
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: ACL 2025 Industry Track

点击查看摘要

Abstract:The integration of generative artificial intelligence into educational applications has enhanced personalized and interactive learning experiences, and it shows strong potential to promote young learners language acquisition. However, it is still challenging to ensure consistent and robust performance across different languages and cultural contexts, and kids-friendly design requires simplified instructions, engaging interactions, and age-appropriate scaffolding to maintain motivation and optimize learning outcomes. In this work, we introduce SingaKids, a dialogic tutor designed to facilitate language learning through picture description tasks. Our system integrates dense image captioning, multilingual dialogic interaction, speech understanding, and engaging speech generation to create an immersive learning environment in four languages: English, Mandarin, Malay, and Tamil. We further improve the system through multilingual pre-training, task-specific tuning, and scaffolding optimization. Empirical studies with elementary school students demonstrate that SingaKids provides effective dialogic teaching, benefiting learners at different performance levels.
zh

[NLP-90] GraphRAG -Bench: Challenging Domain-Specific Reasoning for Evaluating Graph Retrieval-Augmented Generation

【速读】: 该论文试图解决当前GraphRAG模型评估体系存在的局限性,即现有评估主要依赖传统问答数据集,无法全面衡量GraphRAG在复杂推理能力上的提升。解决方案的关键是提出GraphRAG-Bench,一个大规模、领域特定的基准测试平台,其核心优势包括:设计具有挑战性的多跳推理问题、覆盖多样化的任务类型以及提供端到端的评估框架,从而全面评估图结构化对模型推理能力的增强效果。

链接: https://arxiv.org/abs/2506.02404
作者: Yilin Xiao,Junnan Dong,Chuang Zhou,Su Dong,Qianwen Zhang,Di Yin,Xing Sun,Xiao Huang
机构: The Hong Kong Polytechnic University (香港理工大学); Tencent Youtu Lab (腾讯优图实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Graph Retrieval Augmented Generation (GraphRAG) has garnered increasing recognition for its potential to enhance large language models (LLMs) by structurally organizing domain-specific corpora and facilitating complex reasoning. However, current evaluations of GraphRAG models predominantly rely on traditional question-answering datasets. Their limited scope in questions and evaluation metrics fails to comprehensively assess the reasoning capacity improvements enabled by GraphRAG models. To address this gap, we introduce GraphRAG-Bench, a large-scale, domain-specific benchmark designed to rigorously evaluate GraphRAG models. Our benchmark offers three key superiorities: ((i)) Challenging question design. Featuring college-level, domain-specific questions that demand multi-hop reasoning, the benchmark ensures that simple content retrieval is insufficient for problem-solving. For example, some questions require mathematical reasoning or programming. ((ii)) Diverse task coverage. The dataset includes a broad spectrum of reasoning tasks, multiple-choice, true/false, multi-select, open-ended, and fill-in-the-blank. It spans 16 disciplines in twenty core textbooks. ((iii)) Holistic evaluation framework. GraphRAG-Bench provides comprehensive assessment across the entire GraphRAG pipeline, including graph construction, knowledge retrieval, and answer generation. Beyond final-answer correctness, it evaluates the logical coherence of the reasoning process. By applying nine contemporary GraphRAG methods to GraphRAG-Bench, we demonstrate its utility in quantifying how graph-based structuring improves model reasoning capabilities. Our analysis reveals critical insights about graph architectures, retrieval efficacy, and reasoning capabilities, offering actionable guidance for the research community.
zh

[NLP-91] Consultant Decoding: Yet Another Synergistic Mechanism ACL2025

【速读】: 该论文试图解决基于推测解码(Speculative Decoding, SD)的机制在大语言模型(Large Language Models, LLMs)推理过程中因高拒绝率导致需要多次调用LLMs验证候选标记,从而降低整体效率的问题。其解决方案的关键在于提出一种新的协同机制——顾问解码(Consultant Decoding, CD),该机制通过仅使用LLM计算的逐标记似然性来验证候选标记,而非依赖于重要性采样衍生的度量标准,从而显著提升了推理速度并降低了对大型目标模型的调用频率。

链接: https://arxiv.org/abs/2506.02391
作者: Chuanghao Ding,Jiaping Wang,Ziqing Yang,Xiaoliang Wang,Dahua Lin,Cam-Tu Nguyen,Fei Tan
机构: Nanjing University (南京大学); East China Normal University (华东师范大学); The Chinese University of Hong Kong (香港中文大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: ACL 2025 findings

点击查看摘要

Abstract:The synergistic mechanism based on Speculative Decoding (SD) has garnered considerable attention as a simple yet effective approach for accelerating the inference of large language models (LLMs). Nonetheless, the high rejection rates require repeated LLMs calls to validate draft tokens, undermining the overall efficiency gain of SD. In this work, we revisit existing verification mechanisms and propose a novel synergetic mechanism Consultant Decoding (CD). Unlike SD, which relies on a metric derived from importance sampling for verification, CD verifies candidate drafts using token-level likelihoods computed solely by the LLM. CD achieves up to a 2.5-fold increase in inference speed compared to the target model, while maintaining comparable generation quality (around 100% of the target model’s performance). Interestingly, this is achieved by combining models whose parameter sizes differ by two orders of magnitude. In addition, CD reduces the call frequency of the large target model to below 10%, particularly in more demanding tasks. CD’s performance was even found to surpass that of the large target model, which theoretically represents the upper bound for speculative decoding.
zh

[NLP-92] Exploring Explanations Improves the Robustness of In-Context Learning ACL2025

【速读】: 该论文旨在解决传统基于上下文学习(In-context learning, ICL)在面对分布外数据时泛化能力不足的问题。其解决方案的关键在于引入X²-ICL框架,通过系统性地探索所有可能标签的解释(Explanation-based In-context Learning, X-ICL),从而提升模型在决策过程中的全面性和鲁棒性。

链接: https://arxiv.org/abs/2506.02378
作者: Ukyo Honda,Tatsushi Oka
机构: CyberAgent(赛博集团); Keio University(庆应大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to ACL 2025 (Main Conference)

点击查看摘要

Abstract:In-context learning (ICL) has emerged as a successful paradigm for leveraging large language models (LLMs). However, it often struggles to generalize beyond the distribution of the provided demonstrations. A recent advancement in enhancing robustness is ICL with explanations (X-ICL), which improves prediction reliability by guiding LLMs to understand and articulate the reasoning behind correct labels. Building on this approach, we introduce an advanced framework that extends X-ICL by systematically exploring explanations for all possible labels (X ^2 -ICL), thereby enabling more comprehensive and robust decision-making. Experimental results on multiple natural language understanding datasets validate the effectiveness of X ^2 -ICL, demonstrating significantly improved robustness to out-of-distribution data compared to the existing ICL approaches.
zh

[NLP-93] AnswerCarefully: A Dataset for Improving the Safety of Japanese LLM Output

【速读】: 该论文旨在解决日本大型语言模型(Large Language Model, LLM)输出的安全性和适当性问题,通过构建一个专门针对日本社会文化背景的问答数据集——AnswerCarefully,来促进模型输出的安全性。该数据集包含1,800对问题与参考答案,覆盖了先前英文数据集中定义的多种风险类别,但其数据样本是人工创建的,以反映日本地区LLM使用的社会文化情境。解决方案的关键在于利用该数据集对日本LLM进行微调,从而在不损害通用响应实用性的情况下提升输出安全性。

链接: https://arxiv.org/abs/2506.02372
作者: Hisami Suzuki,Satoru Katsumata,Takashi Kodama,Tetsuro Takahashi,Kouta Nakayama,Satoshi Sekine
机构: NII-LLMC; Retrieva, Inc.; Kagoshima University
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In this paper we present AnswerCarefully, a dataset for promoting the safety and appropriateness of Japanese LLM outputs. The dataset consists of 1,800 pairs of questions and reference answers, where the questions require special attention in answering. It covers a wide range of risk categories established in prior English-language datasets, but the data samples are original in that they are manually created to reflect the socio-cultural context of LLM usage in Japan. We show that using this dataset for instruction to fine-tune a Japanese LLM led to improved output safety without compromising the utility of general responses. We also report the results of a safety evaluation of 12 Japanese LLMs using this dataset as a benchmark. Finally, we describe the latest update on the dataset which provides English translations and annotations of the questions, aimed at facilitating the derivation of similar datasets in different languages and regions.
zh

[NLP-94] DIAMOND: An LLM -Driven Agent for Context-Aware Baseball Highlight Summarization ACL2025

【速读】: 该论文试图解决传统方法在棒球比赛关键时刻识别上的不足,如无法捕捉战略深度、节奏变化和故事线发展,同时克服人工标注成本高且难以扩展的问题。解决方案的关键在于提出DIAMOND,一个基于大语言模型(LLM)的上下文感知棒球精彩瞬间摘要生成代理,其核心是将结构化体育分析与自然语言推理相结合,利用棒球统计指标(如胜率期望、WPA和关键性指数)量化比赛重要性,并通过LLM模块增强基于情境叙事价值的选择,从而实现定量严谨性与定性丰富性的统一。

链接: https://arxiv.org/abs/2506.02351
作者: Jeonghun Kang,Soonmok Kwon,Joonseok Lee,Byung-Hak Kim
机构: TVING; Seoul National University (首尔大学); CJ Corporation
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: To appear in the First REALM (Research on Agent Language Models) workshop at ACL 2025

点击查看摘要

Abstract:Traditional approaches – such as Win Probability Added (WPA)-based ranking or computer vision-driven event detection – can identify scoring plays but often miss strategic depth, momentum shifts, and storyline progression. Manual curation remains the gold standard but is resource-intensive and not scalable. We introduce DIAMOND, an LLM-driven agent for context-aware baseball highlight summarization that integrates structured sports analytics with natural language reasoning. DIAMOND leverages sabermetric features – Win Expectancy, WPA, and Leverage Index – to quantify play importance, while an LLM module enhances selection based on contextual narrative value. This hybrid approach ensures both quantitative rigor and qualitative richness, surpassing the limitations of purely statistical or vision-based systems. Evaluated on five diverse Korean Baseball Organization League games, DIAMOND improves F1-score from 42.9% (WPA-only) to 84.8%, outperforming both commercial and statistical baselines. Though limited in scale, our results highlight the potential of modular, interpretable agent-based frameworks for event-level summarization in sports and beyond.
zh

[NLP-95] ruth over Tricks: Measuring and Mitigating Shortcut Learning in Misinformation Detection

【速读】: 该论文试图解决虚假信息检测模型依赖表面线索(即“快捷方式”)的问题,这些线索在训练数据中与虚假信息相关,但无法适应现实世界中虚假信息的多样性和演变性。这一问题因大型语言模型(Large Language Models, LLMs)能够通过简单提示生成具有说服力的虚假信息而变得更加严峻。解决方案的关键在于提出SMF框架,该框架通过改写、事实摘要和情感归一化等方法增强数据多样性,从而减少模型对快捷方式的依赖,提升模型对深层语义的理解能力。

链接: https://arxiv.org/abs/2506.02350
作者: Herun Wan,Jiaying Wu,Minnan Luo,Zhi Zeng,Zhixiong Su
机构: Xi’an Jiaotong University (西安交通大学); National University of Singapore (新加坡国立大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Misinformation detection models often rely on superficial cues (i.e., \emphshortcuts) that correlate with misinformation in training data but fail to generalize to the diverse and evolving nature of real-world misinformation. This issue is exacerbated by large language models (LLMs), which can easily generate convincing misinformation through simple prompts. We introduce TruthOverTricks, a unified evaluation paradigm for measuring shortcut learning in misinformation detection. TruthOverTricks categorizes shortcut behaviors into intrinsic shortcut induction and extrinsic shortcut injection, and evaluates seven representative detectors across 14 popular benchmarks, along with two new factual misinformation datasets, NQ-Misinfo and Streaming-Misinfo. Empirical results reveal that existing detectors suffer severe performance degradation when exposed to both naturally occurring and adversarially crafted shortcuts. To address this, we propose SMF, an LLM-augmented data augmentation framework that mitigates shortcut reliance through paraphrasing, factual summarization, and sentiment normalization. SMF consistently enhances robustness across 16 benchmarks, encouraging models to rely on deeper semantic understanding rather than shortcut cues. To promote the development of misinformation detectors, we have published the resources publicly at this https URL.
zh

[NLP-96] STORYTELLER: An Enhanced Plot-Planning Framework for Coherent and Cohesive Story Generation

【速读】: 该论文旨在解决自动故事生成中叙事连贯性和逻辑一致性不足的问题(narrative coherence and logical consistency),这一问题限制了生成故事的质量和用户体验。其解决方案的关键在于提出一种名为Storyteller的新方法,该方法通过引入基于语言学基础的主谓宾(SVO)三元组的剧情节点结构,以及集成动态模块STORYLINE和叙事实体知识图谱(NEKG),实现了对故事生成过程的持续交互,从而确保生成故事在结构、连贯性和沉浸感方面的提升。

链接: https://arxiv.org/abs/2506.02347
作者: Jiaming Li,Yukun Chen,Ziqiang Liu,Minghuan Tan,Lei Zhang,Yunshui Li,Run Luo,Longze Chen,Jing Luo,Ahmadreza Argha,Hamid Alinejad-Rokny,Wei Zhou,Min Yang
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Stories are central to human culture, serving to share ideas, preserve traditions, and foster connections. Automatic story generation, a key advancement in artificial intelligence (AI), offers new possibilities for creating personalized content, exploring creative ideas, and enhancing interactive experiences. However, existing methods struggle to maintain narrative coherence and logical consistency. This disconnect compromises the overall storytelling experience, underscoring the need for substantial improvements. Inspired by human cognitive processes, we introduce Storyteller, a novel approach that systemically improves the coherence and consistency of automatically generated stories. Storyteller introduces a plot node structure based on linguistically grounded subject verb object (SVO) triplets, which capture essential story events and ensure a consistent logical flow. Unlike previous methods, Storyteller integrates two dynamic modules, the STORYLINE and narrative entity knowledge graph (NEKG),that continuously interact with the story generation process. This integration produces structurally sound, cohesive and immersive narratives. Extensive experiments demonstrate that Storyteller significantly outperforms existing approaches, achieving an 84.33% average win rate through human preference evaluation. At the same time, it is also far ahead in other aspects including creativity, coherence, engagement, and relevance.
zh

[NLP-97] One Missing Piece for Open-Source Reasoning Models: A Dataset to Mitigate Cold-Starting Short CoT LLM s in RL ACL2025

【速读】: 该论文试图解决当前大型推理模型(Large Reasoning Model, LRM)依赖已有模型(如R1)进行训练所导致的局限性问题,旨在推动独立LRM的发展。其解决方案的关键在于构建一个不依赖于推理时扩展的大型语言模型(Large Language Model, LLM)生成的长链式思维(Long Chain-of-Thought, CoT)数据集,并通过一种管道机制将新颖的推理策略引入短CoT LLM中,从而增强其推理能力并实现对思维预算的可控性。

链接: https://arxiv.org/abs/2506.02338
作者: Hyungjoo Chae,Dongjin Kang,Jihyuk Kim,Beong-woo Kwak,Sunghyun Park,Haeju Park,Jinyoung Yeo,Moontae Lee,Kyungjae Lee
机构: Yonsei University (延世大学); LG AI Research (LG人工智能研究); University of Illinois Chicago (伊利诺伊大学芝加哥分校)
类目: Computation and Language (cs.CL)
备注: ACL 2025 Industry

点击查看摘要

Abstract:With the release of R1, a publicly available large reasoning model (LRM), researchers commonly train new LRMs by training language models on R1’s long chain-of-thought (CoT) inferences. While prior works show that LRMs’ capabilities can be reproduced through direct distillation, the continued reliance on the existing models (e.g., R1) remains a critical limitation in advancing the field. As a first step toward independent LRM development, this paper explores the possibility of constructing a long CoT dataset with LLMs that are not trained for inference-time scaling. To this end, we present the Long CoT Collection, a dataset of 100K CoT rationales annotated using existing short CoT LLMs. We develop a pipeline that induces o1’s novel reasoning strategies into short CoT LLMs, enabling them to think longer and introducing controllability over the thought budget to better manage the overthinking problem. Our extensive analyses validate that our dataset achieves quality comparable to–or slightly below–R1. Furthermore, our experiments demonstrate that training on our dataset not only strengthens general reasoning skills, but also provides a strong foundation for reinforcement learning–models initialized on our data achieve 2-3x larger gains with RLVR.
zh

[NLP-98] Something Just Like TRuST : Toxicity Recognition of Span and Target

【速读】: 该论文试图解决在线内容(包括由语言模型生成的内容)中的毒性问题,该问题因其潜在的负面心理和社会影响而成为关键关注点。解决方案的关键在于引入TRuST数据集,这是一个综合性的数据集,融合了现有数据集,并包含毒性、目标社会群体及有毒片段的标签,涵盖了如种族、性别、宗教、残疾和政治等多样化的目标群体,同时包含人工/机器标注和人工机器生成的数据,以提升毒性检测的准确性。

链接: https://arxiv.org/abs/2506.02326
作者: Berk Atil,Namrata Sureddy,Rebecca J. Passonneau
机构: Penn State University (宾夕法尼亚州立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Toxicity in online content, including content generated by language models, has become a critical concern due to its potential for negative psychological and social impact. This paper introduces TRuST, a comprehensive dataset designed to improve toxicity detection that merges existing datasets, and has labels for toxicity, target social group, and toxic spans. It includes a diverse range of target groups such as ethnicity, gender, religion, disability, and politics, with both human/machine-annotated and human machine-generated data. We benchmark state-of-the-art large language models (LLMs) on toxicity detection, target group identification, and toxic span extraction. We find that fine-tuned models consistently outperform zero-shot and few-shot prompting, though performance remains low for certain social groups. Further, reasoning capabilities do not significantly improve performance, indicating that LLMs have weak social reasoning skills.
zh

[NLP-99] Quantifying Misattribution Unfairness in Authorship Attribution

【速读】: 该论文试图解决作者归属错误(authorship misattribution)在实际应用中可能带来的不公平问题,特别是某些作者在候选池中面临更高的被错误归因风险。解决方案的关键在于引入一个名为Misattribution Unfairness Index (MAUIk) 的度量方法,该方法通过评估作者在其未撰写文本中被排名靠前的频率来量化不公平性。研究发现,模型在潜在搜索空间中的作者向量嵌入方式与不公平性密切相关,尤其靠近嵌入作者中心的作者具有更高的被错误归因风险。

链接: https://arxiv.org/abs/2506.02321
作者: Pegah Alipoormolabashi,Ajay Patel,Niranjan Balasubramanian
机构: Stony Brook University (斯托尼布鲁克大学); University of Pennsylvania (宾夕法尼亚大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Authorship misattribution can have profound consequences in real life. In forensic settings simply being considered as one of the potential authors of an evidential piece of text or communication can result in undesirable scrutiny. This raises a fairness question: Is every author in the candidate pool at equal risk of misattribution? Standard evaluation measures for authorship attribution systems do not explicitly account for this notion of fairness. We introduce a simple measure, Misattribution Unfairness Index (MAUIk), which is based on how often authors are ranked in the top k for texts they did not write. Using this measure we quantify the unfairness of five models on two different datasets. All models exhibit high levels of unfairness with increased risks for some authors. Furthermore, we find that this unfairness relates to how the models embed the authors as vectors in the latent search space. In particular, we observe that the risk of misattribution is higher for authors closer to the centroid (or center) of the embedded authors in the haystack. These results indicate the potential for harm and the need for communicating with and calibrating end users on misattribution risk when building and providing such models for downstream use.
zh

[NLP-100] ResearchCodeBench: Benchmarking LLM s on Implementing Novel Machine Learning Research Code

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在将近期研究论文中的新想法转化为可执行代码方面的能力不足问题。其解决方案的关键是引入ResearchCodeBench,这是一个包含212个编码挑战的基准测试集,用于评估LLMs将顶级2024-2025年机器学习研究论文中的前沿成果转化为可运行代码的能力。通过这一基准测试,研究人员能够系统地评估和比较不同LLMs的代码生成性能,并揭示其在实现准确性和错误模式方面的表现。

链接: https://arxiv.org/abs/2506.02314
作者: Tianyu Hua,Harper Hua,Violet Xiang,Benjamin Klieger,Sang T. Truong,Weixin Liang,Fan-Yun Sun,Nick Haber
机构: Stanford University (斯坦福大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have shown promise in transforming machine learning research, yet their capability to faithfully implement novel ideas from recent research papers-ideas unseen during pretraining-remains unclear. We introduce ResearchCodeBench, a benchmark of 212 coding challenges that evaluates LLMs’ ability to translate cutting-edge ML contributions from top 2024-2025 research papers into executable code. We assessed 30+ proprietary and open-source LLMs, finding that even the best models correctly implement less than 40% of the code. We find Gemini-2.5-Pro-Preview to perform best at 37.3% success rate, with O3 (High) and O4-mini (High) following behind at 32.3% and 30.8% respectively. We present empirical findings on performance comparison, contamination, and error patterns. By providing a rigorous and community-driven evaluation platform, ResearchCodeBench enables continuous understanding and advancement of LLM-driven innovation in research code generation.
zh

[NLP-101] Explain-then-Process: Using Grammar Prompting to Enhance Grammatical Acceptability Judgments ACL2025

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)虽然能够解释语法规则,但在判断句子可接受性时却难以应用这些规则的问题。解决方案的关键在于提出“语法提示”(grammar prompting)方法,即首先让LLM生成相关句法现象的简洁解释,随后将该解释作为额外上下文反馈给目标模型(无论是LLM还是小型语言模型,SLM),以辅助其判断最小对句中哪个句子是符合语法规则的。这一方法在多个语言的基准测试中显著提升了性能,尤其在缩小LLM与SLM之间的准确率差距方面效果显著。

链接: https://arxiv.org/abs/2506.02302
作者: Russell Scheinberg,Ameeta Agrawal,Amber Shore,So Young Lee
机构: Portland State University (波特兰州立大学); Miami University (迈阿密大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at ACL 2025 Findings

点击查看摘要

Abstract:Large language models (LLMs) can explain grammatical rules, yet they often fail to apply those rules when judging sentence acceptability. We present “grammar prompting”, an explain-then-process paradigm: a large LLM first produces a concise explanation of the relevant syntactic phenomenon, then that explanation is fed back as additional context to the target model – either an LLM or a smaller language model (SLM) – before deciding which sentence of a minimal pair is grammatical. On the English BLiMP, Chinese SLING, and Russian RuBLiMP benchmarks, this simple prompt design yields substantial improvements over strong baselines across many syntactic phenomena. Feeding an LLM’s metalinguistic explanation back to the target model bridges the gap between knowing a rule and using it. On SLMs, grammar prompting alone trims the average LLM-SLM accuracy gap by about 20%, and when paired with chain-of-thought, by 56% (13.0 pp - 5.8 pp), all at negligible cost. The lightweight, language-agnostic cue lets low-cost SLMs approach frontier-LLM performance in multilingual settings.
zh

[NLP-102] LAM SIMULATOR: Advancing Data Generation for Large Action Model Training via Online Exploration and Trajectory Feedback

【速读】: 该论文旨在解决大型行动模型(Large Action Models, LAMs)在训练过程中对高质量数据的依赖问题,尤其是在涉及多步骤任务(如规划、执行工具调用和响应反馈)时的数据获取难题。其解决方案的关键在于提出LAM SIMULATOR,这是一个用于在线探索代理任务的综合性框架,具备动态任务查询生成器、丰富的工具集以及交互式环境,使大型语言模型(Large Language Model, LLM)代理能够自主调用工具并接收实时反馈,从而生成高质量的行动轨迹数据,用于LAMs的训练。

链接: https://arxiv.org/abs/2506.02298
作者: Thai Hoang,Kung-Hsiang Huang,Shirley Kokane,Jianguo Zhang,Zuxin Liu,Ming Zhu,Jake Grigsby,Tian Lan,Michael S Ryoo,Chien-Sheng Wu,Shelby Heinecke,Huan Wang,Silvio Savarese,Caiming Xiong,Juan Carlos Niebles
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: LAM Simulator framework for agentic data generation

点击查看摘要

Abstract:Large Action Models (LAMs) for AI Agents offer incredible potential but face challenges due to the need for high-quality training data, especially for multi-steps tasks that involve planning, executing tool calls, and responding to feedback. To address these issues, we present LAM SIMULATOR, a comprehensive framework designed for online exploration of agentic tasks with high-quality feedback. Our framework features a dynamic task query generator, an extensive collection of tools, and an interactive environment where Large Language Model (LLM) Agents can call tools and receive real-time feedback. This setup enables LLM Agents to explore and solve tasks autonomously, facilitating the discovery of multiple approaches to tackle any given task. The resulting action trajectory data are then used to create high-quality training datasets for LAMs. Our experiments on popular agentic benchmarks, ToolBench and CRMArena, highlight the effectiveness of LAM SIMULATOR: models trained with self-generated datasets using our framework achieve significant performance gains, up to a 49.3% improvement over their original baselines. LAM SIMULATOR requires minimal human input during dataset creation, highlighting LAM SIMULATOR’s efficiency and effectiveness in speeding up development of AI agents.
zh

[NLP-103] Sounding Like a Winner? Prosodic Differences in Post-Match Interviews INTERSPEECH2025

【速读】: 该论文试图解决如何通过运动员赛后采访中的语音特征来判断其比赛胜负的问题,其核心是利用韵律特征(prosodic features)和自监督学习(SSL)表示进行比赛结果分类。解决方案的关键在于提取传统声学特征和深度语音表示,并结合机器学习分类器,其中SSL模型如Wav2Vec 2.0和HuBERT在捕捉与情绪状态相关的细微语音模式方面表现出色,同时韵律线索如音高变化仍然是判断胜利的重要指标。

链接: https://arxiv.org/abs/2506.02283
作者: Sofoklis Kakouros,Haoyu Chen
机构: 未知
类目: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: Accepted to Interspeech 2025

点击查看摘要

Abstract:This study examines the prosodic characteristics associated with winning and losing in post-match tennis interviews. Additionally, this research explores the potential to classify match outcomes solely based on post-match interview recordings using prosodic features and self-supervised learning (SSL) representations. By analyzing prosodic elements such as pitch and intensity, alongside SSL models like Wav2Vec 2.0 and HuBERT, the aim is to determine whether an athlete has won or lost their match. Traditional acoustic features and deep speech representations are extracted from the data, and machine learning classifiers are employed to distinguish between winning and losing players. Results indicate that SSL representations effectively differentiate between winning and losing outcomes, capturing subtle speech patterns linked to emotional states. At the same time, prosodic cues – such as pitch variability – remain strong indicators of victory.
zh

[NLP-104] ImpRAG : Retrieval-Augmented Generation with Implicit Queries

【速读】: 该论文试图解决传统检索增强生成(Retrieval-Augmented Generation, RAG)系统中检索与生成过程分离所带来的任务泛化能力受限的问题。其关键解决方案是提出一种无需显式查询的RAG系统ImpRAG,通过将检索与生成整合为统一模型,使模型能够隐式表达信息需求。ImpRAG通过将预训练解码器-only语言模型划分为专门的层组,同时优化检索与生成任务,并采用两阶段推理流程,在相同模型参数和前向传播下实现检索与生成,从而减少检索器与语言模型之间的差异。

链接: https://arxiv.org/abs/2506.02279
作者: Wenzheng Zhang,Xi Victoria Lin,Karl Stratos,Wen-tau Yih,Mingda Chen
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) systems traditionally treat retrieval and generation as separate processes, requiring explicit textual queries to connect them. This separation can limit the ability of models to generalize across diverse tasks. In this work, we propose a query-free RAG system, named ImpRAG, which integrates retrieval and generation into a unified model. ImpRAG allows models to implicitly express their information needs, eliminating the need for human-specified queries. By dividing pretrained decoder-only language models into specialized layer groups, ImpRAG optimizes retrieval and generation tasks simultaneously. Our approach employs a two-stage inference process, using the same model parameters and forward pass for both retrieval and generation, thereby minimizing the disparity between retrievers and language models. Experiments on 8 knowledge-intensive tasks demonstrate that ImpRAG achieves 3.6-11.5 improvements in exact match scores on unseen tasks with diverse formats, highlighting its effectiveness in enabling models to articulate their own information needs and generalize across tasks. Our analysis underscores the importance of balancing retrieval and generation parameters and leveraging generation perplexities as retrieval training objectives for enhanced performance.
zh

[NLP-105] CoDial: Interpretable Task-Oriented Dialogue Systems Through Dialogue Flow Alignment

【速读】: 该论文试图解决在对话系统中高效教授专业且未见过的任务的问题,这一过程因专家知识成本、训练数据需求及技术难度而面临挑战。解决方案的关键在于提出一种名为CoDial(Code for Dialogue)的框架,该框架通过将专家知识表示为结构化的异构图,并将其转换为可执行的对话逻辑,从而实现任务导向型对话系统的可解释、可修改且真正的零样本规范。

链接: https://arxiv.org/abs/2506.02264
作者: Radin Shayanfar,Chu Fei Luo,Rohan Bhambhoria,Samuel Dahan,Xiaodan Zhu
机构: Queen’s University(皇后大学); Cornell Law School(康奈尔法学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:It is often challenging to teach specialized, unseen tasks to dialogue systems due to the high cost of expert knowledge, training data, and high technical difficulty. To support domain-specific applications - such as law, medicine, or finance - it is essential to build frameworks that enable non-technical experts to define, test, and refine system behaviour with minimal effort. Achieving this requires cross-disciplinary collaboration between developers and domain specialists. In this work, we introduce a novel framework, CoDial (Code for Dialogue), that converts expert knowledge, represented as a novel structured heterogeneous graph, into executable conversation logic. CoDial can be easily implemented in existing guardrailing languages, such as Colang, to enable interpretable, modifiable, and true zero-shot specification of task-oriented dialogue systems. Empirically, CoDial achieves state-of-the-art performance on the STAR dataset for inference-based models and is competitive with similar baselines on the well-known MultiWOZ dataset. We also demonstrate CoDial’s iterative improvement via manual and LLM-aided feedback, making it a practical tool for expert-guided alignment of LLMs in high-stakes domains.
zh

[NLP-106] Investigating the Impact of Word Informativeness on Speech Emotion Recognition INTERSPEECH2025

【速读】: 该论文试图解决在语音情绪识别中如何准确识别携带最相关声学变化的语音信号段这一关键问题(key challenge)。传统方法通过对整句话或更长的语音片段计算能量和F0等特征的功能量,可能忽略了长时统计中的细微变化。该研究提出的解决方案的关键在于利用预训练语言模型生成的词信息量(word informativeness)来识别语义重要的语音段,并仅在这些段落上计算声学特征,从而提升情绪识别的准确性。

链接: https://arxiv.org/abs/2506.02239
作者: Sofoklis Kakouros
机构: 未知
类目: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: Accepted to Interspeech 2025

点击查看摘要

Abstract:In emotion recognition from speech, a key challenge lies in identifying speech signal segments that carry the most relevant acoustic variations for discerning specific emotions. Traditional approaches compute functionals for features such as energy and F0 over entire sentences or longer speech portions, potentially missing essential fine-grained variation in the long-form statistics. This research investigates the use of word informativeness, derived from a pre-trained language model, to identify semantically important segments. Acoustic features are then computed exclusively for these identified segments, enhancing emotion recognition accuracy. The methodology utilizes standard acoustic prosodic features, their functionals, and self-supervised representations. Results indicate a notable improvement in recognition performance when features are computed on segments selected based on word informativeness, underscoring the effectiveness of this approach.
zh

[NLP-107] VLCD: Vision-Language Contrastive Distillation for Accurate and Efficient Automatic Placenta Analysis ALT AAAI

【速读】: 该论文试图解决医疗领域中基于视觉-语言对比学习(Vision-Language Contrastive Learning, VLC)的自动化方法计算复杂度高、部署受限的问题。其关键解决方案是提出两种改进策略:一是文本锚定的视觉-语言对比知识蒸馏(VLCD),作为一种用于医学VLC预训练的新知识蒸馏方法;二是利用大规模自然图像数据集进行无监督预蒸馏,以提升模型初始化效果。这些方法使得高效的神经网络能够在保持或超越教师模型性能的同时实现模型压缩和加速,从而提高医学VLC方法的效率与可部署性。

链接: https://arxiv.org/abs/2506.02229
作者: Manas Mehta,Yimu Pan,Kelly Gallagher,Alison D. Gernand,Jeffery A. Goldstein,Delia Mwinyelle,Leena Mithal,James Z. Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Proceedings of the 9th International Workshop on Health Intelligence, in conjunction with the Annual AAAI Conference on Artificial Intelligence, Philadelphia, Pennsylvania, March 2025

点击查看摘要

Abstract:Pathological examination of the placenta is an effective method for detecting and mitigating health risks associated with childbirth. Recent advancements in AI have enabled the use of photographs of the placenta and pathology reports for detecting and classifying signs of childbirth-related pathologies. However, existing automated methods are computationally extensive, which limits their deployability. We propose two modifications to vision-language contrastive learning (VLC) frameworks to enhance their accuracy and efficiency: (1) text-anchored vision-language contrastive knowledge distillation (VLCD)-a new knowledge distillation strategy for medical VLC pretraining, and (2) unsupervised predistillation using a large natural images dataset for improved initialization. Our approach distills efficient neural networks that match or surpass the teacher model in performance while achieving model compression and acceleration. Our results showcase the value of unsupervised predistillation in improving the performance and robustness of our approach, specifically for lower-quality images. VLCD serves as an effective way to improve the efficiency and deployability of medical VLC approaches, making AI-based healthcare solutions more accessible, especially in resource-constrained environments.
zh

[NLP-108] Leverag ing Natural Language Processing to Unravel the Mystery of Life: A Review of NLP Approaches in Genomics Transcriptomics and Proteomics

【速读】: 该论文试图解决如何将自然语言处理(Natural Language Processing, NLP)方法有效应用于生物序列数据的分析问题,以提升对基因组学、转录组学和蛋白质组学等领域的理解。其解决方案的关键在于利用从人类语言中发展出来的技术,如词嵌入(word2vec)、Transformer模型以及Hyena算子等,通过适配这些模型以处理DNA、RNA和蛋白质序列,同时探索不同的分词策略和模型架构,以优化其在不同生物任务中的性能。此外,论文还强调了语言模型在结构预测、基因表达分析和进化研究等方面的潜力,展示了其在解析大规模基因组数据中的价值。

链接: https://arxiv.org/abs/2506.02212
作者: Ella Rannon,David Burstein
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Genomics (q-bio.GN)
备注:

点击查看摘要

Abstract:Natural Language Processing (NLP) has transformed various fields beyond linguistics by applying techniques originally developed for human language to the analysis of biological sequences. This review explores the application of NLP methods to biological sequence data, focusing on genomics, transcriptomics, and proteomics. We examine how various NLP methods, from classic approaches like word2vec to advanced models employing transformers and hyena operators, are being adapted to analyze DNA, RNA, protein sequences, and entire genomes. The review also examines tokenization strategies and model architectures, evaluating their strengths, limitations, and suitability for different biological tasks. We further cover recent advances in NLP applications for biological data, such as structure prediction, gene expression, and evolutionary analysis, highlighting the potential of these methods for extracting meaningful insights from large-scale genomic data. As language models continue to advance, their integration into bioinformatics holds immense promise for advancing our understanding of biological processes in all domains of life.
zh

[NLP-109] KDRL: Post-Training Reasoning LLM s via Unified Knowledge Distillation and Reinforcement Learning

【速读】: 该论文试图解决大型语言模型(Large Language Model, LLM)后训练过程中推理能力提升的效率与泛化性之间的矛盾问题。具体而言,强化学习(Reinforcement Learning, RL)虽然能够促进复杂推理行为的出现,但在初始策略难以探索高奖励轨迹时存在样本效率低的问题;而知识蒸馏(Knowledge Distillation, KD)虽能提高学习效率,但泛化能力较差。解决方案的关键在于提出一种统一的后训练框架KDRL,该框架通过结合教师监督(KD)与自我探索(RL)来联合优化推理模型,利用策略梯度优化同时最小化学生模型与教师模型分布之间的反向Kullback-Leibler散度(RKL)并最大化基于规则的预期奖励,从而在性能与推理令牌效率之间取得良好平衡。

链接: https://arxiv.org/abs/2506.02208
作者: Hongling Xu,Qi Zhu,Heyuan Deng,Jinpeng Li,Lu Hou,Yasheng Wang,Lifeng Shang,Ruifeng Xu,Fei Mi
机构: Harbin Institute of Technology (Shenzhen) (哈尔滨工业大学(深圳)); Huawei Noah’s Ark Lab (华为诺亚方舟实验室)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advances in large language model (LLM) post-training have leveraged two distinct paradigms to enhance reasoning capabilities: reinforcement learning (RL) and knowledge distillation (KD). While RL enables the emergence of complex reasoning behaviors, it often suffers from low sample efficiency when the initial policy struggles to explore high-reward trajectories. Conversely, KD improves learning efficiency via mimicking the teacher model but tends to generalize poorly to out-of-domain scenarios. In this work, we present \textbfKDRL, a \textitunified post-training framework that jointly optimizes a reasoning model through teacher supervision (KD) and self-exploration (RL). Specifically, KDRL leverages policy gradient optimization to simultaneously minimize the reverse Kullback-Leibler divergence (RKL) between the student and teacher distributions while maximizing the expected rule-based rewards. We first formulate a unified objective that integrates GRPO and KD, and systematically explore how different KL approximations, KL coefficients, and reward-guided KD strategies affect the overall post-training dynamics and performance. Empirical results on multiple reasoning benchmarks demonstrate that KDRL outperforms GRPO and various KD baselines while achieving a favorable balance between performance and reasoning token efficiency. These findings indicate that integrating KD and RL serves as an effective and efficient strategy to train reasoning LLMs.
zh

[NLP-110] BehaviorBox: Automated Discovery of Fine-Grained Performance Differences Between Language Models ACL2025

【速读】: 该论文试图解决语言模型(Language Model, LM)评估中难以找到具有意义且可泛化的差异示例的问题,从而理解不同模型在性能上的优劣。其解决方案的关键在于提出一种自动化比较语言模型的方法——BehaviorBox,该方法利用感知性能的上下文嵌入来识别文本中细微的特征,这些特征能够体现两个模型在生成难度上的差异,例如“条件句中的‘were’”或“情感陈述后的感叹号”等特定语境下的表现差异。

链接: https://arxiv.org/abs/2506.02204
作者: Lindia Tjuatja,Graham Neubig
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted to ACL 2025 Main Conference

点击查看摘要

Abstract:Language model evaluation is a daunting task: prompts are brittle, corpus-level perplexities are vague, and the choice of benchmarks are endless. Finding examples that show meaningful, generalizable differences between two LMs is crucial to understanding where one model succeeds and another fails. Can this process be done automatically? In this work, we propose methodology for automated comparison of language models that uses performance-aware contextual embeddings to find fine-grained features of text where one LM outperforms another. Our method, which we name BehaviorBox, extracts coherent features that demonstrate differences with respect to the ease of generation between two LMs. Specifically, BehaviorBox finds features that describe groups of words in fine-grained contexts, such as “conditional ‘were’ in the phrase ‘if you were’” and “exclamation marks after emotional statements”, where one model outperforms another within a particular datatset. We apply BehaviorBox to compare models that vary in size, model family, and post-training, and enumerate insights into specific contexts that illustrate meaningful differences in performance which cannot be found by measures such as corpus-level perplexity alone.
zh

[NLP-111] Echoes of Phonetics: Unveiling Relevant Acoustic Cues for ASR via Feature Attribution INTERSPEECH2025

【速读】: 该论文试图解决当前自动语音识别(ASR)模型依赖的特定声学线索尚不明确的问题,尤其是现有研究仅限于少量音素和过时模型。其解决方案的关键在于应用特征归因技术,以识别现代基于Conformer的ASR系统所依赖的相关声学线索,并通过分析爆破音、摩擦音和元音在时域和频域中的特性,评估这些线索与人类语音感知的关联性。

链接: https://arxiv.org/abs/2506.02181
作者: Dennis Fucci,Marco Gaido,Matteo Negri,Mauro Cettolo,Luisa Bentivogli
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at Interspeech 2025

点击查看摘要

Abstract:Despite significant advances in ASR, the specific acoustic cues models rely on remain unclear. Prior studies have examined such cues on a limited set of phonemes and outdated models. In this work, we apply a feature attribution technique to identify the relevant acoustic cues for a modern Conformer-based ASR system. By analyzing plosives, fricatives, and vowels, we assess how feature attributions align with their acoustic properties in the time and frequency domains, also essential for human speech perception. Our findings show that the ASR model relies on vowels’ full time spans, particularly their first two formants, with greater saliency in male speech. It also better captures the spectral characteristics of sibilant fricatives than non-sibilants and prioritizes the release phase in plosives, especially burst characteristics. These insights enhance the interpretability of ASR models and highlight areas for future research to uncover potential gaps in model robustness.
zh

[NLP-112] Cocktail-Party Audio-Visual Speech Recognition INTERSPEECH2025

【速读】: 该论文旨在解决音频-视觉语音识别(Audio-Visual Speech Recognition, AVSR)在复杂环境下的性能瓶颈,特别是在鸡尾酒会场景中,现有模型通常假设说话人持续发声,而忽略了现实环境中存在的说话和无声面部片段的混合情况。解决方案的关键在于构建一个新颖的音频-视觉鸡尾酒会数据集,并提供一个包含1526小时的AVSR数据集,其中包含说话面部和无声面部片段,从而显著提升模型在极端噪声条件下的识别性能,相对当前最先进方法将词错误率(WER)降低了67%,从119%降至39.2%。

链接: https://arxiv.org/abs/2506.02178
作者: Thai-Binh Nguyen,Ngoc-Quan Pham,Alexander Waibel
机构: 未知
类目: ound (cs.SD); Computation and Language (cs.CL)
备注: Accepted at Interspeech 2025

点击查看摘要

Abstract:Audio-Visual Speech Recognition (AVSR) offers a robust solution for speech recognition in challenging environments, such as cocktail-party scenarios, where relying solely on audio proves insufficient. However, current AVSR models are often optimized for idealized scenarios with consistently active speakers, overlooking the complexities of real-world settings that include both speaking and silent facial segments. This study addresses this gap by introducing a novel audio-visual cocktail-party dataset designed to benchmark current AVSR systems and highlight the limitations of prior approaches in realistic noisy conditions. Additionally, we contribute a 1526-hour AVSR dataset comprising both talking-face and silent-face segments, enabling significant performance gains in cocktail-party environments. Our approach reduces WER by 67% relative to the state-of-the-art, reducing WER from 119% to 39.2% in extreme noise, without relying on explicit segmentation cues.
zh

[NLP-113] AI Debate Aids Assessment of Controversial Claims

【速读】: 该论文试图解决人工智能在信息准确性方面可能加剧虚假信息传播和加深社会分歧的问题,特别是在公共健康等关键领域,事实准确性直接影响个体福祉。其解决方案的关键在于利用AI辩论(AI debate)机制,通过让两个AI系统就有争议的新冠疫情事实性主张进行对立论证,引导持有不同信念和偏见的人类裁判更接近真相。研究发现,AI辩论相较于单一AI顾问系统能显著提升判断准确性和信心校准,尤其在主流观点持有者中效果更为明显,同时也能帮助持怀疑态度的裁判向准确观点靠拢。此外,具有拟人化特征的AI裁判在准确性上优于人类裁判和无个性特征的默认AI裁判,表明其在监督前沿AI模型方面的潜力。

链接: https://arxiv.org/abs/2506.02175
作者: Salman Rahman,Sheriff Issaka,Ashima Suvarna,Genglin Liu,James Shiffer,Jaeyoung Lee,Md Rizwan Parvez,Hamid Palangi,Shi Feng,Nanyun Peng,Yejin Choi,Julian Michael,Liwei Jiang,Saadia Gabriel
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As AI grows more powerful, it will increasingly shape how we understand the world. But with this influence comes the risk of amplifying misinformation and deepening social divides-especially on consequential topics like public health where factual accuracy directly impacts well-being. Scalable Oversight aims to ensure AI truthfulness by enabling humans to supervise systems that may exceed human capabilities–yet humans themselves hold different beliefs and biases that impair their judgment. We study whether AI debate can guide biased judges toward the truth by having two AI systems debate opposing sides of controversial COVID-19 factuality claims where people hold strong prior beliefs. We conduct two studies: one with human judges holding either mainstream or skeptical beliefs evaluating factuality claims through AI-assisted debate or consultancy protocols, and a second examining the same problem with personalized AI judges designed to mimic these different human belief systems. In our human study, we find that debate-where two AI advisor systems present opposing evidence-based arguments-consistently improves judgment accuracy and confidence calibration, outperforming consultancy with a single-advisor system by 10% overall. The improvement is most significant for judges with mainstream beliefs (+15.2% accuracy), though debate also helps skeptical judges who initially misjudge claims move toward accurate views (+4.7% accuracy). In our AI judge study, we find that AI judges with human-like personas achieve even higher accuracy (78.5%) than human judges (70.1%) and default AI judges without personas (69.8%), suggesting their potential for supervising frontier AI models. These findings highlight AI debate as a promising path toward scalable, bias-resilient oversight–leveraging both diverse human and AI judgments to move closer to truth in contested domains.
zh

[NLP-114] Different Speech Translation Models Encode and Translate Speaker Gender Differently ACL2025

【速读】: 该论文试图解决语音翻译(Speech Translation, ST)模型是否能够编码说话者性别信息的问题,以及这种编码对翻译结果中性别分配的影响。其解决方案的关键在于从可解释性角度出发,采用探测方法(probing methods)评估不同ST模型中的性别编码能力,进而分析新型架构在性别编码方面的不足及其导致的男性化翻译偏差。

链接: https://arxiv.org/abs/2506.02172
作者: Dennis Fucci,Marco Gaido,Matteo Negri,Luisa Bentivogli,Andre Martins,Giuseppe Attanasio
机构: University of Trento (特伦托大学); Fondazione Bruno Kessler (布鲁诺·凯塞尔基金会); Unbabel (Unbabel); Instituto Superior Técnico (里斯本理工学院); Instituto de Telecomunicações (电信研究所)
类目: Computation and Language (cs.CL)
备注: Accepted at ACL 2025

点击查看摘要

Abstract:Recent studies on interpreting the hidden states of speech models have shown their ability to capture speaker-specific features, including gender. Does this finding also hold for speech translation (ST) models? If so, what are the implications for the speaker’s gender assignment in translation? We address these questions from an interpretability perspective, using probing methods to assess gender encoding across diverse ST models. Results on three language directions (English-French/Italian/Spanish) indicate that while traditional encoder-decoder models capture gender information, newer architectures – integrating a speech encoder with a machine translation system via adapters – do not. We also demonstrate that low gender encoding capabilities result in systems’ tendency toward a masculine default, a translation bias that is more pronounced in newer architectures.
zh

[NLP-115] A Dynamic Framework for Semantic Grouping of Common Data Elements (CDE) Using Embeddings and Clustering

【速读】: 该论文旨在解决异构生物医学数据集中Common Data Elements (CDEs)的协调问题,具体挑战包括语义异质性、结构变异性和上下文依赖性,以实现数据整合的标准化、提升互操作性并加速科学发现。其解决方案的关键在于利用Large Language Models (LLMs)生成上下文感知的文本嵌入,将CDE转换为捕捉语义关系的密集向量,并通过Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN)进行无监督聚类,结合LLM摘要实现自动化标签,并通过监督学习训练分类器对新或未聚类的CDE进行分类。

链接: https://arxiv.org/abs/2506.02160
作者: Madan Krishnamurthy,Daniel Korn,Melissa A Haendel,Christopher J Mungall,Anne E Thessen
机构: 未知
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This research aims to develop a dynamic and scalable framework to facilitate harmonization of Common Data Elements (CDEs) across heterogeneous biomedical datasets by addressing challenges such as semantic heterogeneity, structural variability, and context dependence to streamline integration, enhance interoperability, and accelerate scientific discovery. Our methodology leverages Large Language Models (LLMs) for context-aware text embeddings that convert CDEs into dense vectors capturing semantic relationships and patterns. These embeddings are clustered using Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN) to group semantically similar CDEs. The framework incorporates four key steps: (1) LLM-based text embedding to mathematically represent semantic context, (2) unsupervised clustering of embeddings via HDBSCAN, (3) automated labeling using LLM summarization, and (4) supervised learning to train a classifier assigning new or unclustered CDEs to labeled clusters. Evaluated on the NIH NLM CDE Repository with over 24,000 CDEs, the system identified 118 meaningful clusters at an optimized minimum cluster size of 20. The classifier achieved 90.46 percent overall accuracy, performing best in larger categories. External validation against Gravity Projects Social Determinants of Health domains showed strong agreement (Adjusted Rand Index 0.52, Normalized Mutual Information 0.78), indicating that embeddings effectively capture cluster characteristics. This adaptable and scalable approach offers a practical solution to CDE harmonization, improving selection efficiency and supporting ongoing data interoperability.
zh

[NLP-116] HENT-SRT: Hierarchical Efficient Neural Transducer with Self-Distillation for Joint Speech Recognition and Translation

【速读】: 该论文旨在解决神经转导器(Neural Transducer, NT)在语音翻译(Speech Translation, ST)任务中面临的词序重构困难和性能下降问题,以及现有NT方法在联合建模自动语音识别(ASR)与ST时计算成本高的问题。其解决方案的关键在于提出HENT-SRT(Hierarchical Efficient Neural Transducer for Speech Recognition and Translation)框架,通过因子分解ASR与翻译任务以更好地处理词序重构,并采用自蒸馏结合CTC一致性正则化来保证ST的鲁棒性同时保持ASR性能;此外,通过引入下采样分层编码器、无状态预测器和剪枝的转导损失提升计算效率,并在解码过程中引入空白惩罚以减少误删并提升翻译质量。

链接: https://arxiv.org/abs/2506.02157
作者: Amir Hussein,Cihan Xiao,Matthew Wiesner,Dan Povey,Leibny Paola Garcia,Sanjeev Khudanpur
机构: JHU (约翰霍普金斯大学); CLSP (计算语言学与语音处理中心); HLTCOE (人机语言技术研究中心)
类目: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Neural transducers (NT) provide an effective framework for speech streaming, demonstrating strong performance in automatic speech recognition (ASR). However, the application of NT to speech translation (ST) remains challenging, as existing approaches struggle with word reordering and performance degradation when jointly modeling ASR and ST, resulting in a gap with attention-based encoder-decoder (AED) models. Existing NT-based ST approaches also suffer from high computational training costs. To address these issues, we propose HENT-SRT (Hierarchical Efficient Neural Transducer for Speech Recognition and Translation), a novel framework that factorizes ASR and translation tasks to better handle reordering. To ensure robust ST while preserving ASR performance, we use self-distillation with CTC consistency regularization. Moreover, we improve computational efficiency by incorporating best practices from ASR transducers, including a down-sampled hierarchical encoder, a stateless predictor, and a pruned transducer loss to reduce training complexity. Finally, we introduce a blank penalty during decoding, reducing deletions and improving translation quality. Our approach is evaluated on three conversational datasets Arabic, Spanish, and Mandarin achieving new state-of-the-art performance among NT models and substantially narrowing the gap with AED-based systems.
zh

[NLP-117] BabyLMs First Constructions: Causal interventions provide a signal of learning

【速读】: 该论文试图解决的问题是:尽管生成式 AI (Generative AI) 模型在语言处理任务中表现出对构造(construction)的敏感性,但这些模型通常训练于远超人类儿童发展过程中所能接触的数据量,这引发了其在模拟人类语言学习方面的适用性争议。该研究的关键解决方案是采用 Rozner 等人(2025)的方法,评估 2024 年 BabyLM 挑战中模型的构造学习能力,并证明即使在使用发展上合理的数据量训练下,模型仍能表征多样化的构造,包括表面难以区分的复杂构造。此外,研究还发现构造表征与模型在 BabyLM 基准测试中的表现之间存在相关性,表明构造学习可能在功能上具有重要性。

链接: https://arxiv.org/abs/2506.02147
作者: Joshua Rozner,Leonie Weissweiler,Cory Shain
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Construction grammar posits that children acquire constructions (form-meaning pairings) from the statistics of their environment. Recent work supports this hypothesis by showing sensitivity to constructions in pretrained language models (PLMs), including one recent study (Rozner et al., 2025) demonstrating that constructions shape the PLM’s output distribution. However, models under study have generally been trained on developmentally implausible amounts of data, casting doubt on their relevance to human language learning. Here we use Rozner et al.'s methods to evaluate constructional learning in models from the 2024 BabyLM challenge. Our results show that even when trained on developmentally plausible quantities of data, models represent diverse constructions, even hard cases that are superficially indistinguishable. We further find correlational evidence that constructional performance may be functionally relevant: models that better represent constructions perform better on the BabyLM benchmarks.
zh

[NLP-118] Model Internal Sleuthing: Finding Lexical Identity and Inflectional Morphology in Modern Language Models

【速读】: 该论文试图解决当前大型变压器语言模型(Large Transformer-based Language Models)在编码语言信息方面的机制理解不足的问题,特别是针对现代模型如何表示词法身份和屈折形态的差异。其解决方案的关键在于通过在层级激活上训练线性和非线性分类器,以预测词干和屈折特征,从而分析不同模型在不同层次中对语言信息的表征方式。研究发现,模型在早期层中以线性方式集中词法信息,而在后期层中逐渐呈现非线性特征,而屈折信息则在整个层次中保持均匀可访问且线性可分。

链接: https://arxiv.org/abs/2506.02132
作者: Michael Li,Nishant Subramani
机构: Carnegie Mellon University - Language Technologies Institute (卡内基梅隆大学-语言技术研究所)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large transformer-based language models dominate modern NLP, yet our understanding of how they encode linguistic information is rooted in studies of early models like BERT and GPT-2. To better understand today’s language models, we investigate how both classical architectures (BERT, DeBERTa, GPT-2)and contemporary large language models (Pythia, OLMo-2, Gemma-2, Qwen2.5, Llama-3.1) represent lexical identity and inflectional morphology. We train linear and nonlinear classifiers on layer-wise activations to predict word lemmas and inflectional features. We discover that models concentrate lexical information linearly in early layers and increasingly nonlinearly in later layers, while keeping inflectional information uniformly accessible and linearly separable throughout the layers. Further analysis reveals that these models encode inflectional morphology through generalizable abstractions, but rely predominantly on memorization to encode lexical identity. Remarkably, these patterns emerge across all 16 models we test, despite differences in architecture, size, and training regime (including pretrained and instruction-tuned variants). This consistency suggests that, despite substantial advances in LLM technologies, transformer models organize linguistic information in similar ways, indicating that these properties could be fundamental for next token prediction and are learned early during pretraining. Our code is available at this https URL.
zh

[NLP-119] Knowledge or Reasoning ? A Close Look at How LLM s Think Across Domains

【速读】: 该论文试图解决大型语言模型在复杂任务中内部推理过程的质量与透明度问题,特别是在医学和数学领域中的逐步推理能力。其解决方案的关键在于引入一个细粒度的评估框架,该框架将思维轨迹显式分解为知识和推理两部分,并通过知识指数(Knowledge Index, KI)和信息增益(Information Gain, InfoGain)分别衡量知识的正确性和推理的质量。基于此框架,研究分析了经过监督微调(SFT)和/或强化学习(RL)训练的R1蒸馏模型和基础Qwen模型在不同领域的表现,揭示了模型在知识传递、推理质量及领域适应性方面的关键问题与改进路径。

链接: https://arxiv.org/abs/2506.02126
作者: Juncheng Wu,Sheng Liu,Haoqin Tu,Hang Yu,Xiaoke Huang,James Zou,Cihang Xie,Yuyin Zhou
机构: UC Santa Cruz (加州大学圣克鲁兹分校); Stanford University (斯坦福大学); Tongji University (同济大学)
类目: Computation and Language (cs.CL)
备注: 17 pages, preprint

点击查看摘要

Abstract:Recent advances in reasoning-enhanced Large Language Models such as OpenAI-o1/3 and DeepSeek-R1 have significantly improved performance on complex tasks. However, the quality and transparency of their internal reasoning processes remain underexplored. This work moves beyond the final-answer accuracy and investigates step-by-step reasoning in the medical and mathematical domains by explicitly decomposing the thinking trajectories into two parts: knowledge and reasoning. Specifically, we introduce a fine-grained evaluation framework that judges: (1) the correctness of knowledge used (measured by Knowledge Index (KI)) and (2) the quality of reasoning (measured by Information Gain (InfoGain)). Using this framework, we study R1-distilled and base Qwen models trained with supervised fine-tuning (SFT) and/or reinforcement learning (RL) in the medical and math domains. Three intriguing findings emerge: (1) The general reasoning abilities in R1-distilled models do not transfer effectively to the medical domain through either SFT or RL. (2) SFT raises final-answer accuracy in both domains, but often at the cost of reasoning quality: InfoGain drops by 38.9% on average compared with untrained models; In the medical domain, however, SFT remains crucial because domain knowledge is indispensable. (3) RL enhances medical reasoning by pruning inaccurate or irrelevant knowledge from reasoning paths, thereby improving both reasoning accuracy and knowledge correctness.
zh

[NLP-120] SynthRL: Scaling Visual Reasoning with Verifiable Data Synthesis

【速读】: 该论文试图解决如何通过合成强化学习(Reinforcement Learning, RL)数据来进一步提升基于可验证奖励的强化学习(Reinforcement Learning with Verifiable Reward, RLVR)的效果问题。其解决方案的关键在于提出了一种名为SynthRL的可扩展且保证质量的数据生成流水线,该流水线包含三个核心阶段:选择具有适当分布的种子问题、在保持原始答案不变的前提下将种子问题增强为更具挑战性的变体,以及一个保证验证阶段以确保近似完美的正确性和难度提升。

链接: https://arxiv.org/abs/2506.02096
作者: Zijian Wu,Jinjie Ni,Xiangyan Liu,Zichen Liu,Hang Yan,Michael Qizhe Shieh
机构: National University of Singapore (新加坡国立大学); The Chinese University of Hong Kong (香港中文大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-language models (VLMs) trained via reinforcement learning with verifiable reward (RLVR) have shown notable progress in scaling test-time compute effectively. In this work, we investigate how synthesized RL data can further improve RLVR. To this end, we propose \textbfSynthRL-a scalable and guaranteed pipeline for automatic data scaling in reasoning-oriented RL training. SynthRL comprises three key stages: (1) selecting seed questions with appropriate distribution, (2) augmenting them into more challenging variants while preserving the original answers, and (3) a guaranteed verification stage that ensures near-perfect correctness and difficulty enhancement. Our empirical experiments demonstrate SynthRL’s scalability and effectiveness. When applied to the MMK12 dataset, SynthRL synthesizes over 3.3K additional verifiable, challenging questions from approximately 8K seed samples. Models trained with our synthesized data achieve consistent gains across five out-of-domain visual math reasoning benchmarks, with a significant improvement over baseline models trained on seed data alone. Notably, detailed analysis reveals that the gains are more pronounced on the most challenging evaluation samples, highlighting SynthRL’s effectiveness in eliciting deeper and more complex reasoning patterns.
zh

[NLP-121] Enhancing Speech Emotion Recognition with Graph-Based Multimodal Fusion and Prosodic Features for the Speech Emotion Recognition in Naturalistic Conditions Challenge at Interspeech 2025

【速读】: 该论文旨在解决在自然、非结构化语音环境下进行情绪识别(Speech Emotion Recognition, SER)的挑战,特别是由于情感表达的细微性和真实音频的不可预测性。其解决方案的关键在于结合先进的音频模型与通过韵律和频谱线索增强的文本特征,同时探索基频(Fundamental Frequency, F0)量化和预训练音频标记模型的有效性,并采用集成模型以提高系统的鲁棒性。

链接: https://arxiv.org/abs/2506.02088
作者: Alef Iury Siqueira Ferreira,Lucas Rafael Gris,Alexandre Ferro Filho,Lucas Ólives,Daniel Ribeiro,Luiz Fernando,Fernanda Lustosa,Rodrigo Tanaka,Frederico Santos de Oliveira,Arlindo Galvão Filho
机构: 未知
类目: ound (cs.SD); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Training SER models in natural, spontaneous speech is especially challenging due to the subtle expression of emotions and the unpredictable nature of real-world audio. In this paper, we present a robust system for the INTERSPEECH 2025 Speech Emotion Recognition in Naturalistic Conditions Challenge, focusing on categorical emotion recognition. Our method combines state-of-the-art audio models with text features enriched by prosodic and spectral cues. In particular, we investigate the effectiveness of Fundamental Frequency (F0) quantization and the use of a pretrained audio tagging model. We also employ an ensemble model to improve robustness. On the official test set, our system achieved a Macro F1-score of 39.79% (42.20% on validation). Our results underscore the potential of these methods, and analysis of fusion techniques confirmed the effectiveness of Graph Attention Networks. Our source code is publicly available.
zh

[NLP-122] Unveiling Audio Deepfake Origins: A Deep Metric learning And Conformer Network Approach With Ensemble Fusion INTERSPEECH2025

【速读】: 该论文旨在解决音频深度伪造(audio deepfakes)中源系统追踪(source system tracing)的问题,即识别音频信号的生成设备或算法来源。其解决方案的关键在于结合了深度度量多类N-对损失(deep metric multi-class N-pair loss)、真实强调与虚假分散框架(Real Emphasis and Fake Dispersion framework)、Conformer分类网络以及集成评分嵌入融合(ensemble score-embedding fusion)。其中,N-对损失提升了模型的判别能力,而真实强调与虚假分散框架通过区分真实与虚假语音模式增强了系统的鲁棒性;Conformer网络则有效捕捉音频信号中的全局与局部依赖关系,从而提升源追踪性能。

链接: https://arxiv.org/abs/2506.02085
作者: Ajinkya Kulkarni,Sandipana Dowerah,Tanel Alumae,Mathew Magimai.-Doss
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: Accepted at Interspeech 2025, Netherlands

点击查看摘要

Abstract:Audio deepfakes are acquiring an unprecedented level of realism with advanced AI. While current research focuses on discerning real speech from spoofed speech, tracing the source system is equally crucial. This work proposes a novel audio source tracing system combining deep metric multi-class N-pair loss with Real Emphasis and Fake Dispersion framework, a Conformer classification network, and ensemble score-embedding fusion. The N-pair loss improves discriminative ability, while Real Emphasis and Fake Dispersion enhance robustness by focusing on differentiating real and fake speech patterns. The Conformer network captures both global and local dependencies in the audio signal, crucial for source tracing. The proposed ensemble score-embedding fusion shows an optimal trade-off between in-domain and out-of-domain source tracing scenarios. We evaluate our method using Frechet Distance and standard metrics, demonstrating superior performance in source tracing over the baseline system.
zh

[NLP-123] Assigning Distinct Roles to Quantized and Low-Rank Matrices Toward Optimal Weight Decomposition ACL2025

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)压缩过程中,现有联合优化方法在量化(quantization)与低秩近似(low-rank approximation)之间难以平衡的问题。传统方法在迭代优化中倾向于优先考虑某一组件而牺牲另一组件,导致分解效果不理想。论文提出的解决方案关键在于引入异常值驱动的低秩初始化(Outlier-Driven Low-Rank Initialization, ODLRI),通过让低秩组件专门捕捉对激活敏感的权重,从而减轻异常值对量化的影响,实现量化与低秩近似的更优平衡。

链接: https://arxiv.org/abs/2506.02077
作者: Yoonjun Cho,Soeun Kim,Dongjae Jeon,Kyelim Lee,Beomsoo Lee,Albert No
机构: Yonsei University (延世大学); Hongik University (弘益大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted to Findings of ACL 2025

点击查看摘要

Abstract:Decomposing weight matrices into quantization and low-rank components ( \mathbfW \approx \mathbfQ + \mathbfL\mathbfR ) is a widely used technique for compressing large language models (LLMs). Existing joint optimization methods iteratively alternate between quantization and low-rank approximation. However, these methods tend to prioritize one component at the expense of the other, resulting in suboptimal decompositions that fail to leverage each component’s unique strengths. In this work, we introduce Outlier-Driven Low-Rank Initialization (ODLRI), which assigns low-rank components the specific role of capturing activation-sensitive weights. This structured decomposition mitigates outliers’ negative impact on quantization, enabling more effective balance between quantization and low-rank approximation. Experiments on Llama2 (7B, 13B, 70B), Llama3-8B, and Mistral-7B demonstrate that incorporating ODLRI into the joint optimization framework consistently reduces activation-aware error, minimizes quantization scale, and improves perplexity and zero-shot accuracy in low-bit settings.
zh

[NLP-124] Learning More with Less: Self-Supervised Approaches for Low-Resource Speech Emotion Recognition INTERSPEECH2025

【速读】: 该论文试图解决低资源语言(Low-Resource Languages, LRLs)中语音情感识别(Speech Emotion Recognition, SER)面临的挑战,主要由于标注数据的稀缺性。其解决方案的关键在于采用无监督学习方法,特别是对比学习(contrastive learning)和Bootstrap Your Own Latent(BYOL)等自监督学习技术,以提升跨语言的泛化能力。通过这些方法,在乌尔都语、德语和孟加拉语中分别实现了10.6%、15.2%和13.9%的F1分数提升,验证了其在LRLs中的有效性。

链接: https://arxiv.org/abs/2506.02059
作者: Ziwei Gong,Pengyuan Shi,Kaan Donbekci,Lin Ai,Run Chen,David Sasu,Zehui Wu,Julia Hirschberg
机构: 未知
类目: ound (cs.SD); Computation and Language (cs.CL)
备注: Accepted at Interspeech 2025

点击查看摘要

Abstract:Speech Emotion Recognition (SER) has seen significant progress with deep learning, yet remains challenging for Low-Resource Languages (LRLs) due to the scarcity of annotated data. In this work, we explore unsupervised learning to improve SER in low-resource settings. Specifically, we investigate contrastive learning (CL) and Bootstrap Your Own Latent (BYOL) as self-supervised approaches to enhance cross-lingual generalization. Our methods achieve notable F1 score improvements of 10.6% in Urdu, 15.2% in German, and 13.9% in Bangla, demonstrating their effectiveness in LRLs. Additionally, we analyze model behavior to provide insights on key factors influencing performance across languages, and also highlighting challenges in low-resource SER. This work provides a foundation for developing more inclusive, explainable, and robust emotion recognition systems for underrepresented languages.
zh

[NLP-125] Evaluating the Unseen Capabilities: How Many Theorems Do LLM s Know?

【速读】: 该论文试图解决当前对大规模语言模型(Large Language Models, LLMs)评估中存在的不一致性问题,即现有评估方法未能准确反映模型的实际能力。解决方案的关键在于引入KnowSum,这是一个统计框架,通过量化评估任务中未观察到的知识(unseen knowledge)来提供更全面的评估。KnowSum通过从已观察知识实例的出现频率进行外推,估计未观测部分的规模,从而揭示了仅依赖观察到的模型性能时可能遗漏的大量知识。

链接: https://arxiv.org/abs/2506.02058
作者: Xiang Li,Jiayi Xin,Qi Long,Weijie J. Su
机构: University of Pennsylvania (宾夕法尼亚大学)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG); Applications (stat.AP); Methodology (stat.ME)
备注:

点击查看摘要

Abstract:Accurate evaluation of large language models (LLMs) is crucial for understanding their capabilities and guiding their development. However, current evaluations often inconsistently reflect the actual capacities of these models. In this paper, we demonstrate that one of many contributing factors to this \textitevaluation crisis is the oversight of unseen knowledge – information encoded by LLMs but not directly observed or not yet observed during evaluations. We introduce KnowSum, a statistical framework designed to provide a more comprehensive assessment by quantifying the unseen knowledge for a class of evaluation tasks. KnowSum estimates the unobserved portion by extrapolating from the appearance frequencies of observed knowledge instances. We demonstrate the effectiveness and utility of KnowSum across three critical applications: estimating total knowledge, evaluating information retrieval effectiveness, and measuring output diversity. Our experiments reveal that a substantial volume of knowledge is omitted when relying solely on observed LLM performance. Importantly, KnowSum yields significantly different comparative rankings for several common LLMs based on their internal knowledge.
zh

[NLP-126] Enhancing Speech Instruction Understanding and Disambiguation in Robotics via Speech Prosody INTERSPEECH2025

【速读】: 该论文旨在解决机器人在理解与执行口语指令时存在的意图歧义问题,传统方法依赖语音识别将语音转为文本,往往丢失了对意图消歧至关重要的韵律线索。其解决方案的关键在于直接利用语音韵律(prosody)来推断和解决指令意图,并通过上下文学习将预测的意图整合到大型语言模型中,以消歧并选择合适的任务计划。

链接: https://arxiv.org/abs/2506.02057
作者: David Sasu,Kweku Andoh Yamoah,Benedict Quartey,Natalie Schluter
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted to Interspeech 2025

点击查看摘要

Abstract:Enabling robots to accurately interpret and execute spoken language instructions is essential for effective human-robot collaboration. Traditional methods rely on speech recognition to transcribe speech into text, often discarding crucial prosodic cues needed for disambiguating intent. We propose a novel approach that directly leverages speech prosody to infer and resolve instruction intent. Predicted intents are integrated into large language models via in-context learning to disambiguate and select appropriate task plans. Additionally, we present the first ambiguous speech dataset for robotics, designed to advance research in speech disambiguation. Our method achieves 95.79% accuracy in detecting referent intents within an utterance and determines the intended task plan of ambiguous instructions with 71.96% accuracy, demonstrating its potential to significantly improve human-robot communication.
zh

[NLP-127] Enhancing Multimodal Continual Instruction Tuning with BranchLoRA ACL2025

【速读】: 该论文旨在解决多模态持续指令微调(Multimodal Continual Instruction Tuning, MCIT)中因灾难性遗忘(Catastrophic Forgetting, CF)导致的模型性能下降问题。现有方法基于混合专家(Mixture-of-Experts, MoE)LoRA框架进行微调,但由于简单地通过求和聚合所有LoRA模块,导致模型难以维持对先前任务的对齐能力。论文提出BranchLoRA框架,其关键在于引入一种非对称结构与灵活的无调优冻结机制,使不同分支在保持任务内知识专长的同时促进跨任务协作,并通过增量式任务特定路由器优化分支分配,从而有效缓解CF问题。

链接: https://arxiv.org/abs/2506.02041
作者: Duzhen Zhang,Yong Ren,Zhong-Zhi Li,Yahan Yu,Jiahua Dong,Chenxing Li,Zhilong Ji,Jinfeng Bai
机构: Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学); Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); Kyoto University (京都大学); Tencent AI Lab (腾讯人工智能实验室); Tomorrow Advancing Life (明天推进生命)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by ACL2025 Main Conference

点击查看摘要

Abstract:Multimodal Continual Instruction Tuning (MCIT) aims to finetune Multimodal Large Language Models (MLLMs) to continually align with human intent across sequential tasks. Existing approaches often rely on the Mixture-of-Experts (MoE) LoRA framework to preserve previous instruction alignments. However, these methods are prone to Catastrophic Forgetting (CF), as they aggregate all LoRA blocks via simple summation, which compromises performance over time. In this paper, we identify a critical parameter inefficiency in the MoELoRA framework within the MCIT context. Based on this insight, we propose BranchLoRA, an asymmetric framework to enhance both efficiency and performance. To mitigate CF, we introduce a flexible tuning-freezing mechanism within BranchLoRA, enabling branches to specialize in intra-task knowledge while fostering inter-task collaboration. Moreover, we incrementally incorporate task-specific routers to ensure an optimal branch distribution over time, rather than favoring the most recent task. To streamline inference, we introduce a task selector that automatically routes test inputs to the appropriate router without requiring task identity. Extensive experiments on the latest MCIT benchmark demonstrate that BranchLoRA significantly outperforms MoELoRA and maintains its superiority across various MLLM sizes.
zh

[NLP-128] FinS-Pilot: A Benchmark for Online Financial System

【速读】: 该论文试图解决金融领域中检索增强生成(RAG)系统评估基准不足的问题,特别是在数据保密性和动态数据整合方面的限制。解决方案的关键在于提出FinS-Pilot,这是一个针对在线金融应用中RAG系统的新型基准,其构建基于真实金融助手交互数据,并结合实时API数据和结构化文本源,通过覆盖关键金融领域的意图分类框架进行组织,从而实现对金融助手处理静态知识和时效性市场信息能力的全面评估。

链接: https://arxiv.org/abs/2506.02037
作者: Feng Wang,Yiding Sun,Jiaxin Mao,Wei Xue,Danqing Xu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated remarkable capabilities across various professional domains, with their performance typically evaluated through standardized benchmarks. However, the development of financial RAG benchmarks has been constrained by data confidentiality issues and the lack of dynamic data integration. To address this issue, we introduces FinS-Pilot, a novel benchmark for evaluating RAG systems in online financial applications. Constructed from real-world financial assistant interactions, our benchmark incorporates both real-time API data and structured text sources, organized through an intent classification framework covering critical financial domains such as equity analysis and macroeconomic forecasting. The benchmark enables comprehensive evaluation of financial assistants’ capabilities in handling both static knowledge and time-sensitive market information. Through systematic experiments with multiple Chinese leading LLMs, we demonstrate FinS-Pilot’s effectiveness in identifying models suitable for financial applications while addressing the current gap in specialized evaluation tools for the financial domain. Our work contributes both a practical evaluation framework and a curated dataset to advance research in financial NLP systems. The code and dataset are accessible on GitHub\footnotethis https URL_rag_benchmark.
zh

[NLP-129] ChatCFD: an End-to-End CFD Agent with Domain-specific Structured Thinking

【速读】: 该论文旨在解决计算流体动力学(Computational Fluid Dynamics, CFD)在操作复杂性和专业知识需求方面的限制。其解决方案的关键在于提出ChatCFD,一个基于大型语言模型的流程自动化系统,该系统在OpenFOAM框架内实现了从自然语言提示或已发表文献中配置和执行复杂模拟的功能。ChatCFD的核心创新在于其结构化的数据库构建、配置验证和错误反馈方法,通过将CFD和OpenFOAM知识与通用语言模型相结合,提升了系统的准确性与适应性。

链接: https://arxiv.org/abs/2506.02019
作者: E Fan,Weizong Wang,Tianhan Zhang
机构: Southern University of Science and Technology (南方科技大学); Beihang University (北京航空航天大学); Ministry of Education (教育部); Beijing Key Laboratory of High Efficiency Spacecraft Propulsion Technology (北京市高效空间推进技术重点实验室)
类目: Computation and Language (cs.CL)
备注: 19 pages, 8 figures

点击查看摘要

Abstract:Computational Fluid Dynamics (CFD) is essential for scientific and engineering advancements but is limited by operational complexity and the need for extensive expertise. This paper presents ChatCFD, a large language model-driven pipeline that automates CFD workflows within the OpenFOAM framework. It enables users to configure and execute complex simulations from natural language prompts or published literature with minimal expertise. The innovation is its structured approach to database construction, configuration validation, and error reflection, integrating CFD and OpenFOAM knowledge with general language models to improve accuracy and adaptability. Validation shows ChatCFD can autonomously reproduce published CFD results, handling complex, unseen configurations beyond basic examples, a task challenging for general language models.
zh

[NLP-130] Enhancing Paraphrase Type Generation: The Impact of DPO and RLHF Evaluated with Human-Ranked Data

【速读】: 该论文旨在解决现有生成式 AI (Generative AI) 在句式改写类型生成中与人类偏好不一致的问题,主要原因是依赖自动化指标和有限的人类标注训练数据,导致语义保真度和语言转换的关键方面被忽视。其解决方案的关键在于利用人类排序的句式改写类型数据集,并结合直接偏好优化(Direct Preference Optimization, DPO)方法,使模型输出更贴近人类判断,从而提升句式改写生成的准确性与人类偏好评分。

链接: https://arxiv.org/abs/2506.02018
作者: Christopher Lee Lübbers
机构: 未知
类目: Computation and Language (cs.CL)
备注: 21 pages, 11 figures. Master’s thesis, University of Goettingen, December 2025. Code: this https URL . Models: this https URL

点击查看摘要

Abstract:Paraphrasing re-expresses meaning to enhance applications like text simplification, machine translation, and question-answering. Specific paraphrase types facilitate accurate semantic analysis and robust language models. However, existing paraphrase-type generation methods often misalign with human preferences due to reliance on automated metrics and limited human-annotated training data, obscuring crucial aspects of semantic fidelity and linguistic transformations. This study addresses this gap by leveraging a human-ranked paraphrase-type dataset and integrating Direct Preference Optimization (DPO) to align model outputs directly with human judgments. DPO-based training increases paraphrase-type generation accuracy by 3 percentage points over a supervised baseline and raises human preference ratings by 7 percentage points. A newly created human-annotated dataset supports more rigorous future evaluations. Additionally, a paraphrase-type detection model achieves F1 scores of 0.91 for addition/deletion, 0.78 for same polarity substitution, and 0.70 for punctuation changes. These findings demonstrate that preference data and DPO training produce more reliable, semantically accurate paraphrases, enabling downstream applications such as improved summarization and more robust question-answering. The PTD model surpasses automated metrics and provides a more reliable framework for evaluating paraphrase quality, advancing paraphrase-type research toward richer, user-aligned language generation and establishing a stronger foundation for future evaluations grounded in human-centric criteria. Comments: 21 pages, 11 figures. Master’s thesis, University of Goettingen, December 2025. Code: this https URL. Models: this https URL Subjects: Computation and Language (cs.CL) ACMclasses: I.2.7 Cite as: arXiv:2506.02018 [cs.CL] (or arXiv:2506.02018v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2506.02018 Focus to learn more arXiv-issued DOI via DataCite
zh

[NLP-131] Pruning for Performance: Efficient Idiom and Metaphor Classification in Low-Resource Konkani Using mBERT

【速读】: 该论文试图解决隐喻语言表达对自然语言处理(Natural Language Processing, NLP)系统带来的持续性挑战,特别是在低资源语言如Konkani中的应用问题。解决方案的关键在于提出一种混合模型,该模型将预训练的多语言BERT(Multilingual BERT, mBERT)与双向长短期记忆网络(bidirectional LSTM)和线性分类器相结合,并通过梯度-based注意力头剪枝策略提升模型效率。此方法在隐喻分类任务中达到了78%的准确率,并在习语分类任务中进一步验证了其有效性。

链接: https://arxiv.org/abs/2506.02005
作者: Timothy Do,Pranav Saran,Harshita Poojary,Pranav Prabhu,Sean O’Brien,Vasu Sharma,Kevin Zhu
机构: Algoverse AI Research (Algoverse AI Research)
类目: Computation and Language (cs.CL)
备注: 9 pages, 7 figures

点击查看摘要

Abstract:In this paper, we address the persistent challenges that figurative language expressions pose for natural language processing (NLP) systems, particularly in low-resource languages such as Konkani. We present a hybrid model that integrates a pre-trained Multilingual BERT (mBERT) with a bidirectional LSTM and a linear classifier. This architecture is fine-tuned on a newly introduced annotated dataset for metaphor classification, developed as part of this work. To improve the model’s efficiency, we implement a gradient-based attention head pruning strategy. For metaphor classification, the pruned model achieves an accuracy of 78%. We also applied our pruning approach to expand on an existing idiom classification task, achieving 83% accuracy. These results demonstrate the effectiveness of attention head pruning for building efficient NLP tools in underrepresented languages.
zh

[NLP-132] NovelHopQA: Diagnosing Multi-Hop Reasoning Failures in Long Narrative Contexts

【速读】: 该论文试图解决当前大型语言模型(Large Language Models, LLMs)在处理涉及多跳推理(multi-hop reasoning)的长文本问答任务时表现不佳的问题,特别是当上下文长度达到数万token时。解决方案的关键在于引入NovelHopQA,这是首个在自然叙事场景中联合变化上下文长度和推理深度的基准测试,通过关键词引导的流水线构建基于连贯故事情节的分步推理链,从而提供一个可控的诊断环境来评估和压力测试多跳推理能力。

链接: https://arxiv.org/abs/2506.02000
作者: Abhay Gupta,Michael Lu,Kevin Zhu,Sean O’Brien,Vasu Sharma
机构: Algoverse AI Research (Algoverse AI 研究)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Current large language models (LLMs) struggle to answer questions that span tens of thousands of tokens, especially when multi-hop reasoning is involved. While prior benchmarks explore long-context comprehension or multi-hop reasoning in isolation, none jointly vary context length and reasoning depth in natural narrative settings. We introduce NovelHopQA, the first benchmark to evaluate k1-4 hop QA over 64k-128k-token excerpts from 83 full-length public-domain novels. A keyword-guided pipeline builds hop-separated chains grounded in coherent storylines. We evaluate six state-of-the-art (SOTA) models and apply oracle-context filtering to ensure all questions are genuinely answerable. Human annotators validate both alignment and hop depth. We noticed consistent accuracy drops with increased hops and context length, even in frontier models-revealing that sheer scale does not guarantee robust reasoning. Our failure mode analysis highlights common breakdowns, such as missed final-hop integration and long-range drift. NovelHopQA offers a controlled diagnostic setting to stress-test multi-hop reasoning at scale.
zh

[NLP-133] Inter(sectional) Alia(s): Ambiguity in Voice Agent Identity via Intersectional Japanese Self-Referents

【速读】: 该论文试图解决对话代理在拟人化过程中引发的伦理问题,特别是关于机器使用人类社会身份线索进行拟人化所带来的身份认同争议。其解决方案的关键在于探讨非代词自我指称(Non-pronominal Self-referents, NPSR)和语音作为社会表达媒介在塑造代理身份感知中的作用。研究通过众包实验发现,语音性别化现象显著,而某些交集性自我指称(如boku和watakushi)可能通过中立性和模糊性规避性别化,从而提供了一种更复杂且文化敏感的身份表达方式。

链接: https://arxiv.org/abs/2506.01998
作者: Takao Fujii,Katie Seaborn,Madeleine Steeds,Jun Kato
机构: Institute of Science Tokyo(东京科学大学); University College Dublin(都柏林大学学院); AIST(日本产业技术综合研究所)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: CHI '25

点击查看摘要

Abstract:Conversational agents that mimic people have raised questions about the ethics of anthropomorphizing machines with human social identity cues. Critics have also questioned assumptions of identity neutrality in humanlike agents. Recent work has revealed that intersectional Japanese pronouns can elicit complex and sometimes evasive impressions of agent identity. Yet, the role of other “neutral” non-pronominal self-referents (NPSR) and voice as a socially expressive medium remains unexplored. In a crowdsourcing study, Japanese participants (N = 204) evaluated three ChatGPT voices (Juniper, Breeze, and Ember) using seven self-referents. We found strong evidence of voice gendering alongside the potential of intersectional self-referents to evade gendering, i.e., ambiguity through neutrality and elusiveness. Notably, perceptions of age and formality intersected with gendering as per sociolinguistic theories, especially boku and watakushi. This work provides a nuanced take on agent identity perceptions and champions intersectional and culturally-sensitive work on voice agents.
zh

[NLP-134] No Free Lunch in Active Learning: LLM Embedding Quality Dictates Query Strategy Success NEURIPS2025

【速读】: 该论文试图解决深度主动学习(deep active learning, AL)在实际应用中的可行性问题,特别是在大规模语言模型(large language models, LLMs)能够生成通用表示的背景下。其解决方案的关键在于利用冻结的LLM嵌入(embeddings)来降低迭代微调大型模型的计算成本,同时系统地研究LLM嵌入质量对查询策略的影响。通过采用多个高质量嵌入模型和多种文本分类任务,论文揭示了嵌入质量与查询策略选择之间的关系,并强调了在不同任务中评估AL策略的重要性。

链接: https://arxiv.org/abs/2506.01992
作者: Lukas Rauch,Moritz Wirth,Denis Huseljic,Marek Herde,Bernhard Sick,Matthias Aßenmacher
机构: University of Kassel, Germany(黑森州卡塞尔大学); LMU Munich, Germany(慕尼黑路德维希-马克西米利安大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: under review @NeurIPS2025

点击查看摘要

Abstract:The advent of large language models (LLMs) capable of producing general-purpose representations lets us revisit the practicality of deep active learning (AL): By leveraging frozen LLM embeddings, we can mitigate the computational costs of iteratively fine-tuning large backbones. This study establishes a benchmark and systematically investigates the influence of LLM embedding quality on query strategies in deep AL. We employ five top-performing models from the massive text embedding benchmark (MTEB) leaderboard and two baselines for ten diverse text classification tasks. Our findings reveal key insights: First, initializing the labeled pool using diversity-based sampling synergizes with high-quality embeddings, boosting performance in early AL iterations. Second, the choice of the optimal query strategy is sensitive to embedding quality. While the computationally inexpensive Margin sampling can achieve performance spikes on specific datasets, we find that strategies like Badge exhibit greater robustness across tasks. Importantly, their effectiveness is often enhanced when paired with higher-quality embeddings. Our results emphasize the need for context-specific evaluation of AL strategies, as performance heavily depends on embedding quality and the target task.
zh

[NLP-135] urning LLM Activations Quantization-Friendly

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在量化过程中因参数和激活值中的显著异常值导致的量化误差问题。解决方案的关键在于通过分析这些异常值对逐层量化误差的影响,并利用平滑和旋转变换来优化量化过程,同时提出了一种基于通道幅度的新度量标准以评估和可视化量化难度,并引入一种结合通道级缩放与旋转的混合方法,其理论依据已通过数学公式进行了阐述。

链接: https://arxiv.org/abs/2506.01967
作者: Patrik Czakó,Gábor Kertész,Sándor Szénási
机构: Obuda University(布达佩斯考文纽斯大学); John von Neumann Faculty of Informatics(约翰·冯·诺伊曼信息学院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 6 pages, 5 figures. Accepted to SACI 2025 conference proceedings

点击查看摘要

Abstract:Quantization effectively reduces the serving costs of Large Language Models (LLMs) by speeding up data movement through compressed parameters and enabling faster operations via integer arithmetic. However, activating integer arithmetic requires quantizing both weights and activations, which poses challenges due to the significant outliers in LLMs that increase quantization error. In this work, we investigate these outliers with an emphasis on their effect on layer-wise quantization error, then examine how smoothing and rotation transform the observed values. Our primary contributions include introducing a new metric to measure and visualize quantization difficulty based on channel magnitudes, as well as proposing a hybrid approach that applies channel-wise scaling before rotation, supported by a mathematical formulation of its benefits.
zh

[NLP-136] Breaking Quadratic Barriers: A Non-Attention LLM for Ultra-Long Context Horizons

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在处理极长上下文窗口时面临的计算和内存效率问题,尤其是传统Transformer架构由于自注意力机制导致的二次复杂度问题。其解决方案的关键在于摒弃基于注意力的机制,转而采用一组互补组件:状态空间块(State Space blocks)通过连续时间卷积核实现近线性扩展的序列处理能力,多分辨率卷积层捕获不同尺度的局部上下文,轻量级循环监督器维持跨序列块的全局隐藏状态,以及检索增强的外部记忆模块,用于存储和检索高层次块嵌入而不引入二次运算。

链接: https://arxiv.org/abs/2506.01963
作者: Andrew Kiruluta,Preethi Raju,Priscilla Burity
机构: UC Berkeley, CA
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present a novel non attention based architecture for large language models (LLMs) that efficiently handles very long context windows, on the order of hundreds of thousands to potentially millions of tokens. Unlike traditional Transformer designs, which suffer from quadratic memory and computation overload due to the nature of the self attention mechanism, our model avoids token to token attention entirely. Instead, it combines the following complementary components: State Space blocks (inspired by S4) that learn continuous time convolution kernels and scale near linearly with sequence length, Multi Resolution Convolution layers that capture local context at different dilation levels, a lightweight Recurrent Supervisor to maintain a global hidden state across sequential chunks, and Retrieval Augmented External Memory that stores and retrieves high-level chunk embeddings without reintroducing quadratic operations.
zh

[NLP-137] Research on Medical Named Entity Identification Based On Prompt-Biomrc Model and Its Application in Intelligent Consultation System

【速读】: 该论文旨在解决医学领域中命名实体识别(Named Entity Recognition, NER)的精度与效率问题。其解决方案的关键在于提出Prompt-bioMRC模型,该模型结合了硬模板和软提示设计,以提升医学实体识别的性能。通过在多个医学数据集上的实验验证,该方法表现出优于传统模型的效果,为智能诊断系统等应用提供了可靠的技术支持。

链接: https://arxiv.org/abs/2506.01961
作者: Jinzhu Yang
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This study is dedicated to exploring the application of prompt learning methods to advance Named Entity Recognition (NER) within the medical domain. In recent years, the emergence of large-scale models has driven significant progress in NER tasks, particularly with the introduction of the BioBERT language model, which has greatly enhanced NER capabilities in medical texts. Our research introduces the Prompt-bioMRC model, which integrates both hard template and soft prompt designs aimed at refining the precision and efficiency of medical entity recognition. Through extensive experimentation across diverse medical datasets, our findings consistently demonstrate that our approach surpasses traditional models. This enhancement not only validates the efficacy of our methodology but also highlights its potential to provide reliable technological support for applications like intelligent diagnosis systems. By leveraging advanced NER techniques, this study contributes to advancing automated medical data processing, facilitating more accurate medical information extraction, and supporting efficient healthcare decision-making processes.
zh

[NLP-138] Generate Not Recommend: Personalized Multimodal Content Generation

【速读】: 该论文旨在解决信息过载问题,通过推荐系统从大量网络内容中为用户提供个性化的结果。然而,传统推荐系统仅限于过滤和选择现有项目,缺乏生成新概念的能力,从而限制了其满足用户需求和偏好的能力。该论文提出了一种新范式,即直接生成多模态的个性化项目(如图像),以满足用户需求。解决方案的关键在于利用任意到任意的大型多模态模型(Large Multimodal Models, LMMs),并通过监督微调和在线强化学习策略进行训练,使其具备为用户生成定制化下一项目的潜力。

链接: https://arxiv.org/abs/2506.01704
作者: Jiongnan Liu,Zhicheng Dou,Ning Hu,Chenyan Xiong
机构: Carnegie Mellon University (卡内基梅隆大学); Renmin University of China (中国人民大学); Serendipity One Inc. (塞伦迪皮蒂一号公司)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:To address the challenge of information overload from massive web contents, recommender systems are widely applied to retrieve and present personalized results for users. However, recommendation tasks are inherently constrained to filtering existing items and lack the ability to generate novel concepts, limiting their capacity to fully satisfy user demands and preferences. In this paper, we propose a new paradigm that goes beyond content filtering and selecting: directly generating personalized items in a multimodal form, such as images, tailored to individual users. To accomplish this, we leverage any-to-any Large Multimodal Models (LMMs) and train them in both supervised fine-tuning and online reinforcement learning strategy to equip them with the ability to yield tailored next items for users. Experiments on two benchmark datasets and user study confirm the efficacy of the proposed method. Notably, the generated images not only align well with users’ historical preferences but also exhibit relevance to their potential future interests.
zh

[NLP-139] DYNAC: Dynamic Vocabulary based Non-Autoregressive Contextualization for Speech Recognition INTERSPEECH2025

【速读】: 该论文试图解决在自动语音识别中,针对罕见和未见过短语的上下文偏差(Contextual Biasing, CB)问题,同时提升模型的推理速度。现有方法虽然通过引入动态词汇(Dynamic Vocabulary)提高了CB的准确性,但存在推理速度慢的问题。该论文提出的解决方案关键在于DYNAC(Dynamic Vocabulary-based NAR Contextualization),它是一种自条件化的连接时序分类(CTC)方法,将动态词汇集成到中间层,通过让编码器依赖于动态词汇来有效捕捉静态与动态标记之间的依赖关系,从而在保持识别性能的同时显著降低实时因子(RTF)。

链接: https://arxiv.org/abs/2506.00422
作者: Yui Sudo,Yosuke Fukumoto,Muhammad Shakeel,Yifan Peng,Chyi-Jiunn Lin,Shinji Watanabe
机构: 未知
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted to Interspeech 2025

点击查看摘要

Abstract:Contextual biasing (CB) improves automatic speech recognition for rare and unseen phrases. Recent studies have introduced dynamic vocabulary, which represents context phrases as expandable tokens in autoregressive (AR) models. This method improves CB accuracy but with slow inference speed. While dynamic vocabulary can be applied to non-autoregressive (NAR) models, such as connectionist temporal classification (CTC), the conditional independence assumption fails to capture dependencies between static and dynamic tokens. This paper proposes DYNAC (Dynamic Vocabulary-based NAR Contextualization), a self-conditioned CTC method that integrates dynamic vocabulary into intermediate layers. Conditioning the encoder on dynamic vocabulary, DYNAC effectively captures dependencies between static and dynamic tokens while reducing the real-time factor (RTF). Experimental results show that DYNAC reduces RTF by 81% with a 0.1-point degradation in word error rate on the LibriSpeech 960 test-clean set.
zh

[NLP-140] IMPersona: Evaluating Individual Level LM Impersonation

【速读】: 该论文试图解决语言模型(Language Models, LMs)在多大程度上能够模拟特定个体的写作风格和个性知识的问题。解决方案的关键在于提出IMPersona框架,该框架结合了监督微调和受层次化记忆启发的检索系统,使开源模型如Llama-3.1-8B-Instruct能够在有限规模下实现较高水平的拟人化表现。实验结果显示,集成记忆的微调模型在盲测对话中被误认为人类的比例达到44.44%,显著高于基于提示的方法。

链接: https://arxiv.org/abs/2504.04332
作者: Quan Shi,Carlos E. Jimenez,Stephen Dong,Brian Seo,Caden Yao,Adam Kelch,Karthik Narasimhan
机构: Princeton Language and Intelligence (普林斯顿语言与智能中心), Princeton University (普林斯顿大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 25 pages, 9 pages main

点击查看摘要

Abstract:As language models achieve increasingly human-like capabilities in conversational text generation, a critical question emerges: to what extent can these systems simulate the characteristics of specific individuals? To evaluate this, we introduce IMPersona, a framework for evaluating LMs at impersonating specific individuals’ writing style and personal knowledge. Using supervised fine-tuning and a hierarchical memory-inspired retrieval system, we demonstrate that even modestly sized open-source models, such as Llama-3.1-8B-Instruct, can achieve impersonation abilities at concerning levels. In blind conversation experiments, participants (mis)identified our fine-tuned models with memory integration as human in 44.44% of interactions, compared to just 25.00% for the best prompting-based approach. We analyze these results to propose detection methods and defense strategies against such impersonation attempts. Our findings raise important questions about both the potential applications and risks of personalized language models, particularly regarding privacy, security, and the ethical deployment of such technologies in real-world contexts.
zh

[NLP-141] An Exploratory Framework for Future SETI Applications: Detecting Generative Reactivity via Language Models

【速读】: 该论文试图解决的问题是:在缺乏明确符号编码的情况下,噪声类输入是否能够引发语言模型中的结构化响应。传统上,外星信号的检测依赖于解码过程,而本文则提出了一种新的视角,即通过观察语言模型输出的结构化特征来判断输入中是否存在潜在的规律性。解决方案的关键在于定义了一个综合评分指标——语义诱导潜力(Semantic Induction Potential, SIP),该指标结合了熵、语法连贯性、压缩增益和重复惩罚等维度,用于评估不同类型的声学输入对语言模型的反应能力。实验结果表明,鲸鱼和鸟类的声音比白噪声更能激发语言模型的结构化响应,这表明语言模型可能在无传统语义的数据中识别出隐含的结构。

链接: https://arxiv.org/abs/2506.02730
作者: Po-Chieh Yu
机构: Taiwan Astronomical Research Alliance (TARA), Taiwan; Institute of Astronomy and Astrophysics, Academia Sinica, Taipei, 10617, Taiwan
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Computation and Language (cs.CL)
备注: submitted to the International Journal of Astrobiology

点击查看摘要

Abstract:We present an exploratory framework to test whether noise-like input can induce structured responses in language models. Instead of assuming that extraterrestrial signals must be decoded, we evaluate whether inputs can trigger linguistic behavior in generative systems. This shifts the focus from decoding to viewing structured output as a sign of underlying regularity in the input. We tested GPT-2 small, a 117M-parameter model trained on English text, using four types of acoustic input: human speech, humpback whale vocalizations, Phylloscopus trochilus birdsong, and algorithmically generated white noise. All inputs were treated as noise-like, without any assumed symbolic encoding. To assess reactivity, we defined a composite score called Semantic Induction Potential (SIP), combining entropy, syntax coherence, compression gain, and repetition penalty. Results showed that whale and bird vocalizations had higher SIP scores than white noise, while human speech triggered only moderate responses. This suggests that language models may detect latent structure even in data without conventional semantics. We propose that this approach could complement traditional SETI methods, especially in cases where communicative intent is unknown. Generative reactivity may offer a different way to identify data worth closer attention.
zh

计算机视觉

[CV-0] IllumiCraft: Unified Geometry and Illumination Diffusion for Controllable Video Generation

【速读】:该论文试图解决基于扩散模型生成视频时缺乏对场景光照和视觉外观在帧间显式几何线索集成的问题。解决方案的关键在于提出IllumiCraft,这是一个端到端的扩散框架,接受三种互补输入:高动态范围(HDR)视频图用于详细光照控制、合成重光照帧结合随机光照变化(可选配静态背景参考图像)以提供外观线索,以及捕捉精确三维几何信息的3D点轨迹。通过在统一扩散架构中整合光照、外观和几何线索,IllumiCraft生成与用户定义提示对齐的时间连贯视频。

链接: https://arxiv.org/abs/2506.03150
作者: Yuanze Lin,Yi-Wen Chen,Yi-Hsuan Tsai,Ronald Clark,Ming-Hsuan Yang
机构: University of Oxford (牛津大学); UC Merced (加州大学默塞德分校); NEC Labs America (NEC美国实验室); Atmanity Inc. (Atmanity公司); Google DeepMind (谷歌深度思维)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM)
备注: Tech Report

点击查看摘要

Abstract:Although diffusion-based models can generate high-quality and high-resolution video sequences from textual or image inputs, they lack explicit integration of geometric cues when controlling scene lighting and visual appearance across frames. To address this limitation, we propose IllumiCraft, an end-to-end diffusion framework accepting three complementary inputs: (1) high-dynamic-range (HDR) video maps for detailed lighting control; (2) synthetically relit frames with randomized illumination changes (optionally paired with a static background reference image) to provide appearance cues; and (3) 3D point tracks that capture precise 3D geometry information. By integrating the lighting, appearance, and geometry cues within a unified diffusion architecture, IllumiCraft generates temporally coherent videos aligned with user-defined prompts. It supports background-conditioned and text-conditioned video relighting and provides better fidelity than existing controllable video generation methods. Project Page: this https URL
zh

[CV-1] Self-Supervised Spatial Correspondence Across Modalities CVPR2025 WWW

【速读】:该论文试图解决跨模态时空对应问题(cross-modal space-time correspondence),即在不同视觉模态的图像(如RGB图像和深度图)之间找到对应像素对,这些像素对代表场景中的同一物理点。解决方案的关键在于扩展对比随机游走框架(contrastive random walk framework),以同时学习跨模态和同模态匹配的循环一致特征表示(cycle-consistent feature representations)。该方法无需显式的光一致性假设,并且可以完全使用未标注数据进行训练,而不需要任何空间对齐的多模态图像对。

链接: https://arxiv.org/abs/2506.03148
作者: Ayush Shrivastava,Andrew Owens
机构: University of Michigan (密歇根大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025. Project link: this https URL . Code: this https URL

点击查看摘要

Abstract:We present a method for finding cross-modal space-time correspondences. Given two images from different visual modalities, such as an RGB image and a depth map, our model identifies which pairs of pixels correspond to the same physical points in the scene. To solve this problem, we extend the contrastive random walk framework to simultaneously learn cycle-consistent feature representations for both cross-modal and intra-modal matching. The resulting model is simple and has no explicit photo-consistency assumptions. It can be trained entirely using unlabeled data, without the need for any spatially aligned multimodal image pairs. We evaluate our method on both geometric and semantic correspondence tasks. For geometric matching, we consider challenging tasks such as RGB-to-depth and RGB-to-thermal matching (and vice versa); for semantic matching, we evaluate on photo-sketch and cross-style image alignment. Our method achieves strong performance across all benchmarks.
zh

[CV-2] Context as Memory: Scene-Consistent Interactive Long Video Generation with Memory Retrieval

【速读】:该论文试图解决交互式长视频生成中场景一致性记忆能力不足的问题,现有方法由于缺乏对历史上下文的有效利用而难以维持长期视频生成的连贯性。其解决方案的关键在于提出“Context-as-Memory”机制,该机制将历史上下文作为记忆用于视频生成,包含两个核心设计:一是以帧格式存储上下文而不进行额外后处理;二是通过在输入阶段沿帧维度拼接上下文与待预测帧进行条件控制,无需外部控制模块。此外,为减少计算开销,引入了Memory Retrieval模块,通过计算相机姿态之间的FOV(Field of View)重叠来选择真正相关的上下文帧,从而显著减少候选帧数量且不造成显著信息损失。

链接: https://arxiv.org/abs/2506.03141
作者: Jiwen Yu,Jianhong Bai,Yiran Qin,Quande Liu,Xintao Wang,Pengfei Wan,Di Zhang,Xihui Liu
机构: The University of Hong Kong (香港大学); Zhejiang University (浙江大学); Kuaishou Technology (快手科技)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in interactive video generation have shown promising results, yet existing approaches struggle with scene-consistent memory capabilities in long video generation due to limited use of historical context. In this work, we propose Context-as-Memory, which utilizes historical context as memory for video generation. It includes two simple yet effective designs: (1) storing context in frame format without additional post-processing; (2) conditioning by concatenating context and frames to be predicted along the frame dimension at the input, requiring no external control modules. Furthermore, considering the enormous computational overhead of incorporating all historical context, we propose the Memory Retrieval module to select truly relevant context frames by determining FOV (Field of View) overlap between camera poses, which significantly reduces the number of candidate frames without substantial information loss. Experiments demonstrate that Context-as-Memory achieves superior memory capabilities in interactive long video generation compared to SOTAs, even generalizing effectively to open-domain scenarios not seen during training. The link of our project page is this https URL.
zh

[CV-3] CamCloneMaster: Enabling Reference-based Camera Control for Video Generation

【速读】:该论文旨在解决传统相机控制方法需要用户手动输入显式相机参数序列的问题,这一过程对于复杂相机运动来说较为繁琐。其解决方案的关键在于提出CamCloneMaster框架,该框架允许用户通过参考视频复制相机运动,而无需提供相机参数或在测试阶段进行微调,从而实现更直观的相机控制。

链接: https://arxiv.org/abs/2506.03140
作者: Yawen Luo,Jianhong Bai,Xiaoyu Shi,Menghan Xia,Xintao Wang,Pengfei Wan,Di Zhang,Kun Gai,Tianfan Xue
机构: The Chinese University of Hong Kong (香港中文大学); Zhejiang University (浙江大学); Kuaishou Technology (快手科技)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Camera control is crucial for generating expressive and cinematic videos. Existing methods rely on explicit sequences of camera parameters as control conditions, which can be cumbersome for users to construct, particularly for intricate camera movements. To provide a more intuitive camera control method, we propose CamCloneMaster, a framework that enables users to replicate camera movements from reference videos without requiring camera parameters or test-time fine-tuning. CamCloneMaster seamlessly supports reference-based camera control for both Image-to-Video and Video-to-Video tasks within a unified framework. Furthermore, we present the Camera Clone Dataset, a large-scale synthetic dataset designed for camera clone learning, encompassing diverse scenes, subjects, and camera movements. Extensive experiments and user studies demonstrate that CamCloneMaster outperforms existing methods in terms of both camera controllability and visual quality.
zh

[CV-4] SVGenius: Benchmarking LLM s in SVG Understanding Editing and Generation

【速读】:该论文旨在解决当前SVG处理基准测试中存在的现实场景覆盖有限、复杂度分层不足以及评估范式碎片化的问题。其解决方案的关键在于提出SVGenius,这是一个涵盖2,377个查询的综合性基准,包含理解、编辑和生成三个渐进维度,并基于来自24个应用领域的实际数据进行系统性复杂度分层,通过8个任务类别和18个指标对模型进行全面评估。

链接: https://arxiv.org/abs/2506.03139
作者: Siqi Chen,Xinyu Dong,Haolei Xu,Xingyu Wu,Fei Tang,Hang Zhang,Yuchen Yan,Linjuan Wu,Wenqi Zhang,Guiyang Hou,Yongliang Shen,Weiming Lu,Yueting Zhuang
机构: Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 19 pages,4 figures, Project page: this https URL , Code: this https URL

点击查看摘要

Abstract:Large Language Models (LLMs) and Multimodal LLMs have shown promising capabilities for SVG processing, yet existing benchmarks suffer from limited real-world coverage, lack of complexity stratification, and fragmented evaluation paradigms. We introduce SVGenius, a comprehensive benchmark comprising 2,377 queries across three progressive dimensions: understanding, editing, and generation. Built on real-world data from 24 application domains with systematic complexity stratification, SVGenius evaluates models through 8 task categories and 18 metrics. We assess 22 mainstream models spanning different scales, architectures, training paradigms, and accessibility levels. Our analysis reveals that while proprietary models significantly outperform open-source counterparts, all models exhibit systematic performance degradation with increasing complexity, indicating fundamental limitations in current approaches; however, reasoning-enhanced training proves more effective than pure scaling for overcoming these limitations, though style transfer remains the most challenging capability across all model types. SVGenius establishes the first systematic evaluation framework for SVG processing, providing crucial insights for developing more capable vector graphics models and advancing automated graphic design applications. Appendix and supplementary materials (including all data and code) are available at this https URL.
zh

[CV-5] Native-Resolution Image Synthesis

【速读】:该论文旨在解决传统生成模型在图像合成中对固定分辨率和正方形图像的限制,其核心问题是如何实现对任意分辨率和宽高比图像的生成。解决方案的关键在于提出一种名为Native-resolution diffusion Transformer (NiT) 的架构,该架构通过原生处理可变长度视觉标记,在去噪过程中显式建模不同的分辨率和宽高比,从而突破传统方法的局限性。

链接: https://arxiv.org/abs/2506.03131
作者: Zidong Wang,Lei Bai,Xiangyu Yue,Wanli Ouyang,Yiyuan Zhang
机构: MMLab, CUHK (多媒体实验室,香港中文大学); Shanghai AI Lab (上海人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Project Page: this https URL

点击查看摘要

Abstract:We introduce native-resolution image synthesis, a novel generative modeling paradigm that enables the synthesis of images at arbitrary resolutions and aspect ratios. This approach overcomes the limitations of conventional fixed-resolution, square-image methods by natively handling variable-length visual tokens, a core challenge for traditional techniques. To this end, we introduce the Native-resolution diffusion Transformer (NiT), an architecture designed to explicitly model varying resolutions and aspect ratios within its denoising process. Free from the constraints of fixed formats, NiT learns intrinsic visual distributions from images spanning a broad range of resolutions and aspect ratios. Notably, a single NiT model simultaneously achieves the state-of-the-art performance on both ImageNet-256x256 and 512x512 benchmarks. Surprisingly, akin to the robust zero-shot capabilities seen in advanced large language models, NiT, trained solely on ImageNet, demonstrates excellent zero-shot generalization performance. It successfully generates high-fidelity images at previously unseen high resolutions (e.g., 1536 x 1536) and diverse aspect ratios (e.g., 16:9, 3:1, 4:3), as shown in Figure 1. These findings indicate the significant potential of native-resolution modeling as a bridge between visual generative modeling and advanced LLM methodologies.
zh

[CV-6] AnimeShooter: A Multi-Shot Animation Dataset for Reference-Guided Video Generation

【速读】:该论文旨在解决生成连贯多镜头动画视频时缺乏一致角色引导的问题,现有公开数据集主要关注现实场景且缺乏用于角色一致性的参考图像。解决方案的关键在于提出AnimeShooter数据集,该数据集通过自动化流程实现镜头间的强视觉一致性,并提供多层次的注释,包括故事级和镜头级注释,以及配套的音频子集AnimeShooter-audio。此外,论文还引入AnimeShooterGen模型,利用多模态大语言模型(Multimodal Large Language Models, MLLMs)和视频扩散模型,通过处理参考图像和先前生成的镜头以生成后续镜头,从而提升跨镜头视觉一致性和对参考视觉指导的遵循程度。

链接: https://arxiv.org/abs/2506.03126
作者: Lu Qiu,Yizhuo Li,Yuying Ge,Yixiao Ge,Ying Shan,Xihui Liu
机构: The University of Hong Kong (香港大学); ARC Lab, Tencent PCG (腾讯PCG ARC实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project released at: this https URL

点击查看摘要

Abstract:Recent advances in AI-generated content (AIGC) have significantly accelerated animation production. To produce engaging animations, it is essential to generate coherent multi-shot video clips with narrative scripts and character references. However, existing public datasets primarily focus on real-world scenarios with global descriptions, and lack reference images for consistent character guidance. To bridge this gap, we present AnimeShooter, a reference-guided multi-shot animation dataset. AnimeShooter features comprehensive hierarchical annotations and strong visual consistency across shots through an automated pipeline. Story-level annotations provide an overview of the narrative, including the storyline, key scenes, and main character profiles with reference images, while shot-level annotations decompose the story into consecutive shots, each annotated with scene, characters, and both narrative and descriptive visual captions. Additionally, a dedicated subset, AnimeShooter-audio, offers synchronized audio tracks for each shot, along with audio descriptions and sound sources. To demonstrate the effectiveness of AnimeShooter and establish a baseline for the reference-guided multi-shot video generation task, we introduce AnimeShooterGen, which leverages Multimodal Large Language Models (MLLMs) and video diffusion models. The reference image and previously generated shots are first processed by MLLM to produce representations aware of both reference and context, which are then used as the condition for the diffusion model to decode the subsequent shot. Experimental results show that the model trained on AnimeShooter achieves superior cross-shot visual consistency and adherence to reference visual guidance, which highlight the value of our dataset for coherent animated video generation.
zh

[CV-7] DCM: Dual-Expert Consistency Model for Efficient and High-Quality Video Generation

【速读】:该论文旨在解决一致性模型(Consistency Models)在视频扩散模型(video diffusion models)中应用时出现的时间一致性退化和外观细节损失问题。其关键解决方案是提出一种参数高效的双专家一致性模型(Dual-Expert Consistency Model, DCM),其中语义专家专注于学习语义布局和运动,而细节专家则专注于精细细节的优化。通过引入时间一致性损失以及GAN和特征匹配损失,提升了模型的生成质量和时间一致性,从而在显著减少采样步骤的情况下实现了最先进的视觉效果。

链接: https://arxiv.org/abs/2506.03123
作者: Zhengyao Lv,Chenyang Si,Tianlin Pan,Zhaoxi Chen,Kwan-Yee K. Wong,Yu Qiao,Ziwei Liu
机构: Nanjing University (南京大学); The University of Hong Kong (香港大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); University of Chinese Academy of Sciences (中国科学院大学); S-Lab, Nanyang Technological University (S-Lab,南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion Models have achieved remarkable results in video synthesis but require iterative denoising steps, leading to substantial computational overhead. Consistency Models have made significant progress in accelerating diffusion models. However, directly applying them to video diffusion models often results in severe degradation of temporal consistency and appearance details. In this paper, by analyzing the training dynamics of Consistency Models, we identify a key conflicting learning dynamics during the distillation process: there is a significant discrepancy in the optimization gradients and loss contributions across different timesteps. This discrepancy prevents the distilled student model from achieving an optimal state, leading to compromised temporal consistency and degraded appearance details. To address this issue, we propose a parameter-efficient \textbfDual-Expert Consistency Model~(DCM), where a semantic expert focuses on learning semantic layout and motion, while a detail expert specializes in fine detail refinement. Furthermore, we introduce Temporal Coherence Loss to improve motion consistency for the semantic expert and apply GAN and Feature Matching Loss to enhance the synthesis quality of the detail this http URL approach achieves state-of-the-art visual quality with significantly reduced sampling steps, demonstrating the effectiveness of expert specialization in video diffusion model distillation. Our code and models are available at \hrefthis https URLthis https URL.
zh

[CV-8] Controllable Human-centric Keyframe Interpolation with Generative Prior

【速读】:该论文旨在解决现有插值方法在缺乏三维几何引导的情况下,难以生成复杂、关节人体动作的合理中间帧,并且对合成动态控制能力有限的问题。其解决方案的关键在于提出PoseFuse3D Keyframe Interpolator (PoseFuse3D-KI),该框架将三维人体引导信号整合到扩散过程中,通过引入一种新型的SMPL-X编码器和融合网络,将三维几何与形状转换为二维潜在条件空间,并结合二维姿态嵌入,提供丰富的空间和结构线索以提升插值精度。

链接: https://arxiv.org/abs/2506.03119
作者: Zujin Guo,Size Wu,Zhongang Cai,Wei Li,Chen Change Loy
机构: S-Lab, Nanyang Technological University (S-Lab,南洋理工大学); SenseTime Research (商汤科技研究)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Existing interpolation methods use pre-trained video diffusion priors to generate intermediate frames between sparsely sampled keyframes. In the absence of 3D geometric guidance, these methods struggle to produce plausible results for complex, articulated human motions and offer limited control over the synthesized dynamics. In this paper, we introduce PoseFuse3D Keyframe Interpolator (PoseFuse3D-KI), a novel framework that integrates 3D human guidance signals into the diffusion process for Controllable Human-centric Keyframe Interpolation (CHKI). To provide rich spatial and structural cues for interpolation, our PoseFuse3D, a 3D-informed control model, features a novel SMPL-X encoder that transforms 3D geometry and shape into the 2D latent conditioning space, alongside a fusion network that integrates these 3D cues with 2D pose embeddings. For evaluation, we build CHKI-Video, a new dataset annotated with both 2D poses and 3D SMPL-X parameters. We show that PoseFuse3D-KI consistently outperforms state-of-the-art baselines on CHKI-Video, achieving a 9% improvement in PSNR and a 38% reduction in LPIPS. Comprehensive ablations demonstrate that our PoseFuse3D model improves interpolation fidelity.
zh

[CV-9] HumanRAM: Feed-forward Human Reconstruction and Animation Model using Transformers SIGGRAPH2025

【速读】:该论文试图解决单目或稀疏图像下人体重建与动画生成的挑战,现有方法通常依赖复杂的密集视角采集和耗时的个体优化过程。解决方案的关键在于提出HumanRAM,这是一种新颖的前馈方法,通过将人体重建与动画整合到统一框架中,引入由共享SMPL-X神经纹理参数化的显式姿态条件,结合基于Transformer的大规模重建模型(LRM),从而实现高质量的人体重建与高保真姿态控制动画。

链接: https://arxiv.org/abs/2506.03118
作者: Zhiyuan Yu,Zhe Li,Hujun Bao,Can Yang,Xiaowei Zhou
机构: Hong Kong University of Science and Technology (香港科技大学); Huawei (华为); State Key Laboratory of CAD&CG, Zhejiang University (浙江省CAD&CG重点实验室,浙江大学); Zhejiang University (浙江大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by SIGGRAPH 2025 (Conference Track). Project page: this https URL

点击查看摘要

Abstract:3D human reconstruction and animation are long-standing topics in computer graphics and vision. However, existing methods typically rely on sophisticated dense-view capture and/or time-consuming per-subject optimization procedures. To address these limitations, we propose HumanRAM, a novel feed-forward approach for generalizable human reconstruction and animation from monocular or sparse human images. Our approach integrates human reconstruction and animation into a unified framework by introducing explicit pose conditions, parameterized by a shared SMPL-X neural texture, into transformer-based large reconstruction models (LRM). Given monocular or sparse input images with associated camera parameters and SMPL-X poses, our model employs scalable transformers and a DPT-based decoder to synthesize realistic human renderings under novel viewpoints and novel poses. By leveraging the explicit pose conditions, our model simultaneously enables high-quality human reconstruction and high-fidelity pose-controlled animation. Experiments show that HumanRAM significantly surpasses previous methods in terms of reconstruction accuracy, animation fidelity, and generalization performance on real-world datasets. Video results are available at this https URL.
zh

[CV-10] argeted Forgetting of Image Subgroups in CLIP Models

【速读】:该论文试图解决基础模型(Foundation Models, FMs)在无监督预训练过程中可能继承有害或不想要的知识,从而影响其在实际应用中的可靠性问题,特别是在无法访问预训练数据的情况下,如何实现细粒度的知识遗忘。解决方案的关键在于提出一种三阶段方法:首先进行遗忘阶段,对需要遗忘的样本进行微调;其次进行提醒阶段,恢复保留样本的性能;最后进行恢复阶段,通过模型融合(model souping)恢复零样本能力。此外,引入知识蒸馏以处理遗忘、保留样本与未见过的预训练数据之间的分布差异,从而在不损害整体性能的前提下实现选择性遗忘。

链接: https://arxiv.org/abs/2506.03117
作者: Zeliang Zhang,Gaowen Liu,Charles Fleming,Ramana Rao Kompella,Chenliang Xu
机构: University of Rochester (罗切斯特大学); Cisco Research (思科研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 Figures,5 Pages. The project page is \url{ this https URL }

点击查看摘要

Abstract:Foundation models (FMs) such as CLIP have demonstrated impressive zero-shot performance across various tasks by leveraging large-scale, unsupervised pre-training. However, they often inherit harmful or unwanted knowledge from noisy internet-sourced datasets, compromising their reliability in real-world applications. Existing model unlearning methods either rely on access to pre-trained datasets or focus on coarse-grained unlearning (e.g., entire classes), leaving a critical gap for fine-grained unlearning. In this paper, we address the challenging scenario of selectively forgetting specific portions of knowledge within a class, without access to pre-trained data, while preserving the model’s overall performance. We propose a novel three-stage approach that progressively unlearns targeted knowledge while mitigating over-forgetting. It consists of (1) a forgetting stage to fine-tune the CLIP on samples to be forgotten, (2) a reminding stage to restore performance on retained samples, and (3) a restoring stage to recover zero-shot capabilities using model souping. Additionally, we introduce knowledge distillation to handle the distribution disparity between forgetting, retaining samples, and unseen pre-trained data. Extensive experiments on CIFAR-10, ImageNet-1K, and style datasets demonstrate that our approach effectively unlearns specific subgroups while maintaining strong zero-shot performance on semantically similar subgroups and other categories, significantly outperforming baseline unlearning methods, which lose effectiveness under the CLIP unlearning setting.
zh

[CV-11] Zero-Shot Tree Detection and Segmentation from Aerial Forest Imagery

【速读】:该论文试图解决从遥感影像中大规模分割单个树木的问题,这一问题对于生态研究的进展至关重要,尤其是在气候变化和其他环境因素快速改变全球森林景观的背景下。当前的RGB树分割方法依赖于使用标记的树木数据集训练专门的机器学习模型,但这些模型仍依赖于难以扩展的训练数据。该论文的关键解决方案是探索使用最先进的图像分割模型——Segment Anything Model 2 (SAM2) 在零样本(zero-shot)情况下进行个体树木检测和分割的有效性,通过预训练的SAM2模型在两个任务上进行评估:(1)零样本分割和(2)通过使用现有树木检测模型的预测作为提示进行零样本迁移。结果表明,SAM2不仅具有出色的泛化能力,还能与基于领域内标记数据训练的专用方法形成自然协同效应。

链接: https://arxiv.org/abs/2506.03114
作者: Michelle Chen,David Russell,Amritha Pallavoor,Derek Young,Jane Wu
机构: UC Berkeley (加州大学伯克利分校); UC Davis (加州大学戴维斯分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code: this https URL

点击查看摘要

Abstract:Large-scale delineation of individual trees from remote sensing imagery is crucial to the advancement of ecological research, particularly as climate change and other environmental factors rapidly transform forest landscapes across the world. Current RGB tree segmentation methods rely on training specialized machine learning models with labeled tree datasets. While these learning-based approaches can outperform manual data collection when accurate, the existing models still depend on training data that’s hard to scale. In this paper, we investigate the efficacy of using a state-of-the-art image segmentation model, Segment Anything Model 2 (SAM2), in a zero-shot manner for individual tree detection and segmentation. We evaluate a pretrained SAM2 model on two tasks in this domain: (1) zero-shot segmentation and (2) zero-shot transfer by using predictions from an existing tree detection model as prompts. Our results suggest that SAM2 not only has impressive generalization capabilities, but also can form a natural synergy with specialized methods trained on in-domain labeled data. We find that applying large pretrained models to problems in remote sensing is a promising avenue for future progress. We make our code available at: this https URL.
zh

[CV-12] Revisiting Continuity of Image Tokens for Cross-domain Few-shot Learning ICML2025

【速读】:该论文试图解决跨域小样本学习(Cross-Domain Few-Shot Learning, CDFSL)问题,即在目标域数据稀缺的情况下,如何提升Vision Transformer (ViT) 的泛化能力。其解决方案的关键在于重新审视图像标记(image tokens)连续性在模型泛化中的作用,发现破坏图像标记的连续性可以减少模型对大尺度空间模式的依赖,从而缓解领域差距。基于此,作者提出了一种简单但有效的方法,通过进一步破坏图像标记的连续性,使模型更关注局部小尺度模式,从而提升在目标域上的性能。

链接: https://arxiv.org/abs/2506.03110
作者: Shuai Yi,Yixiong Zou,Yuhua Li,Ruixuan Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICML 2025(spotlight)

点击查看摘要

Abstract:Vision Transformer (ViT) has achieved remarkable success due to its large-scale pretraining on general domains, but it still faces challenges when applying it to downstream distant domains that have only scarce training data, which gives rise to the Cross-Domain Few-Shot Learning (CDFSL) task. Inspired by Self-Attention’s insensitivity to token orders, we find an interesting phenomenon neglected in current works: disrupting the continuity of image tokens (i.e., making pixels not smoothly transited across patches) in ViT leads to a noticeable performance decline in the general (source) domain but only a marginal decrease in downstream target domains. This questions the role of image tokens’ continuity in ViT’s generalization under large domain gaps. In this paper, we delve into this phenomenon for an interpretation. We find continuity aids ViT in learning larger spatial patterns, which are harder to transfer than smaller ones, enlarging domain distances. Meanwhile, it implies that only smaller patterns within each patch could be transferred under extreme domain gaps. Based on this interpretation, we further propose a simple yet effective method for CDFSL that better disrupts the continuity of image tokens, encouraging the model to rely less on large patterns and more on smaller ones. Extensive experiments show the effectiveness of our method in reducing domain gaps and outperforming state-of-the-art works. Codes and models are available at this https URL.
zh

[CV-13] ByteMorph: Benchmarking Instruction-Guided Image Editing with Non-Rigid Motions

【速读】:该论文试图解决在计算机视觉中基于指令对图像进行编辑以反映非刚性运动、相机视角变化、物体形变、人体关节运动及复杂交互的问题,这一问题目前尚未得到充分研究。现有方法和数据集主要关注静态场景或刚性变换,难以处理涉及动态运动的表达性编辑。解决方案的关键在于提出ByteMorph框架,其核心包括一个大规模数据集ByteMorph-6M和基于Diffusion Transformer (DiT) 的强基准模型ByteMorpher,通过运动引导的数据生成、分层合成技术和自动化标注来确保数据集的多样性、真实性和语义一致性。

链接: https://arxiv.org/abs/2506.03107
作者: Di Chang,Mingdeng Cao,Yichun Shi,Bo Liu,Shengqu Cai,Shijie Zhou,Weilin Huang,Gordon Wetzstein,Mohammad Soleymani,Peng Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Website: this https URL Dataset: this https URL Benchmark: this https URL Code: this https URL Demo: this https URL

点击查看摘要

Abstract:Editing images with instructions to reflect non-rigid motions, camera viewpoint shifts, object deformations, human articulations, and complex interactions, poses a challenging yet underexplored problem in computer vision. Existing approaches and datasets predominantly focus on static scenes or rigid transformations, limiting their capacity to handle expressive edits involving dynamic motion. To address this gap, we introduce ByteMorph, a comprehensive framework for instruction-based image editing with an emphasis on non-rigid motions. ByteMorph comprises a large-scale dataset, ByteMorph-6M, and a strong baseline model built upon the Diffusion Transformer (DiT), named ByteMorpher. ByteMorph-6M includes over 6 million high-resolution image editing pairs for training, along with a carefully curated evaluation benchmark ByteMorph-Bench. Both capture a wide variety of non-rigid motion types across diverse environments, human figures, and object categories. The dataset is constructed using motion-guided data generation, layered compositing techniques, and automated captioning to ensure diversity, realism, and semantic coherence. We further conduct a comprehensive evaluation of recent instruction-based image editing methods from both academic and commercial domains.
zh

[CV-14] DyTact: Capturing Dynamic Contacts in Hand-Object Manipulation

【速读】:该论文旨在解决动态手-物体接触重建的问题,这一问题在AI角色动画、扩展现实(XR)和机器人技术中对于实现真实感操作至关重要,但受限于严重的遮挡、复杂的表面细节以及现有捕捉技术的局限性,仍具挑战性。其解决方案的关键在于提出DyTact,一种无需标记的非侵入式捕捉方法,通过基于2D高斯表面元素(surfels)的动态刚体表示来建模复杂操作,并将这些表面元素绑定到MANO网格以利用模板模型的归纳偏差来稳定和加速优化过程。此外,通过细化模块和接触引导的自适应采样策略,DyTact有效处理了时间依赖的高频形变和严重遮挡问题。

链接: https://arxiv.org/abs/2506.03103
作者: Xiaoyan Cong,Angela Xing,Chandradeep Pokhariya,Rao Fu,Srinath Sridhar
机构: Brown University(布朗大学); Indian Institute of Technology Delhi(印度理工学院德里分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Reconstructing dynamic hand-object contacts is essential for realistic manipulation in AI character animation, XR, and robotics, yet it remains challenging due to heavy occlusions, complex surface details, and limitations in existing capture techniques. In this paper, we introduce DyTact, a markerless capture method for accurately capturing dynamic contact in hand-object manipulations in a non-intrusive manner. Our approach leverages a dynamic, articulated representation based on 2D Gaussian surfels to model complex manipulations. By binding these surfels to MANO meshes, DyTact harnesses the inductive bias of template models to stabilize and accelerate optimization. A refinement module addresses time-dependent high-frequency deformations, while a contact-guided adaptive sampling strategy selectively increases surfel density in contact regions to handle heavy occlusion. Extensive experiments demonstrate that DyTact not only achieves state-of-the-art dynamic contact estimation accuracy but also significantly improves novel view synthesis quality, all while operating with fast optimization and efficient memory usage. Project Page: this https URL .
zh

[CV-15] EgoVLM: Policy Optimization for Egocentric Video Understanding

【速读】:该论文旨在解决第一人称视频流中的鲁棒推理问题,特别是在可穿戴相机和自主代理等具身AI应用中。其关键解决方案是提出EgoVLM,一个专门设计用于在第一人称视频上下文中整合视觉理解与时空推理的视觉语言模型(Vision-Language Model)。EgoVLM通过Group Relative Policy Optimization(GRPO)进行微调,这是一种适应性强化学习方法,旨在使模型输出更接近人类推理步骤。此外,EgoVLM通过显式生成推理轨迹增强了可解释性,并引入了一种基于关键帧的奖励机制,以提升时间定位的第一人称推理性能。

链接: https://arxiv.org/abs/2506.03097
作者: Ashwin Vinod,Shrey Pandit,Aditya Vavre,Linshen Liu
机构: The University of Texas at Austin (德克萨斯大学奥斯汀分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Our Code can be found at this https URL

点击查看摘要

Abstract:Emerging embodied AI applications, such as wearable cameras and autonomous agents, have underscored the need for robust reasoning from first person video streams. We introduce EgoVLM, a vision-language model specifically designed to integrate visual comprehension and spatial-temporal reasoning within egocentric video contexts. EgoVLM is fine-tuned via Group Relative Policy Optimization (GRPO), a reinforcement learning method adapted to align model outputs with human-like reasoning steps. Following DeepSeek R1-Zero’s approach, we directly tune using RL without any supervised fine-tuning phase on chain-of-thought (CoT) data. We evaluate EgoVLM on egocentric video question answering benchmarks and show that domain-specific training substantially improves performance over general-purpose VLMs. Our EgoVLM-3B, trained exclusively on non-CoT egocentric data, outperforms the base Qwen2.5-VL 3B and 7B models by 14.33 and 13.87 accuracy points on the EgoSchema benchmark, respectively. By explicitly generating reasoning traces, EgoVLM enhances interpretability, making it well-suited for downstream applications. Furthermore, we introduce a novel keyframe-based reward that incorporates salient frame selection to guide reinforcement learning optimization. This reward formulation opens a promising avenue for future exploration in temporally grounded egocentric reasoning.
zh

[CV-16] FuseLIP: Multimodal Embeddings via Early Fusion of Discrete Tokens

【速读】:该论文旨在解决对比语言-图像预训练(Contrastive Language-Image Pre-training, CLIP)在处理多模态输入时的局限性,即无法原生地将图像和文本编码为单一特征向量的问题。为了解决这一问题,本文提出的解决方案关键在于设计一种名为FuseLIP的新型多模态嵌入架构,通过使用一个统一的Transformer模型对扩展的文本和图像标记词汇进行处理,实现早期融合,使不同模态在编码过程中相互交互,从而获得更丰富的表示。

链接: https://arxiv.org/abs/2506.03096
作者: Christian Schlarmann,Francesco Croce,Nicolas Flammarion,Matthias Hein
机构: Tübingen AI Center(图宾根人工智能中心); University of Tübingen(图宾根大学); EPFL(洛桑联邦理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Code and models available at this https URL

点击查看摘要

Abstract:Contrastive language-image pre-training aligns the features of text-image pairs in a common latent space via distinct encoders for each modality. While this approach achieves impressive performance in several zero-shot tasks, it cannot natively handle multimodal inputs, i.e., encoding image and text into a single feature vector. As a remedy, it is common practice to use additional modules to merge the features extracted by the unimodal encoders. In this work, we present FuseLIP, an alternative architecture for multimodal embedding. Leveraging recent progress in discrete image tokenizers, we propose to use a single transformer model which operates on an extended vocabulary of text and image tokens. This early fusion approach allows the different modalities to interact at each depth of encoding and obtain richer representations compared to common late fusion. We collect new datasets for multimodal pre-training and evaluation, designing challenging tasks for multimodal encoder models. We show that FuseLIP outperforms other approaches in multimodal embedding tasks such as VQA and text-guided image transformation retrieval, while being comparable to baselines on unimodal tasks.
zh

[CV-17] DPO Learning with LLM s-Judge Signal for Computer Use Agents

【速读】:该论文试图解决当前计算机使用代理(Computer Use Agents, CUA)在隐私保护和资源效率方面的不足,尤其是其依赖云端推理所带来的计算需求高、隐私风险大以及可扩展性差的问题。解决方案的关键在于开发一个轻量级的视觉-语言模型,该模型能够在本地设备上运行,并通过引入LLM-as-Judge框架自动评估和筛选合成的交互轨迹,从而生成高质量的强化学习数据,而无需人工标注。

链接: https://arxiv.org/abs/2506.03095
作者: Man Luo,David Cobbley,Xin Su,Shachar Rosenman,Vasudev Lal,Shao-Yen Tseng,Phillip Howard
机构: Intel(英特尔); Thoughtworks(思特沃克)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Computer use agents (CUA) are systems that automatically interact with graphical user interfaces (GUIs) to complete tasks. CUA have made significant progress with the advent of large vision-language models (VLMs). However, these agents typically rely on cloud-based inference with substantial compute demands, raising critical privacy and scalability concerns, especially when operating on personal devices. In this work, we take a step toward privacy-preserving and resource-efficient agents by developing a lightweight vision-language model that runs entirely on local machines. To train this compact agent, we introduce an LLM-as-Judge framework that automatically evaluates and filters synthetic interaction trajectories, producing high-quality data for reinforcement learning without human annotation. Experiments on the OS-World benchmark demonstrate that our fine-tuned local model outperforms existing baselines, highlighting a promising path toward private, efficient, and generalizable GUI agents.
zh

[CV-18] Explicitly Modeling Subcortical Vision with a Neuro-Inspired Front-End Improves CNN Robustness

【速读】:该论文试图解决卷积神经网络(CNN)在面对视觉扰动和分布外图像时表现出的脆弱性问题,相较于生物视觉系统,其鲁棒性不足。解决方案的关键在于引入一种新型混合CNN——早期视觉网络(EVNets),该网络结合了模仿灵长类初级视皮层(V1)的VOneBlock与一个新的皮层下模块(SubcorticalBlock),该模块基于神经科学中的计算模型构建,并通过参数化以最大化与皮层下响应的一致性。这种架构设计提升了模型在标准V1基准测试中的V1对齐度,并更好地模拟了非经典感受野现象,从而增强了模型的鲁棒性。

链接: https://arxiv.org/abs/2506.03089
作者: Lucas Piper,Arlindo L. Oliveira,Tiago Marques
机构: INESC-ID Lisboa (INESC-ID Lisboa); IST Técnico Lisboa, Universidade de Lisboa (IST Técnico Lisboa, Universidade de Lisboa); Breast Unit, Champalimaud Clinical Center, Champalimaud Foundation (Breast Unit, Champalimaud Clinical Center, Champalimaud Foundation)
类目: Computer Vision and Pattern Recognition (cs.CV); Neurons and Cognition (q-bio.NC)
备注:

点击查看摘要

Abstract:Convolutional neural networks (CNNs) trained on object recognition achieve high task performance but continue to exhibit vulnerability under a range of visual perturbations and out-of-domain images, when compared with biological vision. Prior work has demonstrated that coupling a standard CNN with a front-end block (VOneBlock) that mimics the primate primary visual cortex (V1) can improve overall model robustness. Expanding on this, we introduce Early Vision Networks (EVNets), a new class of hybrid CNNs that combine the VOneBlock with a novel SubcorticalBlock, whose architecture draws from computational models in neuroscience and is parameterized to maximize alignment with subcortical responses reported across multiple experimental studies. Without being optimized to do so, the assembly of the SubcorticalBlock with the VOneBlock improved V1 alignment across most standard V1 benchmarks, and better modeled extra-classical receptive field phenomena. In addition, EVNets exhibit stronger emergent shape bias and overperform the base CNN architecture by 8.5% on an aggregate benchmark of robustness evaluations, including adversarial perturbations, common corruptions, and domain shifts. Finally, we show that EVNets can be further improved when paired with a state-of-the-art data augmentation technique, surpassing the performance of the isolated data augmentation approach by 7.3% on our robustness benchmark. This result reveals complementary benefits between changes in architecture to better mimic biology and training-based machine learning approaches.
zh

[CV-19] InterMamba: Efficient Human-Human Interaction Generation with Adaptive Spatio-Temporal Mamba

【速读】:该论文旨在解决人类-人类交互生成在运动合成中的可扩展性和效率问题,现有方法多依赖于基于Transformer的架构,但此类方法在处理长序列依赖和实时反馈时存在局限。其解决方案的关键在于提出一种基于Mamba框架的自适应时空Mamba框架,通过两个并行的状态空间模型(SSM)分支结合自适应机制,有效融合运动序列的空间与时间特征,并引入两个关键模块——自适应时空Mamba模块和跨适应时空Mamba模块,以增强模型对单个运动序列内部依赖关系及不同序列间交互关系的捕捉能力。

链接: https://arxiv.org/abs/2506.03084
作者: Zizhao Wu,Yingying Sun,Yiming Chen,Xiaoling Gu,Ruyu Liu,Jiazhou Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Human-human interaction generation has garnered significant attention in motion synthesis due to its vital role in understanding humans as social beings. However, existing methods typically rely on transformer-based architectures, which often face challenges related to scalability and efficiency. To address these issues, we propose a novel, efficient human-human interaction generation method based on the Mamba framework, designed to meet the demands of effectively capturing long-sequence dependencies while providing real-time feedback. Specifically, we introduce an adaptive spatio-temporal Mamba framework that utilizes two parallel SSM branches with an adaptive mechanism to integrate the spatial and temporal features of motion sequences. To further enhance the model’s ability to capture dependencies within individual motion sequences and the interactions between different individual sequences, we develop two key modules: the self-adaptive spatio-temporal Mamba module and the cross-adaptive spatio-temporal Mamba module, enabling efficient feature learning. Extensive experiments demonstrate that our method achieves state-of-the-art results on two interaction datasets with remarkable quality and efficiency. Compared to the baseline method InterGen, our approach not only improves accuracy but also requires a minimal parameter size of just 66M ,only 36% of InterGen’s, while achieving an average inference speed of 0.57 seconds, which is 46% of InterGen’s execution time.
zh

[CV-20] SG2VID: Scene Graphs Enable Fine-Grained Control for Video Synthesis

【速读】:该论文旨在解决传统手术模拟工具在提供逼真视觉效果和人体解剖结构多样性方面的不足,以及现有基于生成模型的模拟器在精细的人类控制方面存在的缺陷。其解决方案的关键在于引入SG2VID,这是首个基于扩散模型的视频生成模型,它利用场景图(Scene Graph)实现精确的视频合成与细粒度的人类控制,从而在手术模拟中实现对工具尺寸、运动、新工具引入及整体场景布局的准确操控。

链接: https://arxiv.org/abs/2506.03082
作者: Ssharvien Kumar Sivakumar,Yannik Frisch,Ghazal Ghazaei,Anirban Mukhopadhyay
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Surgical simulation plays a pivotal role in training novice surgeons, accelerating their learning curve and reducing intra-operative errors. However, conventional simulation tools fall short in providing the necessary photorealism and the variability of human anatomy. In response, current methods are shifting towards generative model-based simulators. Yet, these approaches primarily focus on using increasingly complex conditioning for precise synthesis while neglecting the fine-grained human control aspect. To address this gap, we introduce SG2VID, the first diffusion-based video model that leverages Scene Graphs for both precise video synthesis and fine-grained human control. We demonstrate SG2VID’s capabilities across three public datasets featuring cataract and cholecystectomy surgery. While SG2VID outperforms previous methods both qualitatively and quantitatively, it also enables precise synthesis, providing accurate control over tool and anatomy’s size and movement, entrance of new tools, as well as the overall scene layout. We qualitatively motivate how SG2VID can be used for generative augmentation and present an experiment demonstrating its ability to improve a downstream phase detection task when the training set is extended with our synthetic videos. Finally, to showcase SG2VID’s ability to retain human control, we interact with the Scene Graphs to generate new video samples depicting major yet rare intra-operative irregularities.
zh

[CV-21] ORV: 4D Occupancy-centric Robot Video Generation

【速读】:该论文试图解决通过远程操作获取真实世界机器人仿真数据的耗时和劳动密集问题,以及现有基于动作驱动的生成模型在控制精度和泛化能力上的不足。解决方案的关键在于提出ORV(Occupancy-centric Robot Video)框架,该框架利用4D语义占据序列作为细粒度表示,为视频生成提供更精确的语义和几何引导,从而实现高时间一致性和精确可控的逼真机器人视频生成。

链接: https://arxiv.org/abs/2506.03079
作者: Xiuyu Yang,Bohan Li,Shaocong Xu,Nan Wang,Chongjie Ye,Zhaoxi Chen,Minghan Qin,Yikang Ding,Xin Jin,Hang Zhao,Hao Zhao
机构: Beijing Academy of Artificial Intelligence; IIIS, Tsinghua University (清华大学人工智能研究院); Shanghai Jiao Tong University (上海交通大学); Eastern Institute of Technology, Ningbo (宁波工程学院); The Chinese University of Hong Kong, Shenzhen (香港中文大学深圳校区); National University of Singapore (新加坡国立大学); ByteDance (字节跳动); Megvii Technology (旷视科技); AIR, Tsinghua University (清华大学人工智能研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL ; Code: this https URL

点击查看摘要

Abstract:Acquiring real-world robotic simulation data through teleoperation is notoriously time-consuming and labor-intensive. Recently, action-driven generative models have gained widespread adoption in robot learning and simulation, as they eliminate safety concerns and reduce maintenance efforts. However, the action sequences used in these methods often result in limited control precision and poor generalization due to their globally coarse alignment. To address these limitations, we propose ORV, an Occupancy-centric Robot Video generation framework, which utilizes 4D semantic occupancy sequences as a fine-grained representation to provide more accurate semantic and geometric guidance for video generation. By leveraging occupancy-based representations, ORV enables seamless translation of simulation data into photorealistic robot videos, while ensuring high temporal consistency and precise controllability. Furthermore, our framework supports the simultaneous generation of multi-view videos of robot gripping operations - an important capability for downstream robotic learning tasks. Extensive experimental results demonstrate that ORV consistently outperforms existing baseline methods across various datasets and sub-tasks. Demo, Code and Model: this https URL
zh

[CV-22] LEG-SLAM: Real-Time Language-Enhanced Gaussian Splatting for SLAM

【速读】:该论文试图解决在实时语义3D高斯溅射(Gaussian Splatting)中融合语义信息并保持实时性能的问题,特别是在SLAM(同时定位与建图)应用中的挑战。解决方案的关键在于将优化的高斯溅射实现与基于DINOv2的视觉-语言特征提取相结合,并通过主成分分析(PCA)驱动的可学习特征压缩器进行处理,从而实现在线密集SLAM,同时生成高质量的逼真图像和语义标记场景地图。

链接: https://arxiv.org/abs/2506.03073
作者: Roman Titkov,Egor Zubkov,Dmitry Yudin,Jaafar Mahmoud,Malik Mohrat,Gennady Sidorov
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Modern Gaussian Splatting methods have proven highly effective for real-time photorealistic rendering of 3D scenes. However, integrating semantic information into this representation remains a significant challenge, especially in maintaining real-time performance for SLAM (Simultaneous Localization and Mapping) applications. In this work, we introduce LEG-SLAM – a novel approach that fuses an optimized Gaussian Splatting implementation with visual-language feature extraction using DINOv2 followed by a learnable feature compressor based on Principal Component Analysis, while enabling an online dense SLAM. Our method simultaneously generates high-quality photorealistic images and semantically labeled scene maps, achieving real-time scene reconstruction with more than 10 fps on the Replica dataset and 18 fps on ScanNet. Experimental results show that our approach significantly outperforms state-of-the-art methods in reconstruction speed while achieving competitive rendering quality. The proposed system eliminates the need for prior data preparation such as camera’s ego motion or pre-computed static semantic maps. With its potential applications in autonomous robotics, augmented reality, and other interactive domains, LEG-SLAM represents a significant step forward in real-time semantic 3D Gaussian-based SLAM. Project page: this https URL
zh

[CV-23] EDITOR: Effective and Interpretable Prompt Inversion for Text-to-Image Diffusion Models

【速读】:该论文试图解决文本到图像生成模型中提示逆向(prompt inversion)的问题,即从生成的图像中恢复出用于生成该图像的文本提示。其解决方案的关键在于提出一种名为\sys的提示逆向技术,该技术通过使用预训练的图像描述模型初始化嵌入、在潜在空间中进行反向工程优化以及利用嵌入到文本的模型将嵌入转换为文本,从而实现更高质量的图像相似性、文本对齐度、提示可解释性和泛化能力。

链接: https://arxiv.org/abs/2506.03067
作者: Mingzhe Li,Gehao Zhang,Zhenting Wang,Shiqing Ma,Siqi Pan,Richard Cartwright,Juan Zhai
机构: University of Massachusetts, Amherst(马萨诸塞大学阿默斯特分校); Rutgers University(罗格斯大学); Dolby Laboratories(杜比实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Text-to-image generation models~(e.g., Stable Diffusion) have achieved significant advancements, enabling the creation of high-quality and realistic images based on textual descriptions. Prompt inversion, the task of identifying the textual prompt used to generate a specific artifact, holds significant potential for applications including data attribution, model provenance, and watermarking validation. Recent studies introduced a delayed projection scheme to optimize for prompts representative of the vocabulary space, though challenges in semantic fluency and efficiency remain. Advanced image captioning models or visual large language models can generate highly interpretable prompts, but they often lack in image similarity. In this paper, we propose a prompt inversion technique called \sys for text-to-image diffusion models, which includes initializing embeddings using a pre-trained image captioning model, refining them through reverse-engineering in the latent space, and converting them to texts using an embedding-to-text model. Our experiments on the widely-used datasets, such as MS COCO, LAION, and Flickr, show that our method outperforms existing methods in terms of image similarity, textual alignment, prompt interpretability and generalizability. We further illustrate the application of our generated prompts in tasks such as cross-concept image synthesis, concept manipulation, evolutionary multi-concept generation and unsupervised segmentation.
zh

[CV-24] Sparse-vDiT: Unleashing the Power of Sparse Attention to Accelerate Video Diffusion Transformers

【速读】:该论文旨在解决视频生成中扩散变换器(Diffusion Transformers, DiTs)因注意力机制的二次复杂度导致的推理延迟问题。其关键解决方案是通过分析视频扩散变换器(Video Diffusion Transformer, vDiT)中的注意力图,识别出三种重复的稀疏性模式(对角线、多对角线和垂直条纹结构),并利用这些模式设计稀疏加速框架Sparse-vDiT,包括针对每种稀疏模式的优化稀疏内核以及基于硬件感知成本建模的离线稀疏扩散搜索算法,从而实现计算效率的提升。

链接: https://arxiv.org/abs/2506.03065
作者: Pengtao Chen,Xianfang Zeng,Maosen Zhao,Peng Ye,Mingzhu Shen,Wei Cheng,Gang Yu,Tao Chen
机构: Fudan University (复旦大学); StepFun (步方科技); The Chinese University of Hong Kong (香港中文大学); Imperial College London (帝国理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:While Diffusion Transformers (DiTs) have achieved breakthroughs in video generation, this long sequence generation task remains constrained by the quadratic complexity of attention mechanisms, resulting in significant inference latency. Through detailed analysis of attention maps in Video Diffusion Transformer (vDiT), we identify three recurring sparsity patterns: diagonal, multi-diagonal, and vertical-stripe structures. And even 3-6% attention heads can be skipped. Crucially, these patterns exhibit strong layer-depth and head-position correlations but show limited dependence on the input content. Leveraging these findings, we propose Sparse-vDiT, a sparsity acceleration framework for vDiT comprising: 1) Pattern-optimized sparse kernels that replace dense attention with computationally efficient implementations for each identified sparsity pattern. 2) An offline sparse diffusion search algorithm that selects the optimal sparse computation strategy per layer and head via hardware-aware cost modeling. After determining the optimal configuration, we fuse heads within the same layer that share the same attention strategy, enhancing inference efficiency. Integrated into state-of-the-art vDiT models (CogVideoX1.5, HunyuanVideo, and Wan2.1), Sparse-vDiT achieves 2.09 \times , 2.38 \times , and 1.67 \times theoretical FLOP reduction, and actual inference speedups of 1.76 \times , 1.85 \times , and 1.58 \times , respectively, while maintaining high visual fidelity, with PSNR values reaching 24.13, 27.09, and 22.59. Our work demonstrates that latent structural sparsity in vDiTs can be systematically exploited for long video synthesis.
zh

[CV-25] Smartflow: Enabling Scalable Spatiotemporal Geospatial Research

【速读】:该论文旨在解决大规模地理区域、长时间尺度及海量影像档案中的地理空间模型开发与分析问题,特别是针对重型建筑监测的挑战。其解决方案的关键在于引入Smartflow框架,该框架基于开源工具和技术,利用STAC兼容目录作为统一输入,将异构地理空间数据转换为标准化数据立方体,并通过Kubernetes实现工作流的编排与执行,从而支持水平和垂直扩展,有效提升模型训练与分析的效率与可扩展性。

链接: https://arxiv.org/abs/2506.03022
作者: David McVicar,Brian Avant,Adrian Gould,Diego Torrejon,Charles Della Porta,Ryan Mukherjee
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:BlackSky introduces Smartflow, a cloud-based framework enabling scalable spatiotemporal geospatial research built on open-source tools and technologies. Using STAC-compliant catalogs as a common input, heterogeneous geospatial data can be processed into standardized datacubes for analysis and model training. Model experimentation is managed using a combination of tools, including ClearML, Tensorboard, and Apache Superset. Underpinning Smartflow is Kubernetes, which orchestrates the provisioning and execution of workflows to support both horizontal and vertical scalability. This combination of features makes Smartflow well-suited for geospatial model development and analysis over large geographic areas, time scales, and expansive image archives. We also present a novel neural architecture, built using Smartflow, to monitor large geographic areas for heavy construction. Qualitative results based on data from the IARPA Space-based Machine Automated Recognition Technique (SMART) program are presented that show the model is capable of detecting heavy construction throughout all major phases of development. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2506.03022 [cs.CV] (or arXiv:2506.03022v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2506.03022 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: IGARSS 2023 Related DOI: https://doi.org/10.1109/IGARSS52108.2023.10283095 Focus to learn more DOI(s) linking to related resources
zh

[CV-26] DFBench: Benchmarking Deepfake Image Detection Capability of Large Multimodal Models

【速读】:该论文旨在解决当前深度伪造(deepfake)检测方法在面对不断进化且日益逼真的AI生成内容时所面临的挑战,尤其是现有数据集在生成模型和内容多样性方面的局限性。其解决方案的关键在于提出DFBench,一个大规模的深度伪造基准,具备广泛的多样性、最新的生成模型以及双向评估机制,同时引入MoA-DF(Mixture of Agents for DeepFake detection),通过多大型多模态模型(LMMs)的联合概率策略实现先进的检测性能。

链接: https://arxiv.org/abs/2506.03007
作者: Jiarui Wang,Huiyu Duan,Juntong Wang,Ziheng Jia,Woo Yi Yang,Xiaorong Zhu,Yu Zhao,Jiaying Qian,Yuke Xing,Guangtao Zhai,Xiongkuo Min
机构: Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:With the rapid advancement of generative models, the realism of AI-generated images has significantly improved, posing critical challenges for verifying digital content authenticity. Current deepfake detection methods often depend on datasets with limited generation models and content diversity that fail to keep pace with the evolving complexity and increasing realism of the AI-generated content. Large multimodal models (LMMs), widely adopted in various vision tasks, have demonstrated strong zero-shot capabilities, yet their potential in deepfake detection remains largely unexplored. To bridge this gap, we present \textbfDFBench, a large-scale DeepFake Benchmark featuring (i) broad diversity, including 540,000 images across real, AI-edited, and AI-generated content, (ii) latest model, the fake images are generated by 12 state-of-the-art generation models, and (iii) bidirectional benchmarking and evaluating for both the detection accuracy of deepfake detectors and the evasion capability of generative models. Based on DFBench, we propose \textbfMoA-DF, Mixture of Agents for DeepFake detection, leveraging a combined probability strategy from multiple LMMs. MoA-DF achieves state-of-the-art performance, further proving the effectiveness of leveraging LMMs for deepfake detection. Database and codes are publicly available at this https URL.
zh

[CV-27] PartComposer: Learning and Composing Part-Level Concepts from Single-Image Examples

【速读】:该论文试图解决单图像示例下部件级概念学习的问题,旨在使文本到图像扩散模型能够从有意义的组件中组合出新颖对象。现有方法在学习细粒度概念或需要大量数据作为输入方面存在困难。解决方案的关键在于提出一种动态数据合成管道,以生成多样化的部件组合来应对少样本数据稀缺问题,并通过概念预测器最大化去噪潜在表示与结构化概念代码之间的互信息,从而实现对概念解耦和重新组合的直接调控。

链接: https://arxiv.org/abs/2506.03004
作者: Junyu Liu,R. Kenny Jones,Daniel Ritchie
机构: Brown University (布朗大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present PartComposer: a framework for part-level concept learning from single-image examples that enables text-to-image diffusion models to compose novel objects from meaningful components. Existing methods either struggle with effectively learning fine-grained concepts or require a large dataset as input. We propose a dynamic data synthesis pipeline generating diverse part compositions to address one-shot data scarcity. Most importantly, we propose to maximize the mutual information between denoised latents and structured concept codes via a concept predictor, enabling direct regulation on concept disentanglement and re-composition supervision. Our method achieves strong disentanglement and controllable composition, outperforming subject and part-level baselines when mixing concepts from the same, or different, object categories.
zh

[CV-28] Astrophotography turbulence mitigation via generative models

【速读】:该论文旨在解决地面望远镜拍摄的天文图像因大气湍流导致成像质量下降的问题(atmospheric turbulence)。其解决方案的关键在于提出了一种基于扩散模型(diffusion models)的生成式恢复方法AstroDiff,该方法利用了生成式AI(Generative AI)的高质量生成先验和恢复能力,以有效减轻大气湍流对图像的影响。

链接: https://arxiv.org/abs/2506.02981
作者: Joonyeoup Kim,Yu Yuan,Xingguang Zhang,Xijun Wang,Stanley Chan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Photography is the cornerstone of modern astronomical and space research. However, most astronomical images captured by ground-based telescopes suffer from atmospheric turbulence, resulting in degraded imaging quality. While multi-frame strategies like lucky imaging can mitigate some effects, they involve intensive data acquisition and complex manual processing. In this paper, we propose AstroDiff, a generative restoration method that leverages both the high-quality generative priors and restoration capabilities of diffusion models to mitigate atmospheric turbulence. Extensive experiments demonstrate that AstroDiff outperforms existing state-of-the-art learning-based methods in astronomical image turbulence mitigation, providing higher perceptual quality and better structural fidelity under severe turbulence conditions. Our code and additional results are available at this https URL
zh

[CV-29] Deep Learning for Retinal Degeneration Assessment: A Comprehensive Analysis of the MARIO AMD Progression Challenge MICCAI

【速读】:该论文旨在解决通过光学相干断层扫描(OCT)图像自动检测和监测年龄相关性黄斑变性(AMD)的问题,特别是针对新生血管活动变化的识别。其关键解决方案在于设计了两个任务:一是对连续两幅2D OCT B扫描图像进行演变分类,二是预测接受抗血管内皮生长因子(VEGF)治疗的患者在三个月内的AMD进展。通过多模态数据集的使用以及参与团队的算法开发,该挑战为AMD监测提供了基准,并展示了人工智能在测量AMD进展方面与医生相当的能力,但在预测未来演变方面仍存在不足。

链接: https://arxiv.org/abs/2506.02976
作者: Rachid Zeghlache,Ikram Brahim,Pierre-Henri Conze,Mathieu Lamard,Mohammed El Amine Lazouni,Zineb Aziza Elaouaber,Leila Ryma Lazouni,Christopher Nielsen,Ahmad O. Ahsan,Matthias Wilms,Nils D. Forkert,Lovre Antonio Budimir,Ivana Matovinović,Donik Vršnak,Sven Lončarić,Philippe Zhang,Weili Jiang,Yihao Li,Yiding Hao,Markus Frohmann,Patrick Binder,Marcel Huber,Taha Emre,Teresa Finisterra Araújo,Marzieh Oghbaie,Hrvoje Bogunović,Amerens A. Bekkers,Nina M. van Liebergen,Hugo J. Kuijf,Abdul Qayyum,Moona Mazher,Steven A. Niederer,Alberto J. Beltrán-Carrero,Juan J. Gómez-Valverde,Javier Torresano-Rodríquez,Álvaro Caballero-Sastre,María J. Ledesma Carbayo,Yosuke Yamagishi,Yi Ding,Robin Peretzke,Alexandra Ertl,Maximilian Fischer,Jessica Kächele,Sofiane Zehar,Karim Boukli Hacene,Thomas Monfort,Béatrice Cochener,Mostafa El Habib Daho,Anas-Alexis Benyoussef,Gwenolé Quellec
机构: LaTIM UMR 1101, Inserm, Brest, France; University of Western Brittany, Brest, France; IMT Atlantique, Brest, France; Service d’Ophtalmologie, CHRU Brest, Brest, France; University of Tlemcen, Algeria; LAZOUNI Ophthalmology Clinic, Tlemcen, Algeria; Ophthalmology Department, CHRU Brest, Brest, France; INSERM U1227 Lymphocytes B et Autoimmunite (LBAI), Brest, France; Imperial College London, United Kingdom; Division of Radiology and Biomedical Engineering, Graduate School of Medicine, The University of Tokyo, Tokyo, Japan; Evolucare Technologies, France; College of Computer Science, Sichuan University, China; Medical University of Vienna, Austria; TNO, The Hague, The Netherlands; Image Sciences Institute, UMC Utrecht, Utrecht, The Netherlands; Johannes Kepler University Linz, Austria; University of Zagreb, Faculty of Electrical Engineering and Computing, Croatia; Department of Radiology, University of Calgary, Calgary, AB, Canada; Biomedical Engineering Graduate Program, University of Calgary, Calgary, AB, Canada; Hotchkiss Brain Institute, University of Calgary, Calgary, AB, Canada; Alberta Children’s Hospital Research Institute, University of Calgary, Calgary, AB, Canada; Department of Pediatrics, University of Calgary, Calgary, AB, Canada; Department of Community Health Sciences, University of Calgary, Calgary, AB, Canada; Department of Clinical Neuroscience, University of Calgary, Calgary, AB, Canada; University of Calgary, Calgary, AB, Canada; German Cancer Research Center (DKFZ) Heidelberg, Division of Medical Image Computing, Germany; Medical Faculty Heidelberg, Heidelberg University, Germany; German Cancer Consortium (DKTK), DKFZ, Germany; Biomedical Image Technologies (BIT), ETSI Telecomunicación, Universidad Politécnica de Madrid, Spain; Centro de Investigación Biomédica en Red de Bioingeniería, Biomateriales y Nanomedicina (CIBER-BBN), Madrid, Spain; Ophthalmology Service of the Provincial Ophthalmic Institute, Hospital Universitario Gregorio Marañón, Madrid, Spain; University of Edinburgh, Scotland; National Heart and Lung Institute, Faculty of Medicine, Imperial College London, United Kingdom; Hawkes Institute, Department of Computer Science, University College London, London, United Kingdom; United Imaging Healthcare, China
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: MARIO-MICCAI-CHALLENGE 2024

点击查看摘要

Abstract:The MARIO challenge, held at MICCAI 2024, focused on advancing the automated detection and monitoring of age-related macular degeneration (AMD) through the analysis of optical coherence tomography (OCT) images. Designed to evaluate algorithmic performance in detecting neovascular activity changes within AMD, the challenge incorporated unique multi-modal datasets. The primary dataset, sourced from Brest, France, was used by participating teams to train and test their models. The final ranking was determined based on performance on this dataset. An auxiliary dataset from Algeria was used post-challenge to evaluate population and device shifts from submitted solutions. Two tasks were involved in the MARIO challenge. The first one was the classification of evolution between two consecutive 2D OCT B-scans. The second one was the prediction of future AMD evolution over three months for patients undergoing anti-vascular endothelial growth factor (VEGF) therapy. Thirty-five teams participated, with the top 12 finalists presenting their methods. This paper outlines the challenge’s structure, tasks, data characteristics, and winning methodologies, setting a benchmark for AMD monitoring using OCT, infrared imaging, and clinical data (such as the number of visits, age, gender, etc.). The results of this challenge indicate that artificial intelligence (AI) performs as well as a physician in measuring AMD progression (Task 1) but is not yet able of predicting future evolution (Task 2).
zh

[CV-30] HaploOmni: Unified Single Transformer for Multimodal Video Understanding and Generation

【速读】:该论文旨在解决统一多模态理解和生成模型在训练效率和跨模态兼容性方面的挑战。其关键解决方案是提出一种基于先验知识的多模态预热策略,以及引入特征预缩放和多模态AdaLN技术,以提升模型的跨模态对齐能力,从而构建一个高效的单一Transformer架构——HaploOmni。

链接: https://arxiv.org/abs/2506.02975
作者: Yicheng Xiao,Lin Song,Rui Yang,Cheng Cheng,Zunnan Xu,Zhaoyang Zhang,Yixiao Ge,Xiu Li,Ying Shan
机构: Tsinghua Shenzhen International Graduate School, Tsinghua University (清华大学深圳国际研究生院,清华大学); ARC Lab, Tencent PCG (腾讯PCG人工智能实验室); The University of Hong Kong (香港大学); Xi’an JiaoTong University (西安交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:With the advancement of language models, unified multimodal understanding and generation have made significant strides, with model architectures evolving from separated components to unified single-model frameworks. This paper explores an efficient training paradigm to build a single transformer for unified multimodal understanding and generation. Specifically, we propose a multimodal warmup strategy utilizing prior knowledge to extend capabilities. To address cross-modal compatibility challenges, we introduce feature pre-scaling and multimodal AdaLN techniques. Integrating the proposed technologies, we present the HaploOmni, a new single multimodal transformer. With limited training costs, HaploOmni achieves competitive performance across multiple image and video understanding and generation benchmarks over advanced unified models. All codes will be made public at this https URL.
zh

[CV-31] FORLA:Federated Object-centric Representation Learning with Slot Attention

【速读】:该论文试图解决在异构无标签数据集中学习高效视觉表示的问题,这是联邦学习中的一个核心挑战。解决方案的关键在于提出FORLA框架,该框架通过使用无监督的槽注意力(slot attention)实现跨客户端的对象中心表示学习和特征适配。其核心组件包括一个共享的特征适配器和一个共享的槽注意力模块,其中特征适配器通过客户端间的协作对基础模型的特征进行适配,而槽注意力模块则通过对齐跨客户端的对象级表示来促进跨域学习。此外,采用两分支的学生-教师架构优化适配器,使模型能够学习到紧凑且具有广泛泛化能力的表示。

链接: https://arxiv.org/abs/2506.02964
作者: Guiqiu Liao,Matjaz Jogan,Eric Eaton,Daniel A. Hashimoto
机构: University of Pennsylvania (宾夕法尼亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 24 pages, 6 figures

点击查看摘要

Abstract:Learning efficient visual representations across heterogeneous unlabeled datasets remains a central challenge in federated learning. Effective federated representations require features that are jointly informative across clients while disentangling domain-specific factors without supervision. We introduce FORLA, a novel framework for federated object-centric representation learning and feature adaptation across clients using unsupervised slot attention. At the core of our method is a shared feature adapter, trained collaboratively across clients to adapt features from foundation models, and a shared slot attention module that learns to reconstruct the adapted features. To optimize this adapter, we design a two-branch student-teacher architecture. In each client, a student decoder learns to reconstruct full features from foundation models, while a teacher decoder reconstructs their adapted, low-dimensional counterpart. The shared slot attention module bridges cross-domain learning by aligning object-level representations across clients. Experiments in multiple real-world datasets show that our framework not only outperforms centralized baselines on object discovery but also learns a compact, universal representation that generalizes well across domains. This work highlights federated slot attention as an effective tool for scalable, unsupervised visual representation learning from cross-domain data with distributed concepts.
zh

[CV-32] Interaction Field Matching: Overcoming Limitations of Electrostatic Models

【速读】:该论文试图解决在生成式数据和迁移任务中,如何准确建模复杂电场分布的问题,特别是传统电场匹配(Electrostatic Field Matching, EFM)方法在处理电容器极板外复杂电场时的局限性。其解决方案的关键在于提出一种更通用的交互场匹配(Interaction Field Matching, IFM)框架,该框架不仅限于静电场,还可以使用其他类型的交互场,并受到物理中夸克与反夸克强相互作用的启发,设计出一种能够有效解决EFM中建模难题的交互场实现方式。

链接: https://arxiv.org/abs/2506.02950
作者: Stepan I. Manukhov,Alexander Kolesov,Vladimir V. Palyulin,Alexander Korotin
机构: Skolkovo Institute of Science and Technology (斯科尔科沃科学与技术学院); Artificial Intelligence Research Institute (人工智能研究院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Electrostatic field matching (EFM) has recently appeared as a novel physics-inspired paradigm for data generation and transfer using the idea of an electric capacitor. However, it requires modeling electrostatic fields using neural networks, which is non-trivial because of the necessity to take into account the complex field outside the capacitor plates. In this paper, we propose Interaction Field Matching (IFM), a generalization of EFM which allows using general interaction fields beyond the electrostatic one. Furthermore, inspired by strong interactions between quarks and antiquarks in physics, we design a particular interaction field realization which solves the problems which arise when modeling electrostatic fields in EFM. We show the performance on a series of toy and image data transfer problems.
zh

[CV-33] MIND: Material Interface Generation from UDFs for Non-Manifold Surface Reconstruction

【速读】:该论文旨在解决从无符号距离场(UDF)中提取非流形网格时存在的挑战,尤其是由于学习到的场通常无法获得精确的零距离值,导致传统方法通过局部有符号距离场(SDF)重构表面时引入拓扑伪影或无法处理非流形几何的问题。其解决方案的关键在于提出MIND算法,该算法通过从UDF中推导出有意义的空间划分,使目标表面作为不同区域之间的界面显现,并结合多标签全局场与输入UDF构建支持非流形网格提取的材料界面。

链接: https://arxiv.org/abs/2506.02938
作者: Xuhui Chen,Fei Hou,Wencheng Wang,Hong Qin,Ying He
机构: Institute of Software, Chinese Academy of Sciences (中国科学院软件研究所); University of Chinese Academy of Sciences (中国科学院大学); Stony Brook University (石溪大学); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Unsigned distance fields (UDFs) are widely used in 3D deep learning due to their ability to represent shapes with arbitrary topology. While prior work has largely focused on learning UDFs from point clouds or multi-view images, extracting meshes from UDFs remains challenging, as the learned fields rarely attain exact zero distances. A common workaround is to reconstruct signed distance fields (SDFs) locally from UDFs to enable surface extraction via Marching Cubes. However, this often introduces topological artifacts such as holes or spurious components. Moreover, local SDFs are inherently incapable of representing non-manifold geometry, leading to complete failure in such cases. To address this gap, we propose MIND (Material Interface from Non-manifold Distance fields), a novel algorithm for generating material interfaces directly from UDFs, enabling non-manifold mesh extraction from a global perspective. The core of our method lies in deriving a meaningful spatial partitioning from the UDF, where the target surface emerges as the interface between distinct regions. We begin by computing a two-signed local field to distinguish the two sides of manifold patches, and then extend this to a multi-labeled global field capable of separating all sides of a non-manifold structure. By combining this multi-labeled field with the input UDF, we construct material interfaces that support non-manifold mesh extraction via a multi-labeled Marching Cubes algorithm. Extensive experiments on UDFs generated from diverse data sources, including point cloud reconstruction, multi-view reconstruction, and medial axis transforms, demonstrate that our approach robustly handles complex non-manifold surfaces and significantly outperforms existing methods.
zh

[CV-34] owards Auto-Annotation from Annotation Guidelines: A Benchmark through 3D LiDAR Detection

【速读】:该论文试图解决真实应用场景中数据标注(data annotation)的难题,即依赖人工标注员根据专家制定的详细指南手动标注数据,这一过程通常耗时、繁琐且成本高昂。其解决方案的关键在于引入一个新的基准测试集AnnoGuide:从标注指南自动标注(Auto-Annotation from Annotation Guidelines),旨在评估直接从专家定义的标注指南中实现数据自动标注的方法,从而无需手动标注。该方案采用了一个概念上简单的流程,利用开源基础模型(Foundation Models, FMs)进行RGB图像中的目标检测与分割,结合已知的相机位姿将2D检测投影到3D,并在每个2D检测的视锥区域内聚类LiDAR点以生成3D立方体。通过逐步优化关键组件,该方法显著提升了3D检测的mAP值,但从结果来看,AnnoGuide仍是一个开放且具有挑战性的问题,亟需发展基于LiDAR的基础模型。

链接: https://arxiv.org/abs/2506.02914
作者: Yechi Ma,Wei Hua,Shu Kong
机构: Zhejiang University (浙江大学); University of Macau (澳门大学); Institute of Collaborative Innovation (协同创新研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:A crucial yet under-appreciated prerequisite in machine learning solutions for real-applications is data annotation: human annotators are hired to manually label data according to detailed, expert-crafted guidelines. This is often a laborious, tedious, and costly process. To study methods for facilitating data annotation, we introduce a new benchmark AnnoGuide: Auto-Annotation from Annotation Guidelines. It aims to evaluate automated methods for data annotation directly from expert-defined annotation guidelines, eliminating the need for manual labeling. As a case study, we repurpose the well-established nuScenes dataset, commonly used in autonomous driving research, which provides comprehensive annotation guidelines for labeling LiDAR point clouds with 3D cuboids across 18 object classes. These guidelines include a few visual examples and textual descriptions, but no labeled 3D cuboids in LiDAR data, making this a novel task of multi-modal few-shot 3D detection without 3D annotations. The advances of powerful foundation models (FMs) make AnnoGuide especially timely, as FMs offer promising tools to tackle its challenges. We employ a conceptually straightforward pipeline that (1) utilizes open-source FMs for object detection and segmentation in RGB images, (2) projects 2D detections into 3D using known camera poses, and (3) clusters LiDAR points within the frustum of each 2D detection to generate a 3D cuboid. Starting with a non-learned solution that leverages off-the-shelf FMs, we progressively refine key components and achieve significant performance improvements, boosting 3D detection mAP from 12.1 to 21.9! Nevertheless, our results highlight that AnnoGuide remains an open and challenging problem, underscoring the urgent need for developing LiDAR-based FMs. We release our code and models at GitHub: this https URL
zh

[CV-35] FlySearch: Exploring how vision-language models explore

【速读】:该论文试图解决视觉-语言模型(Vision-Language Models, VLMs)在复杂、非结构化现实环境中进行有效目标搜索与导航的能力问题。研究通过引入FlySearch,一个3D、户外、逼真的环境,评估VLMs在不同难度场景下的表现,发现当前最先进的VLMs在简单的探索任务中也难以可靠完成,且随着任务复杂度增加,与人类表现的差距进一步扩大。解决方案的关键在于识别并缓解导致性能不足的核心原因,如视觉幻觉、上下文误解以及任务规划失败,并通过微调部分模型参数来改善其在实际环境中的表现。

链接: https://arxiv.org/abs/2506.02896
作者: Adam Pardyl,Dominik Matuszek,Mateusz Przebieracz,Marek Cygan,Bartosz Zieliński,Maciej Wołczyk
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:The real world is messy and unstructured. Uncovering critical information often requires active, goal-driven exploration. It remains to be seen whether Vision-Language Models (VLMs), which recently emerged as a popular zero-shot tool in many difficult tasks, can operate effectively in such conditions. In this paper, we answer this question by introducing FlySearch, a 3D, outdoor, photorealistic environment for searching and navigating to objects in complex scenes. We define three sets of scenarios with varying difficulty and observe that state-of-the-art VLMs cannot reliably solve even the simplest exploration tasks, with the gap to human performance increasing as the tasks get harder. We identify a set of central causes, ranging from vision hallucination, through context misunderstanding, to task planning failures, and we show that some of them can be addressed by finetuning. We publicly release the benchmark, scenarios, and the underlying codebase.
zh

[CV-36] VolTex: Food Volume Estimation using Text-Guided Segmentation and Neural Surface Reconstruction

【速读】:该论文试图解决食品体积估计中食品部分选择不足的问题(food portions selection),即现有3D食品体积估计方法虽然能够准确计算体积,但在选择特定食品对象方面存在局限。解决方案的关键在于提出VolTex框架,通过文本输入指定目标食品项进行分割,从而实现对现实场景中特定食品对象的精确选择,并利用神经表面重建方法生成高保真三维网格以进行体积计算。

链接: https://arxiv.org/abs/2506.02895
作者: Ahmad AlMughrabi,Umair Haroon,Ricardo Marques,Petia Radeva
机构: Universitat de Barcelona, Spain(巴塞罗那大学,西班牙); Universitat Pompeu Fabra (UPF), Spain(庞佩乌法布拉大学,西班牙); IMUB & Institut de Neurociències, Barcelona(巴塞罗那医学研究所与神经科学研究所)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate food volume estimation is crucial for dietary monitoring, medical nutrition management, and food intake analysis. Existing 3D Food Volume estimation methods accurately compute the food volume but lack for food portions selection. We present VolTex, a framework that improves \changethe food object selection in food volume estimation. Allowing users to specify a target food item via text input to be segmented, our method enables the precise selection of specific food objects in real-world scenes. The segmented object is then reconstructed using the Neural Surface Reconstruction method to generate high-fidelity 3D meshes for volume computation. Extensive evaluations on the MetaFood3D dataset demonstrate the effectiveness of our approach in isolating and reconstructing food items for accurate volume estimation. The source code is accessible at this https URL.
zh

[CV-37] Dense Match Summarization for Faster Two-view Estimation CVPR

【速读】:该论文旨在解决从密集对应关系中快速获得鲁棒的两视图相对姿态的问题。尽管密集匹配器能够显著提升姿态估计的准确性和鲁棒性,但大量的匹配点导致在RANSAC中的鲁棒估计运行时间显著增加。解决方案的关键在于提出一种高效的匹配摘要方案,该方案在保持与使用完整密集匹配集相当的准确性的同时,实现了10-100倍的运行速度提升。

链接: https://arxiv.org/abs/2506.02893
作者: Jonathan Astermark,Anders Heyden,Viktor Larsson
机构: Lund University (隆德大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to Computer Vision and Pattern Recognition (CVPR) 2025

点击查看摘要

Abstract:In this paper, we speed up robust two-view relative pose from dense correspondences. Previous work has shown that dense matchers can significantly improve both accuracy and robustness in the resulting pose. However, the large number of matches comes with a significantly increased runtime during robust estimation in RANSAC. To avoid this, we propose an efficient match summarization scheme which provides comparable accuracy to using the full set of dense matches, while having 10-100x faster runtime. We validate our approach on standard benchmark datasets together with multiple state-of-the-art dense matchers.
zh

[CV-38] OpenFace 3.0: A Lightweight Multitask System for Comprehensive Facial Behavior Analysis

【速读】:该论文旨在解决自动面部行为分析系统在计算领域中的需求,特别是在视觉、多模态交互、机器人技术和情感计算等方向。其解决方案的关键在于引入OpenFace 3.0,这是一个开源工具包,能够实现面部关键点检测、面部动作单元检测、眼动估计和面部情绪识别。OpenFace 3.0的核心创新是采用轻量级统一模型,通过多任务架构在多样化的人群、头部姿态、光照条件、视频分辨率和面部分析任务上进行训练,从而提升了预测性能、推理速度和内存效率。

链接: https://arxiv.org/abs/2506.02891
作者: Jiewen Hu,Leena Mathur,Paul Pu Liang,Louis-Philippe Morency
机构: Carnegie Mellon University (卡内基梅隆大学); Massachusetts Institute of Technology (麻省理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: IEEE FG 2025, \c{opyright} 2025 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work

点击查看摘要

Abstract:In recent years, there has been increasing interest in automatic facial behavior analysis systems from computing communities such as vision, multimodal interaction, robotics, and affective computing. Building upon the widespread utility of prior open-source facial analysis systems, we introduce OpenFace 3.0, an open-source toolkit capable of facial landmark detection, facial action unit detection, eye-gaze estimation, and facial emotion recognition. OpenFace 3.0 contributes a lightweight unified model for facial analysis, trained with a multi-task architecture across diverse populations, head poses, lighting conditions, video resolutions, and facial analysis tasks. By leveraging the benefits of parameter sharing through a unified model and training paradigm, OpenFace 3.0 exhibits improvements in prediction performance, inference speed, and memory efficiency over similar toolkits and rivals state-of-the-art models. OpenFace 3.0 can be installed and run with a single line of code and operate in real-time without specialized hardware. OpenFace 3.0 code for training models and running the system is freely available for research purposes and supports contributions from the community.
zh

[CV-39] GaRA-SAM: Robustifying Segment Anything Model with Gated-Rank Adaptation

【速读】:该论文旨在提升Segment Anything Model (SAM) 对输入退化(input degradations)的鲁棒性,以确保其在高风险应用如自动驾驶和机器人领域的有效部署。解决方案的关键在于提出一种轻量级的适配器机制——门控秩适应(gated-rank adaptation, GaRA),该机制通过在冻结的SAM中间层引入适配器,利用学习到的门控模块动态调整权重矩阵的有效秩,从而实现细粒度且输入感知的鲁棒性增强,同时保持SAM的泛化能力。

链接: https://arxiv.org/abs/2506.02882
作者: Sohyun Lee,Yeho Kwon,Lukas Hoyer,Suha Kwak
机构: POSTECH(浦项科技大学); Google(谷歌)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Improving robustness of the Segment Anything Model (SAM) to input degradations is critical for its deployment in high-stakes applications such as autonomous driving and robotics. Our approach to this challenge prioritizes three key aspects: first, parameter efficiency to maintain the inherent generalization capability of SAM; second, fine-grained and input-aware robustification to precisely address the input corruption; and third, adherence to standard training protocols for ease of training. To this end, we propose gated-rank adaptation (GaRA). GaRA introduces lightweight adapters into intermediate layers of the frozen SAM, where each adapter dynamically adjusts the effective rank of its weight matrix based on the input by selectively activating (rank-1) components of the matrix using a learned gating module. This adjustment enables fine-grained and input-aware robustification without compromising the generalization capability of SAM. Our model, GaRA-SAM, significantly outperforms prior work on all robust segmentation benchmarks. In particular, it surpasses the previous best IoU score by up to 21.3%p on ACDC, a challenging real corrupted image dataset.
zh

[CV-40] NTIRE 2025 XGC Quality Assessment Challenge: Methods and Results

【速读】:该论文旨在解决视频及虚拟人物(talking head)处理领域中的质量评估问题,通过NTIRE 2025 XGC Quality Assessment Challenge推动相关技术的发展。该挑战分为三个赛道:用户生成视频、AI生成视频和虚拟人物,分别使用FineVD-GC、Q-Eval-Video和THQA-NTIRE数据集进行评估。解决方案的关键在于为每个赛道设计专门的质量评估方法,并通过大量参赛者提交的模型和事实表(fact sheets)验证其有效性,从而促进各赛道领域的技术进步。

链接: https://arxiv.org/abs/2506.02875
作者: Xiaohong Liu,Xiongkuo Min,Qiang Hu,Xiaoyun Zhang,Jie Guo,Guangtao Zhai,Shushi Wang,Yingjie Zhou,Lu Liu,Jingxin Li,Liu Yang,Farong Wen,Li Xu,Yanwei Jiang,Xilei Zhu,Chunyi Li,Zicheng Zhang,Huiyu Duan,Xiele Wu,Yixuan Gao,Yuqin Cao,Jun Jia,Wei Sun,Jiezhang Cao,Radu Timofte,Baojun Li,Jiamian Huang,Dan Luo,Tao Liu,Weixia Zhang,Bingkun Zheng,Junlin Chen,Ruikai Zhou,Meiya Chen,Yu Wang,Hao Jiang,Xiantao Li,Yuxiang Jiang,Jun Tang,Yimeng Zhao,Bo Hu,Zelu Qi,Chaoyang Zhang,Fei Zhao,Ping Shi,Lingzhi Fu,Heng Cong,Shuai He,Rongyu Zhang,Jiarong He,Zongyao Hu,Wei Luo,Zihao Yu,Fengbin Guan,Yiting Lu,Xin Li,Zhibo Chen,Mengjing Su,Yi Wang,Tuo Chen,Chunxiao Li,Shuaiyu Zhao,Jiaxin Wen,Chuyi Lin,Sitong Liu,Ningxin Chu,Jing Wan,Yu Zhou,Baoying Chen,Jishen Zeng,Jiarui Liu,Xianjin Liu,Xin Chen,Lanzhi Zhou,Hangyu Li,You Han,Bibo Xiang,Zhenjie Liu,Jianzhang Lu,Jialin Gui,Renjie Lu,Shangfei Wang,Donghao Zhou,Jingyu Lin,Quanjian Song,Jiancheng Huang,Yufeng Yang,Changwei Wang,Shupeng Zhong,Yang Yang,Lihuo He,Jia Liu,Yuting Xing,Tida Fang,Yuchun Jin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: NTIRE 2025 XGC Quality Assessment Challenge Report. arXiv admin note: text overlap with arXiv:2404.16687

点击查看摘要

Abstract:This paper reports on the NTIRE 2025 XGC Quality Assessment Challenge, which will be held in conjunction with the New Trends in Image Restoration and Enhancement Workshop (NTIRE) at CVPR 2025. This challenge is to address a major challenge in the field of video and talking head processing. The challenge is divided into three tracks, including user generated video, AI generated video and talking head. The user-generated video track uses the FineVD-GC, which contains 6,284 user generated videos. The user-generated video track has a total of 125 registered participants. A total of 242 submissions are received in the development phase, and 136 submissions are received in the test phase. Finally, 5 participating teams submitted their models and fact sheets. The AI generated video track uses the Q-Eval-Video, which contains 34,029 AI-Generated Videos (AIGVs) generated by 11 popular Text-to-Video (T2V) models. A total of 133 participants have registered in this track. A total of 396 submissions are received in the development phase, and 226 submissions are received in the test phase. Finally, 6 participating teams submitted their models and fact sheets. The talking head track uses the THQA-NTIRE, which contains 12,247 2D and 3D talking heads. A total of 89 participants have registered in this track. A total of 225 submissions are received in the development phase, and 118 submissions are received in the test phase. Finally, 8 participating teams submitted their models and fact sheets. Each participating team in every track has proposed a method that outperforms the baseline, which has contributed to the development of fields in three tracks.
zh

[CV-41] Pan-Arctic Permafrost Landform and Human-built Infrastructure Feature Detection with Vision Transformers and Location Embeddings

【速读】:该论文旨在解决在泛北极范围内利用亚米级卫星影像准确识别永久冻土地貌、融区扰动及人类建筑基础设施的问题,其核心挑战在于处理大规模图像数据时的计算效率与特征检测模型的鲁棒性。解决方案的关键在于引入基于视觉变压器(Vision Transformers, ViTs)的模型架构,通过自监督预训练缓解标注数据不足的问题,并结合地理空间位置嵌入以提升模型在不同区域的泛化能力。实验结果表明,融合位置嵌入的ViTs在两项任务中优于传统的卷积神经网络(CNN)模型,展示了具有空间感知能力的Transformer模型在北极遥感应用中的潜力。

链接: https://arxiv.org/abs/2506.02868
作者: Amal S. Perera,David Fernandez,Chandi Witharana,Elias Manos,Michael Pimenta,Anna K. Liljedahl,Ingmar Nitze,Yili Yang,Todd Nicholson,Chia-Yu Hsu,Wenwen Li,Guido Grosse
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 20 pages, 2 column IEEE format, 13 Figures

点击查看摘要

Abstract:Accurate mapping of permafrost landforms, thaw disturbances, and human-built infrastructure at pan-Arctic scale using sub-meter satellite imagery is increasingly critical. Handling petabyte-scale image data requires high-performance computing and robust feature detection models. While convolutional neural network (CNN)-based deep learning approaches are widely used for remote sensing (RS),similar to the success in transformer based large language models, Vision Transformers (ViTs) offer advantages in capturing long-range dependencies and global context via attention mechanisms. ViTs support pretraining via self-supervised learning-addressing the common limitation of labeled data in Arctic feature detection and outperform CNNs on benchmark datasets. Arctic also poses challenges for model generalization, especially when features with the same semantic class exhibit diverse spectral characteristics. To address these issues for Arctic feature detection, we integrate geospatial location embeddings into ViTs to improve adaptation across regions. This work investigates: (1) the suitability of pre-trained ViTs as feature extractors for high-resolution Arctic remote sensing tasks, and (2) the benefit of combining image and location embeddings. Using previously published datasets for Arctic feature detection, we evaluate our models on three tasks-detecting ice-wedge polygons (IWP), retrogressive thaw slumps (RTS), and human-built infrastructure. We empirically explore multiple configurations to fuse image embeddings and location embeddings. Results show that ViTs with location embeddings outperform prior CNN-based models on two of the three tasks including F1 score increase from 0.84 to 0.92 for RTS detection, demonstrating the potential of transformer-based models with spatial awareness for Arctic RS applications.
zh

[CV-42] MVTD: A Benchmark Dataset for Maritime Visual Object Tracking

【速读】:该论文旨在解决海事环境中视觉目标跟踪(Maritime Visual Object Tracking, MVOT)面临的独特挑战,如镜面水反射、低对比度目标、动态背景变化和频繁遮挡等问题。现有通用目标跟踪算法在这些复杂场景下的性能显著下降,表明需要针对海事领域的专用数据集。论文提出的关键解决方案是构建一个名为Maritime Visual Tracking Dataset (MVTD) 的综合性公开基准数据集,包含182个高分辨率视频序列和约150,000帧,涵盖四种典型海事目标类别。通过在MVTD上对最新跟踪算法进行微调,模型性能得到显著提升,验证了领域自适应和迁移学习在专业跟踪任务中的有效性。

链接: https://arxiv.org/abs/2506.02866
作者: Ahsan Baidar Bakht,Muhayy Ud Din,Sajid Javed,Irfan Hussain
机构: Khalifa University Center for Autonomous Robotic Systems (KUCARS), Khalifa University, United Arab Emirates.
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submited to Nature Scientific Data

点击查看摘要

Abstract:Visual Object Tracking (VOT) is a fundamental task with widespread applications in autonomous navigation, surveillance, and maritime robotics. Despite significant advances in generic object tracking, maritime environments continue to present unique challenges, including specular water reflections, low-contrast targets, dynamically changing backgrounds, and frequent occlusions. These complexities significantly degrade the performance of state-of-the-art tracking algorithms, highlighting the need for domain-specific datasets. To address this gap, we introduce the Maritime Visual Tracking Dataset (MVTD), a comprehensive and publicly available benchmark specifically designed for maritime VOT. MVTD comprises 182 high-resolution video sequences, totaling approximately 150,000 frames, and includes four representative object classes: boat, ship, sailboat, and unmanned surface vehicle (USV). The dataset captures a diverse range of operational conditions and maritime scenarios, reflecting the real-world complexities of maritime environments. We evaluated 14 recent SOTA tracking algorithms on the MVTD benchmark and observed substantial performance degradation compared to their performance on general-purpose datasets. However, when fine-tuned on MVTD, these models demonstrate significant performance gains, underscoring the effectiveness of domain adaptation and the importance of transfer learning in specialized tracking contexts. The MVTD dataset fills a critical gap in the visual tracking community by providing a realistic and challenging benchmark for maritime scenarios. Dataset and Source Code can be accessed here "this https URL.
zh

[CV-43] Enhancing Abnormality Identification: Robust Out-of-Distribution Strategies for Deepfake Detection

【速读】:该论文试图解决深度伪造(deepfake)检测在开放集场景下的泛化能力不足问题,即传统神经网络在封闭世界假设下训练,难以应对来自训练分布之外的生成模型数据。解决方案的关键在于提出两种新的异常数据(Out-Of-Distribution, OOD)检测方法:第一种方法通过重建输入图像来识别异常数据,第二种方法则引入注意力机制以增强对OOD的检测能力。实验结果表明,所提方法在深度伪造检测任务中表现出色,具有良好的鲁棒性和适应性。

链接: https://arxiv.org/abs/2506.02857
作者: Luca Maiano,Fabrizio Casadei,Irene Amerini
机构: Sapienza University of Rome (罗马第一大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Detecting deepfakes has become a critical challenge in Computer Vision and Artificial Intelligence. Despite significant progress in detection techniques, generalizing them to open-set scenarios continues to be a persistent difficulty. Neural networks are often trained on the closed-world assumption, but with new generative models constantly evolving, it is inevitable to encounter data generated by models that are not part of the training distribution. To address these challenges, in this paper, we propose two novel Out-Of-Distribution (OOD) detection approaches. The first approach is trained to reconstruct the input image, while the second incorporates an attention mechanism for detecting OODs. Our experiments validate the effectiveness of the proposed approaches compared to existing state-of-the-art techniques. Our method achieves promising results in deepfake detection and ranks among the top-performing configurations on the benchmark, demonstrating their potential for robust, adaptable solutions in dynamic, real-world applications.
zh

[CV-44] Hierarchical Self-Prompting SAM: A Prompt-Free Medical Image Segmentation Framework

【速读】:该论文旨在解决Segment Anything Model (SAM)在医学图像分割中依赖手动提示(prompt)的问题,这一限制使其在医学成像领域的应用受到制约。论文提出的解决方案是Hierarchical Self-Prompting SAM (HSP-SAM),其关键在于引入了在自提示过程中学习抽象提示(abstract prompts)的机制,而非仅依赖位置提示(positional prompts)。这一创新使得模型能够在无提示的情况下实现强大的医学图像分割性能,并在多个任务和数据集上表现出优越的泛化能力和鲁棒性。

链接: https://arxiv.org/abs/2506.02854
作者: Mengmeng Zhang,Xingyuan Dai,Yicheng Sun,Jing Wang,Yueyang Yao,Xiaoyan Gong,Fuze Cong,Feiyue Wang,Yisheng Lv
机构: Institute of Automation, Chinese Academy of Science(中国科学院自动化研究所); School of Artificial Intelligence, University of Chinese Academy of Science(中国科学院大学人工智能学院); Department of Radiology, Peking Union Medical College Hospital, Peking Union Medical College, Chinese Academy of Medical Sciences(北京协和医院放射科,北京协和医学院,中国医学科学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Although the Segment Anything Model (SAM) is highly effective in natural image segmentation, it requires dependencies on prompts, which limits its applicability to medical imaging where manual prompts are often unavailable. Existing efforts to fine-tune SAM for medical segmentation typically struggle to remove this dependency. We propose Hierarchical Self-Prompting SAM (HSP-SAM), a novel self-prompting framework that enables SAM to achieve strong performance in prompt-free medical image segmentation. Unlike previous self-prompting methods that remain limited to positional prompts similar to vanilla SAM, we are the first to introduce learning abstract prompts during the self-prompting process. This simple and intuitive self-prompting framework achieves superior performance on classic segmentation tasks such as polyp and skin lesion segmentation, while maintaining robustness across diverse medical imaging modalities. Furthermore, it exhibits strong generalization to unseen datasets, achieving improvements of up to 14.04% over previous state-of-the-art methods on some challenging benchmarks. These results suggest that abstract prompts encapsulate richer and higher-dimensional semantic information compared to positional prompts, thereby enhancing the model’s robustness and generalization performance. All models and codes will be released upon acceptance.
zh

[CV-45] Learning Pyramid-structured Long-range Dependencies for 3D Human Pose Estimation

【速读】:该论文旨在解决在2D关节的空间约束下恢复3D姿态时,人体动作协调建模中的长程依赖关系建模问题。现有方法通过增加网络深度来学习非连接部位之间的依赖关系,但这种方法引入了不相关的噪声并增加了模型规模。该论文的关键解决方案是利用金字塔结构来更好地学习潜在的长程依赖关系,提出了一种新颖的金字塔图注意力(Pyramid Graph Attention, PGA)模块,通过跨尺度的方式捕捉关节和组之间的相关性,并结合图卷积模块构建了轻量级多尺度变换器架构——金字塔图变换器(Pyramid Graph Transformer, PGFormer)。

链接: https://arxiv.org/abs/2506.02853
作者: Mingjie Wei,Xuemei Xie,Yutong Zhong,Guangming Shi
机构: Xidian University (西安电子科技大学); Pazhou LAB (Huangpu) (琶洲实验室(黄埔)); Peng Cheng Laboratory (鹏城实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IEEE Transactions on Multimedia (TMM)

点击查看摘要

Abstract:Action coordination in human structure is indispensable for the spatial constraints of 2D joints to recover 3D pose. Usually, action coordination is represented as a long-range dependence among body parts. However, there are two main challenges in modeling long-range dependencies. First, joints should not only be constrained by other individual joints but also be modulated by the body parts. Second, existing methods make networks deeper to learn dependencies between non-linked parts. They introduce uncorrelated noise and increase the model size. In this paper, we utilize a pyramid structure to better learn potential long-range dependencies. It can capture the correlation across joints and groups, which complements the context of the human sub-structure. In an effective cross-scale way, it captures the pyramid-structured long-range dependence. Specifically, we propose a novel Pyramid Graph Attention (PGA) module to capture long-range cross-scale dependencies. It concatenates information from various scales into a compact sequence, and then computes the correlation between scales in parallel. Combining PGA with graph convolution modules, we develop a Pyramid Graph Transformer (PGFormer) for 3D human pose estimation, which is a lightweight multi-scale transformer architecture. It encapsulates human sub-structures into self-attention by pooling. Extensive experiments show that our approach achieves lower error and smaller model size than state-of-the-art methods on Human3.6M and MPI-INF-3DHP datasets. The code is available at this https URL.
zh

[CV-46] METok: Multi-Stage Event-based Token Compression for Efficient Long Video Understanding

【速读】:该论文旨在解决视频大语言模型(Video Large Language Models, VLLMs)在处理长视频时面临的高计算需求和视觉数据冗余问题。其解决方案的关键在于提出METok,一个无需训练的多阶段基于事件的令牌压缩框架,通过三个关键阶段逐步消除冗余视觉令牌:(1)视觉编码阶段的事件感知压缩,(2)预填充阶段基于语义对齐和事件重要性的分层令牌剪枝,以及(3)解码阶段的键值缓存(KV Cache)优化,从而在保持准确性的同时显著提升推理效率。

链接: https://arxiv.org/abs/2506.02850
作者: Mengyue Wang,Shuo Chen,Kristian Kersting,Volker Tresp,Yunpu Ma
机构: Technical University of Munich (慕尼黑工业大学); LMU Munich (慕尼黑路德维希-马克西米利安大学); DFKI SAINT (德国人工智能研究中心SAINT实验室); Hessian AI (黑森人工智能); TU Darmstadt (达姆施塔特工业大学); Munich Center for Machine Learning (慕尼黑机器学习中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 10 figures

点击查看摘要

Abstract:Recent advances in Video Large Language Models (VLLMs) have significantly enhanced their ability to understand video content. Nonetheless, processing long videos remains challenging due to high computational demands and the redundancy present in the visual data. In this work, we propose METok, a training-free, Multi-stage Event-based Token compression framework designed to accelerate VLLMs’ inference while preserving accuracy. METok progressively eliminates redundant visual tokens across three critical stages: (1) event-aware compression during vision encoding, (2) hierarchical token pruning in the prefilling stage based on semantic alignment and event importance, and (3) a decoding-stage KV Cache optimization that further reduces memory consumption. Our experiments on diverse video benchmarks demonstrate that METok achieves an optimal trade-off between efficiency and accuracy by dynamically selecting informative visual tokens. For instance, equipping LongVA-7B with METok realizes an 80.6% FLOPs reduction and 93.5% KV Cache memory savings, all while maintaining comparable or even superior accuracy.
zh

[CV-47] PBR-SR: Mesh PBR Texture Super Resolution from 2D Image Priors CEC

【速读】:该论文旨在解决物理基础渲染(PBR)纹理超分辨率(SR)的问题,即从低分辨率(LR)的PBR输入中生成高质量、高分辨率的PBR纹理。其解决方案的关键在于利用预训练的自然图像超分辨率模型,并通过迭代优化减少超分辨率先验与可微分渲染之间的偏差,同时在多视角渲染中应用2D先验约束以缓解视图不一致和光照敏感性问题,并在PBR纹理域中引入身份约束以保证上采样纹理对原始输入的忠实度。该方法无需额外训练或数据,完全依赖预训练图像先验,实现了高效的PBR纹理超分辨率。

链接: https://arxiv.org/abs/2506.02846
作者: Yujin Chen,Yinyu Nie,Benjamin Ummenhofer,Reiner Birkl,Michael Paulitsch,Matthias Nießner
机构: Technical University of Munich (慕尼黑工业大学); Intel Labs (英特尔实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL , Video: this https URL

点击查看摘要

Abstract:We present PBR-SR, a novel method for physically based rendering (PBR) texture super resolution (SR). It outputs high-resolution, high-quality PBR textures from low-resolution (LR) PBR input in a zero-shot manner. PBR-SR leverages an off-the-shelf super-resolution model trained on natural images, and iteratively minimizes the deviations between super-resolution priors and differentiable renderings. These enhancements are then back-projected into the PBR map space in a differentiable manner to produce refined, high-resolution textures. To mitigate view inconsistencies and lighting sensitivity, which is common in view-based super-resolution, our method applies 2D prior constraints across multi-view renderings, iteratively refining the shared, upscaled textures. In parallel, we incorporate identity constraints directly in the PBR texture domain to ensure the upscaled textures remain faithful to the LR input. PBR-SR operates without any additional training or data requirements, relying entirely on pretrained image priors. We demonstrate that our approach produces high-fidelity PBR textures for both artist-designed and AI-generated meshes, outperforming both direct SR models application and prior texture optimization methods. Our results show high-quality outputs in both PBR and rendering evaluations, supporting advanced applications such as relighting.
zh

[CV-48] Go Beyond Earth: Understanding Human Actions and Scenes in Microgravity Environments NEURIPS2025

【速读】:该论文试图解决在微重力环境下视频理解的领域鲁棒性问题,因为现有数据集主要针对地球重力条件,而微重力环境会改变人类运动、交互和视觉语义,导致现实世界视觉系统存在显著差距。解决方案的关键是引入MicroG-4M,这是首个针对微重力环境中人类活动时空与语义理解的基准数据集,其包含来自真实太空任务和电影模拟的数据,支持细粒度多标签动作识别、时间视频描述生成和视觉问答三个核心任务,以全面评估微重力场景下的空间定位与语义推理能力。

链接: https://arxiv.org/abs/2506.02845
作者: Di Wen,Lei Qi,Kunyu Peng,Kailun Yang,Fei Teng,Ao Luo,Jia Fu,Yufan Chen,Ruiping Liu,Yitian Shi,M. Saquib Sarfraz,Rainer Stiefelhagen
机构: Karlsruhe Institute of Technology (卡尔斯鲁厄理工学院); Hunan University (湖南大学); Waseda University (早稻田大学); KTH Royal Institute of Technology (瑞典皇家理工学院); RISE Research Institutes of Sweden (瑞典里瑟研究机构); Mercedes-Benz Tech Innovation (梅赛德斯-奔驰技术创新)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 3 figures, submitted to NeurIPS 2025

点击查看摘要

Abstract:Despite substantial progress in video understanding, most existing datasets are limited to Earth’s gravitational conditions. However, microgravity alters human motion, interactions, and visual semantics, revealing a critical gap for real-world vision systems. This presents a challenge for domain-robust video understanding in safety-critical space applications. To address this, we introduce MicroG-4M, the first benchmark for spatio-temporal and semantic understanding of human activities in microgravity. Constructed from real-world space missions and cinematic simulations, the dataset includes 4,759 clips covering 50 actions, 1,238 context-rich captions, and over 7,000 question-answer pairs on astronaut activities and scene understanding. MicroG-4M supports three core tasks: fine-grained multi-label action recognition, temporal video captioning, and visual question answering, enabling a comprehensive evaluation of both spatial localization and semantic reasoning in microgravity contexts. We establish baselines using state-of-the-art models. All data, annotations, and code are available at this https URL.
zh

[CV-49] Random Registers for Cross-Domain Few-Shot Learning ICML2025

【速读】:该论文旨在解决跨域小样本学习(Cross-domain Few-shot Learning, CDFSL)中模型迁移能力不足的问题,尤其是在源域数据充足而目标域数据稀缺的情况下,Vision Transformer (ViT) 的迁移性能仍存在较大提升空间。论文的关键解决方案是发现并利用随机寄存器(random registers)替代可学习提示(learnable prompts)来增强模型在目标域中的泛化能力,通过引入随机噪声扰动注意力机制,降低损失函数的尖锐度,从而提升模型的迁移性。这一方法在四个基准数据集上得到了验证,表现出优越的性能。

链接: https://arxiv.org/abs/2506.02843
作者: Shuai Yi,Yixiong Zou,Yuhua Li,Ruixuan Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICML 2025

点击查看摘要

Abstract:Cross-domain few-shot learning (CDFSL) aims to transfer knowledge from a data-sufficient source domain to data-scarce target domains. Although Vision Transformer (ViT) has shown superior capability in many vision tasks, its transferability against huge domain gaps in CDFSL is still under-explored. In this paper, we find an intriguing phenomenon: during the source-domain training, prompt tuning, as a common way to train ViT, could be harmful for the generalization of ViT in target domains, but setting them to random noises (i.e., random registers) could consistently improve target-domain performance. We then delve into this phenomenon for an interpretation. We find that learnable prompts capture domain information during the training on the source dataset, which views irrelevant visual patterns as vital cues for recognition. This can be viewed as a kind of overfitting and increases the sharpness of the loss landscapes. In contrast, random registers are essentially a novel way of perturbing attention for the sharpness-aware minimization, which helps the model find a flattened minimum in loss landscapes, increasing the transferability. Based on this phenomenon and interpretation, we further propose a simple but effective approach for CDFSL to enhance the perturbation on attention maps by adding random registers on the semantic regions of image tokens, improving the effectiveness and efficiency of random registers. Extensive experiments on four benchmarks validate our rationale and state-of-the-art performance. Codes and models are available at this https URL.
zh

[CV-50] PhysGaia: A Physics-Aware Dataset of Multi-Body Interactions for Dynamic Novel View Synthesis KR

【速读】:该论文旨在解决物理感知动态新视角合成(Physics-aware Dynamic Novel View Synthesis, DyNVS)领域中缺乏高质量数据集的问题。现有数据集主要关注于照片级真实感重建,而未能有效支持物理规律驱动的动态场景建模。论文提出的PhysGaia数据集通过引入包含结构化物体和非结构化物理现象的复杂动态场景,以及多种物理材料(如液体、气体、粘弹性物质和纺织品)的交互行为,解决了这一问题。其关键在于严格遵循物理定律生成场景,并提供包括3D粒子轨迹和物理参数在内的真实标注信息,从而为物理建模的定量评估和先进DyNVS模型的集成应用提供了基础。

链接: https://arxiv.org/abs/2506.02794
作者: Mijeong Kim,Gunhee Kim,Jungyoon Choi,Wonjae Roh,Bohyung Han
机构: Seoul National University (首尔大学)
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this http URL , Data: this https URL

点击查看摘要

Abstract:We introduce PhysGaia, a novel physics-aware dataset specifically designed for Dynamic Novel View Synthesis (DyNVS), encompassing both structured objects and unstructured physical phenomena. Unlike existing datasets that primarily focus on photorealistic reconstruction, PhysGaia is created to actively support physics-aware dynamic scene modeling. Our dataset provides complex dynamic scenarios with rich interactions among multiple objects, where they realistically collide with each other and exchange forces. Furthermore, it contains a diverse range of physical materials, such as liquid, gas, viscoelastic substance, and textile, which moves beyond the rigid bodies prevalent in existing datasets. All scenes in PhysGaia are faithfully generated to strictly adhere to physical laws, leveraging carefully selected material-specific physics solvers. To enable quantitative evaluation of physical modeling, our dataset provides essential ground-truth information, including 3D particle trajectories and physics parameters, e.g., viscosity. To facilitate research adoption, we also provide essential integration pipelines for using state-of-the-art DyNVS models with our dataset and report their results. By addressing the critical lack of datasets for physics-aware modeling, PhysGaia will significantly advance research in dynamic view synthesis, physics-based scene understanding, and deep learning models integrated with physical simulation – ultimately enabling more faithful reconstruction and interpretation of complex dynamic scenes. Our datasets and codes are available in the project website, this http URL.
zh

[CV-51] Automated Measurement of Optic Nerve Sheath Diameter Using Ocular Ultrasound Video

【速读】:该论文旨在解决传统手动测量视神经鞘直径(Optic Nerve Sheath Diameter, ONSD)过程中对操作者经验和技能的高度依赖问题,尤其是在从超声视频序列中选择最佳帧和进行ONSD测量时。其解决方案的关键在于采用核相关滤波(Kernel Correlation Filter, KCF)跟踪算法自动识别最佳帧,并结合简单线性迭代聚类(Simple Linear Iterative Clustering, SLIC)分割算法对视神经鞘进行定位与测量,同时利用高斯混合模型(Gaussian Mixture Model, GMM)与KL散度方法提升测量精度。

链接: https://arxiv.org/abs/2506.02789
作者: Renxing Li,Weiyi Tang,Peiqi Li,Qiming Huang,Jiayuan She,Shengkai Li,Haoran Xu,Yeyun Wan,Jing Liu,Hailong Fu,Xiang Li,Jiangang Chen
机构: Shanghai Key Laboratory of Multidimensional Information Processing, East China Normal University (上海市多维信息处理重点实验室,华东师范大学); Fudan University (复旦大学); Xi’an Jiaotong-Liverpool University (西交利物浦大学); School of Artificial Intelligence and Advanced Computing, Xi’an Jiaotong-Liverpool University (Taicang Campus) (人工智能与先进计算学院,西交利物浦大学(太仓校区)); Northeastern University (东北大学); School of Advanced Technology, Xi’an Jiaotong-Liverpool University (高级技术学院,西交利物浦大学); Capital Medical University (首都医科大学); Naval Medical University (海军医学大学); Beijing Obstetrics and Gynecology Hospital (北京妇产医院); Changzheng Hospital, Second Affiliated Hospital of Naval Medical University (长征医院,海军医学大学附属第二医院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 9 figures

点击查看摘要

Abstract:Objective. Elevated intracranial pressure (ICP) is recognized as a biomarker of secondary brain injury, with a significant linear correlation observed between optic nerve sheath diameter (ONSD) and ICP. Frequent monitoring of ONSD could effectively support dynamic evaluation of ICP. However, ONSD measurement is heavily reliant on the operator’s experience and skill, particularly in manually selecting the optimal frame from ultrasound sequences and measuring ONSD. Approach. This paper presents a novel method to automatically identify the optimal frame from video sequences for ONSD measurement by employing the Kernel Correlation Filter (KCF) tracking algorithm and Simple Linear Iterative Clustering (SLIC) segmentation algorithm. The optic nerve sheath is mapped and measured using a Gaussian Mixture Model (GMM) combined with a KL-divergence-based method. Results. When compared with the average measurements of two expert clinicians, the proposed method achieved a mean error, mean squared deviation, and intraclass correlation coefficient (ICC) of 0.04, 0.054, and 0.782, respectively. Significance. The findings suggest that this method provides highly accurate automated ONSD measurements, showing potential for clinical application.
zh

[CV-52] SAMJ: Fast Image Annotation on ImageJ/Fiji via Segment Anything Model

【速读】:该论文试图解决生物医学图像分析中掩码标注(mask annotation)的瓶颈问题,该过程因耗时耗力而限制了AI技术的广泛应用。其解决方案的关键在于引入SAMJ,这是一个基于Segment Anything Model (SAM) 的用户友好的ImageJ/Fiji插件,通过一键安装即可在普通计算机上实现无缝、交互式的标注,从而简化并加速标记图像数据集的创建。

链接: https://arxiv.org/abs/2506.02783
作者: Carlos Garcia-Lopez-de-Haro,Caterina Fuster-Barcelo,Curtis T. Rueden,Jonathan Heras,Vladimir Ulman,Daniel Franco-Barranco,Adrian Ines,Kevin W. Eliceiri,Jean-Christophe Olivo-Marin,Jean-Yves Tinevez,Daniel Sage,Arrate Munoz-Barrutia
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Mask annotation remains a significant bottleneck in AI-driven biomedical image analysis due to its labor-intensive nature. To address this challenge, we introduce SAMJ, a user-friendly ImageJ/Fiji plugin leveraging the Segment Anything Model (SAM). SAMJ enables seamless, interactive annotations with one-click installation on standard computers. Designed for real-time object delineation in large scientific images, SAMJ is an easy-to-use solution that simplifies and accelerates the creation of labeled image datasets.
zh

[CV-53] FreeScene: Mixed Graph Diffusion for 3D Scene Synthesis from Free Prompts CVPR2025

【速读】:该论文旨在解决3D室内场景生成中的可控性问题,现有方法要么仅提供粗粒度的语言控制,虽便捷但缺乏精细定制能力,要么依赖复杂的图结构控制,虽具备较高可控性但需要用户具备专业知识。其解决方案的关键在于提出FreeScene框架,该框架通过基于视觉语言模型(VLM)的Graph Designer将自由形式的用户输入(如文本描述和参考图像)转化为图表示,并引入MG-DiT(Mixed Graph Diffusion Transformer)模型,实现图感知的去噪过程,从而在单一模型中支持多种任务,如文本到场景、图到场景及重排等,显著提升了生成质量和可控性。

链接: https://arxiv.org/abs/2506.02781
作者: Tongyuan Bai,Wangyuanfan Bai,Dong Chen,Tieru Wu,Manyi Li,Rui Ma
机构: School of Artificial Intelligence, Jilin University (吉林大学人工智能学院); School of Software, Shandong University (山东大学软件学院); Engineering Research Center of Knowledge-Driven Human-Machine Intelligence, MOE, China (知识驱动人机智能工程研究中心,教育部,中国)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2025

点击查看摘要

Abstract:Controllability plays a crucial role in the practical applications of 3D indoor scene synthesis. Existing works either allow rough language-based control, that is convenient but lacks fine-grained scene customization, or employ graph based control, which offers better controllability but demands considerable knowledge for the cumbersome graph design process. To address these challenges, we present FreeScene, a user-friendly framework that enables both convenient and effective control for indoor scene this http URL, FreeScene supports free-form user inputs including text description and/or reference images, allowing users to express versatile design intentions. The user inputs are adequately analyzed and integrated into a graph representation by a VLM-based Graph Designer. We then propose MG-DiT, a Mixed Graph Diffusion Transformer, which performs graph-aware denoising to enhance scene generation. Our MG-DiT not only excels at preserving graph structure but also offers broad applicability to various tasks, including, but not limited to, text-to-scene, graph-to-scene, and rearrangement, all within a single model. Extensive experiments demonstrate that FreeScene provides an efficient and user-friendly solution that unifies text-based and graph based scene synthesis, outperforming state-of-the-art methods in terms of both generation quality and controllability in a range of applications.
zh

[CV-54] A Dynamic Transformer Network for Vehicle Detection

【速读】:该论文旨在解决现有基于深度网络的车辆检测方法在不同光照和遮挡条件下性能受限的问题。其解决方案的关键在于提出一种动态Transformer网络(DTNet),通过动态卷积引导深度网络动态生成权重,以提高检测器的适应性;同时引入结合通道注意力和Transformer的混合注意力机制,以增强通道与像素之间的关系,提取更具判别性的特征;此外,还采用依赖空间位置信息的平移不变卷积来优化结构信息,从而提升车辆检测的准确性。

链接: https://arxiv.org/abs/2506.02765
作者: Chunwei Tian,Kai Liu,Bob Zhang,Zhixiang Huang,Chia-Wen Lin,David Zhang
机构: Harbin Institute of Technology(哈尔滨工业大学); University of Macau(澳门大学); Anhui University(安徽大学); National Tsing Hua University(台湾清华大学); The Chinese University of Hong Kong (Shenzhen)(香港中文大学(深圳))
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 5 figures. This paper has been accepted for publication in IEEE Transactions on Consumer Electronics

点击查看摘要

Abstract:Stable consumer electronic systems can assist traffic better. Good traffic consumer electronic systems require collaborative work between traffic algorithms and hardware. However, performance of popular traffic algorithms containing vehicle detection methods based on deep networks via learning data relation rather than learning differences in different lighting and occlusions is limited. In this paper, we present a dynamic Transformer network for vehicle detection (DTNet). DTNet utilizes a dynamic convolution to guide a deep network to dynamically generate weights to enhance adaptability of an obtained detector. Taking into relations of different information account, a mixed attention mechanism based channel attention and Transformer is exploited to strengthen relations of channels and pixels to extract more salient information for vehicle detection. To overcome the drawback of difference in an image account, a translation-variant convolution relies on spatial location information to refine obtained structural information for vehicle detection. Experimental results illustrate that our DTNet is competitive for vehicle detection. Code of the proposed DTNet can be obtained at this https URL.
zh

[CV-55] Unified Attention Modeling for Efficient Free-Viewing and Visual Search via Shared Representations

【速读】:该论文试图解决自由观看(free-viewing)和任务驱动的视觉搜索(task-specific visual search)中人类注意力建模是否可以共享共同表示的问题。其解决方案的关键在于提出一种基于Human Attention transformer (HAT) 的神经网络架构,验证了两种场景下注意力机制可以共享同一表征,并通过实验表明,利用自由观看训练的模型在视觉搜索任务中仅出现3.86%的性能下降,同时显著降低了计算成本。

链接: https://arxiv.org/abs/2506.02764
作者: Fatma Youssef Mohammed,Kostas Alexis
机构: Norwegian University of Science and Technology (挪威科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to the 2025 IEEE International Conference on Development and Learning (ICDL)

点击查看摘要

Abstract:Computational human attention modeling in free-viewing and task-specific settings is often studied separately, with limited exploration of whether a common representation exists between them. This work investigates this question and proposes a neural network architecture that builds upon the Human Attention transformer (HAT) to test the hypothesis. Our results demonstrate that free-viewing and visual search can efficiently share a common representation, allowing a model trained in free-viewing attention to transfer its knowledge to task-driven visual search with a performance drop of only 3.86% in the predicted fixation scanpaths, measured by the semantic sequence score (SemSS) metric which reflects the similarity between predicted and human scanpaths. This transfer reduces computational costs by 92.29% in terms of GFLOPs and 31.23% in terms of trainable parameters.
zh

[CV-56] RobustSplat: Decoupling Densification and Dynamics for Transient-Free 3DGS

【速读】:该论文旨在解决3D Gaussian Splatting (3DGS)在处理受瞬时物体影响的场景时,因高斯密度增加过程导致的渲染图像中出现伪影的问题。其关键解决方案是提出RobustSplat,包含两个核心设计:一是延迟高斯增长策略,优先优化静态场景结构后再进行高斯分裂/克隆,以减少对瞬时物体的过拟合;二是尺度级联掩码自举方法,通过低分辨率特征相似性监督实现可靠的初始瞬时掩码估计,并逐步过渡到高分辨率监督以提升掩码预测精度。

链接: https://arxiv.org/abs/2506.02751
作者: Chuanyu Fu,Yuqi Zhang,Kunbin Yao,Guanying Chen,Yuan Xiong,Chuan Huang,Shuguang Cui,Xiaochun Cao
机构: Sun Yat-sen University (中山大学); FNii-Shenzhen (深圳鹏城实验室); SSE, CUHKSZ (深圳技术大学理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) has gained significant attention for its real-time, photo-realistic rendering in novel-view synthesis and 3D modeling. However, existing methods struggle with accurately modeling scenes affected by transient objects, leading to artifacts in the rendered images. We identify that the Gaussian densification process, while enhancing scene detail capture, unintentionally contributes to these artifacts by growing additional Gaussians that model transient disturbances. To address this, we propose RobustSplat, a robust solution based on two critical designs. First, we introduce a delayed Gaussian growth strategy that prioritizes optimizing static scene structure before allowing Gaussian splitting/cloning, mitigating overfitting to transient objects in early optimization. Second, we design a scale-cascaded mask bootstrapping approach that first leverages lower-resolution feature similarity supervision for reliable initial transient mask estimation, taking advantage of its stronger semantic consistency and robustness to noise, and then progresses to high-resolution supervision to achieve more precise mask prediction. Extensive experiments on multiple challenging datasets show that our method outperforms existing methods, clearly demonstrating the robustness and effectiveness of our method. Our project page is this https URL.
zh

[CV-57] VTGaussian-SLAM: RGBD SLAM for Large Scale Scenes with Splatting View-Tied 3D Gaussians ICML2025

【速读】:该论文旨在解决从RGBD图像中联合估计相机位姿和场景映射时在大规模场景下的可扩展性问题,现有方法由于需要在有限的GPU内存中优化所有3D高斯 (3D Gaussians),导致无法有效处理极端大规模场景。其解决方案的关键在于提出了一种新的3D表示方式——视图绑定3D高斯 (view-tied 3D Gaussians),该表示通过将高斯与深度像素绑定,无需学习位置、旋转及多维方差,从而显著减少存储需求,并允许在有限内存中使用更多高斯来表达局部细节。同时,所提出的跟踪与映射策略无需在整个训练过程中保持所有高斯可学习,提升了渲染质量和定位精度。

链接: https://arxiv.org/abs/2506.02741
作者: Pengchong Hu,Zhizhong Han
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICML 2025

点击查看摘要

Abstract:Jointly estimating camera poses and mapping scenes from RGBD images is a fundamental task in simultaneous localization and mapping (SLAM). State-of-the-art methods employ 3D Gaussians to represent a scene, and render these Gaussians through splatting for higher efficiency and better rendering. However, these methods cannot scale up to extremely large scenes, due to the inefficient tracking and mapping strategies that need to optimize all 3D Gaussians in the limited GPU memories throughout the training to maintain the geometry and color consistency to previous RGBD observations. To resolve this issue, we propose novel tracking and mapping strategies to work with a novel 3D representation, dubbed view-tied 3D Gaussians, for RGBD SLAM systems. View-tied 3D Gaussians is a kind of simplified Gaussians, which is tied to depth pixels, without needing to learn locations, rotations, and multi-dimensional variances. Tying Gaussians to views not only significantly saves storage but also allows us to employ many more Gaussians to represent local details in the limited GPU memory. Moreover, our strategies remove the need of maintaining all Gaussians learnable throughout the training, while improving rendering quality, and tracking accuracy. We justify the effectiveness of these designs, and report better performance over the latest methods on the widely used benchmarks in terms of rendering and tracking accuracy and scalability. Please see our project page for code and videos at this https URL .
zh

[CV-58] Open-PMC-18M: A High-Fidelity Large Scale Medical Dataset for Multimodal Representation Learning

【速读】:该论文试图解决生物医学文献中复合图(compound figures)的子图(subfigure)大规模提取问题,以及通过高保真图像-文本对齐提升视觉-语言模型表示学习的效果这一关键科学问题。解决方案的关键在于引入一种基于Transformer的目标检测框架,构建了一个包含50万张复合图的合成语料库进行训练,并在此基础上实现了在ImageCLEF 2016和合成基准上的最先进性能,从而为后续的视觉-语言模型训练提供了高质量的数据支持。

链接: https://arxiv.org/abs/2506.02738
作者: Negin Baghbanzadeh,Sajad Ashkezari,Elham Dolatabadi,Arash Afkanpour
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages

点击查看摘要

Abstract:Compound figures, which are multi-panel composites containing diverse subfigures, are ubiquitous in biomedical literature, yet large-scale subfigure extraction remains largely unaddressed. Prior work on subfigure extraction has been limited in both dataset size and generalizability, leaving a critical open question: How does high-fidelity image-text alignment via large-scale subfigure extraction impact representation learning in vision-language models? We address this gap by introducing a scalable subfigure extraction pipeline based on transformer-based object detection, trained on a synthetic corpus of 500,000 compound figures, and achieving state-of-the-art performance on both ImageCLEF 2016 and synthetic benchmarks. Using this pipeline, we release OPEN-PMC-18M, a large-scale high quality biomedical vision-language dataset comprising 18 million clinically relevant subfigure-caption pairs spanning radiology, microscopy, and visible light photography. We train and evaluate vision-language models on our curated datasets and show improved performance across retrieval, zero-shot classification, and robustness benchmarks, outperforming existing baselines. We release our dataset, models, and code to support reproducible benchmarks and further study into biomedical vision-language modeling and representation learning.
zh

[CV-59] GeneA-SLAM2: Dynamic SLAM with AutoEncoder-Preprocessed Genetic Keypoints Resampling and Depth Variance-Guided Dynamic Region Removal

【速读】:该论文旨在解决动态环境中语义SLAM系统在识别动态区域时存在的不足,特别是在高度动态场景下,传统的基于目标检测或语义分割的方法无法完全覆盖所有动态区域。其解决方案的关键在于利用深度方差约束来提取动态像素,并生成精确的深度掩码以指导动态物体的移除,同时通过自编码器重构关键点,改进遗传重采样关键点算法,从而获得更均匀分布的关键点并提升位姿估计的准确性。

链接: https://arxiv.org/abs/2506.02736
作者: Shufan Qing,Anzhen Li,Qiandi Wang,Yuefeng Niu,Mingchen Feng,Guoliang Hu,Jinqiao Wu,Fengtao Nan,Yingchun Fan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Existing semantic SLAM in dynamic environments mainly identify dynamic regions through object detection or semantic segmentation methods. However, in certain highly dynamic scenarios, the detection boxes or segmentation masks cannot fully cover dynamic regions. Therefore, this paper proposes a robust and efficient GeneA-SLAM2 system that leverages depth variance constraints to handle dynamic scenes. Our method extracts dynamic pixels via depth variance and creates precise depth masks to guide the removal of dynamic objects. Simultaneously, an autoencoder is used to reconstruct keypoints, improving the genetic resampling keypoint algorithm to obtain more uniformly distributed keypoints and enhance the accuracy of pose estimation. Our system was evaluated on multiple highly dynamic sequences. The results demonstrate that GeneA-SLAM2 maintains high accuracy in dynamic scenes compared to current methods. Code is available at: this https URL.
zh

[CV-60] LinkTo-Anime: A 2D Animation Optical Flow Dataset from 3D Model Rendering

【速读】:该论文试图解决现有光流数据集主要针对现实世界模拟或合成人类运动,而缺乏针对Celluloid(cel)动画角色运动的专门数据集的问题,这一领域具有独特的视觉和运动特征。解决方案的关键在于引入LinkTo-Anime,这是首个专为cel动画角色运动设计的高质量数据集,通过3D模型渲染生成,并提供包括前向和后向光流、遮挡掩码以及Mixamo骨骼的丰富标注信息,以促进光流估计及相关下游任务的研究。

链接: https://arxiv.org/abs/2506.02733
作者: Xiaoyi Feng,Kaifeng Zou,Caichun Cen,Tao Huang,Hui Guo,Zizhou Huang,Yingli Zhao,Mingqing Zhang,Diwei Wang,Yuntao Zou,Dagang Li
机构: Macau University of Science and Technology (澳门科技大学); The Chinese University of Hong Kong (香港中文大学); Wuzhou University (梧州大学); Université de Strasbourg (斯特拉斯堡大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Existing optical flow datasets focus primarily on real-world simulation or synthetic human motion, but few are tailored to Celluloid(cel) anime character motion: a domain with unique visual and motion characteristics. To bridge this gap and facilitate research in optical flow estimation and downstream tasks such as anime video generation and line drawing colorization, we introduce LinkTo-Anime, the first high-quality dataset specifically designed for cel anime character motion generated with 3D model rendering. LinkTo-Anime provides rich annotations including forward and backward optical flow, occlusion masks, and Mixamo Skeleton. The dataset comprises 395 video sequences, totally 24,230 training frames, 720 validation frames, and 4,320 test frames. Furthermore, a comprehensive benchmark is constructed with various optical flow estimation methods to analyze the shortcomings and limitations across multiple datasets.
zh

[CV-61] oothForge: Automatic Dental Shape Generation using Synchronized Spectral Embeddings

【速读】:该论文试图解决牙科形状数据集稀疏性问题,以及传统方法在生成三维牙齿模型时因形状谱分解不稳定性带来的偏差。其解决方案的关键在于提出在同步频域嵌入上建模潜在流形,通过将所有数据样本的谱对齐到一个公共基底,有效消除了分解不稳定性引入的偏差,并摆脱了以往方法中要求所有形状共享固定连接性的限制。

链接: https://arxiv.org/abs/2506.02702
作者: Tibor Kubík,François Guibault,Michal Španěl,Hervé Lombaert
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Information Processing in Medical Imaging (IPMI2025)

点击查看摘要

Abstract:We introduce ToothForge, a spectral approach for automatically generating novel 3D teeth, effectively addressing the sparsity of dental shape datasets. By operating in the spectral domain, our method enables compact machine learning modeling, allowing the generation of high-resolution tooth meshes in milliseconds. However, generating shape spectra comes with the instability of the decomposed harmonics. To address this, we propose modeling the latent manifold on synchronized frequential embeddings. Spectra of all data samples are aligned to a common basis prior to the training procedure, effectively eliminating biases introduced by the decomposition instability. Furthermore, synchronized modeling removes the limiting factor imposed by previous methods, which require all shapes to share a common fixed connectivity. Using a private dataset of real dental crowns, we observe a greater reconstruction quality of the synthetized shapes, exceeding those of models trained on unaligned embeddings. We also explore additional applications of spectral analysis in digital dentistry, such as shape compression and interpolation. ToothForge facilitates a range of approaches at the intersection of spectral analysis and machine learning, with fewer restrictions on mesh structure. This makes it applicable for shape analysis not only in dentistry, but also in broader medical applications, where guaranteeing consistent connectivity across shapes from various clinics is unrealistic. The code is available at this https URL.
zh

[CV-62] Smoothed Preference Optimization via ReNoise Inversion for Aligning Diffusion Models with Varied Human Preferences ICML2025

【速读】:该论文旨在解决文本到图像生成模型在对齐人类偏好时存在的个体偏好差异未被充分建模的问题,即现有方法通常假设偏好是二值分布,而忽略了偏好在不同个体间的多样性和细微差别。其解决方案的关键在于提出SmPO-Diffusion方法,通过引入平滑的偏好分布替代原始二值分布,并利用奖励模型模拟人类偏好,结合偏好似然平均化改进DPO损失函数,同时采用逆向技术模拟扩散模型的轨迹偏好分布,从而更精确地对齐优化目标,提升模型性能。

链接: https://arxiv.org/abs/2506.02698
作者: Yunhong Lu,Qichao Wang,Hengyuan Cao,Xiaoyin Xu,Min Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICML 2025

点击查看摘要

Abstract:Direct Preference Optimization (DPO) aligns text-to-image (T2I) generation models with human preferences using pairwise preference data. Although substantial resources are expended in collecting and labeling datasets, a critical aspect is often neglected: \textitpreferences vary across individuals and should be represented with more granularity. To address this, we propose SmPO-Diffusion, a novel method for modeling preference distributions to improve the DPO objective, along with a numerical upper bound estimation for the diffusion optimization objective. First, we introduce a smoothed preference distribution to replace the original binary distribution. We employ a reward model to simulate human preferences and apply preference likelihood averaging to improve the DPO loss, such that the loss function approaches zero when preferences are similar. Furthermore, we utilize an inversion technique to simulate the trajectory preference distribution of the diffusion model, enabling more accurate alignment with the optimization objective. Our approach effectively mitigates issues of excessive optimization and objective misalignment present in existing methods through straightforward modifications. Our SmPO-Diffusion achieves state-of-the-art performance in preference evaluation, outperforming baselines across metrics with lower training costs. The project page is this https URL.
zh

[CV-63] LayoutRAG : Retrieval-Augmented Model for Content-agnostic Conditional Layout Generation

【速读】:该论文旨在解决可控版面生成(controllable layout generation)中在给定条件下生成最优视觉布局的问题,即根据特定组件类型或位置等约束条件,生成合理且高质量的元素边界框排列。其解决方案的关键在于通过条件检索和参考引导生成相结合的方式,首先根据给定条件检索合适的版面模板作为参考,再利用这些参考指导去噪或基于流的生成过程,从而更有效地捕捉隐含于条件中的潜在信息,相较于以往直接将条件输入模型让其推断未提供布局属性的方法更具优势。同时,设计了条件调制注意力机制,以选择性地吸收检索知识并适应检索模板与给定条件之间的差异。

链接: https://arxiv.org/abs/2506.02697
作者: Yuxuan Wu,Le Wang,Sanping Zhou,Mengnan Liu,Gang Hua,Haoxiang Li
机构: Xi’an Jiaotong University (西安交通大学); Amazon.com, Inc. (亚马逊公司); Pixocial Technology (皮克斯社交科技)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 5 figures

点击查看摘要

Abstract:Controllable layout generation aims to create plausible visual arrangements of element bounding boxes within a graphic design according to certain optional constraints, such as the type or position of a specific component. While recent diffusion or flow-matching models have achieved considerable advances in multifarious conditional generation tasks, there remains considerable room for generating optimal arrangements under given conditions. In this work, we propose to carry out layout generation through retrieving by conditions and reference-guided generation. Specifically, we retrieve appropriate layout templates according to given conditions as references. The references are then utilized to guide the denoising or flow-based transport process. By retrieving layouts compatible with the given conditions, we can uncover the potential information not explicitly provided in the given condition. Such an approach offers more effective guidance to the model during the generation process, in contrast to previous models that feed the condition to the model and let the model infer the unprovided layout attributes directly. Meanwhile, we design a condition-modulated attention that selectively absorbs retrieval knowledge, adapting to the difference between retrieved templates and given conditions. Extensive experiment results show that our method successfully produces high-quality layouts that meet the given conditions and outperforms existing state-of-the-art models. Code will be released upon acceptance.
zh

[CV-64] FaceSleuth: Learning-Driven Single-Orientation Attention Verifies Vertical Dominance in Micro-Expression Recognition

【速读】:该论文旨在解决微表情识别(Micro-expression Recognition, MER)中模型难以放大毫秒级、低幅度面部运动并抑制个体特异性外观的问题。其解决方案的关键在于提出一种双流架构FaceSleuth,该架构通过连续垂直注意力(Continuously Vertical Attention, CVA)模块增强经验上主导的垂直轴运动,利用基于分层跨窗口注意力的面部位置聚焦器定位信号,并通过轻量级动作单元嵌入引导特征学习至生理上有意义的区域。此外,为验证手动选择的垂直轴是否最优,进一步提出了单方向注意力(Single-Orientation Attention, SOA)模块,该模块可端到端学习池化方向,具有微小参数增加且在角度收敛时退化为CVA,实验证明其有效提升了识别性能。

链接: https://arxiv.org/abs/2506.02695
作者: Linquan Wu,Tianxiang Jiang,Wenhao Duan,Yini Fang,Jacky Keung
机构: City University of Hong Kong(香港城市大学); University of Science and Technology of China(中国科学技术大学); Ocean University of China(中国海洋大学); Hong Kong University Science and Technology(香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 2 figures

点击查看摘要

Abstract:Micro-expression recognition (MER) demands models that can amplify millisecond-level, low-amplitude facial motions while suppressing identity-specific appearance. We introduce FaceSleuth, a dual-stream architecture that (1) enhances motion along the empirically dominant vertical axix through a Continuously Vertical Attention (CVA) block, (2) localises the resulting signals with a Facial Position Focalizer built on hierarchical cross-window attention, and (3) steers feature learning toward physiologically meaningful regions via lightweight Action-Unit embeddings. To examine whether the hand-chosen vertical axis is indeed optimal, we further propose a Single-Orientation Attention (SOA) module that learns its own pooling direction end-to-end. SOA is differentiable, adds only 0.16 % parameters, and collapses to CVA when the learned angle converges to \Pi/2. In practice, SOA reliably drifts to 88°, confirming the effectiveness of the vertical prior while delivering consistent gains. On three standard MER benchmarks, FaceSleuth with CVA already surpasses previous state-of-the-art methods; plugging in SOA lifts accuracy and F1 score performance to 95.1 % / 0.918 on CASME II, 87.1 % / 0.840 on SAMM, and 92.9 % / 0.917 on MMEW without sacrificing model compactness. These results establish a new state of the art and, for the first time, provide empirical evidence that the vertical attention bias is the most discriminative orientation for MER.
zh

[CV-65] Large-scale Self-supervised Video Foundation Model for Intelligent Surgery

【速读】:该论文旨在解决现有基于AI的计算机辅助干预(Computer-Assisted Intervention, CAI)方法在术中动态场景理解上的不足,特别是由于预训练过程中缺乏显式的时序建模导致的时空理解不完整问题。其解决方案的关键在于提出SurgVISTA框架,该框架通过视频级别的联合时空表征学习,从大规模手术视频数据中捕捉复杂的空间结构和时间动态,并结合基于手术领域专家引导的图像级知识蒸馏,以增强对细粒度解剖和语义特征的学习能力。

链接: https://arxiv.org/abs/2506.02692
作者: Shu Yang,Fengtao Zhou,Leon Mayer,Fuxiang Huang,Yiliang Chen,Yihui Wang,Sunan He,Yuxiang Nie,Xi Wang,Ömer Sümer,Yueming Jin,Huihui Sun,Shuchang Xu,Alex Qinyang Liu,Zheng Li,Jing Qin,Jeremy YuenChun Teoh,Lena Maier-Hein,Hao Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Computer-Assisted Intervention (CAI) has the potential to revolutionize modern surgery, with surgical scene understanding serving as a critical component in supporting decision-making, improving procedural efficacy, and ensuring intraoperative safety. While existing AI-driven approaches alleviate annotation burdens via self-supervised spatial representation learning, their lack of explicit temporal modeling during pre-training fundamentally restricts the capture of dynamic surgical contexts, resulting in incomplete spatiotemporal understanding. In this work, we introduce the first video-level surgical pre-training framework that enables joint spatiotemporal representation learning from large-scale surgical video data. To achieve this, we constructed a large-scale surgical video dataset comprising 3,650 videos and approximately 3.55 million frames, spanning more than 20 surgical procedures and over 10 anatomical structures. Building upon this dataset, we propose SurgVISTA (Surgical Video-level Spatial-Temporal Architecture), a reconstruction-based pre-training method that captures intricate spatial structures and temporal dynamics through joint spatiotemporal modeling. Additionally, SurgVISTA incorporates image-level knowledge distillation guided by a surgery-specific expert to enhance the learning of fine-grained anatomical and semantic features. To validate its effectiveness, we established a comprehensive benchmark comprising 13 video-level datasets spanning six surgical procedures across four tasks. Extensive experiments demonstrate that SurgVISTA consistently outperforms both natural- and surgical-domain pre-trained models, demonstrating strong potential to advance intelligent surgical systems in clinically meaningful scenarios.
zh

[CV-66] owards Geometry Problem Solving in the Large Model Era: A Survey

【速读】:该论文试图解决几何问题求解(Geometry Problem Solving, GPS)在人工智能领域的自动化难题,其核心挑战在于同时实现空间理解与严格逻辑推理的双重需求。解决方案的关键在于通过三个核心维度进行系统性综述:基准构建、文本与图示解析以及推理范式,并提出一种统一的分析框架,以推动未来研究向人类水平的几何推理发展,包括自动化基准生成和可解释的神经符号集成。

链接: https://arxiv.org/abs/2506.02690
作者: Yurui Zhao,Xiang Wang,Jiahong Liu,Irwin King,Zhitao Huang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Geometric Topology (math.GT)
备注: 8pages, 4 figures, conference submission

点击查看摘要

Abstract:Geometry problem solving (GPS) represents a critical frontier in artificial intelligence, with profound applications in education, computer-aided design, and computational graphics. Despite its significance, automating GPS remains challenging due to the dual demands of spatial understanding and rigorous logical reasoning. Recent advances in large models have enabled notable breakthroughs, particularly for SAT-level problems, yet the field remains fragmented across methodologies, benchmarks, and evaluation frameworks. This survey systematically synthesizes GPS advancements through three core dimensions: (1) benchmark construction, (2) textual and diagrammatic parsing, and (3) reasoning paradigms. We further propose a unified analytical paradigm, assess current limitations, and identify emerging opportunities to guide future research toward human-level geometric reasoning, including automated benchmark generation and interpretable neuro-symbolic integration.
zh

[CV-67] Solving Inverse Problems with FLAIR

【速读】:该论文试图解决逆成像问题中使用基于流的潜在生成模型(如Stable Diffusion 3)时,由于编码到低维潜在空间导致的非线性前向映射、数据似然项不可处理以及推理过程中难以恢复罕见数据模式等关键障碍。其解决方案的关键在于提出FLAIR框架,这是一个无需训练的变分框架,通过引入与退化类型无关的流匹配变分目标,并结合确定性轨迹调整以恢复异常模式,同时解耦数据保真度和正则化项的优化,并引入时间依赖的校准方案以根据离线精度估计调节正则化强度。

链接: https://arxiv.org/abs/2506.02680
作者: Julius Erbach,Dominik Narnhofer,Andreas Dombos,Bernt Schiele,Jan Eric Lenssen,Konrad Schindler
机构: ETH Zürich (ETH Zurich); Max Planck Institute for Informatics (马克斯·普朗克信息研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Flow-based latent generative models such as Stable Diffusion 3 are able to generate images with remarkable quality, even enabling photorealistic text-to-image generation. Their impressive performance suggests that these models should also constitute powerful priors for inverse imaging problems, but that approach has not yet led to comparable fidelity. There are several key obstacles: (i) the encoding into a lower-dimensional latent space makes the underlying (forward) mapping non-linear; (ii) the data likelihood term is usually intractable; and (iii) learned generative models struggle to recover rare, atypical data modes during inference. We present FLAIR, a novel training free variational framework that leverages flow-based generative models as a prior for inverse problems. To that end, we introduce a variational objective for flow matching that is agnostic to the type of degradation, and combine it with deterministic trajectory adjustments to recover atypical modes. To enforce exact consistency with the observed data, we decouple the optimization of the data fidelity and regularization terms. Moreover, we introduce a time-dependent calibration scheme in which the strength of the regularization is modulated according to off-line accuracy estimates. Results on standard imaging benchmarks demonstrate that FLAIR consistently outperforms existing diffusion- and flow-based methods in terms of reconstruction quality and sample diversity.
zh

[CV-68] Self-Disentanglement and Re-Composition for Cross-Domain Few-Shot Segmentation ICML2025

【速读】:该论文试图解决跨域小样本分割(Cross-Domain Few-Shot Segmentation, CD-FSS)中由于源域模式纠缠导致的特征迁移困难问题。其解决方案的关键在于对视觉Transformer(ViT)结构进行自然分解,并通过学习对所有ViT组件比较进行加权,以实现解缠绕特征的学习与重组,从而提升模型的泛化能力和微调效果。

链接: https://arxiv.org/abs/2506.02677
作者: Jintao Tong,Yixiong Zou,Guangyao Chen,Yuhua Li,Ruixuan Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted by ICML 2025

点击查看摘要

Abstract:Cross-Domain Few-Shot Segmentation (CD-FSS) aims to transfer knowledge from a source-domain dataset to unseen target-domain datasets with limited annotations. Current methods typically compare the distance between training and testing samples for mask prediction. However, we find an entanglement problem exists in this widely adopted method, which tends to bind sourcedomain patterns together and make each of them hard to transfer. In this paper, we aim to address this problem for the CD-FSS task. We first find a natural decomposition of the ViT structure, based on which we delve into the entanglement problem for an interpretation. We find the decomposed ViT components are crossly compared between images in distance calculation, where the rational comparisons are entangled with those meaningless ones by their equal importance, leading to the entanglement problem. Based on this interpretation, we further propose to address the entanglement problem by learning to weigh for all comparisons of ViT components, which learn disentangled features and re-compose them for the CD-FSS task, benefiting both the generalization and finetuning. Experiments show that our model outperforms the state-of-the-art CD-FSS method by 1.92% and 1.88% in average accuracy under 1-shot and 5-shot settings, respectively.
zh

[CV-69] Small Aid Big Leap: Efficient Test-Time Adaptation for Vision-Language Models with AdaptNet

【速读】:该论文旨在解决测试时自适应(Test-time adaptation, TTA)在视觉-语言模型(Vision-Language Models, VLMs)中计算成本高且可扩展性差的问题。现有方法通常依赖于逐样本的适应粒度和昂贵的辅助设计(如数据增强),导致效率低下。其解决方案的关键在于提出一种基于适配器的框架SAIL(Small Aid, Big Leap),通过轻量级可学习的AdaptNet实现高效且可扩展的模型适应。SAIL的核心机制是利用预训练的冻结VLM与AdaptNet通过置信度加权插值协作,生成鲁棒预测,并通过批处理方式将这些预测作为自监督目标来对齐AdaptNet输出,从而显著降低计算成本。

链接: https://arxiv.org/abs/2506.02671
作者: Xiao Chen,Jiazhen Huang,Qinting Jiang,Fanding Huang,Xianghua Fu,Jingyan Jiang,Zhi Wang
机构: Tsinghua University (清华大学); Sichuan University (四川大学); Shenzhen Technology University (深圳技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Test-time adaptation (TTA) has emerged as a critical technique for enhancing the generalization capability of vision-language models (VLMs) during inference. However, existing approaches often incur substantial computational costs and exhibit poor scalability, primarily due to sample-wise adaptation granularity and reliance on costly auxiliary designs such as data augmentation. To address these limitations, we introduce SAIL (Small Aid, Big Leap), a novel adapter-based TTA framework that leverages a lightweight, learnable AdaptNet to enable efficient and scalable model adaptation. As SAIL’s core, a frozen pre-trained VLM collaborates with AdaptNet through a confidence-based interpolation weight, generating robust predictions during inference. These predictions serve as self-supervised targets to align AdaptNet’s outputs through efficient batch-wise processing, dramatically reducing computational costs without modifying the VLM or requiring memory caches. To mitigate catastrophic forgetting during continual adaptation, we propose a gradient-aware reset strategy driven by a gradient drift indicator (GDI), which dynamically detects domain transitions and strategically resets AdaptNet for stable adaptation. Extensive experiments across diverse benchmarks on two scenarios demonstrate that SAIL achieves state-of-the-art performance while maintaining low computational costs. These results highlight SAIL’s effectiveness, efficiency and scalability for real-world deployment. The code will be released upon acceptance.
zh

[CV-70] MotionRAG -Diff: A Retrieval-Augmented Diffusion Framework for Long-Term Music-to-Dance Generation

【速读】:该论文旨在解决生成长期、连贯且逼真的音乐条件舞蹈序列的挑战,现有方法存在关键局限性:运动图方法依赖于固定模板库,限制了创造性生成;扩散模型虽然能够生成新动作,但常缺乏时间连贯性和音乐对齐。其解决方案的关键在于提出一种混合框架——MotionRAG-Diff,该框架将检索增强生成(Retrieval-Augmented Generation, RAG)与基于扩散的优化相结合,通过跨模态对比学习架构实现音乐与舞蹈表示的对齐,优化运动图系统以保证长序列的现实感和时间连贯性,并引入多条件扩散模型以提升动作质量和全局同步性。

链接: https://arxiv.org/abs/2506.02661
作者: Mingyang Huang,Peng Zhang,Bang Zhang
机构: Tongyi Lab (通义实验室); Alibaba Group (阿里巴巴集团)
类目: ound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Audio and Speech Processing (eess.AS)
备注: 12 pages, 5 figures

点击查看摘要

Abstract:Generating long-term, coherent, and realistic music-conditioned dance sequences remains a challenging task in human motion synthesis. Existing approaches exhibit critical limitations: motion graph methods rely on fixed template libraries, restricting creative generation; diffusion models, while capable of producing novel motions, often lack temporal coherence and musical alignment. To address these challenges, we propose \textbfMotionRAG-Diff , a hybrid framework that integrates Retrieval-Augmented Generation (RAG) with diffusion-based refinement to enable high-quality, musically coherent dance generation for arbitrary long-term music inputs. Our method introduces three core innovations: (1) A cross-modal contrastive learning architecture that aligns heterogeneous music and dance representations in a shared latent space, establishing unsupervised semantic correspondence without paired data; (2) An optimized motion graph system for efficient retrieval and seamless concatenation of motion segments, ensuring realism and temporal coherence across long sequences; (3) A multi-condition diffusion model that jointly conditions on raw music signals and contrastive features to enhance motion quality and global synchronization. Extensive experiments demonstrate that MotionRAG-Diff achieves state-of-the-art performance in motion quality, diversity, and music-motion synchronization accuracy. This work establishes a new paradigm for music-driven dance generation by synergizing retrieval-based template fidelity with diffusion-based creative enhancement.
zh

[CV-71] ControlMambaIR: Conditional Controls with State-Space Model for Image Restoration

【速读】:该论文旨在解决图像去雨、去模糊和去噪等任务中的感知质量挑战。其解决方案的关键在于将Mamba网络架构与扩散模型相结合,构建条件网络以实现精细化的条件控制,从而提升图像生成过程的控制能力和优化效果。

链接: https://arxiv.org/abs/2506.02633
作者: Cheng Yang,Lijing Liang,Zhixun Su
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper proposes ControlMambaIR, a novel image restoration method designed to address perceptual challenges in image deraining, deblurring, and denoising tasks. By integrating the Mamba network architecture with the diffusion model, the condition network achieves refined conditional control, thereby enhancing the control and optimization of the image generation process. To evaluate the robustness and generalization capability of our method across various image degradation conditions, extensive experiments were conducted on several benchmark datasets, including Rain100H, Rain100L, GoPro, and SSID. The results demonstrate that our proposed approach consistently surpasses existing methods in perceptual quality metrics, such as LPIPS and FID, while maintaining comparable performance in image distortion metrics, including PSNR and SSIM, highlighting its effectiveness and adaptability. Notably, ablation experiments reveal that directly noise prediction in the diffusion process achieves better performance, effectively balancing noise suppression and detail preservation. Furthermore, the findings indicate that the Mamba architecture is particularly well-suited as a conditional control network for diffusion models, outperforming both CNN- and Attention-based approaches in this context. Overall, these results highlight the flexibility and effectiveness of ControlMambaIR in addressing a range of image restoration perceptual challenges.
zh

[CV-72] Synthetic Iris Image Databases and Identity Leakage: Risks and Mitigation Strategies

【速读】:该论文旨在解决获取大规模、多样化生物特征数据集(尤其是虹膜图像)所面临的困难,这些问题在生物特征方法开发中被视为关键挑战。解决方案的关键在于利用生成式AI (Generative AI) 技术,包括传统图像处理方法、生成对抗网络 (GANs)、变分自编码器 (VAEs) 以及扩散模型等,以合成高质量的虹膜图像,从而减少对真实生物特征数据的依赖。同时,论文还探讨了生成模型在数据合成过程中可能引发的个体生物特征泄露风险,并提出了相应的防范策略。

链接: https://arxiv.org/abs/2506.02626
作者: Ada Sawilska,Mateusz Trokielewicz
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper presents a comprehensive overview of iris image synthesis methods, which can alleviate the issues associated with gathering large, diverse datasets of biometric data from living individuals, which are considered pivotal for biometric methods development. These methods for synthesizing iris data range from traditional, hand crafted image processing-based techniques, through various iterations of GAN-based image generators, variational autoencoders (VAEs), as well as diffusion models. The potential and fidelity in iris image generation of each method is discussed and examples of inferred predictions are provided. Furthermore, the risks of individual biometric features leakage from the training sets are considered, together with possible strategies for preventing them, which have to be implemented should these generative methods be considered a valid replacement of real-world biometric datasets.
zh

[CV-73] SiamNAS: Siamese Surrogate Model for Dominance Relation Prediction in Multi-objective Neural Architecture Search GECCO’-25

【速读】:该论文旨在解决现代神经网络架构搜索(Neural Architecture Search, NAS)中多目标优化的计算成本过高问题,特别是在平衡模型精度、参数数量和计算成本之间的权衡时。其解决方案的关键在于提出一种基于Siamese网络块的代理建模方法,通过预测候选架构之间的支配关系来替代传统的拥挤距离计算,从而在生存者选择策略中引入基于模型大小的启发式规则。该方法轻量且易于训练,在NAS-Bench-201数据集上的实验表明,它能够显著降低计算成本,同时识别出帕累托最优解。

链接: https://arxiv.org/abs/2506.02623
作者: Yuyang Zhou,Ferrante Neri,Yew-Soon Ong,Ruibin Bai
机构: University of Nottingham Ningbo China(诺丁汉大学宁波分校); University of Surrey(萨里大学); Nanyang Technological University(南洋理工大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Genetic and Evolutionary Computation Conference (GECCO’ 25)

点击查看摘要

Abstract:Modern neural architecture search (NAS) is inherently multi-objective, balancing trade-offs such as accuracy, parameter count, and computational cost. This complexity makes NAS computationally expensive and nearly impossible to solve without efficient approximations. To address this, we propose a novel surrogate modelling approach that leverages an ensemble of Siamese network blocks to predict dominance relationships between candidate architectures. Lightweight and easy to train, the surrogate achieves 92% accuracy and replaces the crowding distance calculation in the survivor selection strategy with a heuristic rule based on model size. Integrated into a framework termed SiamNAS, this design eliminates costly evaluations during the search process. Experiments on NAS-Bench-201 demonstrate the framework’s ability to identify Pareto-optimal solutions with significantly reduced computational costs. The proposed SiamNAS identified a final non-dominated set containing the best architecture in NAS-Bench-201 for CIFAR-10 and the second-best for ImageNet, in terms of test error rate, within 0.01 GPU days. This proof-of-concept study highlights the potential of the proposed Siamese network surrogate model to generalise to multi-tasking optimisation, enabling simultaneous optimisation across tasks. Additionally, it offers opportunities to extend the approach for generating Sets of Pareto Sets (SOS), providing diverse Pareto-optimal solutions for heterogeneous task settings.
zh

[CV-74] FlexPainter: Flexible and Multi-View Consistent Texture Generation

【速读】:该论文旨在解决纹理映射生成中的控制灵活性不足、提示模态有限以及多视角生成图像不一致导致的纹理生成质量差的问题。其解决方案的关键在于引入\textbf{FlexPainter},通过构建共享的条件嵌入空间实现多模态条件引导的灵活聚合,并采用基于图像的CFG方法分解结构与风格信息以实现参考图像风格化;同时利用图像扩散先验中的3D知识,通过网格表示同步生成多视角图像,并结合视角同步与自适应加权模块确保局部一致性,最终通过3D感知的纹理补全与增强模型生成高质量、无缝的高分辨率纹理图。

链接: https://arxiv.org/abs/2506.02620
作者: Dongyu Yan,Leyi Wu,Jiantao Lin,Luozhou Wang,Tianshuo Xu,Zhifei Chen,Zhen Yang,Lie Xu,Shunsi Zhang,Yingcong Chen
机构: HKUST(GZ)(香港科技大学(广州)); Quwan(泉湾); HKUST(香港科技大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 10 figures in main paper, 10 pages, 12 figures in supplementary

点击查看摘要

Abstract:Texture map production is an important part of 3D modeling and determines the rendering quality. Recently, diffusion-based methods have opened a new way for texture generation. However, restricted control flexibility and limited prompt modalities may prevent creators from producing desired results. Furthermore, inconsistencies between generated multi-view images often lead to poor texture generation quality. To address these issues, we introduce \textbfFlexPainter, a novel texture generation pipeline that enables flexible multi-modal conditional guidance and achieves highly consistent texture generation. A shared conditional embedding space is constructed to perform flexible aggregation between different input modalities. Utilizing such embedding space, we present an image-based CFG method to decompose structural and style information, achieving reference image-based stylization. Leveraging the 3D knowledge within the image diffusion prior, we first generate multi-view images simultaneously using a grid representation to enhance global understanding. Meanwhile, we propose a view synchronization and adaptive weighting module during diffusion sampling to further ensure local consistency. Finally, a 3D-aware texture completion model combined with a texture enhancement model is used to generate seamless, high-resolution texture maps. Comprehensive experiments demonstrate that our framework significantly outperforms state-of-the-art methods in both flexibility and generation quality.
zh

[CV-75] Rodrigues Network for Learning Robot Actions

【速读】:该论文试图解决在机器人学习中对关节动作进行理解和预测的问题,当前常见的架构如多层感知机(MLP)和Transformer缺乏反映关节系统潜在运动学结构的归纳偏置。解决方案的关键在于提出神经罗德里格斯算子(Neural Rodrigues Operator),这是一种可学习的经典正向运动学操作的泛化形式,旨在将运动学感知的归纳偏置注入神经计算中。基于该算子,作者设计了罗德里格斯网络(RodriNet),一种专门用于处理动作的新颖神经架构。

链接: https://arxiv.org/abs/2506.02618
作者: Jialiang Zhang,Haoran Geng,Yang You,Congyue Deng,Pieter Abbeel,Jitendra Malik,Leonidas Guibas
机构: Peking University (北京大学); UC Berkeley (加州大学伯克利分校); Stanford University (斯坦福大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Understanding and predicting articulated actions is important in robot learning. However, common architectures such as MLPs and Transformers lack inductive biases that reflect the underlying kinematic structure of articulated systems. To this end, we propose the Neural Rodrigues Operator, a learnable generalization of the classical forward kinematics operation, designed to inject kinematics-aware inductive bias into neural computation. Building on this operator, we design the Rodrigues Network (RodriNet), a novel neural architecture specialized for processing actions. We evaluate the expressivity of our network on two synthetic tasks on kinematic and motion prediction, showing significant improvements compared to standard backbones. We further demonstrate its effectiveness in two realistic applications: (i) imitation learning on robotic benchmarks with the Diffusion Policy, and (ii) single-image 3D hand reconstruction. Our results suggest that integrating structured kinematic priors into the network architecture improves action learning in various domains.
zh

[CV-76] Hierarchical Question-Answering for Driving Scene Understanding Using Vision-Language Models

【速读】:该论文旨在解决自动驾驶中场景理解的高效性与细节准确性之间的平衡问题,特别是在有限计算资源下实现对关键驾驶视觉元素的精准识别与描述。解决方案的关键在于采用一种分层问答(hierarchical question-answering, QA)方法,通过微调一个轻量级的视觉语言模型(vision-language model, VLM)以适应特定地理区域的驾驶场景,并在推理阶段利用结构化的问答树动态分解任务,从而在保证场景描述准确性的同时降低计算开销。

链接: https://arxiv.org/abs/2506.02615
作者: Safaa Abdullahi Moallim Mohamud,Minjin Baek,Dong Seog Han
机构: Kyungpook National University (庆北国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:In this paper, we present a hierarchical question-answering (QA) approach for scene understanding in autonomous vehicles, balancing cost-efficiency with detailed visual interpretation. The method fine-tunes a compact vision-language model (VLM) on a custom dataset specific to the geographical area in which the vehicle operates to capture key driving-related visual elements. At the inference stage, the hierarchical QA strategy decomposes the scene understanding task into high-level and detailed sub-questions. Instead of generating lengthy descriptions, the VLM navigates a structured question tree, where answering high-level questions (e.g., “Is it possible for the ego vehicle to turn left at the intersection?”) triggers more detailed sub-questions (e.g., “Is there a vehicle approaching the intersection from the opposite direction?”). To optimize inference time, questions are dynamically skipped based on previous answers, minimizing computational overhead. The extracted answers are then synthesized using handcrafted templates to ensure coherent, contextually accurate scene descriptions. We evaluate the proposed approach on the custom dataset using GPT reference-free scoring, demonstrating its competitiveness with state-of-the-art methods like GPT-4o in capturing key scene details while achieving significantly lower inference time. Moreover, qualitative results from real-time deployment highlight the proposed approach’s capacity to capture key driving elements with minimal latency.
zh

[CV-77] High Performance Space Debris Tracking in Complex Skylight Backgrounds with a Large-Scale Dataset

【速读】:该论文旨在解决空间碎片实时准确跟踪的问题,现有方法主要依赖传统信号处理技术,难以有效处理复杂背景和密集的空间碎片。其解决方案的关键在于提出一种基于深度学习的Space Debris Tracking Network (SDT-Net),通过有效表征碎片特征,提升端到端模型学习的效率与稳定性。

链接: https://arxiv.org/abs/2506.02614
作者: Guohang Zhuang,Weixi Song,Jinyang Huang,Chenwei Yang,Yan Lu
机构: Hefei University of Technology (合肥工业大学); Zhejiang University (浙江大学); Astronomy, polar research institute of china (中国极地研究中心天文部分); The Chinese University of Hong Kong (香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:With the rapid development of space exploration, space debris has attracted more attention due to its potential extreme threat, leading to the need for real-time and accurate debris tracking. However, existing methods are mainly based on traditional signal processing, which cannot effectively process the complex background and dense space debris. In this paper, we propose a deep learning-based Space Debris Tracking Network~(SDT-Net) to achieve highly accurate debris tracking. SDT-Net effectively represents the feature of debris, enhancing the efficiency and stability of end-to-end model learning. To train and evaluate this model effectively, we also produce a large-scale dataset Space Debris Tracking Dataset (SDTD) by a novel observation-based data simulation scheme. SDTD contains 18,040 video sequences with a total of 62,562 frames and covers 250,000 synthetic space debris. Extensive experiments validate the effectiveness of our model and the challenging of our dataset. Furthermore, we test our model on real data from the Antarctic Station, achieving a MOTA score of 70.6%, which demonstrates its strong transferability to real-world scenarios. Our dataset and code will be released soon.
zh

[CV-78] One-Step Diffusion-based Real-World Image Super-Resolution with Visual Perception Distillation

【速读】:该论文旨在解决基于扩散模型的图像超分辨率(SR)任务中生成图像在感知质量和语义一致性方面不足的问题,特别是针对现有知识蒸馏方法在减少采样步骤时导致的CLIPIQA分数较低和语义对齐不足的问题。其解决方案的关键在于提出一种名为VPD-SR的新型视觉感知扩散蒸馏框架,该框架包含两个核心组件:显式语义感知监督(Explicit Semantic-aware Supervision, ESS)和高频感知(High-Frequency Perception, HFP)损失。ESS通过利用CLIP模型的视觉感知能力提升语义一致性,而HFP损失则通过恢复退化图像中缺失的高频细节来增强感知质量,从而实现高效的单步SR重建。

链接: https://arxiv.org/abs/2506.02605
作者: Xue Wu,Jingwei Xin,Zhijun Tu,Jie Hu,Jie Li,Nannan Wang,Xinbo Gao
机构: Xidian University (西安电子科技大学); Huawei Noah’s Ark Lab (华为诺亚方舟实验室); Chongqing University of Posts and Telecommunications (重庆邮电大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion-based models have been widely used in various visual generation tasks, showing promising results in image super-resolution (SR), while typically being limited by dozens or even hundreds of sampling steps. Although existing methods aim to accelerate the inference speed of multi-step diffusion-based SR methods through knowledge distillation, their generated images exhibit insufficient semantic alignment with real images, resulting in suboptimal perceptual quality reconstruction, specifically reflected in the CLIPIQA score. These methods still have many challenges in perceptual quality and semantic fidelity. Based on the challenges, we propose VPD-SR, a novel visual perception diffusion distillation framework specifically designed for SR, aiming to construct an effective and efficient one-step SR model. Specifically, VPD-SR consists of two components: Explicit Semantic-aware Supervision (ESS) and High-Frequency Perception (HFP) loss. Firstly, the ESS leverages the powerful visual perceptual understanding capabilities of the CLIP model to extract explicit semantic supervision, thereby enhancing semantic consistency. Then, Considering that high-frequency information contributes to the visual perception quality of images, in addition to the vanilla distillation loss, the HFP loss guides the student model to restore the missing high-frequency details in degraded images that are critical for enhancing perceptual quality. Lastly, we expand VPD-SR in adversarial training manner to further enhance the authenticity of the generated content. Extensive experiments conducted on synthetic and real-world datasets demonstrate that the proposed VPD-SR achieves superior performance compared to both previous state-of-the-art methods and the teacher model with just one-step sampling.
zh

[CV-79] Application of convolutional neural networks in image super-resolution

【速读】:该论文试图解决不同深度学习方法在图像超分辨率中的关系与差异缺乏系统总结的问题,其解决方案的关键在于分析基于卷积神经网络(Convolutional Neural Networks, CNNs)的插值方法和模块,包括双三次插值、最近邻插值、双线性插值、转置卷积、子像素层和元上采样,并通过实验对比这些方法的性能,以明确其差异与联系。

链接: https://arxiv.org/abs/2506.02604
作者: Tian Chunwei,Song Mingjian,Zuo Wangmeng,Du Bo,Zhang Yanning,Zhang Shichao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: It has been accepted by CAAI transactions on intelligent systems, in Chinese language

点击查看摘要

Abstract:Due to strong learning abilities of convolutional neural networks (CNNs), they have become mainstream methods for image super-resolution. However, there are big differences of different deep learning methods with different types. There is little literature to summarize relations and differences of different methods in image super-resolution. Thus, summarizing these literatures are important, according to loading capacity and execution speed of devices. This paper first introduces principles of CNNs in image super-resolution, then introduces CNNs based bicubic interpolation, nearest neighbor interpolation, bilinear interpolation, transposed convolution, sub-pixel layer, meta up-sampling for image super-resolution to analyze differences and relations of different CNNs based interpolations and modules, and compare performance of these methods by experiments. Finally, this paper gives potential research points and drawbacks and summarizes the whole paper, which can facilitate developments of CNNs in image super-resolution.
zh

[CV-80] Hyperspectral Image Generation with Unmixing Guided Diffusion Model

【速读】:该论文旨在解决高光谱图像生成中现有生成模型依赖条件生成方案导致图像多样性受限的问题,以及将扩散模型从RGB数据适配到高光谱数据时面临的高维性和物理约束挑战。其解决方案的关键在于提出一种由高光谱解混引导的扩散模型,该模型包含两个核心模块:解混自编码器模块和丰度扩散模块。解混自编码器模块通过解混引导将生成任务从图像空间转移到低维丰度空间,显著降低计算复杂度同时保持高保真度;丰度扩散模块生成满足非负性和归一化约束的样本,确保重建高光谱图像的物理一致性。

链接: https://arxiv.org/abs/2506.02601
作者: Shiyu Shen,Bin Pan,Ziye Zhang,Zhenwei Shi
机构: Nankai University (南开大学); Beihang University (北京航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Recently, hyperspectral image generation has received increasing attention, but existing generative models rely on conditional generation schemes, which limits the diversity of generated images. Diffusion models are popular for their ability to generate high-quality samples, but adapting these models from RGB to hyperspectral data presents the challenge of high dimensionality and physical constraints. To address these challenges, we propose a novel diffusion model guided by hyperspectral unmixing. Our model comprises two key modules: an unmixing autoencoder module and an abundance diffusion module. The unmixing autoencoder module leverages unmixing guidance to shift the generative task from the image space to the low-dimensional abundance space, significantly reducing computational complexity while preserving high fidelity. The abundance diffusion module generates samples that satisfy the constraints of non-negativity and unity, ensuring the physical consistency of the reconstructed HSIs. Additionally, we introduce two evaluation metrics tailored to hyperspectral data. Empirical results, evaluated using both traditional metrics and our proposed metrics, indicate that our model is capable of generating high-quality and diverse hyperspectral images, offering an advancement in hyperspectral data generation.
zh

[CV-81] BEVCALIB: LiDAR-Camera Calibration via Geometry-Guided Birds-Eye View Representations

【速读】:该论文旨在解决LiDAR-相机标定问题,这是自动驾驶和机器人系统中多模态感知融合的基础。传统标定方法需要在受控环境中收集大量数据,并且无法补偿车辆/机器人运动过程中的变换变化。解决方案的关键在于提出一种名为BEVCALIB的模型,该模型利用鸟瞰图(BEV)特征从原始数据中进行LiDAR-相机标定。通过分别提取相机和LiDAR的BEV特征并将其融合到共享的BEV特征空间中,同时引入一种新颖的特征选择器以优化变换解码器中的关键特征,从而减少内存消耗并提高训练效率。

链接: https://arxiv.org/abs/2506.02587
作者: Weiduo Yuan,Jerry Li,Justin Yue,Divyank Shah,Konstantinos Karydis,Hang Qiu
机构: University of Southern California (南加州大学); University of California, Riverside (加州大学河滨分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Accurate LiDAR-camera calibration is fundamental to fusing multi-modal perception in autonomous driving and robotic systems. Traditional calibration methods require extensive data collection in controlled environments and cannot compensate for the transformation changes during the vehicle/robot movement. In this paper, we propose the first model that uses bird’s-eye view (BEV) features to perform LiDAR camera calibration from raw data, termed BEVCALIB. To achieve this, we extract camera BEV features and LiDAR BEV features separately and fuse them into a shared BEV feature space. To fully utilize the geometric information from the BEV feature, we introduce a novel feature selector to filter the most important features in the transformation decoder, which reduces memory consumption and enables efficient training. Extensive evaluations on KITTI, NuScenes, and our own dataset demonstrate that BEVCALIB establishes a new state of the art. Under various noise conditions, BEVCALIB outperforms the best baseline in the literature by an average of (47.08%, 82.32%) on KITTI dataset, and (78.17%, 68.29%) on NuScenes dataset, in terms of (translation, rotation), respectively. In the open-source domain, it improves the best reproducible baseline by one order of magnitude. Our code and demo results are available at this https URL.
zh

[CV-82] Contrast Compress: Learning Lightweight Embeddings for Short Trajectories

【速读】:该论文旨在解决如何高效且准确地检索语义和方向上相似的短距离轨迹的问题,这是运动预测和自主导航等下游应用的基础。现有方法通常依赖计算密集型启发式策略或缺乏可解释性和可控性的潜在锚点表示。论文提出的解决方案关键在于利用带有对比三元组损失的Transformer编码器学习固定维度的轨迹嵌入,通过强调判别特征空间的重要性,从而提升轨迹数据的表示能力。实验表明,基于余弦相似性目标的嵌入在聚类效果上优于基于FFT的基线方法,并且紧凑的Transformer架构在保持较高检索性能的同时显著降低了计算开销。

链接: https://arxiv.org/abs/2506.02571
作者: Abhishek Vivekanandan,Christian Hubschneider,J. Marius Zöllner
机构: FZI Research Center for Information Technology (FZI信息技术研究中心); Karlsruhe Institute of Technology (卡尔斯鲁厄理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted for peer review

点击查看摘要

Abstract:The ability to retrieve semantically and directionally similar short-range trajectories with both accuracy and efficiency is foundational for downstream applications such as motion forecasting and autonomous navigation. However, prevailing approaches often depend on computationally intensive heuristics or latent anchor representations that lack interpretability and controllability. In this work, we propose a novel framework for learning fixed-dimensional embeddings for short trajectories by leveraging a Transformer encoder trained with a contrastive triplet loss that emphasize the importance of discriminative feature spaces for trajectory data. We analyze the influence of Cosine and FFT-based similarity metrics within the contrastive learning paradigm, with a focus on capturing the nuanced directional intent that characterizes short-term maneuvers. Our empirical evaluation on the Argoverse 2 dataset demonstrates that embeddings shaped by Cosine similarity objectives yield superior clustering of trajectories by both semantic and directional attributes, outperforming FFT-based baselines in retrieval tasks. Notably, we show that compact Transformer architectures, even with low-dimensional embeddings (e.g., 16 dimensions, but qualitatively down to 4), achieve a compelling balance between retrieval performance (minADE, minFDE) and computational overhead, aligning with the growing demand for scalable and interpretable motion priors in real-time systems. The resulting embeddings provide a compact, semantically meaningful, and efficient representation of trajectory data, offering a robust alternative to heuristic similarity measures and paving the way for more transparent and controllable motion forecasting pipelines.
zh

[CV-83] DCI: Dual-Conditional Inversion for Boosting Diffusion-Based Image Editing

【速读】:该论文旨在解决扩散模型中逆过程(inversion)存在的重建精度与编辑灵活性之间的固有权衡问题(reconstruction accuracy and editing flexibility trade-off)。这一限制源于在逆过程中难以同时保持语义对齐和结构一致性。论文提出的解决方案是Dual-Conditional Inversion (DCI),其关键在于通过联合条件约束源提示(source prompt)和参考图像(reference image)来引导逆过程,将逆过程建模为一个双条件固定点优化问题,从而在语义和视觉空间中锚定逆过程轨迹,实现更准确且可编辑的潜在表示。

链接: https://arxiv.org/abs/2506.02560
作者: Zixiang Li,Haoyu Wang,Wei Wang,Chuangchuang Tan,Yunchao Wei,Yao Zhao
机构: Beijing Jiaotong University (北京交通大学); MOE (教育部)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion models have achieved remarkable success in image generation and editing tasks. Inversion within these models aims to recover the latent noise representation for a real or generated image, enabling reconstruction, editing, and other downstream tasks. However, to date, most inversion approaches suffer from an intrinsic trade-off between reconstruction accuracy and editing flexibility. This limitation arises from the difficulty of maintaining both semantic alignment and structural consistency during the inversion process. In this work, we introduce Dual-Conditional Inversion (DCI), a novel framework that jointly conditions on the source prompt and reference image to guide the inversion process. Specifically, DCI formulates the inversion process as a dual-condition fixed-point optimization problem, minimizing both the latent noise gap and the reconstruction error under the joint guidance. This design anchors the inversion trajectory in both semantic and visual space, leading to more accurate and editable latent representations. Our novel setup brings new understanding to the inversion process. Extensive experiments demonstrate that DCI achieves state-of-the-art performance across multiple editing tasks, significantly improving both reconstruction quality and editing precision. Furthermore, we also demonstrate that our method achieves strong results in reconstruction tasks, implying a degree of robustness and generalizability approaching the ultimate goal of the inversion process.
zh

[CV-84] Kernel-based Unsupervised Embedding Alignment for Enhanced Visual Representation in Vision-language Models ICML2025

【速读】:该论文试图解决现有视觉-语言模型(如CLIP)在细粒度感知方面的局限性,这一问题导致了下游多模态大语言模型(MLLMs)出现显著的性能缺陷。解决方案的关键在于提出一种基于核的方法,将CLIP的视觉表示与以视觉为中心的基础模型(如DINOv2)的表示进行对齐,从而在保持与文本嵌入兼容性的前提下提升视觉感知能力。该方法设计用于高效的随机优化,并通过仅对图像编码器进行微调,使其在零样本目标识别、细粒度空间推理和定位任务中表现出显著改进。

链接: https://arxiv.org/abs/2506.02557
作者: Shizhan Gong,Yankai Jiang,Qi Dou,Farzan Farnia
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICML 2025

点击查看摘要

Abstract:Vision-language models, such as CLIP, have achieved significant success in aligning visual and textual representations, becoming essential components of many multi-modal large language models (MLLMs) like LLaVA and OpenFlamingo. However, numerous studies have identified CLIP’s limited fine-grained perception as a critical drawback, leading to substantial failures in downstream MLLMs. In contrast, vision-centric foundation models like DINOv2 demonstrate remarkable capabilities in capturing fine details from images. In this work, we propose a novel kernel-based method to align CLIP’s visual representation with that of DINOv2, ensuring that the resulting embeddings maintain compatibility with text embeddings while enhancing perceptual capabilities. Our alignment objective is designed for efficient stochastic optimization. Following this image-only alignment fine-tuning, the visual encoder retains compatibility with the frozen text encoder and exhibits significant improvements in zero-shot object recognition, fine-grained spatial reasoning, and localization. By integrating the aligned visual encoder, downstream MLLMs also demonstrate enhanced performance.
zh

[CV-85] SurgVLM: A Large Vision-Language Model and Systematic Evaluation Benchmark for Surgical Intelligence

【速读】:该论文旨在解决手术领域中生成式 AI(Generative AI)应用不足的问题,特别是针对手术视觉感知、时间序列分析和推理能力的特殊需求。现有通用视觉-语言模型由于缺乏领域特定监督和高质量大规模手术数据库而无法满足这些需求。解决方案的关键在于构建一个大规模多模态手术数据库——SurgVLM-DB,包含超过1.81百万帧和7.79百万次对话,覆盖16种手术类型和18个解剖结构,并通过统一和重新组织23个公开数据集、标准化标签以及分层视觉-语言对齐,实现对逐步细化的手术任务的全面覆盖。基于此数据集,研究者提出了SurgVLM,一个针对手术智能的大型视觉-语言基础模型,并构建了SurgVLM-Bench以评估性能。

链接: https://arxiv.org/abs/2506.02555
作者: Zhitao Zeng,Zhu Zhuo,Xiaojun Jia,Erli Zhang,Junde Wu,Jiaan Zhang,Yuxuan Wang,Chang Han Low,Jian Jiang,Zilong Zheng,Xiaochun Cao,Yutong Ban,Qi Dou,Yang Liu,Yueming Jin
机构: National University of Singapore (新加坡国立大学); Nanyang Technological University (南洋理工大学); University of Oxford (牛津大学); State Key Laboratory of General Artificial Intelligence, BIGAI (通用人工智能国家重点实验室,BIGAI); Shanghai Jiao Tong University (上海交通大学); Sun Yat-sen University (中山大学); The Chinese University of Hong Kong (香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 29 pages, 5 figures

点击查看摘要

Abstract:Foundation models have achieved transformative success across biomedical domains by enabling holistic understanding of multimodal data. However, their application in surgery remains underexplored. Surgical intelligence presents unique challenges - requiring surgical visual perception, temporal analysis, and reasoning. Existing general-purpose vision-language models fail to address these needs due to insufficient domain-specific supervision and the lack of a large-scale high-quality surgical database. To bridge this gap, we propose SurgVLM, one of the first large vision-language foundation models for surgical intelligence, where this single universal model can tackle versatile surgical tasks. To enable this, we construct a large-scale multimodal surgical database, SurgVLM-DB, comprising over 1.81 million frames with 7.79 million conversations, spanning more than 16 surgical types and 18 anatomical structures. We unify and reorganize 23 public datasets across 10 surgical tasks, followed by standardizing labels and doing hierarchical vision-language alignment to facilitate comprehensive coverage of gradually finer-grained surgical tasks, from visual perception, temporal analysis, to high-level reasoning. Building upon this comprehensive dataset, we propose SurgVLM, which is built upon Qwen2.5-VL, and undergoes instruction tuning to 10+ surgical tasks. We further construct a surgical multimodal benchmark, SurgVLM-Bench, for method evaluation. SurgVLM-Bench consists of 6 popular and widely-used datasets in surgical domain, covering several crucial downstream tasks. Based on SurgVLM-Bench, we evaluate the performance of our SurgVLM (3 SurgVLM variants: SurgVLM-7B, SurgVLM-32B, and SurgVLM-72B), and conduct comprehensive comparisons with 14 mainstream commercial VLMs (e.g., GPT-4o, Gemini 2.0 Flash, Qwen2.5-Max).
zh

[CV-86] HiLO: High-Level Object Fusion for Autonomous Driving using Transformers

【速读】:该论文旨在解决自动驾驶中环境感知的传感器数据融合问题,特别是在近量产车辆中,传统基于学习的特征级融合方法因复杂度和硬件需求高而难以应用,而传统高阶融合方法如卡尔曼滤波虽计算需求低但性能有限。论文的关键解决方案是改进自适应卡尔曼滤波(Adapted Kalman Filter)并提出一种基于Transformer的高阶目标融合方法——HiLO,通过实验验证其在F₁分数和平均IoU上的显著提升,并在跨域场景下展示了良好的泛化能力。

链接: https://arxiv.org/abs/2506.02554
作者: Timo Osterburg,Franz Albers,Christopher Diehl,Rajesh Pushparaj,Torsten Bertram
机构: TU Dortmund University (多特蒙德工业大学)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 6 pages, accepted at IEEE Intelligent Vehicles Symposium (IV) 2025

点击查看摘要

Abstract:The fusion of sensor data is essential for a robust perception of the environment in autonomous driving. Learning-based fusion approaches mainly use feature-level fusion to achieve high performance, but their complexity and hardware requirements limit their applicability in near-production vehicles. High-level fusion methods offer robustness with lower computational requirements. Traditional methods, such as the Kalman filter, dominate this area. This paper modifies the Adapted Kalman Filter (AKF) and proposes a novel transformer-based high-level object fusion method called HiLO. Experimental results demonstrate improvements of 25.9 percentage points in \textrmF_1 score and 6.1 percentage points in mean IoU. Evaluation on a new large-scale real-world dataset demonstrates the effectiveness of the proposed approaches. Their generalizability is further validated by cross-domain evaluation between urban and highway scenarios. Code, data, and models are available at this https URL .
zh

[CV-87] chnical Report for Ego4D Long-Term Action Anticipation Challenge 2025 CVPR ICIP

【速读】:该论文旨在解决Ego4D长期动作预测(Long-Term Action Anticipation, LTA)任务中的挑战,即在第一人称视角视频中准确预测未来的动作序列。解决方案的关键在于提出了一种三阶段框架:首先使用高性能视觉编码器提取视觉特征,随后通过Transformer模型结合动词-名词共现矩阵进行动作识别,最后将识别出的动词-名词对作为文本提示输入微调的大规模语言模型(Large Language Model, LLM),以生成未来的动作序列。该方法在CVPR 2025挑战赛中取得了第一名,并建立了新的最先进性能基准。

链接: https://arxiv.org/abs/2506.02550
作者: Qiaohui Chu,Haoyu Zhang,Yisen Feng,Meng Liu,Weili Guan,Yaowei Wang,Liqiang Nie
机构: Harbin Institute of Technology (Shenzhen) (哈尔滨工业大学(深圳)); Pengcheng Laboratory (鹏城实验室); Shandong Jianzhu University (山东建筑大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: The champion solution for the Ego4D Long-Term Action Anticipation Challenge at the CVPR EgoVis Workshop 2025

点击查看摘要

Abstract:In this report, we present a novel three-stage framework developed for the Ego4D Long-Term Action Anticipation (LTA) task. Inspired by recent advances in foundation models, our method consists of three stages: feature extraction, action recognition, and long-term action anticipation. First, visual features are extracted using a high-performance visual encoder. The features are then fed into a Transformer to predict verbs and nouns, with a verb-noun co-occurrence matrix incorporated to enhance recognition accuracy. Finally, the predicted verb-noun pairs are formatted as textual prompts and input into a fine-tuned large language model (LLM) to anticipate future action sequences. Our framework achieves first place in this challenge at CVPR 2025, establishing a new state-of-the-art in long-term action prediction. Our code will be released at this https URL.
zh

[CV-88] Probabilistic Online Event Downsampling CVPR2025

【速读】:该论文旨在解决事件相机(event camera)由于其高时间分辨率带来的高带宽、内存和计算需求问题。传统方法通过事件下采样来缓解这一问题,但多数方法依赖于固定的启发式或阈值策略,限制了其适应性。论文提出的解决方案是基于概率框架POLED,其关键在于通过事件重要性概率密度函数(ePDF)建模事件的重要性,该函数可以任意定义并适应不同应用场景。该方法在纯在线设置中运行,能够从原始事件流中实时估计事件重要性,实现场景特定的自适应,并引入零样本事件下采样,确保下采样后的事件可用于未经过任务特定微调的模型。

链接: https://arxiv.org/abs/2506.02547
作者: Andreu Girbau-Xalabarder,Jun Nagata,Shinichi Sumiyoshi
机构: Denso IT Laboratory(电装IT实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET)
备注: Accepted at CVPR 2025 Event-Vision workshop

点击查看摘要

Abstract:Event cameras capture scene changes asynchronously on a per-pixel basis, enabling extremely high temporal resolution. However, this advantage comes at the cost of high bandwidth, memory, and computational demands. To address this, prior work has explored event downsampling, but most approaches rely on fixed heuristics or threshold-based strategies, limiting their adaptability. Instead, we propose a probabilistic framework, POLED, that models event importance through an event-importance probability density function (ePDF), which can be arbitrarily defined and adapted to different applications. Our approach operates in a purely online setting, estimating event importance on-the-fly from raw event streams, enabling scene-specific adaptation. Additionally, we introduce zero-shot event downsampling, where downsampled events must remain usable for models trained on the original event stream, without task-specific adaptation. We design a contour-preserving ePDF that prioritizes structurally important events and evaluate our method across four datasets and tasks–object classification, image interpolation, surface normal estimation, and object detection–demonstrating that intelligent sampling is crucial for maintaining performance under event-budget constraints.
zh

[CV-89] HIEGNet: A Heterogenous Graph Neural Network Including the Immune Environment in Glomeruli Classification

【速读】:该论文旨在解决肾病病理学中对肾小球健康状态分类的问题,这一任务在图神经网络(GNNs)中的应用尚未得到充分探索。其关键在于构建有效的图结构,即准确识别节点、边及其相关特征。为此,作者提出了一种结合传统与机器学习计算机视觉技术的流程,用于构建异构图,并进一步设计了一种新型的异构GNN架构HIEGNet,该架构整合了肾小球及其周围免疫细胞的信息,从而在分类过程中考虑每个肾小球的免疫环境。

链接: https://arxiv.org/abs/2506.02542
作者: Niklas Kormann,Masoud Ramuz,Zeeshan Nisar,Nadine S. Schaadt,Hendrik Annuth,Benjamin Doerr,Friedrich Feuerhake,Thomas Lampert,Johannes F. Lutzeyer
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM)
备注: Accepted for poster presentation at MIDL 2025

点击查看摘要

Abstract:Graph Neural Networks (GNNs) have recently been found to excel in histopathology. However, an important histopathological task, where GNNs have not been extensively explored, is the classification of glomeruli health as an important indicator in nephropathology. This task presents unique difficulties, particularly for the graph construction, i.e., the identification of nodes, edges, and informative features. In this work, we propose a pipeline composed of different traditional and machine learning-based computer vision techniques to identify nodes, edges, and their corresponding features to form a heterogeneous graph. We then proceed to propose a novel heterogeneous GNN architecture for glomeruli classification, called HIEGNet, that integrates both glomeruli and their surrounding immune cells. Hence, HIEGNet is able to consider the immune environment of each glomerulus in its classification. Our HIEGNet was trained and tested on a dataset of Whole Slide Images from kidney transplant patients. Experimental results demonstrate that HIEGNet outperforms several baseline models and generalises best between patients among all baseline models. Our implementation is publicly available at this https URL.
zh

[CV-90] Rethinking Post-Unlearning Behavior of Large Vision-Language Models

【速读】:该论文试图解决大型视觉语言模型(Large Vision-Language Models, LVLMs)在使用机器遗忘(Machine Unlearning)技术去除特定训练数据时所引发的“遗忘后遗症”(Unlearning Aftermaths)问题,这些问题包括生成质量下降、幻觉或过度拒绝响应等。解决方案的关键在于提出一种新的遗忘任务,要求模型在保护隐私的同时提供具有视觉依据且信息丰富的响应,并引入PUBG方法,通过显式引导模型在遗忘后的输出分布上达到理想状态,从而有效缓解遗忘后遗症,避免隐私泄露。

链接: https://arxiv.org/abs/2506.02541
作者: Minsung Kim,Nakyeong Yang,Kyomin Jung
机构: Seoul National University (首尔国立大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 5 figures

点击查看摘要

Abstract:Machine unlearning is used to mitigate the privacy risks of Large Vision-Language Models (LVLMs) arising from training on large-scale web data. However, existing unlearning methods often fail to carefully select substitute outputs for forget targets, resulting in Unlearning Aftermaths-undesirable behaviors such as degenerate, hallucinated, or excessively refused responses. We highlight that, especially for generative LVLMs, it is crucial to consider the quality and informativeness of post-unlearning responses rather than relying solely on naive suppression. To address this, we introduce a new unlearning task for LVLMs that requires models to provide privacy-preserving yet informative and visually grounded responses. We also propose PUBG, a novel unlearning method that explicitly guides post-unlearning behavior toward a desirable output distribution. Experiments show that, while existing methods suffer from Unlearning Aftermaths despite successfully preventing privacy violations, PUBG effectively mitigates these issues, generating visually grounded and informative responses without privacy leakage for forgotten targets.
zh

[CV-91] VisuRiddles: Fine-grained Perception is a Primary Bottleneck for Multimodal Large Language Models in Abstract Visual Reasoning

【速读】:该论文试图解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在抽象视觉推理(Abstract Visual Reasoning, AVR)任务中的性能瓶颈问题,主要由于其对抽象图形的感知能力有限。解决方案的关键在于提出了一种名为Perceptual Riddle Synthesizer (PRS) 的自动化框架,该框架能够生成具有细粒度感知描述的谜题,从而提升模型在抽象视觉感知方面的训练效果,并通过监督中间推理阶段来增强模型的可解释性与性能。

链接: https://arxiv.org/abs/2506.02537
作者: Hao Yan,Handong Zheng,Hao Wang,Liang Yin,Xingchen Liu,Zhenbiao Cao,Xinxing Su,Zihao Chen,Jihao Wu,Minghui Liao,Chao Weng,Wei Chen,Yuliang Liu,Xiang Bai
机构: Huazhong University of Science and Technology (华中科技大学); Huawei Inc. (华为公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 13 pages, 4 figures

点击查看摘要

Abstract:Recent strides in multimodal large language models (MLLMs) have significantly advanced their performance in many reasoning tasks. However, Abstract Visual Reasoning (AVR) remains a critical challenge, primarily due to limitations in perceiving abstract graphics. To tackle this issue, we investigate the bottlenecks in current MLLMs and synthesize training data to improve their abstract visual perception. First, we propose VisuRiddles, a benchmark for AVR, featuring tasks meticulously constructed to assess models’ reasoning capacities across five core dimensions and two high-level reasoning categories. Second, we introduce the Perceptual Riddle Synthesizer (PRS), an automated framework for generating riddles with fine-grained perceptual descriptions. PRS not only generates valuable training data for abstract graphics but also provides fine-grained perceptual description, crucially allowing for supervision over intermediate reasoning stages and thereby improving both training efficacy and model interpretability. Our extensive experimental results on VisuRiddles empirically validate that fine-grained visual perception is the principal bottleneck and our synthesis framework markedly enhances the performance of contemporary MLLMs on these challenging tasks. Our code and dataset will be released at this https URL
zh

[CV-92] MemoryOut: Learning Principal Features via Multimodal Sparse Filtering Network for Semi-supervised Video Anomaly Detection

【速读】:该论文旨在解决视频异常检测(Video Anomaly Detection, VAD)方法在重建或预测基础上面临的两个关键问题:一是强泛化能力导致异常事件被准确重建或预测,难以区分正常与异常模式;二是仅依赖低级外观和运动线索限制了从复杂场景中识别异常事件的高级语义能力。其解决方案的关键在于提出一种包含两个核心创新的框架:首先,引入稀疏特征过滤模块(Sparse Feature Filtering Module, SFFM),通过瓶颈滤波器动态自适应地移除异常信息,避免记忆正常原型;其次,设计专家混合(Mixture of Experts, MoE)架构以增强主特征的多样性。其次,为克服现有方法对语义的忽视,集成视觉-语言模型(Vision-Language Model, VLM)生成视频片段的文本描述,实现语义、外观和运动线索的联合建模,并通过语义相似性约束和运动帧差对比损失强化模态一致性。

链接: https://arxiv.org/abs/2506.02535
作者: Juntong Li,Lingwei Dang,Yukun Su,Yun Hao,Qingxin Xiao,Yongwei Nie,Qingyao Wu
机构: School of Software Engineering, South China University of Technology (软件学院,华南理工大学); Wechat AI, Tencent (微信人工智能,腾讯); Institute for Super Robotics (Huangpu) (超机器人研究所(黄埔)); School of Computer Science and Engineering, South China University of Technology (计算机科学与工程学院,华南理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Video Anomaly Detection (VAD) methods based on reconstruction or prediction face two critical challenges: (1) strong generalization capability often results in accurate reconstruction or prediction of abnormal events, making it difficult to distinguish normal from abnormal patterns; (2) reliance only on low-level appearance and motion cues limits their ability to identify high-level semantic in abnormal events from complex scenes. To address these limitations, we propose a novel VAD framework with two key innovations. First, to suppress excessive generalization, we introduce the Sparse Feature Filtering Module (SFFM) that employs bottleneck filters to dynamically and adaptively remove abnormal information from features. Unlike traditional memory modules, it does not need to memorize the normal prototypes across the training dataset. Further, we design the Mixture of Experts (MoE) architecture for SFFM. Each expert is responsible for extracting specialized principal features during running time, and different experts are selectively activated to ensure the diversity of the learned principal features. Second, to overcome the neglect of semantics in existing methods, we integrate a Vision-Language Model (VLM) to generate textual descriptions for video clips, enabling comprehensive joint modeling of semantic, appearance, and motion cues. Additionally, we enforce modality consistency through semantic similarity constraints and motion frame-difference contrastive loss. Extensive experiments on multiple public datasets validate the effectiveness of our multimodal joint modeling framework and sparse feature filtering paradigm. Project page at this https URL.
zh

[CV-93] Enhancing Monocular Height Estimation via Weak Supervision from Imperfect Labels

【速读】:该论文旨在解决单目高度估计在遥感中因缺乏高质量标注数据而导致模型泛化能力不足的问题,从而限制了现有方法的大规模应用。其解决方案的关键在于首次引入带有不完美标签的数据(包括不完整、不精确和不准确的标签)来训练像素级高度估计网络,并设计了一个与任何单目高度估计网络兼容的集成框架。该框架通过考虑噪声标签、领域偏移和高度值的长尾分布,精心设计了架构和损失函数,利用平衡软损失和序数约束从不完美标签中提取隐含信息,从而提升了模型在不同领域的性能。

链接: https://arxiv.org/abs/2506.02534
作者: Sining Chen,Yilei Shi,Xiao Xiang Zhu
机构: Technical University of Munich (TUM); Munich Center for Machine Learning; TUM Georg Nemetschek Institute for Artificial Intelligence for the Built World
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Monocular height estimation is considered the most efficient and cost-effective means of 3D perception in remote sensing, and it has attracted much attention since the emergence of deep learning. While training neural networks requires a large amount of data, data with perfect labels are scarce and only available within developed regions. The trained models therefore lack generalizability, which limits the potential for large-scale application of existing methods. We tackle this problem for the first time, by introducing data with imperfect labels into training pixel-wise height estimation networks, including labels that are incomplete, inexact, and inaccurate compared to high-quality labels. We propose an ensemble-based pipeline compatible with any monocular height estimation network. Taking the challenges of noisy labels, domain shift, and long-tailed distribution of height values into consideration, we carefully design the architecture and loss functions to leverage the information concealed in imperfect labels using weak supervision through balanced soft losses and ordinal constraints. We conduct extensive experiments on two datasets with different resolutions, DFC23 (0.5 to 1 m) and GBH (3 m). The results indicate that the proposed pipeline outperforms baselines by achieving more balanced performance across various domains, leading to improvements of average root mean square errors up to 22.94 %, and 18.62 % on DFC23 and GBH, respectively. The efficacy of each design component is validated through ablation studies. Code is available at this https URL.
zh

[CV-94] RelationAdapter: Learning and Transferring Visual Relation with Diffusion Transformers

【速读】:该论文旨在解决现有单参考图像编辑方法在非刚性变换任务中表现不佳的问题,尤其是在处理内容感知的编辑意图传递时存在局限。其解决方案的关键在于引入RelationAdapter模块,该模块通过利用源-目标图像对,使基于Diffusion Transformer(DiT)的模型能够从少量示例中有效捕捉并应用视觉变换,从而提升模型在视觉提示驱动场景下的泛化能力和编辑性能。

链接: https://arxiv.org/abs/2506.02528
作者: Yan Gong,Yiren Song,Yicheng Li,Chenglin Li,Yin Zhang
机构: Zhe Jiang University (浙江大学); National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Inspired by the in-context learning mechanism of large language models (LLMs), a new paradigm of generalizable visual prompt-based image editing is emerging. Existing single-reference methods typically focus on style or appearance adjustments and struggle with non-rigid transformations. To address these limitations, we propose leveraging source-target image pairs to extract and transfer content-aware editing intent to novel query images. To this end, we introduce RelationAdapter, a lightweight module that enables Diffusion Transformer (DiT) based models to effectively capture and apply visual transformations from minimal examples. We also introduce Relation252K, a comprehensive dataset comprising 218 diverse editing tasks, to evaluate model generalization and adaptability in visual prompt-driven scenarios. Experiments on Relation252K show that RelationAdapter significantly improves the model’s ability to understand and transfer editing intent, leading to notable gains in generation quality and overall editing performance.
zh

[CV-95] LumosFlow: Motion-Guided Long Video Generation

【速读】:该论文旨在解决长视频生成中时间连贯性和视觉吸引力不足的问题,特别是在传统方法中由于逐段生成和拼接或关键帧插值导致的时间重复或不自然过渡等问题。其解决方案的关键在于引入LumosFlow框架,该框架通过显式引入运动指导,首先利用大型运动文本到视频扩散模型(LMTV-DM)生成具有较大运动间隔的关键帧以保证内容多样性,随后通过分解中间帧插值为运动生成与后处理优化两个阶段,其中使用潜在光流扩散模型(LOF-DM)合成复杂且大运动的光流,并通过MotionControlNet对变形结果进行优化,从而提升质量并引导中间帧生成,实现了15倍的插值效果,确保相邻帧间运动的合理与连续。

链接: https://arxiv.org/abs/2506.02497
作者: Jiahao Chen,Hangjie Yuan,Yichen Qian,Jingyun Liang,Jiazheng Xing,Pengwei Liu,Weihua Chen,Fan Wang,Bing Su
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Long video generation has gained increasing attention due to its widespread applications in fields such as entertainment and simulation. Despite advances, synthesizing temporally coherent and visually compelling long sequences remains a formidable challenge. Conventional approaches often synthesize long videos by sequentially generating and concatenating short clips, or generating key frames and then interpolate the intermediate frames in a hierarchical manner. However, both of them still remain significant challenges, leading to issues such as temporal repetition or unnatural transitions. In this paper, we revisit the hierarchical long video generation pipeline and introduce LumosFlow, a framework introduce motion guidance explicitly. Specifically, we first employ the Large Motion Text-to-Video Diffusion Model (LMTV-DM) to generate key frames with larger motion intervals, thereby ensuring content diversity in the generated long videos. Given the complexity of interpolating contextual transitions between key frames, we further decompose the intermediate frame interpolation into motion generation and post-hoc refinement. For each pair of key frames, the Latent Optical Flow Diffusion Model (LOF-DM) synthesizes complex and large-motion optical flows, while MotionControlNet subsequently refines the warped results to enhance quality and guide intermediate frame generation. Compared with traditional video frame interpolation, we achieve 15x interpolation, ensuring reasonable and continuous motion between adjacent frames. Experiments show that our method can generate long videos with consistent motion and appearance. Code and models will be made publicly available upon acceptance. Our project page: this https URL
zh

[CV-96] owards In-the-wild 3D Plane Reconstruction from a Single Image CVPR2025

【速读】:该论文旨在解决从单张图像中进行跨域3D平面重建的挑战,现有方法通常仅在室内或室外单一数据集上训练,限制了其在多样化测试数据中的泛化能力。解决方案的关键在于提出一种名为ZeroPlane的基于Transformer的框架,通过解耦平面法向量与偏移量的表示,并采用示例引导的分类-回归范式来分别学习平面和偏移量,同时引入先进的图像编码器和像素-几何增强的平面嵌入模块,以提升重建精度和跨域泛化能力。

链接: https://arxiv.org/abs/2506.02493
作者: Jiachen Liu,Rui Yu,Sili Chen,Sharon X. Huang,Hengkai Guo
机构: The Pennsylvania State University (宾夕法尼亚州立大学); University of Louisville (路易斯维尔大学); Bytedance (字节跳动)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025 Highlighted Paper

点击查看摘要

Abstract:3D plane reconstruction from a single image is a crucial yet challenging topic in 3D computer vision. Previous state-of-the-art (SOTA) methods have focused on training their system on a single dataset from either indoor or outdoor domain, limiting their generalizability across diverse testing data. In this work, we introduce a novel framework dubbed ZeroPlane, a Transformer-based model targeting zero-shot 3D plane detection and reconstruction from a single image, over diverse domains and environments. To enable data-driven models across multiple domains, we have curated a large-scale planar benchmark, comprising over 14 datasets and 560,000 high-resolution, dense planar annotations for diverse indoor and outdoor scenes. To address the challenge of achieving desirable planar geometry on multi-dataset training, we propose to disentangle the representation of plane normal and offset, and employ an exemplar-guided, classification-then-regression paradigm to learn plane and offset respectively. Additionally, we employ advanced backbones as image encoder, and present an effective pixel-geometry-enhanced plane embedding module to further facilitate planar reconstruction. Extensive experiments across multiple zero-shot evaluation datasets have demonstrated that our approach significantly outperforms previous methods on both reconstruction accuracy and generalizability, especially over in-the-wild data. Our code and data are available at: this https URL.
zh

[CV-97] Co-Evidential Fusion with Information Volume for Medical Image Segmentation

【速读】:该论文旨在解决现有半监督图像分割方法在利用体素级不确定性多源信息进行针对性学习方面的不足。其解决方案的关键在于引入一种基于广义证据深度学习的皮格诺斯协同证据融合策略,以获得更精确的体素不确定性度量,并结合信息量质量(IVUM)评估构建的证据,从而优化证据深度学习和设计新的优化目标。

链接: https://arxiv.org/abs/2506.02492
作者: Yuanpeng He,Lijian Li,Tianxiang Zhan,Chi-Man Pun,Wenpin Jiao,Zhi Jin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Although existing semi-supervised image segmentation methods have achieved good performance, they cannot effectively utilize multiple sources of voxel-level uncertainty for targeted learning. Therefore, we propose two main improvements. First, we introduce a novel pignistic co-evidential fusion strategy using generalized evidential deep learning, extended by traditional D-S evidence theory, to obtain a more precise uncertainty measure for each voxel in medical samples. This assists the model in learning mixed labeled information and establishing semantic associations between labeled and unlabeled data. Second, we introduce the concept of information volume of mass function (IVUM) to evaluate the constructed evidence, implementing two evidential learning schemes. One optimizes evidential deep learning by combining the information volume of the mass function with original uncertainty measures. The other integrates the learning pattern based on the co-evidential fusion strategy, using IVUM to design a new optimization objective. Experiments on four datasets demonstrate the competitive performance of our method.
zh

[CV-98] Grasp2Grasp: Vision-Based Dexterous Grasp Translation via Schrödinger Bridges

【速读】:该论文试图解决跨不同形态机械手的视觉引导灵巧抓取迁移问题,即在不依赖成对示教或手部专用仿真的情况下,将源手的抓取意图转化为目标手的功能等效抓取。解决方案的关键在于将该问题建模为基于Schrödinger Bridge形式的抓取分布间的随机传输,并通过条件视觉观测下的得分匹配和流匹配学习源手与目标手潜在抓取空间之间的映射。此外,引入了物理信息成本函数以指导抓取姿态、接触图、力矩空间和操作性的对齐。

链接: https://arxiv.org/abs/2506.02489
作者: Tao Zhong,Jonah Buchanan,Christine Allen-Blanchette
机构: Princeton University (普林斯顿大学); San Jose State University (圣何塞州立大学); Lockheed Martin Corporation (洛克希德·马丁公司)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 19 pages, 4 figures

点击查看摘要

Abstract:We propose a new approach to vision-based dexterous grasp translation, which aims to transfer grasp intent across robotic hands with differing morphologies. Given a visual observation of a source hand grasping an object, our goal is to synthesize a functionally equivalent grasp for a target hand without requiring paired demonstrations or hand-specific simulations. We frame this problem as a stochastic transport between grasp distributions using the Schrödinger Bridge formalism. Our method learns to map between source and target latent grasp spaces via score and flow matching, conditioned on visual observations. To guide this translation, we introduce physics-informed cost functions that encode alignment in base pose, contact maps, wrench space, and manipulability. Experiments across diverse hand-object pairs demonstrate our approach generates stable, physically grounded grasps with strong generalization. This work enables semantic grasp transfer for heterogeneous manipulators and bridges vision-based grasping with probabilistic generative modeling.
zh

[CV-99] Flexiffusion: Training-Free Segment-Wise Neural Architecture Search for Efficient Diffusion Models

【速读】:该论文旨在解决扩散模型(Diffusion Models, DMs)在生成高质量图像时面临的高计算成本问题,以及现有神经网络架构搜索(Neural Architecture Search, NAS)方法在优化DMs时存在的再训练需求、指数级搜索复杂度和依赖大量图像生成的评估速度慢等挑战。其解决方案的关键在于提出Flexiffusion,一个无需训练的NAS框架,通过联合优化生成调度和模型架构,在不修改预训练参数的前提下实现高效搜索。核心创新是将生成过程分解为等长的灵活片段,每个片段动态组合全计算、部分计算和空计算三种步骤类型,从而显著降低候选池规模并保持架构多样性,同时引入轻量级评估指标相对FID(rFID)以大幅减少评估时间。

链接: https://arxiv.org/abs/2506.02488
作者: Hongtao Huang,Xiaojun Chang,Lina Yao
机构: University of New South Wales (新南威尔士大学); University of Technology Sydney (悉尼科技大学); CSIRO’s Data61 (澳大利亚科学工业研究组织数据61); University of New South Wales (新南威尔士大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Diffusion models (DMs) are powerful generative models capable of producing high-fidelity images but are constrained by high computational costs due to iterative multi-step inference. While Neural Architecture Search (NAS) can optimize DMs, existing methods are hindered by retraining requirements, exponential search complexity from step-wise optimization, and slow evaluation relying on massive image generation. To address these challenges, we propose Flexiffusion, a training-free NAS framework that jointly optimizes generation schedules and model architectures without modifying pre-trained parameters. Our key insight is to decompose the generation process into flexible segments of equal length, where each segment dynamically combines three step types: full (complete computation), partial (cache-reused computation), and null (skipped computation). This segment-wise search space reduces the candidate pool exponentially compared to step-wise NAS while preserving architectural diversity. Further, we introduce relative FID (rFID), a lightweight evaluation metric for NAS that measures divergence from a teacher model’s outputs instead of ground truth, slashing evaluation time by over 90% . In practice, Flexiffusion achieves at least 2\times acceleration across LDMs, Stable Diffusion, and DDPMs on ImageNet and MS-COCO, with FID degradation under 5% , outperforming prior NAS and caching methods. Notably, it attains 5.1\times speedup on Stable Diffusion with near-identical CLIP scores. Our work pioneers a resource-efficient paradigm for searching high-speed DMs without sacrificing quality.
zh

[CV-100] owards Better De-raining Generalization via Rainy Characteristics Memorization and Replay

【速读】:该论文旨在解决当前图像去雨方法依赖有限数据集导致在复杂真实雨天条件下性能不足的问题。其解决方案的关键在于引入一种新的框架,使网络能够通过接入不断增长的数据集逐步扩展去雨知识库,从而显著提升适应性。该框架借鉴了人类大脑持续吸收和泛化经验的互补学习机制,首先利用生成对抗网络(GANs)捕捉并保留新数据的独特特征,随后将现有数据与GAN合成数据结合进行去雨网络训练,模拟海马体的重放与交错学习过程,并通过知识蒸馏复现海马体重放触发的新皮层活动模式与已有新皮层知识之间的协同效应。

链接: https://arxiv.org/abs/2506.02477
作者: Kunyu Wang,Xueyang Fu,Chengzhi Cao,Chengjie Ge,Wei Zhai,Zheng-Jun Zha
机构: University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Current image de-raining methods primarily learn from a limited dataset, leading to inadequate performance in varied real-world rainy conditions. To tackle this, we introduce a new framework that enables networks to progressively expand their de-raining knowledge base by tapping into a growing pool of datasets, significantly boosting their adaptability. Drawing inspiration from the human brain’s ability to continuously absorb and generalize from ongoing experiences, our approach borrow the mechanism of the complementary learning system. Specifically, we first deploy Generative Adversarial Networks (GANs) to capture and retain the unique features of new data, mirroring the hippocampus’s role in learning and memory. Then, the de-raining network is trained with both existing and GAN-synthesized data, mimicking the process of hippocampal replay and interleaved learning. Furthermore, we employ knowledge distillation with the replayed data to replicate the synergy between the neocortex’s activity patterns triggered by hippocampal replays and the pre-existing neocortical knowledge. This comprehensive framework empowers the de-raining network to amass knowledge from various datasets, continually enhancing its performance on previously unseen rainy scenes. Our testing on three benchmark de-raining networks confirms the framework’s effectiveness. It not only facilitates continuous knowledge accumulation across six datasets but also surpasses state-of-the-art methods in generalizing to new real-world scenarios.
zh

[CV-101] Generative Perception of Shape and Material from Differential Motion

【速读】:该论文试图解决从单张图像中感知物体形状和材质的固有歧义问题,尤其是在光照未知且无约束的情况下。其解决方案的关键在于引入一种新型的条件去噪扩散模型,该模型能够从物体经历差异运动的短视频中生成形状与材质图。该模型通过参数高效的架构直接在像素空间中进行训练,并能同时生成物体的多个解耦属性,从而有效捕捉并解析视觉歧义。

链接: https://arxiv.org/abs/2506.02473
作者: Xinran Nicole Han,Ko Nishino,Todd Zickler
机构: Harvard University (哈佛大学); Kyoto University (京都大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Perceiving the shape and material of an object from a single image is inherently ambiguous, especially when lighting is unknown and unconstrained. Despite this, humans can often disentangle shape and material, and when they are uncertain, they often move their head slightly or rotate the object to help resolve the ambiguities. Inspired by this behavior, we introduce a novel conditional denoising-diffusion model that generates samples of shape-and-material maps from a short video of an object undergoing differential motions. Our parameter-efficient architecture allows training directly in pixel-space, and it generates many disentangled attributes of an object simultaneously. Trained on a modest number of synthetic object-motion videos with supervision on shape and material, the model exhibits compelling emergent behavior: For static observations, it produces diverse, multimodal predictions of plausible shape-and-material maps that capture the inherent ambiguities; and when objects move, the distributions quickly converge to more accurate explanations. The model also produces high-quality shape-and-material estimates for less ambiguous, real-world objects. By moving beyond single-view to continuous motion observations, our work suggests a generative perception approach for improving visual reasoning in physically-embodied systems.
zh

[CV-102] HRTR: A Single-stage Transformer for Fine-grained Sub-second Action Segmentation in Stroke Rehabilitation

【速读】:该论文旨在解决中风康复过程中对患者运动进行精确追踪的问题,特别是在复杂康复训练中实现细粒度和亚秒级(小于一秒)动作检测的两大挑战。其解决方案的关键在于提出一种单阶段的高分辨率时间变换器(High Resolution Temporal Transformer, HRTR),该模型能够直接对高分辨率、细粒度的亚秒级动作进行时间定位与分类,无需多阶段处理和后处理步骤。在无需任何优化的情况下,HRTR在中风相关数据集和通用数据集上均优于现有最先进系统。

链接: https://arxiv.org/abs/2506.02472
作者: Halil Ismail Helvaci,Justin Philip Huber,Jihye Bae,Sen-ching Samson Cheung
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Stroke rehabilitation often demands precise tracking of patient movements to monitor progress, with complexities of rehabilitation exercises presenting two critical challenges: fine-grained and sub-second (under one-second) action detection. In this work, we propose the High Resolution Temporal Transformer (HRTR), to time-localize and classify high-resolution (fine-grained), sub-second actions in a single-stage transformer, eliminating the need for multi-stage methods and post-processing. Without any refinements, HRTR outperforms state-of-the-art systems on both stroke related and general datasets, achieving Edit Score (ES) of 70.1 on StrokeRehab Video, 69.4 on StrokeRehab IMU, and 88.4 on 50Salads.
zh

[CV-103] Efficient Test-time Adaptive Object Detection via Sensitivity-Guided Pruning CVPR2025

【速读】:该论文旨在解决持续测试时自适应目标检测(Continual Test-Time Adaptive Object Detection, CTTA-OD)中计算效率不足的问题,特别是在资源受限场景下,现有方法过于关注检测效果而忽视了计算开销。其解决方案的关键在于通过剪枝(pruning)策略优化模型,在保持检测性能的同时降低计算复杂度。具体而言,提出了一种基于敏感度引导的通道剪枝方法,通过量化图像和实例级别特征通道对领域差异的敏感度,结合加权稀疏正则化选择性抑制并剪除敏感通道,同时引入随机通道重激活机制以恢复可能有用的特征,从而在提升适应性能的同时减少12%的FLOPs计算量。

链接: https://arxiv.org/abs/2506.02462
作者: Kunyu Wang,Xueyang Fu,Xin Lu,Chengjie Ge,Chengzhi Cao,Wei Zhai,Zheng-Jun Zha
机构: University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted as CVPR 2025 oral paper

点击查看摘要

Abstract:Continual test-time adaptive object detection (CTTA-OD) aims to online adapt a source pre-trained detector to ever-changing environments during inference under continuous domain shifts. Most existing CTTA-OD methods prioritize effectiveness while overlooking computational efficiency, which is crucial for resource-constrained scenarios. In this paper, we propose an efficient CTTA-OD method via pruning. Our motivation stems from the observation that not all learned source features are beneficial; certain domain-sensitive feature channels can adversely affect target domain performance. Inspired by this, we introduce a sensitivity-guided channel pruning strategy that quantifies each channel based on its sensitivity to domain discrepancies at both image and instance levels. We apply weighted sparsity regularization to selectively suppress and prune these sensitive channels, focusing adaptation efforts on invariant ones. Additionally, we introduce a stochastic channel reactivation mechanism to restore pruned channels, enabling recovery of potentially useful features and mitigating the risks of early pruning. Extensive experiments on three benchmarks show that our method achieves superior adaptation performance while reducing computational overhead by 12% in FLOPs compared to the recent SOTA method.
zh

[CV-104] ReSpace: Text-Driven 3D Scene Synthesis and Editing with Preference Alignment

【速读】:该论文旨在解决3D室内场景合成与编辑中存在的语义表达不足、空间推理能力有限以及对复杂布局建模不准确等问题。现有方法要么简化物体语义,要么依赖掩码扩散或地板平面图,难以捕捉复杂的场景结构。其解决方案的关键在于引入ReSpace框架,该框架基于自回归语言模型,采用紧凑的结构化场景表示,明确界定房间边界,并将场景编辑任务转化为下一个标记预测问题,从而实现更丰富的语义理解和更精确的空间布局生成。

链接: https://arxiv.org/abs/2506.02459
作者: Martin JJ. Bucher,Iro Armeni
机构: Stanford University (斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 20 pages, 17 figures (incl. appendix)

点击查看摘要

Abstract:Scene synthesis and editing has emerged as a promising direction in computer graphics. Current trained approaches for 3D indoor scenes either oversimplify object semantics through one-hot class encodings (e.g., ‘chair’ or ‘table’), require masked diffusion for editing, ignore room boundaries, or rely on floor plan renderings that fail to capture complex layouts. In contrast, LLM-based methods enable richer semantics via natural language (e.g., ‘modern studio with light wood furniture’) but do not support editing, remain limited to rectangular layouts or rely on weak spatial reasoning from implicit world models. We introduce ReSpace, a generative framework for text-driven 3D indoor scene synthesis and editing using autoregressive language models. Our approach features a compact structured scene representation with explicit room boundaries that frames scene editing as a next-token prediction task. We leverage a dual-stage training approach combining supervised fine-tuning and preference alignment, enabling a specially trained language model for object addition that accounts for user instructions, spatial geometry, object semantics, and scene-level composition. For scene editing, we employ a zero-shot LLM to handle object removal and prompts for addition. We further introduce a novel voxelization-based evaluation that captures fine-grained geometry beyond 3D bounding boxes. Experimental results surpass state-of-the-art on object addition while maintaining competitive results on full scene synthesis.
zh

[CV-105] PAID: Pairwise Angular-Invariant Decomposition for Continual Test-Time Adaptation

【速读】:该论文试图解决持续测试时适应(Continual Test-Time Adaptation, CTTA)中模型在推理过程中适应变化环境的问题,现有方法主要关注目标数据的利用,而忽视了预训练权重中蕴含的未充分利用的领域不变先验信息。解决方案的关键在于利用预训练权重的几何属性,特别是成对角度结构(pairwise angular structure),该结构在多种退化领域中保持稳定,并编码了领域不变语义信息,因此在适应过程中应被保留。基于此,作者提出了PAID(Pairwise Angular-Invariant Decomposition)方法,通过分解权重为幅度和方向,并引入可学习的正交矩阵来全局旋转方向,同时保持成对角度结构不变,从而实现有效的CTTA。

链接: https://arxiv.org/abs/2506.02453
作者: Kunyu Wang,Xueyang Fu,Yunfei Bao,Chengjie Ge,Chengzhi Cao,Wei Zhai,Zheng-Jun Zha
机构: University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Continual Test-Time Adaptation (CTTA) aims to online adapt a pre-trained model to changing environments during inference. Most existing methods focus on exploiting target data, while overlooking another crucial source of information, the pre-trained weights, which encode underutilized domain-invariant priors. This paper takes the geometric attributes of pre-trained weights as a starting point, systematically analyzing three key components: magnitude, absolute angle, and pairwise angular structure. We find that the pairwise angular structure remains stable across diverse corrupted domains and encodes domain-invariant semantic information, suggesting it should be preserved during adaptation. Based on this insight, we propose PAID (Pairwise Angular-Invariant Decomposition), a prior-driven CTTA method that decomposes weight into magnitude and direction, and introduces a learnable orthogonal matrix via Householder reflections to globally rotate direction while preserving the pairwise angular structure. During adaptation, only the magnitudes and the orthogonal matrices are updated. PAID achieves consistent improvements over recent SOTA methods on four widely used CTTA benchmarks, demonstrating that preserving pairwise angular structure offers a simple yet effective principle for CTTA.
zh

[CV-106] ANT: Adaptive Neural Temporal-Aware Text-to-Motion Model

【速读】:该论文旨在解决扩散模型在文本到动作生成任务中因静态语义条件忽略时频需求而导致的语义对齐不足问题,具体表现为早期去噪阶段需要结构语义构建运动基础,而后期阶段则需要局部细节以实现文本对齐。其解决方案的关键在于提出一种自适应神经时序感知架构(ANT),通过三个核心模块实现:语义时序自适应模块(STA),通过频谱分析自动划分去噪过程为低频结构规划与高频细化;动态无分类器引导调度(DCFG),自适应调整条件与非条件比例以提高效率并保持精度;以及时序-语义重加权,量化对齐文本影响与阶段需求。

链接: https://arxiv.org/abs/2506.02452
作者: Wenshuo Chen,Kuimou Yu,Haozhe Jia,Kaishen Yuan,Bowen Tian,Songning Lai,Hongru Xiao,Erhang Zhang,Lei Wang,Yutao Yue
机构: HKUST(GZ)(香港科技大学(广州)); Tongji University(同济大学); Shandong University(山东大学); Australian National University & Data61/CSIRO(澳大利亚国立大学 & Data61/CSIRO); Thrust of Artificial Intelligence and Thrust of Intelligent Transportation(人工智能研究组和智能交通研究组)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:While diffusion models advance text-to-motion generation, their static semantic conditioning ignores temporal-frequency demands: early denoising requires structural semantics for motion foundations while later stages need localized details for text alignment. This mismatch mirrors biological morphogenesis where developmental phases demand distinct genetic programs. Inspired by epigenetic regulation governing morphological specialization, we propose (ANT), an Adaptive Neural Temporal-Aware architecture. ANT orchestrates semantic granularity through: (i) Semantic Temporally Adaptive (STA) Module: Automatically partitions denoising into low-frequency structural planning and high-frequency refinement via spectral analysis. (ii) Dynamic Classifier-Free Guidance scheduling (DCFG): Adaptively adjusts conditional to unconditional ratio enhancing efficiency while maintaining fidelity. (iii) Temporal-semantic reweighting: Quantitatively aligns text influence with phase requirements. Extensive experiments show that ANT can be applied to various baselines, significantly improving model performance, and achieving state-of-the-art semantic alignment on StableMoFusion.
zh

[CV-107] VidEvent: A Large Dataset for Understanding Dynamic Evolution of Events in Videos

【速读】:该论文试图解决视频中事件理解的挑战性问题,即由于视频结构复杂、语义层次丰富以及动态演变特性,导致AI难以准确理解和预测视觉事件。解决方案的关键在于提出视频事件理解任务,通过提取事件脚本并利用这些脚本进行预测,同时引入了VidEvent数据集,该数据集包含超过23,000个标注良好的事件,具备详细的事件结构、广泛层次和逻辑关系,为相关研究提供了高质量的数据支持和基准模型。

链接: https://arxiv.org/abs/2506.02448
作者: Baoyu Liang,Qile Su,Shoutai Zhu,Yuchen Liang,Chao Tong
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Despite the significant impact of visual events on human cognition, understanding events in videos remains a challenging task for AI due to their complex structures, semantic hierarchies, and dynamic evolution. To address this, we propose the task of video event understanding that extracts event scripts and makes predictions with these scripts from videos. To support this task, we introduce VidEvent, a large-scale dataset containing over 23,000 well-labeled events, featuring detailed event structures, broad hierarchies, and logical relations extracted from movie recap videos. The dataset was created through a meticulous annotation process, ensuring high-quality and reliable event data. We also provide comprehensive baseline models offering detailed descriptions of their architecture and performance metrics. These models serve as benchmarks for future research, facilitating comparisons and improvements. Our analysis of VidEvent and the baseline models highlights the dataset’s potential to advance video event understanding and encourages the exploration of innovative algorithms and models. The dataset and related resources are publicly available at this http URL.
zh

[CV-108] SViMo: Synchronized Diffusion for Video and Motion Generation in Hand-object Interaction Scenarios

【速读】:该论文旨在解决当前3D手-物体交互(Hand-Object Interaction, HOI)运动生成方法依赖预定义3D物体模型和实验室采集的运动数据,从而限制了泛化能力的问题,以及HOI视频生成方法过于关注像素级视觉保真度而牺牲物理合理性的缺陷。其解决方案的关键在于提出一种结合视觉先验与动态约束的同步扩散过程框架,通过三模态自适应调制实现异构语义、外观和运动特征的对齐,并利用3D全注意力机制建模模态间与模态内依赖关系,同时引入视觉感知的3D交互扩散模型,直接从同步扩散输出生成显式3D交互序列并形成闭环反馈,从而消除对预定义物体模型或显式姿态引导的依赖,显著提升视频与运动的一致性。

链接: https://arxiv.org/abs/2506.02444
作者: Lingwei Dang,Ruizhi Shao,Hongwen Zhang,Wei Min,Yebin Liu,Qingyao Wu
机构: South China University of Technology (华南理工大学); Tsinghua University (清华大学); Beijing Normal University (北京师范大学); Shadow AI (影子人工智能)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Hand-Object Interaction (HOI) generation has significant application potential. However, current 3D HOI motion generation approaches heavily rely on predefined 3D object models and lab-captured motion data, limiting generalization capabilities. Meanwhile, HOI video generation methods prioritize pixel-level visual fidelity, often sacrificing physical plausibility. Recognizing that visual appearance and motion patterns share fundamental physical laws in the real world, we propose a novel framework that combines visual priors and dynamic constraints within a synchronized diffusion process to generate the HOI video and motion simultaneously. To integrate the heterogeneous semantics, appearance, and motion features, our method implements tri-modal adaptive modulation for feature aligning, coupled with 3D full-attention for modeling inter- and intra-modal dependencies. Furthermore, we introduce a vision-aware 3D interaction diffusion model that generates explicit 3D interaction sequences directly from the synchronized diffusion outputs, then feeds them back to establish a closed-loop feedback cycle. This architecture eliminates dependencies on predefined object models or explicit pose guidance while significantly enhancing video-motion consistency. Experimental results demonstrate our method’s superiority over state-of-the-art approaches in generating high-fidelity, dynamically plausible HOI sequences, with notable generalization capabilities in unseen real-world scenarios. Project page at \hrefthis https URLthis https URL.
zh

[CV-109] Video-Level Language-Driven Video-Based Visible-Infrared Person Re-Identification

【速读】:该论文旨在解决视频级可见-红外行人重识别(Video-based Visible-Infrared Person Re-Identification, VVI-ReID)中由于模态差异导致的跨模态匹配问题,具体表现为如何生成并利用共享的视频级语言提示来弥合可见光与红外模态之间的差距。解决方案的关键在于提出一种名为视频级语言驱动的VVI-ReID(VLD)框架,其核心模块包括不变模态语言提示(Invariant-Modality Language Prompting, IMLP)和时空提示(Spatial-Temporal Prompting, STP)。IMLP通过联合微调视觉编码器和提示学习器生成跨模态文本提示,并在CLIP的多模态空间中对齐不同模态的视觉特征,从而缓解模态差异;STP则通过时空中心(STH)和时空聚合(STA)子模块建模时空信息,并将其融入文本提示中以增强特征表达。

链接: https://arxiv.org/abs/2506.02439
作者: Shuang Li,Jiaxu Leng,Changjiang Kuang,Mingpi Tan,Xinbo Gao
机构: Chongqing University of Posts and Telecommunications (重庆邮电大学); Chongqing Institute for Brain and Intelligence (重庆脑与智能科学中心); Guangyang Bay Laboratory (广阳湾实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IEEE TIFS

点击查看摘要

Abstract:Video-based Visible-Infrared Person Re-Identification (VVI-ReID) aims to match pedestrian sequences across modalities by extracting modality-invariant sequence-level features. As a high-level semantic representation, language provides a consistent description of pedestrian characteristics in both infrared and visible modalities. Leveraging the Contrastive Language-Image Pre-training (CLIP) model to generate video-level language prompts and guide the learning of modality-invariant sequence-level features is theoretically feasible. However, the challenge of generating and utilizing modality-shared video-level language prompts to address modality gaps remains a critical problem. To address this problem, we propose a simple yet powerful framework, video-level language-driven VVI-ReID (VLD), which consists of two core modules: invariant-modality language prompting (IMLP) and spatial-temporal prompting (STP). IMLP employs a joint fine-tuning strategy for the visual encoder and the prompt learner to effectively generate modality-shared text prompts and align them with visual features from different modalities in CLIP’s multimodal space, thereby mitigating modality differences. Additionally, STP models spatiotemporal information through two submodules, the spatial-temporal hub (STH) and spatial-temporal aggregation (STA), which further enhance IMLP by incorporating spatiotemporal information into text prompts. The STH aggregates and diffuses spatiotemporal information into the [CLS] token of each frame across the vision transformer (ViT) layers, whereas STA introduces dedicated identity-level loss and specialized multihead attention to ensure that the STH focuses on identity-relevant spatiotemporal feature aggregation. The VLD framework achieves state-of-the-art results on two VVI-ReID benchmarks. The code will be released at this https URL.
zh

[CV-110] Empowering Functional Neuroimaging: A Pre-trained Generative Framework for Unified Representation of Neural Signals

【速读】:该论文试图解决多模态功能神经成像在获取成本高、可行性受限以及特定群体数据不足导致的脑机接口(BCI)解码模型不公平问题。解决方案的关键在于通过生成式人工智能(Generative AI)构建一个统一的表示框架,将多模态功能神经成像映射到统一表示空间,从而生成获取受限模态和欠代表群体的数据,提升模型性能与公平性。

链接: https://arxiv.org/abs/2506.02433
作者: Weiheng Yao,Xuhang Chen,Shuqiang Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal functional neuroimaging enables systematic analysis of brain mechanisms and provides discriminative representations for brain-computer interface (BCI) decoding. However, its acquisition is constrained by high costs and feasibility limitations. Moreover, underrepresentation of specific groups undermines fairness of BCI decoding model. To address these challenges, we propose a unified representation framework for multimodal functional neuroimaging via generative artificial intelligence (AI). By mapping multimodal functional neuroimaging into a unified representation space, the proposed framework is capable of generating data for acquisition-constrained modalities and underrepresented groups. Experiments show that the framework can generate data consistent with real brain activity patterns, provide insights into brain mechanisms, and improve performance on downstream tasks. More importantly, it can enhance model fairness by augmenting data for underrepresented groups. Overall, the framework offers a new paradigm for decreasing the cost of acquiring multimodal functional neuroimages and enhancing the fairness of BCI decoding models.
zh

[CV-111] Guiding Registration with Emergent Similarity from Pre-Trained Diffusion Models MICCAI2025

【速读】:该论文试图解决医学图像配准中因解剖结构在不同模态或不同图像中缺失而导致的对齐不准确问题,特别是在强度基础上的相似性损失无法有效捕捉语义对应关系的情况下。解决方案的关键在于利用预训练的扩散模型(diffusion models)提取的特征作为语义相似性度量,以指导可变形图像配准网络,从而实现对跨图像中存在的有意义结构的准确对齐,而忽略那些不存在的结构。

链接: https://arxiv.org/abs/2506.02419
作者: Nurislam Tursynbek,Hastings Greer,Basar Demir,Marc Niethammer
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: MICCAI 2025

点击查看摘要

Abstract:Diffusion models, while trained for image generation, have emerged as powerful foundational feature extractors for downstream tasks. We find that off-the-shelf diffusion models, trained exclusively to generate natural RGB images, can identify semantically meaningful correspondences in medical images. Building on this observation, we propose to leverage diffusion model features as a similarity measure to guide deformable image registration networks. We show that common intensity-based similarity losses often fail in challenging scenarios, such as when certain anatomies are visible in one image but absent in another, leading to anatomically inaccurate alignments. In contrast, our method identifies true semantic correspondences, aligning meaningful structures while disregarding those not present across images. We demonstrate superior performance of our approach on two tasks: multimodal 2D registration (DXA to X-Ray) and monomodal 3D registration (brain-extracted to non-brain-extracted MRI). Code: this https URL
zh

[CV-112] Revisiting End-to-End Learning with Slide-level Supervision in Computational Pathology

【速读】:该论文旨在解决计算病理学(CPath)中预训练编码器与多实例学习(MIL)聚合器结合时存在的性能限制问题,特别是由于编码器未针对下游任务进行微调以及与MIL的优化不一致所导致的缺陷。其解决方案的关键在于重新审视端到端(E2E)学习,并提出一种新的MIL方法ABMILX,通过基于全局相关性的注意力细化和多头机制缓解稀疏注意力MIL带来的优化挑战,同时结合高效的多尺度随机切片采样策略,实现了在多个基准测试中超越现有最先进模型的性能,同时保持计算效率。

链接: https://arxiv.org/abs/2506.02408
作者: Wenhao Tang,Rong Qin,Heng Fang,Fengtao Zhou,Hao Chen,Xiang Li,Ming-Ming Cheng
机构: Nankai University (南开大学); Chongqing University (重庆大学); The Hong Kong University of Science and Technology (香港科技大学); Nankai International Advanced Research Institute (Shenzhen Futian) (南开国际先进研究院(深圳福田))
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Pre-trained encoders for offline feature extraction followed by multiple instance learning (MIL) aggregators have become the dominant paradigm in computational pathology (CPath), benefiting cancer diagnosis and prognosis. However, performance limitations arise from the absence of encoder fine-tuning for downstream tasks and disjoint optimization with MIL. While slide-level supervised end-to-end (E2E) learning is an intuitive solution to this issue, it faces challenges such as high computational demands and suboptimal results. These limitations motivate us to revisit E2E learning. We argue that prior work neglects inherent E2E optimization challenges, leading to performance disparities compared to traditional two-stage methods. In this paper, we pioneer the elucidation of optimization challenge caused by sparse-attention MIL and propose a novel MIL called ABMILX. It mitigates this problem through global correlation-based attention refinement and multi-head mechanisms. With the efficient multi-scale random patch sampling strategy, an E2E trained ResNet with ABMILX surpasses SOTA foundation models under the two-stage paradigm across multiple challenging benchmarks, while remaining computationally efficient (10 RTX3090 hours). We show the potential of E2E learning in CPath and calls for greater research focus in this area. The code is this https URL.
zh

[CV-113] Modelship Attribution: Tracing Multi-Stage Manipulations Across Generative Models

【速读】:该论文旨在解决真实世界中复杂迭代篡改图像的模型溯源问题,即在多阶段、多工具对同一图像进行多次修改后,如何准确识别参与修改的生成式AI模型并重建其编辑序列。现有方法在单阶段篡改场景中表现良好,但在面对复杂的多阶段篡改时效果显著下降。论文提出的关键解决方案是引入一种名为“Modelship Attribution”的任务,并设计了针对该任务的模型溯源Transformer(MAT),通过捕捉各模型独特的编辑模式而非依赖混合的指纹特征,从而有效实现对多阶段篡改流程中各模型贡献的识别与归属。

链接: https://arxiv.org/abs/2506.02405
作者: Zhiya Tan,Xin Zhang,Joey Tianyi Zhou
机构: Nanyang Technological University (南洋理工大学); Agency for Science, Technology and Research (科技研究局)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:As generative techniques become increasingly accessible, authentic visuals are frequently subjected to iterative alterations by various individuals employing a variety of tools. Currently, to avoid misinformation and ensure accountability, a lot of research on detection and attribution is emerging. Although these methods demonstrate promise in single-stage manipulation scenarios, they fall short when addressing complex real-world iterative manipulation. In this paper, we are the first, to the best of our knowledge, to systematically model this real-world challenge and introduce a novel method to solve it. We define a task called “Modelship Attribution”, which aims to trace the evolution of manipulated images by identifying the generative models involved and reconstructing the sequence of edits they performed. To realistically simulate this scenario, we utilize three generative models, StyleMapGAN, DiffSwap, and FacePartsSwap, that sequentially modify distinct regions of the same image. This process leads to the creation of the first modelship dataset, comprising 83,700 images (16,740 images*5). Given that later edits often overwrite the fingerprints of earlier models, the focus shifts from extracting blended fingerprints to characterizing each model’s distinctive editing patterns. To tackle this challenge, we introduce the modelship attribution transformer (MAT), a purpose-built framework designed to effectively recognize and attribute the contributions of various models within complex, multi-stage manipulation workflows. Through extensive experiments and comparative analysis with other related methods, our results, including comprehensive ablation studies, demonstrate that the proposed approach is a highly effective solution for modelship attribution.
zh

[CV-114] owards Explicit Geometry-Reflectance Collaboration for Generalized LiDAR Segmentation in Adverse Weather

【速读】:该论文旨在解决LiDAR语义分割模型在恶劣天气条件下精度下降的问题,特别是针对点云几何结构和反射强度的异构域偏移带来的负面影响。其解决方案的关键在于提出一种新颖的几何-反射协作(Geometry-Reflectance Collaboration, GRC)框架,该框架通过双分支结构独立处理几何与反射特征,并引入鲁棒的多层级特征协作模块以抑制冗余和不可靠信息,从而有效提取场景内在信息并抑制干扰,提升模型在恶劣天气条件下的鲁棒性和泛化能力。

链接: https://arxiv.org/abs/2506.02396
作者: Longyu Yang,Ping Hu,Shangbo Yuan,Lu Zhang,Jun Liu,Hengtao Shen,Xiaofeng Zhu
机构: UESTC(电子科技大学); DLUT(大连理工大学); Lancaster University(兰卡斯特大学); Tongji University(同济大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Existing LiDAR semantic segmentation models often suffer from decreased accuracy when exposed to adverse weather conditions. Recent methods addressing this issue focus on enhancing training data through weather simulation or universal augmentation techniques. However, few works have studied the negative impacts caused by the heterogeneous domain shifts in the geometric structure and reflectance intensity of point clouds. In this paper, we delve into this challenge and address it with a novel Geometry-Reflectance Collaboration (GRC) framework that explicitly separates feature extraction for geometry and reflectance. Specifically, GRC employs a dual-branch architecture designed to independently process geometric and reflectance features initially, thereby capitalizing on their distinct characteristic. Then, GRC adopts a robust multi-level feature collaboration module to suppress redundant and unreliable information from both branches. Consequently, without complex simulation or augmentation, our method effectively extracts intrinsic information about the scene while suppressing interference, thus achieving better robustness and generalization in adverse weather conditions. We demonstrate the effectiveness of GRC through comprehensive experiments on challenging benchmarks, showing that our method outperforms previous approaches and establishes new state-of-the-art results.
zh

[CV-115] he Devil is in the Darkness: Diffusion-Based Nighttime Dehazing Anchored in Brightness Perception

【速读】:该论文试图解决夜间雾霾图像转换为日间等效亮度的问题,现有方法存在两个关键局限:一是数据集忽略了昼夜亮度关系,导致图像合成中的亮度映射与现实不一致;二是模型未显式引入日间亮度知识,限制了真实光照重建能力。解决方案的关键在于提出基于扩散的夜间去雾(Diffusion-Based Nighttime Dehazing, DiffND)框架,其核心在于数据合成管道确保合成与真实场景的亮度一致性,并通过结合预训练扩散模型与亮度感知网络的恢复模型,实现亮度感知的优化,从而提升联合去雾与亮度映射性能。

链接: https://arxiv.org/abs/2506.02395
作者: Xiaofeng Cong,Yu-Xin Zhang,Haoran Wei,Yeying Jin,Junming Hou,Jie Gui,Jing Zhang,Dacheng Tao
机构: Southeast University (东南大学); UESTC (电子科技大学); Tencent Company (腾讯公司); Wuhan University (武汉大学); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:While nighttime image dehazing has been extensively studied, converting nighttime hazy images to daytime-equivalent brightness remains largely unaddressed. Existing methods face two critical limitations: (1) datasets overlook the brightness relationship between day and night, resulting in the brightness mapping being inconsistent with the real world during image synthesis; and (2) models do not explicitly incorporate daytime brightness knowledge, limiting their ability to reconstruct realistic lighting. To address these challenges, we introduce the Diffusion-Based Nighttime Dehazing (DiffND) framework, which excels in both data synthesis and lighting reconstruction. Our approach starts with a data synthesis pipeline that simulates severe distortions while enforcing brightness consistency between synthetic and real-world scenes, providing a strong foundation for learning night-to-day brightness mapping. Next, we propose a restoration model that integrates a pre-trained diffusion model guided by a brightness perception network. This design harnesses the diffusion model’s generative ability while adapting it to nighttime dehazing through brightness-aware optimization. Experiments validate our dataset’s utility and the model’s superior performance in joint haze removal and brightness mapping.
zh

[CV-116] RRCANet: Recurrent Reusable-Convolution Attention Network for Infrared Small Target Detection

【速读】:该论文旨在解决红外小目标检测中的挑战性问题,即在目标具有尺寸小、亮度低、无固定形状和动态变化等特性的情况下,实现高效且准确的检测。其解决方案的关键在于提出一种循环可重用卷积注意力网络(RRCA-Net),该网络通过引入循环可重用卷积块(RuCB)在不增加额外参数的前提下,有效保持并细化深层特征中的目标高层信息;同时,结合双交互注意力聚合模块(DIAAM)促进精炼信息的相互增强与融合,从而提升相邻层间上下文信息的相关性。此外,设计了一种受目标特性启发的损失函数(DpT-k loss),以确保模型稳定收敛。

链接: https://arxiv.org/abs/2506.02393
作者: Yongxian Liu,Boyang Li,Ting Liu,Zaiping Lin,Wei An
机构: National University of Defense Technology (国防科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Infrared small target detection is a challenging task due to its unique characteristics (e.g., small, dim, shapeless and changeable). Recently published CNN-based methods have achieved promising performance with heavy feature extraction and fusion modules. To achieve efficient and effective detection, we propose a recurrent reusable-convolution attention network (RRCA-Net) for infrared small target detection. Specifically, RRCA-Net incorporates reusable-convolution block (RuCB) in a recurrent manner without introducing extra parameters. With the help of the repetitive iteration in RuCB, the high-level information of small targets in the deep layers can be well maintained and further refined. Then, a dual interactive attention aggregation module (DIAAM) is proposed to promote the mutual enhancement and fusion of refined information. In this way, RRCA-Net can both achieve high-level feature refinement and enhance the correlation of contextual information between adjacent layers. Moreover, to achieve steady convergence, we design a target characteristic inspired loss function (DpT-k loss) by integrating physical and mathematical constraints. Experimental results on three benchmark datasets (e.g. NUAA-SIRST, IRSTD-1k, DenseSIRST) demonstrate that our RRCA-Net can achieve comparable performance to the state-of-the-art methods while maintaining a small number of parameters, and act as a plug and play module to introduce consistent performance improvement for several popular IRSTD methods. Our code will be available at this https URL soon.
zh

[CV-117] Multi-level and Multi-modal Action Anticipation ICIP

【速读】:该论文旨在解决动作预测(action anticipation)问题,即从部分观测的视频中预测未来的动作,这一任务需要处理不完整信息并具备时间推理能力和不确定性处理能力。传统方法通常仅关注视觉模态,忽略了多源信息融合的潜力。该论文提出的解决方案关键在于引入一种多层级、多模态的动作预测方法(m\m-Ant),通过结合视觉和文本线索,并显式建模层次化语义信息,以提高预测准确性。此外,为解决粗粒度动作标签不准确的问题,提出了细粒度标签生成器与专用的时间一致性损失函数,从而优化模型性能。

链接: https://arxiv.org/abs/2506.02382
作者: Seulgi Kim,Ghazal Kaviani,Mohit Prabhushankar,Ghassan AlRegib
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted in 2025 IEEE International Conference on Image Processing (ICIP)

点击查看摘要

Abstract:Action anticipation, the task of predicting future actions from partially observed videos, is crucial for advancing intelligent systems. Unlike action recognition, which operates on fully observed videos, action anticipation must handle incomplete information. Hence, it requires temporal reasoning, and inherent uncertainty handling. While recent advances have been made, traditional methods often focus solely on visual modalities, neglecting the potential of integrating multiple sources of information. Drawing inspiration from human behavior, we introduce \textitMulti-level and Multi-modal Action Anticipation (m\m-Ant), a novel multi-modal action anticipation approach that combines both visual and textual cues, while explicitly modeling hierarchical semantic information for more accurate predictions. To address the challenge of inaccurate coarse action labels, we propose a fine-grained label generator paired with a specialized temporal consistency loss function to optimize performance. Extensive experiments on widely used datasets, including Breakfast, 50 Salads, and DARai, demonstrate the effectiveness of our approach, achieving state-of-the-art results with an average anticipation accuracy improvement of 3.08% over existing methods. This work underscores the potential of multi-modal and hierarchical modeling in advancing action anticipation and establishes a new benchmark for future research in the field. Our code is available at: this https URL.
zh

[CV-118] EyeNavGS: A 6-DoF Navigation Dataset and Record-n-Replay Software for Real-World 3DGS Scenes in VR

【速读】:该论文旨在解决3D Gaussian Splatting (3DGS) 生成的高保真三维场景在虚拟现实(VR)中进行六自由度(6-DoF)导航时缺乏真实用户导航数据的问题。现有数据无法支持对3DGS启用的应用程序进行开发、评估及渲染性能优化。其解决方案的关键在于构建EyeNavGS数据集,这是首个公开的6-DoF导航数据集,包含46名参与者在12个多样化真实世界3DGS场景中的探索轨迹,通过Meta Quest Pro头显记录了每个渲染帧的头部姿态和眼动数据,并对场景进行了精确初始化以确保沉浸式体验的舒适性。

链接: https://arxiv.org/abs/2506.02380
作者: Zihao Ding,Cheng-Tse Lee,Mufeng Zhu,Tao Guan,Yuan-Chun Sun,Cheng-Hsin Hsu,Yao Liu
机构: Rutgers University (罗格斯大学); National Tsing Hua University (国立清华大学)
类目: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) is an emerging media representation that reconstructs real-world 3D scenes in high fidelity, enabling 6-degrees-of-freedom (6-DoF) navigation in virtual reality (VR). However, developing and evaluating 3DGS-enabled applications and optimizing their rendering performance, require realistic user navigation data. Such data is currently unavailable for photorealistic 3DGS reconstructions of real-world scenes. This paper introduces EyeNavGS (EyeNavGS), the first publicly available 6-DoF navigation dataset featuring traces from 46 participants exploring twelve diverse, real-world 3DGS scenes. The dataset was collected at two sites, using the Meta Quest Pro headsets, recording the head pose and eye gaze data for each rendered frame during free world standing 6-DoF navigation. For each of the twelve scenes, we performed careful scene initialization to correct for scene tilt and scale, ensuring a perceptually-comfortable VR experience. We also release our open-source SIBR viewer software fork with record-and-replay functionalities and a suite of utility tools for data processing, conversion, and visualization. The EyeNavGS dataset and its accompanying software tools provide valuable resources for advancing research in 6-DoF viewport prediction, adaptive streaming, 3D saliency, and foveated rendering for 3DGS scenes. The EyeNavGS dataset is available at: this https URL.
zh

[CV-119] ViTNF: Leverag ing Neural Fields to Boost Vision Transformers in Generalized Category Discovery

【速读】:该论文旨在解决通用类别发现(Generalized Category Discovery, GCD)任务中,如何利用已知类别数据识别未知类别样本的问题。传统方法在训练过程中将多层感知机(MLP)头与整个网络同步训练,导致训练成本高且难以充分发挥特征提取器的潜力。论文提出的解决方案关键在于用基于神经场(Neural Field, NF)的分类器替代MLP头,该分类器由两个耦合的静态神经场组成,分别用于存储支持样本的特征信息、已知类别的表示以及支持样本类别信息的跨场连接,从而提升了模型在少样本条件下的性能,并显著降低了对训练样本的需求和模型训练难度。

链接: https://arxiv.org/abs/2506.02367
作者: Jiayi Su,Dequan Jin
机构: Guangxi University (广西大学); Shanghai Institute of Applied Mathematics and Mechanics (上海应用数学和力学研究所); Shanghai University (上海大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 22 pages, 3 figures

点击查看摘要

Abstract:Generalized category discovery (GCD) is a highly popular task in open-world recognition, aiming to identify unknown class samples using known class data. By leveraging pre-training, meta-training, and fine-tuning, ViT achieves excellent few-shot learning capabilities. Its MLP head is a feedforward network, trained synchronously with the entire network in the same process, increasing the training cost and difficulty without fully leveraging the power of the feature extractor. This paper proposes a new architecture by replacing the MLP head with a neural field-based one. We first present a new static neural field function to describe the activity distribution of the neural field and then use two static neural field functions to build an efficient few-shot classifier. This neural field-based (NF) classifier consists of two coupled static neural fields. It stores the feature information of support samples by its elementary field, the known categories by its high-level field, and the category information of support samples by its cross-field connections. We replace the MLP head with the proposed NF classifier, resulting in a novel architecture ViTNF, and simplify the three-stage training mode by pre-training the feature extractor on source tasks and training the NF classifier with support samples in meta-testing separately, significantly reducing ViT’s demand for training samples and the difficulty of model training. To enhance the model’s capability in identifying new categories, we provide an effective algorithm to determine the lateral interaction scale of the elementary field. Experimental results demonstrate that our model surpasses existing state-of-the-art methods on CIFAR-100, ImageNet-100, CUB-200, and Standard Cars, achieving dramatic accuracy improvements of 19% and 16% in new and all classes, respectively, indicating a notable advantage in GCD.
zh

[CV-120] Approximate Borderline Sampling using Granular-Ball for Classification Tasks

【速读】:该论文旨在解决基于粒球(Granular Ball, GB)的采样方法在处理类别边界模糊或收缩问题以及缺乏边界采样策略的局限性。其解决方案的关键在于提出一种基于GB的近似边界采样方法(GBABS),该方法通过受限扩散的GB生成(RD-GBG)技术防止GB重叠,同时利用异质最近邻概念实现边界样本的近似采样,从而在无需最优纯度阈值的情况下有效提升噪声数据集的分类性能。

链接: https://arxiv.org/abs/2506.02366
作者: Qin Xie,Qinghua Zhang,Shuyin Xia
机构: Chongqing Key Laboratory of Computational Intelligence (重庆智能计算重点实验室); Chongqing University of Posts and Telecommunications (重庆邮电大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Data sampling enhances classifier efficiency and robustness through data compression and quality improvement. Recently, the sampling method based on granular-ball (GB) has shown promising performance in generality and noisy classification tasks. However, some limitations remain, including the absence of borderline sampling strategies and issues with class boundary blurring or shrinking due to overlap between GBs. In this paper, an approximate borderline sampling method using GBs is proposed for classification tasks. First, a restricted diffusion-based GB generation (RD-GBG) method is proposed, which prevents GB overlaps by constrained expansion, preserving precise geometric representation of GBs via redefined ones. Second, based on the concept of heterogeneous nearest neighbor, a GB-based approximate borderline sampling (GBABS) method is proposed, which is the first general sampling method capable of both borderline sampling and improving the quality of class noise datasets. Additionally, since RD-GBG incorporates noise detection and GBABS focuses on borderline samples, GBABS performs outstandingly on class noise datasets without the need for an optimal purity threshold. Experimental results demonstrate that the proposed methods outperform the GB-based sampling method and several representative sampling methods. Our source code is publicly available at this https URL.
zh

[CV-121] A TRPCA-Inspired Deep Unfolding Network for Hyperspectral Image Denoising via Thresholded t-SVD and Top-K Sparse Transformer

【速读】:该论文旨在解决高光谱图像(Hyperspectral Images, HSIs)在采集和传输过程中受到复杂混合噪声干扰导致的降质问题,从而提升后续分析的有效性。其解决方案的关键在于提出一种新型的深度展开网络(DU-TRPCA),通过在低秩模块与稀疏模块之间进行分阶段交替优化,实现对HSI中全局空间-光谱结构和局部异常值的高效去除。该方法结合了阈值张量奇异值分解(t-SVD)与Top-K稀疏变换器模块,不仅保留了张量鲁棒主成分分析(TRPCA)中固有的低秩逼近与稀疏精炼的交替机制,还通过注意力机制增强了表征能力,从而在严重混合噪声下实现了优于现有方法的去噪性能。

链接: https://arxiv.org/abs/2506.02364
作者: Liang Li,Jianli Zhao,Sheng Fang,Siyu Chen,Hui Sun
机构: Shandong University of Science and Technology (山东科技大学); Gosci Information Technology Co.,Ltd. (戈斯信息科技有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages,6 figures

点击查看摘要

Abstract:Hyperspectral images (HSIs) are often degraded by complex mixed noise during acquisition and transmission, making effective denoising essential for subsequent analysis. Recent hybrid approaches that bridge model-driven and data-driven paradigms have shown great promise. However, most of these approaches lack effective alternation between different priors or modules, resulting in loosely coupled regularization and insufficient exploitation of their complementary strengths. Inspired by tensor robust principal component analysis (TRPCA), we propose a novel deep unfolding network (DU-TRPCA) that enforces stage-wise alternation between two tightly integrated modules: low-rank and sparse. The low-rank module employs thresholded tensor singular value decomposition (t-SVD), providing a widely adopted convex surrogate for tensor low-rankness and has been demonstrated to effectively capture the global spatial-spectral structure of HSIs. The Top-K sparse transformer module adaptively imposes sparse constraints, directly matching the sparse regularization in TRPCA and enabling effective removal of localized outliers and complex noise. This tightly coupled architecture preserves the stage-wise alternation between low-rank approximation and sparse refinement inherent in TRPCA, while enhancing representational capacity through attention mechanisms. Extensive experiments on synthetic and real-world HSIs demonstrate that DU-TRPCA surpasses state-of-the-art methods under severe mixed noise, while offering interpretability benefits and stable denoising dynamics inspired by iterative optimization. Code is available at this https URL.
zh

[CV-122] Auto-Labeling Data for Object Detection

【速读】:该论文试图解决在目标检测任务中依赖传统人工标注标签所带来的高成本问题,以及现有半监督或弱监督方法在功能或计算效率上的不足。其解决方案的关键在于利用预训练的视觉-语言基础模型生成特定应用场景的伪“真实”标签,从而替代传统标注过程,同时将这些自动生成的标签直接集成到现有的模型训练框架中,进而训练出轻量级且计算高效的检测模型。

链接: https://arxiv.org/abs/2506.02359
作者: Brent A. Griffin,Manushree Gangwar,Jacob Sela,Jason J. Corso
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Great labels make great models. However, traditional labeling approaches for tasks like object detection have substantial costs at scale. Furthermore, alternatives to fully-supervised object detection either lose functionality or require larger models with prohibitive computational costs for inference at scale. To that end, this paper addresses the problem of training standard object detection models without any ground truth labels. Instead, we configure previously-trained vision-language foundation models to generate application-specific pseudo “ground truth” labels. These auto-generated labels directly integrate with existing model training frameworks, and we subsequently train lightweight detection models that are computationally efficient. In this way, we avoid the costs of traditional labeling, leverage the knowledge of vision-language models, and keep the efficiency of lightweight models for practical application. We perform exhaustive experiments across multiple labeling configurations, downstream inference models, and datasets to establish best practices and set an extensive auto-labeling benchmark. From our results, we find that our approach is a viable alternative to standard labeling in that it maintains competitive performance on multiple datasets and substantially reduces labeling time and costs.
zh

[CV-123] RoadFormer : Local-Global Feature Fusion for Road Surface Classification in Autonomous Driving

【速读】:该论文旨在解决道路表面分类(Road Surface Classification, RSC)中细粒度分类不足的问题,特别是在相似路面纹理识别方面的局限性。传统视觉分类方法未能充分挖掘细粒度的路面类型特征,导致分类性能受限。论文提出的解决方案关键在于采用纯视觉的细粒度RSC方法,通过卷积和Transformer模块的堆叠融合局部与全局特征信息,并引入前景-背景模块(Foreground-Background Module, FBM)以有效提取路面的细粒度上下文特征,从而提升复杂路面的分类能力。

链接: https://arxiv.org/abs/2506.02358
作者: Tianze Wang,Zhang Zhang,Chao Sun
机构: Beijing Institute of Technology (北京理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The classification of the type of road surface (RSC) aims to utilize pavement features to identify the roughness, wet and dry conditions, and material information of the road surface. Due to its ability to effectively enhance road safety and traffic management, it has received widespread attention in recent years. In autonomous driving, accurate RSC allows vehicles to better understand the road environment, adjust driving strategies, and ensure a safer and more efficient driving experience. For a long time, vision-based RSC has been favored. However, existing visual classification methods have overlooked the exploration of fine-grained classification of pavement types (such as similar pavement textures). In this work, we propose a pure vision-based fine-grained RSC method for autonomous driving scenarios, which fuses local and global feature information through the stacking of convolutional and transformer modules. We further explore the stacking strategies of local and global feature extraction modules to find the optimal feature extraction strategy. In addition, since fine-grained tasks also face the challenge of relatively large intra-class differences and relatively small inter-class differences, we propose a Foreground-Background Module (FBM) that effectively extracts fine-grained context features of the pavement, enhancing the classification ability for complex pavements. Experiments conducted on a large-scale pavement dataset containing one million samples and a simplified dataset reorganized from this dataset achieved Top-1 classification accuracies of 92.52% and 96.50%, respectively, improving by 5.69% to 12.84% compared to SOTA methods. These results demonstrate that RoadFormer outperforms existing methods in RSC tasks, providing significant progress in improving the reliability of pavement perception in autonomous driving systems.
zh

[CV-124] InterRVOS: Interaction-aware Referring Video Object Segmentation

【速读】:该论文试图解决视频中对象交互关系建模不足的问题,即现有方法主要关注单个目标对象的定位,而忽略了对象在交互中的角色及其与其他实体的关系。解决方案的关键在于引入一种新的任务——Interaction-aware referring video object segmentation (InterRVOS),该任务要求分割涉及交互的主体(actor)和目标(target)实体,并通过互补的语义表达描述交互,从而实现对对象间关系的细粒度建模。此外,研究提出了一个基准架构ReVIOSa以及针对交互理解的评估设置,以提升对复杂对象交互的建模能力。

链接: https://arxiv.org/abs/2506.02356
作者: Woojeong Jin,Seongchan Kim,Seungryong Kim
机构: KAIST AI (KAIST人工智能)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Referring video object segmentation aims to segment the object in a video corresponding to a given natural language expression. While prior works have explored various referring scenarios, including motion-centric or multi-instance expressions, most approaches still focus on localizing a single target object in isolation. However, in comprehensive video understanding, an object’s role is often defined by its interactions with other entities, which are largely overlooked in existing datasets and models. In this work, we introduce Interaction-aware referring video object sgementation (InterRVOS), a new task that requires segmenting both actor and target entities involved in an interaction. Each interactoin is described through a pair of complementary expressions from different semantic perspectives, enabling fine-grained modeling of inter-object relationships. To tackle this task, we propose InterRVOS-8K, the large-scale and automatically constructed dataset containing diverse interaction-aware expressions with corresponding masks, including challenging cases such as motion-only multi-instance expressions. We also present a baseline architecture, ReVIOSa, designed to handle actor-target segmentation from a single expression, achieving strong performance in both standard and interaction-focused settings. Furthermore, we introduce an actor-target-aware evalaution setting that enables a more targeted assessment of interaction understanding. Experimental results demonstrate that our approach outperforms prior methods in modeling complex object interactions for referring video object segmentation task, establishing a strong foundation for future research in interaction-centric video understanding. Our project page is available at \hrefthis https URLthis https URL.
zh

[CV-125] RATE-Nav: Region-Aware Termination Enhancement for Zero-shot Object Navigation with Vision-Language Models ACL2025

【速读】:该论文旨在解决物体导航(ObjectNav)任务中冗余探索和探索失败的问题,特别是通过及时终止探索来提高导航效率。其解决方案的关键在于提出RATE-Nav方法,该方法通过几何预测区域分割算法和基于区域的探索估计算法,实现对探索率的准确计算,并结合视觉语言模型(VLM)的视觉问答能力,从而提升导航的成功率和路径相似性分数(SPL)。

链接: https://arxiv.org/abs/2506.02354
作者: Junjie Li,Nan Zhang,Xiaoyang Qu,Kai Lu,Guokuan Li,Jiguang Wan,Jianzong Wang
机构: Huazhong University of Science and Technology (华中科技大学); Ping An Technology (Shenzhen) Co., Ltd. (平安科技(深圳)有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by the 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025)

点击查看摘要

Abstract:Object Navigation (ObjectNav) is a fundamental task in embodied artificial intelligence. Although significant progress has been made in semantic map construction and target direction prediction in current research, redundant exploration and exploration failures remain inevitable. A critical but underexplored direction is the timely termination of exploration to overcome these challenges. We observe a diminishing marginal effect between exploration steps and exploration rates and analyze the cost-benefit relationship of exploration. Inspired by this, we propose RATE-Nav, a Region-Aware Termination-Enhanced method. It includes a geometric predictive region segmentation algorithm and region-Based exploration estimation algorithm for exploration rate calculation. By leveraging the visual question answering capabilities of visual language models (VLMs) and exploration rates enables efficient this http URL-Nav achieves a success rate of 67.8% and an SPL of 31.3% on the HM3D dataset. And on the more challenging MP3D dataset, RATE-Nav shows approximately 10% improvement over previous zero-shot methods.
zh

[CV-126] Generalized Category Discovery via Reciprocal Learning and Class-Wise Distribution Regularization ICML2025

【速读】:该论文旨在解决广义类别发现(Generalized Category Discovery, GCD)中由于自监督信号不可靠导致的基类区分能力不足问题。其解决方案的关键在于提出一种互惠学习框架(Reciprocal Learning Framework, RLF),通过引入一个辅助分支专门负责基类分类,在训练过程中主分支将伪基类样本传递给辅助分支,而辅助分支则为前者提供更可靠的软标签,形成良性循环;同时结合类级分布正则化(Class-wise Distribution Regularization, CDR)以缓解对基类的学习偏差,从而提升整体性能。

链接: https://arxiv.org/abs/2506.02334
作者: Duo Liu,Zhiquan Tan,Linglan Zhao,Zhongqiang Zhang,Xiangzhong Fang,Weiran Huang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICML2025 Poster

点击查看摘要

Abstract:Generalized Category Discovery (GCD) aims to identify unlabeled samples by leveraging the base knowledge from labeled ones, where the unlabeled set consists of both base and novel classes. Since clustering methods are time-consuming at inference, parametric-based approaches have become more popular. However, recent parametric-based methods suffer from inferior base discrimination due to unreliable self-supervision. To address this issue, we propose a Reciprocal Learning Framework (RLF) that introduces an auxiliary branch devoted to base classification. During training, the main branch filters the pseudo-base samples to the auxiliary branch. In response, the auxiliary branch provides more reliable soft labels for the main branch, leading to a virtuous cycle. Furthermore, we introduce Class-wise Distribution Regularization (CDR) to mitigate the learning bias towards base classes. CDR essentially increases the prediction confidence of the unlabeled data and boosts the novel class performance. Combined with both components, our proposed method, RLCD, achieves superior performance in all classes with negligible extra computation. Comprehensive experiments across seven GCD datasets validate its superiority. Our codes are available at this https URL.
zh

[CV-127] Medical World Model: Generative Simulation of Tumor Evolution for Treatment Planning

【速读】:该论文旨在解决临床决策中疾病动态模拟的问题,以提升治疗效果和决策质量。其解决方案的关键在于引入医学世界模型(Medical World Model, MeWM),该模型通过结合视觉-语言模型作为策略模型、肿瘤生成模型作为动力学模型,并进一步提出逆向动力学模型,利用生存分析评估治疗效果,从而实现个体化治疗方案的优化。

链接: https://arxiv.org/abs/2506.02327
作者: Yijun Yang,Zhao-Yang Wang,Qiuping Liu,Shuwen Sun,Kang Wang,Rama Chellappa,Zongwei Zhou,Alan Yuille,Lei Zhu,Yu-Dong Zhang,Jieneng Chen
机构: The Hong Kong University of Science and Technology (香港科技大学); Johns Hopkins University (约翰霍普金斯大学); The First Affiliated Hospital of Nanjing Medical University (南京医科大学第一附属医院); University of California, San Francisco (加利福尼亚大学旧金山分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Providing effective treatment and making informed clinical decisions are essential goals of modern medicine and clinical care. We are interested in simulating disease dynamics for clinical decision-making, leveraging recent advances in large generative models. To this end, we introduce the Medical World Model (MeWM), the first world model in medicine that visually predicts future disease states based on clinical decisions. MeWM comprises (i) vision-language models to serve as policy models, and (ii) tumor generative models as dynamics models. The policy model generates action plans, such as clinical treatments, while the dynamics model simulates tumor progression or regression under given treatment conditions. Building on this, we propose the inverse dynamics model that applies survival analysis to the simulated post-treatment tumor, enabling the evaluation of treatment efficacy and the selection of the optimal clinical action plan. As a result, the proposed MeWM simulates disease dynamics by synthesizing post-treatment tumors, with state-of-the-art specificity in Turing tests evaluated by radiologists. Simultaneously, its inverse dynamics model outperforms medical-specialized GPTs in optimizing individualized treatment protocols across all metrics. Notably, MeWM improves clinical decision-making for interventional physicians, boosting F1-score in selecting the optimal TACE protocol by 13%, paving the way for future integration of medical world models as the second readers.
zh

[CV-128] QARI-OCR: High-Fidelity Arabic Text Recognition through Multimodal Large Language Model Adaptation

【速读】:该论文旨在解决阿拉伯语文字OCR(光学字符识别)中的固有复杂性问题,包括其连写特性、元音符号(tashkeel)以及多样的字体排版。解决方案的关键在于基于Qwen2-VL-2B-Instruct的视觉-语言模型系列,通过在专门的合成数据集上进行迭代微调,逐步优化以适应阿拉伯语的特殊需求。其中,QARI v0.2模型在带丰富元音符号的文本上实现了当前开源的最佳性能,表现出对tashkeel、多种字体和文档布局的优越处理能力,以及在低分辨率图像上的出色表现。

链接: https://arxiv.org/abs/2506.02295
作者: Ahmed Wasfy,Omer Nacar,Abdelakreem Elkhateb,Mahmoud Reda,Omar Elshehy,Adel Ammar,Wadii Boulila
机构: NAMAA; KAND CA Corp.; Prince Sultan University
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The inherent complexities of Arabic script; its cursive nature, diacritical marks (tashkeel), and varied typography, pose persistent challenges for Optical Character Recognition (OCR). We present Qari-OCR, a series of vision-language models derived from Qwen2-VL-2B-Instruct, progressively optimized for Arabic through iterative fine-tuning on specialized synthetic datasets. Our leading model, QARI v0.2, establishes a new open-source state-of-the-art with a Word Error Rate (WER) of 0.160, Character Error Rate (CER) of 0.061, and BLEU score of 0.737 on diacritically-rich texts. Qari-OCR demonstrates superior handling of tashkeel, diverse fonts, and document layouts, alongside impressive performance on low-resolution images. Further explorations (QARI v0.3) showcase strong potential for structural document understanding and handwritten text. This work delivers a marked improvement in Arabic OCR accuracy and efficiency, with all models and datasets released to foster further research.
zh

[CV-129] Improving Knowledge Distillation Under Unknown Covariate Shift Through Confidence-Guided Data Augmentation

【速读】:该论文试图解决知识蒸馏(knowledge distillation)中由于协变量偏移(covariate shift)导致的虚假特征(spurious features)问题,即在训练过程中出现但测试时不存在的特征会严重影响模型的泛化能力。解决方案的关键在于引入一种基于扩散(diffusion-based)的数据增强策略,通过最大化教师模型与学生模型之间的不一致来生成具有挑战性的样本,从而提升学生模型对这些未知虚假特征的鲁棒性。

链接: https://arxiv.org/abs/2506.02294
作者: Niclas Popp,Kevin Alexander Laube,Matthias Hein,Lukas Schott
机构: Bosch Center for Artificial Intelligence (博世人工智能中心); University of Tübingen (图宾根大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Large foundation models trained on extensive datasets demonstrate strong zero-shot capabilities in various domains. To replicate their success when data and model size are constrained, knowledge distillation has become an established tool for transferring knowledge from foundation models to small student networks. However, the effectiveness of distillation is critically limited by the available training data. This work addresses the common practical issue of covariate shift in knowledge distillation, where spurious features appear during training but not at test time. We ask the question: when these spurious features are unknown, yet a robust teacher is available, is it possible for a student to also become robust to them? We address this problem by introducing a novel diffusion-based data augmentation strategy that generates images by maximizing the disagreement between the teacher and the student, effectively creating challenging samples that the student struggles with. Experiments demonstrate that our approach significantly improves worst group and mean group accuracy on CelebA and SpuCo Birds as well as the spurious mAUC on spurious ImageNet under covariate shift, outperforming state-of-the-art diffusion-based data augmentation baselines
zh

[CV-130] Entity Image and Mixed-Modal Image Retrieval Datasets

【速读】:该论文旨在解决多模态学习中缺乏针对结合视觉和文本信息的混合模态图像检索(Mixed-Modal Image Retrieval, MMIR)的挑战性基准问题。其解决方案的关键在于提出一个新的基准测试框架,并构建两个新的数据集:实体图像数据集(Entity Image Dataset, EI)和混合模态图像检索数据集(MMIR)。MMIR数据集包含两种具有挑战性的查询类型,要求模型在给定视觉实体的上下文中对文本描述进行语义定位,从而评估模型对跨模态上下文的深度理解能力。此外,通过众包的人工标注进一步验证了数据集的质量,确保其作为训练和评估混合模态检索任务的有效性。

链接: https://arxiv.org/abs/2506.02291
作者: Cristian-Ioan Blaga,Paul Suganthan,Sahil Dua,Krishna Srinivasan,Enrique Alfonseca,Peter Dornbach,Tom Duerig,Imed Zitouni,Zhe Dong
机构: Google Switzerland(谷歌瑞士); Microsoft AI(微软人工智能)
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Despite advances in multimodal learning, challenging benchmarks for mixed-modal image retrieval that combines visual and textual information are lacking. This paper introduces a novel benchmark to rigorously evaluate image retrieval that demands deep cross-modal contextual understanding. We present two new datasets: the Entity Image Dataset (EI), providing canonical images for Wikipedia entities, and the Mixed-Modal Image Retrieval Dataset (MMIR), derived from the WIT dataset. The MMIR benchmark features two challenging query types requiring models to ground textual descriptions in the context of provided visual entities: single entity-image queries (one entity image with descriptive text) and multi-entity-image queries (multiple entity images with relational text). We empirically validate the benchmark’s utility as both a training corpus and an evaluation set for mixed-modal retrieval. The quality of both datasets is further affirmed through crowd-sourced human annotations. The datasets are accessible through the GitHub page: this https URL.
zh

[CV-131] Rig3R: Rig-Aware Conditioning for Learned 3D Reconstruction

【速读】:该论文旨在解决从多相机系统中估计代理姿态和三维场景结构的问题,这是具身人工智能应用(如自动驾驶)中的核心任务。现有方法如DUSt3R在多视角设置中表现出色,但它们将图像视为无结构的集合,限制了在同步相机系统具有已知或可推断结构的场景中的效果。解决方案的关键在于引入Rig3R,这是一种对先前多视角重建模型的扩展,能够在有可用相机系统结构时加以利用,并在没有时学习推断该结构。Rig3R通过条件化可选的相机系统元数据(如相机ID、时间戳和系统位姿),构建一个对缺失信息具有鲁棒性的系统感知潜在空间,并联合预测点图和两种类型的射线图,从而实现高效的三维重建与系统结构发现。

链接: https://arxiv.org/abs/2506.02265
作者: Samuel Li,Pujith Kachana,Prajwal Chidananda,Saurabh Nair,Yasutaka Furukawa,Matthew Brown
机构: Wayve Technologies(威沃科技); Carnegie Mellon University(卡内基梅隆大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Estimating agent pose and 3D scene structure from multi-camera rigs is a central task in embodied AI applications such as autonomous driving. Recent learned approaches such as DUSt3R have shown impressive performance in multiview settings. However, these models treat images as unstructured collections, limiting effectiveness in scenarios where frames are captured from synchronized rigs with known or inferable structure. To this end, we introduce Rig3R, a generalization of prior multiview reconstruction models that incorporates rig structure when available, and learns to infer it when not. Rig3R conditions on optional rig metadata including camera ID, time, and rig poses to develop a rig-aware latent space that remains robust to missing information. It jointly predicts pointmaps and two types of raymaps: a pose raymap relative to a global frame, and a rig raymap relative to a rig-centric frame consistent across time. Rig raymaps allow the model to infer rig structure directly from input images when metadata is missing. Rig3R achieves state-of-the-art performance in 3D reconstruction, camera pose estimation, and rig discovery, outperforming both traditional and learned methods by 17-45% mAA across diverse real-world rig datasets, all in a single forward pass without post-processing or iterative refinement. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2506.02265 [cs.CV] (or arXiv:2506.02265v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2506.02265 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-132] PAIR-Net: Enhancing Egocentric Speaker Detection via Pretrained Audio-Visual Fusion and Alignment Loss

【速读】:该论文旨在解决第一视角视频中的主动说话人检测(Active Speaker Detection, ASD)问题,该问题在不稳定视角、运动模糊和非屏幕内语音源等条件下传统视觉主导方法性能显著下降。解决方案的关键在于提出PAIR-Net模型,该模型通过集成部分冻结的Whisper音频编码器与微调的AV-HuBERT视觉主干网络,实现跨模态线索的鲁棒融合,并引入模态间对齐损失以平衡模态差异,从而提升多模态一致性与检测性能。

链接: https://arxiv.org/abs/2506.02247
作者: Yu Wang,Juhyung Ha,David J. Crandall
机构: Indiana University Bloomington (印第安纳大学布卢明顿分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 4 pages, 1 figure, and 1 table

点击查看摘要

Abstract:Active speaker detection (ASD) in egocentric videos presents unique challenges due to unstable viewpoints, motion blur, and off-screen speech sources - conditions under which traditional visual-centric methods degrade significantly. We introduce PAIR-Net (Pretrained Audio-Visual Integration with Regularization Network), an effective model that integrates a partially frozen Whisper audio encoder with a fine-tuned AV-HuBERT visual backbone to robustly fuse cross-modal cues. To counteract modality imbalance, we introduce an inter-modal alignment loss that synchronizes audio and visual representations, enabling more consistent convergence across modalities. Without relying on multi-speaker context or ideal frontal views, PAIR-Net achieves state-of-the-art performance on the Ego4D ASD benchmark with 76.6% mAP, surpassing LoCoNet and STHG by 8.2% and 12.9% mAP, respectively. Our results highlight the value of pretrained audio priors and alignment-based fusion for robust ASD under real-world egocentric conditions.
zh

[CV-133] Motion aware video generative model

【速读】:该论文试图解决当前基于扩散的视频生成方法在物理合理性上的不足,即现有方法主要依赖大规模数据集的统计学习,而未显式建模运动的底层物理特性,导致生成视频中出现可感知的非物理伪影,从而降低真实感。解决方案的关键在于提出一种融合物理信息的频域方法,通过分析不同物理运动(平移、旋转、缩放)的频域特征,构建物理运动损失函数和频域增强模块,以优化生成视频对理想频域运动模式的符合度,并在保持原有网络功能的前提下调整视频特征以满足物理约束。

链接: https://arxiv.org/abs/2506.02244
作者: Bowen Xue,Giuseppe Claudio Guarnera,Shuang Zhao,Zahra Montazeri
机构: University of Manchester(曼彻斯特大学); University of York(约克大学); University of California, Irvine(加州大学欧文分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in diffusion-based video generation have yielded unprecedented quality in visual content and semantic coherence. However, current approaches predominantly rely on statistical learning from vast datasets without explicitly modeling the underlying physics of motion, resulting in subtle yet perceptible non-physical artifacts that diminish the realism of generated videos. This paper introduces a physics-informed frequency domain approach to enhance the physical plausibility of generated videos. We first conduct a systematic analysis of the frequency-domain characteristics of diverse physical motions (translation, rotation, scaling), revealing that each motion type exhibits distinctive and identifiable spectral signatures. Building on this theoretical foundation, we propose two complementary components: (1) a physical motion loss function that quantifies and optimizes the conformity of generated videos to ideal frequency-domain motion patterns, and (2) a frequency domain enhancement module that progressively learns to adjust video features to conform to physical motion constraints while preserving original network functionality through a zero-initialization strategy. Experiments across multiple video diffusion architectures demonstrate that our approach significantly enhances motion quality and physical plausibility without compromising visual quality or semantic alignment. Our frequency-domain physical motion framework generalizes effectively across different video generation architectures, offering a principled approach to incorporating physical constraints into deep learning-based video synthesis pipelines. This work seeks to establish connections between data-driven models and physics-based motion models.
zh

[CV-134] Diff2Flow: Training Flow Matching Models via Diffusion Model Alignment CVPR2025

【速读】:该论文旨在解决如何高效地将预训练的扩散模型知识迁移至流匹配(Flow Matching, FM)框架中的关键问题。当前的FM基础模型在微调时计算成本较高,而扩散模型如Stable Diffusion则受益于高效的架构和生态系统支持,但缺乏与FM的有效结合。论文提出的解决方案是Diff2Flow,其关键在于通过重缩放时间步长、对齐插值项以及从扩散预测中推导出与FM兼容的速度场,系统地弥合扩散模型与FM范式的差异,从而实现无需额外计算开销的扩散先验的直接高效FM微调。

链接: https://arxiv.org/abs/2506.02221
作者: Johannes Schusterbauer,Ming Gui,Frank Fundel,Björn Ommer
机构: CompVis @ LMU Munich, MCML
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted by CVPR 2025

点击查看摘要

Abstract:Diffusion models have revolutionized generative tasks through high-fidelity outputs, yet flow matching (FM) offers faster inference and empirical performance gains. However, current foundation FM models are computationally prohibitive for finetuning, while diffusion models like Stable Diffusion benefit from efficient architectures and ecosystem support. This work addresses the critical challenge of efficiently transferring knowledge from pre-trained diffusion models to flow matching. We propose Diff2Flow, a novel framework that systematically bridges diffusion and FM paradigms by rescaling timesteps, aligning interpolants, and deriving FM-compatible velocity fields from diffusion predictions. This alignment enables direct and efficient FM finetuning of diffusion priors with no extra computation overhead. Our experiments demonstrate that Diff2Flow outperforms naïve FM and diffusion finetuning particularly under parameter-efficient constraints, while achieving superior or competitive performance across diverse downstream tasks compared to state-of-the-art methods. We will release our code at this https URL.
zh

[CV-135] Is PMBOK Guide the Right Fit for AI? Re-evaluating Project Management in the Face of Artificial Intelligence Projects

【速读】:该论文试图解决传统项目管理知识体系(PMBOK)在人工智能(Artificial Intelligence, AI)软件项目中的适用性问题,指出其在数据管理、迭代开发支持及伦理与跨学科挑战应对方面的不足。解决方案的关键在于整合数据生命周期管理,采用适用于AI的迭代项目管理框架,并将伦理考量嵌入项目规划与执行过程中,以提升AI软件项目的管理效能。

链接: https://arxiv.org/abs/2506.02214
作者: Alexey Burdakov,Max Jaihyun Ahn
机构: 未知
类目: oftware Engineering (cs.SE); Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 1 figure

点击查看摘要

Abstract:This paper critically evaluates the applicability of the Project Management Body of Knowledge (PMBOK) Guide framework to Artificial Intelligence (AI) software projects, highlighting key limitations and proposing tailored adaptations. Unlike traditional projects, AI initiatives rely heavily on complex data, iterative experimentation, and specialized expertise while navigating significant ethical considerations. Our analysis identifies gaps in the PMBOK Guide, including its limited focus on data management, insufficient support for iterative development, and lack of guidance on ethical and multidisciplinary challenges. To address these deficiencies, we recommend integrating data lifecycle management, adopting iterative and AI project management frameworks, and embedding ethical considerations within project planning and execution. Additionally, we explore alternative approaches that better align with AI’s dynamic and exploratory nature. We aim to enhance project management practices for AI software projects by bridging these gaps.
zh

[CV-136] Fire360: A Benchmark for Robust Perception and Episodic Memory in Degraded 360-Degree Firefighting Videos

【速读】:该论文旨在解决在高可靠性要求的消防场景中,现代人工智能系统在环境复杂性(如烟雾、能见度低和结构变形)下感知与推理能力不足的问题。其关键解决方案是提出Fire360数据集,该数据集包含228个360度视频,覆盖多种极端条件,并提供动作片段、物体位置及退化元数据的标注,支持包括Transformed Object Retrieval(TOR)在内的五项任务,以评估模型在退化环境下的变换不变识别能力。

链接: https://arxiv.org/abs/2506.02167
作者: Aditi Tiwari,Farzaneh Masoud,Dac Trong Nguyen,Jill Kraft,Heng Ji,Klara Nahrstedt
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); Illinois Fire Service Institute (伊利诺伊消防服务研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 20 pages, 9 figures, 6 tables

点击查看摘要

Abstract:Modern AI systems struggle most in environments where reliability is critical - scenes with smoke, poor visibility, and structural deformation. Each year, tens of thousands of firefighters are injured on duty, often due to breakdowns in situational perception. We introduce Fire360, a benchmark for evaluating perception and reasoning in safety-critical firefighting scenarios. The dataset includes 228 360-degree videos from professional training sessions under diverse conditions (e.g., low light, thermal distortion), annotated with action segments, object locations, and degradation metadata. Fire360 supports five tasks: Visual Question Answering, Temporal Action Captioning, Object Localization, Safety-Critical Reasoning, and Transformed Object Retrieval (TOR). TOR tests whether models can match pristine exemplars to fire-damaged counterparts in unpaired scenes, evaluating transformation-invariant recognition. While human experts achieve 83.5% on TOR, models like GPT-4o lag significantly, exposing failures in reasoning under degradation. By releasing Fire360 and its evaluation suite, we aim to advance models that not only see, but also remember, reason, and act under uncertainty. The dataset is available at: this https URL.
zh

[CV-137] Quantifying task-relevant representational similarity using decision variable correlation

【速读】:该论文试图解决深度神经网络与灵长类动物视觉皮层(V4/IT)在决策策略上的相似性问题,特别是针对图像分类任务中两者表征是否具有高度一致性这一争议。其解决方案的关键在于提出了一种新的度量方法——决策变量相关性(Decision Variable Correlation, DVC),该方法通过量化分类任务中单个样本的解码决策之间的相关性,从而捕捉与任务相关的信息,而非仅关注一般性的表征对齐。

链接: https://arxiv.org/abs/2506.02164
作者: Yu (Eric)Qian,Wilson S. Geisler,Xue-Xin Wei
机构: The University of Texas at Austin (德克萨斯大学奥斯汀分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC); Quantitative Methods (q-bio.QM)
备注:

点击查看摘要

Abstract:Previous studies have compared the brain and deep neural networks trained on image classification. Intriguingly, while some suggest that their representations are highly similar, others argued the opposite. Here, we propose a new approach to characterize the similarity of the decision strategies of two observers (models or brains) using decision variable correlation (DVC). DVC quantifies the correlation between decoded decisions on individual samples in a classification task and thus can capture task-relevant information rather than general representational alignment. We evaluate this method using monkey V4/IT recordings and models trained on image classification tasks. We find that model–model similarity is comparable to monkey–monkey similarity, whereas model–monkey similarity is consistently lower and, surprisingly, decreases with increasing ImageNet-1k performance. While adversarial training enhances robustness, it does not improve model–monkey similarity in task-relevant dimensions; however, it markedly increases model–model similarity. Similarly, pre-training on larger datasets does not improve model–monkey similarity. These results suggest a fundamental divergence between the task-relevant representations in monkey V4/IT and those learned by models trained on image classification tasks. Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC); Quantitative Methods (q-bio.QM) Cite as: arXiv:2506.02164 [cs.CV] (or arXiv:2506.02164v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2506.02164 Focus to learn more arXiv-issued DOI via DataCite
zh

[CV-138] IIF-Bench: How Does Your T2I Model Follow Your Instructions?

【速读】:该论文试图解决现有Text-to-Image (T2I)模型评估基准在提示多样性与复杂性不足以及评估指标粗略的问题,从而难以准确评估文本指令与生成图像之间的细粒度对齐性能。其解决方案的关键在于提出TIIF-Bench,一个包含5000个按多个维度组织的提示集,涵盖不同难度和复杂度,并为每个提示提供短版和长版以评估模型对不同提示长度的鲁棒性;同时引入文本渲染和风格控制两个关键属性,以评估文本合成精度和图像美学一致性,并利用大规模视觉语言模型中的世界知识构建一种新型可计算框架,以捕捉T2I模型输出中的细微差异。

链接: https://arxiv.org/abs/2506.02161
作者: Xinyu Wei,Jinrui Zhang,Zeqing Wang,Hongyang Wei,Zhen Guo,Lei Zhang
机构: Hong Kong Polytechnic University (香港理工大学); Sun Yat-sen University (中山大学); Tsinghua University (清华大学); OPPO Research Institute (OPPO研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 23 pages, 12 figures, 11 tables

点击查看摘要

Abstract:The rapid advancements of Text-to-Image (T2I) models have ushered in a new phase of AI-generated content, marked by their growing ability to interpret and follow user instructions. However, existing T2I model evaluation benchmarks fall short in limited prompt diversity and complexity, as well as coarse evaluation metrics, making it difficult to evaluate the fine-grained alignment performance between textual instructions and generated images. In this paper, we present TIIF-Bench (Text-to-Image Instruction Following Benchmark), aiming to systematically assess T2I models’ ability in interpreting and following intricate textual instructions. TIIF-Bench comprises a set of 5000 prompts organized along multiple dimensions, which are categorized into three levels of difficulties and complexities. To rigorously evaluate model robustness to varying prompt lengths, we provide a short and a long version for each prompt with identical core semantics. Two critical attributes, i.e., text rendering and style control, are introduced to evaluate the precision of text synthesis and the aesthetic coherence of T2I models. In addition, we collect 100 high-quality designer level prompts that encompass various scenarios to comprehensively assess model performance. Leveraging the world knowledge encoded in large vision language models, we propose a novel computable framework to discern subtle variations in T2I model outputs. Through meticulous benchmarking of mainstream T2I models on TIIF-Bench, we analyze the pros and cons of current T2I models and reveal the limitations of current T2I benchmarks. Project Page: this https URL.
zh

[CV-139] Implicit Deformable Medical Image Registration with Learnable Kernels MICCAI2025

【速读】:该论文旨在解决可变形医学图像配准中因AI方法产生的不可靠形变而限制其临床应用的问题。其解决方案的关键在于将图像配准重新表述为信号重建问题,通过学习一个核函数从稀疏关键点对应关系中恢复密集位移场,并在一种新颖的分层架构中以粗到细的方式估计位移场,从而实现准确且可靠的形变预测。

链接: https://arxiv.org/abs/2506.02150
作者: Stefano Fogarollo,Gregor Laimer,Reto Bale,Matthias Harders
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: MICCAI 2025 Provisional Accept

点击查看摘要

Abstract:Deformable medical image registration is an essential task in computer-assisted interventions. This problem is particularly relevant to oncological treatments, where precise image alignment is necessary for tracking tumor growth, assessing treatment response, and ensuring accurate delivery of therapies. Recent AI methods can outperform traditional techniques in accuracy and speed, yet they often produce unreliable deformations that limit their clinical adoption. In this work, we address this challenge and introduce a novel implicit registration framework that can predict accurate and reliable deformations. Our insight is to reformulate image registration as a signal reconstruction problem: we learn a kernel function that can recover the dense displacement field from sparse keypoint correspondences. We integrate our method in a novel hierarchical architecture, and estimate the displacement field in a coarse-to-fine manner. Our formulation also allows for efficient refinement at test time, permitting clinicians to easily adjust registrations when needed. We validate our method on challenging intra-patient thoracic and abdominal zero-shot registration tasks, using public and internal datasets from the local University Hospital. Our method not only shows competitive accuracy to state-of-the-art approaches, but also bridges the generalization gap between implicit and explicit registration techniques. In particular, our method generates deformations that better preserve anatomical relationships and matches the performance of specialized commercial systems, underscoring its potential for clinical adoption.
zh

[CV-140] SAB3R: Semantic-Augmented Backbone in 3D Reconstruction

【速读】:该论文试图解决将开放词汇分割(open-vocabulary segmentation)与三维重建(3D reconstruction)统一到一个任务中的问题,即“Map and Locate”任务,该任务旨在从未校准的视频中生成点云并基于开放词汇查询分割物体实例。解决方案的关键在于提出一种简单而有效的基线模型SAB3R,该模型基于MASt3R,并引入轻量级知识蒸馏策略,将2D视觉主干(如CLIP和DINOv2)的密集像素语义特征迁移至MASt3R,从而在单次前向传播中生成像素级语义特征并构建连贯的点云地图,实现了比独立部署MASt3R和CLIP更优的性能。

链接: https://arxiv.org/abs/2506.02112
作者: Xuweiyi Chen,Tian Xia,Sihan Xu,Jianing Yang,Joyce Chai,Zezhou Cheng
机构: University of Virginia (弗吉尼亚大学); University of Michigan (密歇根大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:We introduce a new task, Map and Locate, which unifies the traditionally distinct objectives of open-vocabulary segmentation - detecting and segmenting object instances based on natural language queries - and 3D reconstruction, the process of estimating a scene’s 3D structure from visual inputs. Specifically, Map and Locate involves generating a point cloud from an unposed video and segmenting object instances based on open-vocabulary queries. This task serves as a critical step toward real-world embodied AI applications and introduces a practical task that bridges reconstruction, recognition and reorganization. To tackle this task, we introduce a simple yet effective baseline, which we denote as SAB3R. Our approach builds upon MASt3R, a recent breakthrough in 3D computer vision, and incorporates a lightweight distillation strategy. This method transfers dense, per-pixel semantic features from 2D vision backbones (eg, CLIP and DINOv2) to enhance MASt3R’s capabilities. Without introducing any auxiliary frozen networks, our model generates per-pixel semantic features and constructs cohesive point maps in a single forward pass. Compared to separately deploying MASt3R and CLIP, our unified model, SAB3R, achieves superior performance on the Map and Locate benchmark. Furthermore, we evaluate SAB3R on both 2D semantic segmentation and 3D tasks to comprehensively validate its effectiveness.
zh

[CV-141] Cycle Consistency as Reward: Learning Image-Text Alignment without Human Preferences

【速读】:该论文试图解决语言与视觉模态之间的对齐(alignment)问题,特别是在多模态数据日益详细和复杂的情况下,传统方法依赖于收集人类或AI的偏好,这通常成本高且耗时。其解决方案的关键在于利用循环一致性(cycle consistency)作为监督信号,通过将生成的文本映射回图像空间并计算原始图像与重构图像的相似性,以及将输入的文本描述重构并与原始描述进行文本相似性比较,从而构建一个包含866K对比对的偏好数据集。该方法在详细描述任务中优于现有对齐指标,并在作为最佳N采样验证器时展现出更优的推理时间可扩展性。

链接: https://arxiv.org/abs/2506.02095
作者: Hyojin Bahng,Caroline Chan,Fredo Durand,Phillip Isola
机构: MIT CSAIL (麻省理工学院计算机科学与人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Learning alignment between language and vision is a fundamental challenge, especially as multimodal data becomes increasingly detailed and complex. Existing methods often rely on collecting human or AI preferences, which can be costly and time-intensive. We propose an alternative approach that leverages cycle consistency as a supervisory signal. Given an image and generated text, we map the text back to image space using a text-to-image model and compute the similarity between the original image and its reconstruction. Analogously, for text-to-image generation, we measure the textual similarity between an input caption and its reconstruction through the cycle. We use the cycle consistency score to rank candidates and construct a preference dataset of 866K comparison pairs. The reward model trained on our dataset outperforms state-of-the-art alignment metrics on detailed captioning, with superior inference-time scalability when used as a verifier for Best-of-N sampling. Furthermore, performing DPO and Diffusion DPO using our dataset enhances performance across a wide range of vision-language tasks and text-to-image generation. Our dataset, model, and code are at this https URL
zh

[CV-142] Robust Federated Learning against Noisy Clients via Masked Optimization

【速读】:该论文旨在解决联邦学习(Federated Learning, FL)中由于客户端提供的标注数据存在复杂标签噪声而导致的模型性能下降问题。其关键解决方案是提出一种两阶段优化框架——MaskedOptim,第一阶段通过检测具有较高标签噪声率的客户端来识别噪声源,第二阶段则通过端到端的标签校正机制修正这些客户端的数据标签,从而减轻数据中的错误信息对模型训练的负面影响。此外,该框架采用基于几何中位数的模型聚合方法以提升训练鲁棒性。

链接: https://arxiv.org/abs/2506.02079
作者: Xuefeng Jiang,Tian Wen,Zhiqin Yang,Lvhua Wu,Yufeng Chen,Sheng Sun,Yuwei Wang,Min Liu
机构: Institute of Computing Technology, Chinese Academy of Sciences & University of Chinese Academy of Sciences; Department of Electronic Engineering, The Chinese University of Hong Kong
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
备注: Under review

点击查看摘要

Abstract:In recent years, federated learning (FL) has made significant advance in privacy-sensitive applications. However, it can be hard to ensure that FL participants provide well-annotated data for training. The corresponding annotations from different clients often contain complex label noise at varying levels. This label noise issue has a substantial impact on the performance of the trained models, and clients with greater noise levels can be largely attributed for this degradation. To this end, it is necessary to develop an effective optimization strategy to alleviate the adverse effects of these noisy this http URL this study, we present a two-stage optimization framework, MaskedOptim, to address this intricate label noise problem. The first stage is designed to facilitate the detection of noisy clients with higher label noise rates. The second stage focuses on rectifying the labels of the noisy clients’ data through an end-to-end label correction mechanism, aiming to mitigate the negative impacts caused by misinformation within datasets. This is achieved by learning the potential ground-truth labels of the noisy clients’ datasets via backpropagation. To further enhance the training robustness, we apply the geometric median based model aggregation instead of the commonly-used vanilla averaged model aggregation. We implement sixteen related methods and conduct evaluations on three image datasets and one text dataset with diverse label noise patterns for a comprehensive comparison. Extensive experimental results indicate that our proposed framework shows its robustness in different scenarios. Additionally, our label correction framework effectively enhances the data quality of the detected noisy clients’ local datasets. % Our codes will be open-sourced to facilitate related research communities. Our codes are available via this https URL .
zh

[CV-143] EWGN: Elastic Weight Generation and Context Switching in Deep Learning

【速读】:该论文试图解决神经网络在持续学习过程中面临的灾难性遗忘问题(catastrophic forgetting),即在学习新任务时对先前学习任务的性能下降。解决方案的关键在于提出弹性权重生成网络(Elastic Weight Generative Networks, EWGN),该架构通过一个额外的网络动态生成主网络的权重,并在巩固已学习权重的同时实现输入依赖的权重生成,从而支持上下文切换,缓解任务间的干扰。

链接: https://arxiv.org/abs/2506.02065
作者: Shriraj P. Sawant,Krishna P. Miyapuram
机构: Indian Institute of Technology Gandhinagar (印度理工学院甘地纳格尔分校)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The ability to learn and retain a wide variety of tasks is a hallmark of human intelligence that has inspired research in artificial general intelligence. Continual learning approaches provide a significant step towards achieving this goal. It has been known that task variability and context switching are challenging for learning in neural networks. Catastrophic forgetting refers to the poor performance on retention of a previously learned task when a new task is being learned. Switching between different task contexts can be a useful approach to mitigate the same by preventing the interference between the varying task weights of the network. This paper introduces Elastic Weight Generative Networks (EWGN) as an idea for context switching between two different tasks. The proposed EWGN architecture uses an additional network that generates the weights of the primary network dynamically while consolidating the weights learned. The weight generation is input-dependent and thus enables context switching. Using standard computer vision datasets, namely MNIST and fashion-MNIST, we analyse the retention of previously learned task representations in Fully Connected Networks, Convolutional Neural Networks, and EWGN architectures with Stochastic Gradient Descent and Elastic Weight Consolidation learning algorithms. Understanding dynamic weight generation and context-switching ability can be useful in enabling continual learning for improved performance.
zh

[CV-144] Do You See Me : A Multidimensional Benchmark for Evaluating Visual Perception in Multimodal LLM s

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在视觉感知方面的瓶颈问题,特别是其在处理涉及视觉元素的推理任务时,尽管能够给出正确答案,却可能对关键视觉信息产生误解,从而掩盖了模型的潜在缺陷。为系统性地评估和改进MLLMs的视觉能力,研究者提出了一个名为“Do You See Me”的可扩展基准测试集,包含1,758张图像和2,612个问题,涵盖七个受人类心理学启发的二维与三维子任务,并具备可控复杂度以严格评估MLLMs的视觉技能。该解决方案的关键在于通过设计具有挑战性的视觉推理任务,揭示MLLMs在视觉注意力分配、细粒度细节内部表征稳定性等方面存在的根本性问题,从而推动开发更具鲁棒性的视觉感知能力的MLLMs。

链接: https://arxiv.org/abs/2506.02022
作者: Aditya Kanade,Tanuja Ganu
机构: Microsoft Research (微软研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) show reasoning promise, yet their visual perception is a critical bottleneck. Strikingly, MLLMs can produce correct answers even while misinterpreting crucial visual elements, masking these underlying failures. Our preliminary study on a joint perception-reasoning dataset revealed that for one leading MLLM, 29% of its correct answers to reasoning questions still exhibited visual perception errors. To systematically address this, we introduce “Do You See Me”, a scalable benchmark with 1,758 images and 2,612 questions. It spans seven human-psychology inspired subtasks in 2D and 3D, featuring controllable complexity to rigorously evaluate MLLM visual skills. Our findings on 3 leading closed-source and 5 major open-source models reveal a stark deficit: humans achieve 96.49% accuracy, while top MLLMs average below 50%. This performance gap widens rapidly with increased task complexity (e.g., from 12% to 45% in the visual form constancy subtask). Further analysis into the root causes suggests that failures stem from challenges like misallocated visual attention and the instability of internal representations for fine-grained details, especially at or below encoder patch resolution. This underscores an urgent need for MLLMs with truly robust visual perception. The benchmark dataset, source code and evaluation scripts are available at this https URL.
zh

[CV-145] Dynamic-Aware Video Distillation: Optimizing Temporal Resolution Based on Video Semantics

【速读】:该论文试图解决视频数据集中的冗余问题,尤其是在视频数据集中由于时间信息和不同类别间冗余水平差异所带来的挑战。现有数据集蒸馏(Dataset Distillation, DD)方法假设所有视频语义具有统一的时间冗余水平,这限制了其在视频数据集上的效果。解决方案的关键在于提出一种基于强化学习(Reinforcement Learning, RL)的动态感知视频蒸馏方法(Dynamic-Aware Video Distillation, DAViD),通过预测合成视频的最佳时间分辨率,并引入教师在环奖励函数来更新RL代理策略,从而实现对视频语义的自适应时间分辨率调整。

链接: https://arxiv.org/abs/2506.02021
作者: Yinjie Zhao,Heng Zhao,Bihan Wen,Yew-Soon Ong,Joey Tianyi Zhou
机构: CFAR, Agency for Science, Technology and Research (ASTAR), Singapore; IHPC, Agency for Science, Technology and Research (ASTAR), Singapore; School of EEE, Nanyang Technological University, Singapore; CCDS, Nanyang Technological University, Singapore
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:With the rapid development of vision tasks and the scaling on datasets and models, redundancy reduction in vision datasets has become a key area of research. To address this issue, dataset distillation (DD) has emerged as a promising approach to generating highly compact synthetic datasets with significantly less redundancy while preserving essential information. However, while DD has been extensively studied for image datasets, DD on video datasets remains underexplored. Video datasets present unique challenges due to the presence of temporal information and varying levels of redundancy across different classes. Existing DD approaches assume a uniform level of temporal redundancy across all different video semantics, which limits their effectiveness on video datasets. In this work, we propose Dynamic-Aware Video Distillation (DAViD), a Reinforcement Learning (RL) approach to predict the optimal Temporal Resolution of the synthetic videos. A teacher-in-the-loop reward function is proposed to update the RL agent policy. To the best of our knowledge, this is the first study to introduce adaptive temporal resolution based on video semantics in video dataset distillation. Our approach significantly outperforms existing DD methods, demonstrating substantial improvements in performance. This work paves the way for future research on more efficient and semantic-adaptive video dataset distillation research.
zh

[CV-146] Improve Multi-Modal Embedding Learning via Explicit Hard Negative Gradient Amplifying

【速读】:该论文旨在解决多模态大语言模型(MLLM)中对比学习框架下难负样本(hard negative samples)的有效挖掘与利用问题,以提升模型的多模态嵌入性能。其解决方案的关键在于对info-NCE损失函数相对于查询、正样本和负样本的梯度进行详细分析,揭示难负样本在模型参数更新中的作用,并通过显式放大与难负样本相关的梯度,促使模型学习更具判别性的嵌入表示。基于此方法的多模态嵌入模型在MMEB基准测试中取得了最先进的性能。

链接: https://arxiv.org/abs/2506.02020
作者: Youze Xue,Dian Li,Gang Liu
机构: Tencent QQ (腾讯QQ)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:With the rapid advancement of multi-modal large language models (MLLMs) in recent years, the foundational Contrastive Language-Image Pretraining (CLIP) framework has been successfully extended to MLLMs, enabling more powerful and universal multi-modal embeddings for a wide range of retrieval tasks. Despite these developments, the core contrastive learning paradigm remains largely unchanged from CLIP-style models to MLLMs. Within this framework, the effective mining of hard negative samples continues to be a critical factor for enhancing performance. Prior works have introduced both offline and online strategies for hard negative mining to improve the efficiency of contrastive learning. While these approaches have led to improved multi-modal embeddings, the specific contribution of each hard negative sample to the learning process has not been thoroughly investigated. In this work, we conduct a detailed analysis of the gradients of the info-NCE loss with respect to the query, positive, and negative samples, elucidating the role of hard negatives in updating model parameters. Building upon this analysis, we propose to explicitly amplify the gradients associated with hard negative samples, thereby encouraging the model to learn more discriminative embeddings. Our multi-modal embedding model, trained with the proposed Explicit Gradient Amplifier and based on the LLaVA-OneVision-7B architecture, achieves state-of-the-art performance on the MMEB benchmark compared to previous methods utilizing the same MLLM backbone. Furthermore, when integrated with our self-developed MLLM, QQMM, our approach attains the top rank on the MMEB leaderboard. Code and models are released on this https URL.
zh

[CV-147] Fairness through Feedback: Addressing Algorithmic Misgendering in Automatic Gender Recognition

【速读】:该论文试图解决自动性别识别(Automatic Gender Recognition, AGR)系统中存在的概念性与实践性问题,特别是其基于二元性别假设的分类方式与现实性别多样性的矛盾。现有AGR系统通常依据可观测特征进行分类,而这些特征与生理性别(sex)或社会性别(gender)之间的关联性有限,导致系统在非二元及性别表达不符个体中的可靠性不足。解决方案的关键在于对AGR系统的理论与实践进行重新思考,区分性别(gender)、生理性别(sex)与性别表达(gender expression),并引入用户反馈机制以允许对系统输出进行修正,从而提升系统的公平性与对个体自我表达权利的尊重。

链接: https://arxiv.org/abs/2506.02017
作者: Camilla Quaresmini,Giacomo Zanotti
机构: Politecnico di Milano (米兰理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Automatic Gender Recognition (AGR) systems are an increasingly widespread application in the Machine Learning (ML) landscape. While these systems are typically understood as detecting gender, they often classify datapoints based on observable features correlated at best with either male or female sex. In addition to questionable binary assumptions, from an epistemological point of view, this is problematic for two reasons. First, there exists a gap between the categories the system is meant to predict (woman versus man) and those onto which their output reasonably maps (female versus male). What is more, gender cannot be inferred on the basis of such observable features. This makes AGR tools often unreliable, especially in the case of non-binary and gender non-conforming people. We suggest a theoretical and practical rethinking of AGR systems. To begin, distinctions are made between sex, gender, and gender expression. Then, we build upon the observation that, unlike algorithmic misgendering, human-human misgendering is open to the possibility of re-evaluation and correction. We suggest that analogous dynamics should be recreated in AGR, giving users the possibility to correct the system’s output. While implementing such a feedback mechanism could be regarded as diminishing the system’s autonomy, it represents a way to significantly increase fairness levels in AGR. This is consistent with the conceptual change of paradigm that we advocate for AGR systems, which should be understood as tools respecting individuals’ rights and capabilities of self-expression and determination.
zh

[CV-148] Are classical deep neural networks weakly adversarially robust?

【速读】:该论文试图解决深度神经网络(Deep Neural Networks, DNNs)在面对对抗样本时的鲁棒性不足问题,以及传统对抗训练方法计算开销大的缺陷。其解决方案的关键在于利用逐层特征的聚类特性及渐进前向崩溃(Progressive Feedforward Collapse, PFC)现象,通过构建特征路径并计算示例特征路径与类别中心特征路径之间的相关性,实现对抗样本检测与图像识别的联合优化。该方法在不依赖高计算成本防御策略的情况下,展现出优于传统对抗训练的性能平衡。

链接: https://arxiv.org/abs/2506.02016
作者: Nuolin Sun,Linyuan Wang,Dongyang Li,Bin Yan,Lei Li
机构: Information Engineering University (信息工程大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Adversarial attacks have received increasing attention and it has been widely recognized that classical DNNs have weak adversarial robustness. The most commonly used adversarial defense method, adversarial training, improves the adversarial accuracy of DNNs by generating adversarial examples and retraining the model. However, adversarial training requires a significant computational overhead. In this paper, inspired by existing studies focusing on the clustering properties of DNN output features at each layer and the Progressive Feedforward Collapse phenomenon, we propose a method for adversarial example detection and image recognition that uses layer-wise features to construct feature paths and computes the correlation between the examples feature paths and the class-centered feature paths. Experimental results show that the recognition method achieves 82.77% clean accuracy and 44.17% adversarial accuracy on the ResNet-20 with PFC. Compared to the adversarial training method with 77.64% clean accuracy and 52.94% adversarial accuracy, our method exhibits a trade-off without relying on computationally expensive defense strategies. Furthermore, on the standard ResNet-18, our method maintains this advantage with respective metrics of 80.01% and 46.1%. This result reveals inherent adversarial robustness in DNNs, challenging the conventional understanding of the weak adversarial robustness in DNNs.
zh

[CV-149] Object-centric Self-improving Preference Optimization for Text-to-Image Generation

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在细粒度视觉理解,尤其是文本到图像生成任务中的不足。其解决方案的关键在于提出一种基于对象中心的自我改进偏好优化(Object-centric Self-improving Preference Optimization, OSPO)框架,该框架通过利用MLLM自身的推理能力,无需依赖外部数据集或模型,重点构建高质量的偏好对数据,从而提升偏好优化的效果。OSPO引入了自我改进机制,通过对象中心的提示扰动、密集化和视觉问答(VQA)评分,自动生成对象级别的对比偏好对,有效消除了传统方法中常见的模糊或不均衡变化。

链接: https://arxiv.org/abs/2506.02015
作者: Yoonjin Oh,Yongjin Kim,Hyomin Kim,Donghwan Chi,Sungwoong Kim
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advancements in Multimodal Large Language Models (MLLMs) have significantly improved both image understanding and generation capabilities. Despite these improvements, MLLMs still struggle with fine-grained visual comprehension, particularly in text-to-image generation tasks. While preference optimization methods have been explored to address these limitations in image understanding tasks, their application to image generation remains largely underexplored. To address this gap, we propose an Object-centric Self-improving Preference Optimization (OSPO) framework designed for text-to-image generation by MLLMs. OSPO leverages the intrinsic reasoning abilities of MLLMs without requiring any external datasets or models. OSPO emphasizes the importance of high-quality preference pair data, which is critical for effective preference optimization. To achieve this, it introduces a self-improving mechanism that autonomously constructs object-level contrastive preference pairs through object-centric prompt perturbation, densification and VQA scoring. This process eliminates ambiguous or disproportionate variations commonly found in naively generated preference pairs, thereby enhancing the effectiveness of preference optimization. We validate OSPO on three representative compositional text-to-image benchmarks, demonstrating substantial performance gains over baseline models.
zh

[CV-150] Research on Driving Scenario Technology Based on Multimodal Large Lauguage Model Optimization

【速读】:该论文旨在解决在自动驾驶和辅助驾驶技术中,如何提升模型对复杂驾驶场景的理解能力问题。其解决方案的关键在于提出了一种针对多模态模型的系统性优化方法,包括动态提示优化、数据集构建、模型训练与部署。其中,动态提示优化通过根据输入图像内容调整提示,增强模型对影响本车对象的关注度,从而提升任务特定的感知与判断能力;同时,结合真实与合成数据构建高质量多模态数据集,提升模型在复杂环境中的泛化能力,并通过知识蒸馏、动态微调和量化等技术降低计算与存储成本,实现性能与效率的平衡。

链接: https://arxiv.org/abs/2506.02014
作者: Wang Mengjie,Zhu Huiping,Li Jian,Shi Wenxiu,Zhang Song
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:With the advancement of autonomous and assisted driving technologies, higher demands are placed on the ability to understand complex driving scenarios. Multimodal general large models have emerged as a solution for this challenge. However, applying these models in vertical domains involves difficulties such as data collection, model training, and deployment optimization. This paper proposes a comprehensive method for optimizing multimodal models in driving scenarios, including cone detection, traffic light recognition, speed limit recommendation, and intersection alerts. The method covers key aspects such as dynamic prompt optimization, dataset construction, model training, and deployment. Specifically, the dynamic prompt optimization adjusts the prompts based on the input image content to focus on objects affecting the ego vehicle, enhancing the model’s task-specific focus and judgment capabilities. The dataset is constructed by combining real and synthetic data to create a high-quality and diverse multimodal training dataset, improving the model’s generalization in complex driving environments. In model training, advanced techniques like knowledge distillation, dynamic fine-tuning, and quantization are integrated to reduce storage and computational costs while boosting performance. Experimental results show that this systematic optimization method not only significantly improves the model’s accuracy in key tasks but also achieves efficient resource utilization, providing strong support for the practical application of driving scenario perception technologies.
zh

[CV-151] Leverag ing Large Language Models in Visual Speech Recognition: Model Scaling Context-Aware Decoding and Iterative Polishing

【速读】:该论文旨在解决如何有效利用大型语言模型(Large Language Models, LLMs)以提升视觉语音识别(Visual Speech Recognition, VSR)性能的问题。其关键解决方案包括:通过规模测试验证LLMs尺寸对VSR性能的影响,确认了VSR任务中的缩放定律;引入上下文感知解码机制,通过添加上下文文本引导LLMs解码以提高识别准确率;提出迭代精炼方法,通过多次优化LLMs输出逐步减少识别错误,从而充分挖掘LLMs在VSR任务中的潜力。

链接: https://arxiv.org/abs/2506.02012
作者: Zehua Liu,Xiaolou Li,Li Guo,Lantian Li,Dong Wang
机构: School of Artificial Intelligence (人工智能学院); Center for Speech and Language Technologies (语音与语言技术中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Visual Speech Recognition (VSR) transcribes speech by analyzing lip movements. Recently, Large Language Models (LLMs) have been integrated into VSR systems, leading to notable performance improvements. However, the potential of LLMs has not been extensively studied, and how to effectively utilize LLMs in VSR tasks remains unexplored. This paper systematically explores how to better leverage LLMs for VSR tasks and provides three key contributions: (1) Scaling Test: We study how the LLM size affects VSR performance, confirming a scaling law in the VSR task. (2) Context-Aware Decoding: We add contextual text to guide the LLM decoding, improving recognition accuracy. (3) Iterative Polishing: We propose iteratively refining LLM outputs, progressively reducing recognition errors. Extensive experiments demonstrate that by these designs, the great potential of LLMs can be largely harnessed, leading to significant VSR performance improvement.
zh

[CV-152] OASIS: Online Sample Selection for Continual Visual Instruction Tuning

【速读】:该论文旨在解决持续视觉指令微调(Continual Visual Instruction Tuning, CVIT)场景中因大规模数据带来的训练延迟问题,以及现有数据选择策略依赖预训练参考模型导致的不适用性问题。其解决方案的关键在于提出OASIS(Online Adaptive Sample Selection),该方法通过动态调整每批次选择的样本数量以适应批次间的相对信息量,并通过迭代更新选择分数来减少所选样本的冗余性,从而在保证性能的同时显著降低数据使用量。

链接: https://arxiv.org/abs/2506.02011
作者: Minjae Lee,Minhyuk Seo,Tingyu Qu,Tinne Tuytelaars,Jonghyun Choi
机构: Seoul National University (首尔国立大学); KU Leuven (鲁汶大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In continual visual instruction tuning (CVIT) scenarios, where multi-modal data continuously arrive in an online streaming manner, training delays from large-scale data significantly hinder real-time adaptation. While existing data selection strategies reduce training overheads, they rely on pre-trained reference models, which are impractical in CVIT setups due to unknown future data. Recent reference model-free online sample selection methods address this issue but typically select a fixed number of samples per batch (e.g., top-k), causing them to suffer from distribution shifts where informativeness varies across batches. To address these limitations, we propose OASIS, an adaptive online sample selection approach for CVIT that: (1) dynamically adjusts selected samples per batch based on relative inter-batch informativeness, and (2) minimizes redundancy of selected samples through iterative selection score updates. Empirical results across various MLLMs, such as LLaVA-1.5 and Qwen-VL-2.5, show that OASIS achieves comparable performance to full-data training using only 25% of the data and outperforms the state-of-the-art.
zh

[CV-153] CNVSRC 2024: The Second Chinese Continuous Visual Speech Recognition Challenge INTERSPEECH2025

【速读】:该论文旨在解决中国大词汇量连续视觉语音识别(Chinese Large Vocabulary Continuous Visual Speech Recognition, LVC-VSR)领域的技术挑战,通过提升识别准确性和鲁棒性来推动该领域的发展。其解决方案的关键在于引入更强的基线系统以及新增数据集CN-CVS2-P1以增加数据量和多样性,同时在数据预处理、特征提取、模型设计和训练策略等方面实现了多项重要创新。

链接: https://arxiv.org/abs/2506.02010
作者: Zehua Liu,Xiaolou Li,Chen Chen,Lantian Li,Dong Wang
机构: Beijing University of Posts and Telecommunications (北京邮电大学); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: to be published in INTERSPEECH 2025

点击查看摘要

Abstract:This paper presents the second Chinese Continuous Visual Speech Recognition Challenge (CNVSRC 2024), which builds on CNVSRC 2023 to advance research in Chinese Large Vocabulary Continuous Visual Speech Recognition (LVC-VSR). The challenge evaluates two test scenarios: reading in recording studios and Internet speech. CNVSRC 2024 uses the same datasets as its predecessor CNVSRC 2023, which involves CN-CVS for training and CNVSRC-Single/Multi for development and evaluation. However, CNVSRC 2024 introduced two key improvements: (1) a stronger baseline system, and (2) an additional dataset, CN-CVS2-P1, for open tracks to improve data volume and diversity. The new challenge has demonstrated several important innovations in data preprocessing, feature extraction, model design, and training strategies, further pushing the state-of-the-art in Chinese LVC-VSR. More details and resources are available at the official website.
zh

[CV-154] Johnny: Structuring Representation Space to Enhance Machine Abstract Reasoning Ability

【速读】:该论文旨在解决增强人工智能(Artificial Intelligence)抽象推理能力的问题,特别是针对涉及复杂人类概念的Raven’s Progressive Matrices (RPM)任务。传统端到端RPM求解模型在很大程度上依赖于选项池配置,这种依赖性限制了模型的推理能力。论文提出的解决方案关键在于Johnny架构——一种基于表示空间的RPM求解框架,其通过表示提取模块与推理模块的协同操作,利用学习到的表示空间补充原始的负向选项配置,从而显著提升推理性能。

链接: https://arxiv.org/abs/2506.01970
作者: Ruizhuo Song,Beiming Yuan
机构: University of Science and Technology Beijing (北京科技大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 15 figures, 5 tables

点击查看摘要

Abstract:This paper thoroughly investigates the challenges of enhancing AI’s abstract reasoning capabilities, with a particular focus on Raven’s Progressive Matrices (RPM) tasks involving complex human-like concepts. Firstly, it dissects the empirical reality that traditional end-to-end RPM-solving models heavily rely on option pool configurations, highlighting that this dependency constrains the model’s reasoning capabilities. To address this limitation, the paper proposes the Johnny architecture - a novel representation space-based framework for RPM-solving. Through the synergistic operation of its Representation Extraction Module and Reasoning Module, Johnny significantly enhances reasoning performance by supplementing primitive negative option configurations with a learned representation space. Furthermore, to strengthen the model’s capacity for capturing positional relationships among local features, the paper introduces the Spin-Transformer network architecture, accompanied by a lightweight Straw Spin-Transformer variant that reduces computational overhead through parameter sharing and attention mechanism optimization. Experimental evaluations demonstrate that both Johnny and Spin-Transformer achieve superior performance on RPM tasks, offering innovative methodologies for advancing AI’s abstract reasoning capabilities.
zh

[CV-155] Memorization to Generalization: Emergence of Diffusion Models from Associative Memory

【速读】:该论文试图解决扩散模型在记忆-泛化过程中可能出现的虚假状态(spurious states)问题,这些状态会导致生成结果的不准确。其解决方案的关键在于将扩散模型视为一种联想记忆系统(Associative Memory, AM),通过分析其训练和生成阶段的动态行为,揭示其在不同数据规模下形成的吸引子状态特性,从而提供对虚假状态存在的理论预测与实证验证。

链接: https://arxiv.org/abs/2505.21777
作者: Bao Pham,Gabriel Raya,Matteo Negri,Mohammed J. Zaki,Luca Ambrogioni,Dmitry Krotov
机构: Rensselaer Polytechnic Institute (伦斯勒理工学院); Jheronimus Academy of Data Science, Tilburg University (蒂尔堡大学数据科学学院); University of Rome Sapienza (罗马第一大学); MIT-IBM Watson AI Lab, IBM Research (MIT-IBM沃森人工智能实验室,IBM研究院)
类目: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn); Computer Vision and Pattern Recognition (cs.CV); Neurons and Cognition (q-bio.NC); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Hopfield networks are associative memory (AM) systems, designed for storing and retrieving patterns as local minima of an energy landscape. In the classical Hopfield model, an interesting phenomenon occurs when the amount of training data reaches its critical memory load - spurious,,states , or unintended stable points, emerge at the end of the retrieval dynamics, leading to incorrect recall. In this work, we examine diffusion models, commonly used in generative modeling, from the perspective of AMs. The training phase of diffusion model is conceptualized as memory encoding (training data is stored in the memory). The generation phase is viewed as an attempt of memory retrieval. In the small data regime the diffusion model exhibits a strong memorization phase, where the network creates distinct basins of attraction around each sample in the training set, akin to the Hopfield model below the critical memory load. In the large data regime, a different phase appears where an increase in the size of the training set fosters the creation of new attractor states that correspond to manifolds of the generated samples. Spurious states appear at the boundary of this transition and correspond to emergent attractor states, which are absent in the training set, but, at the same time, have distinct basins of attraction around them. Our findings provide: a novel perspective on the memorization-generalization phenomenon in diffusion models via the lens of AMs, theoretical prediction of existence of spurious states, empirical validation of this prediction in commonly-used diffusion models.
zh

[CV-156] Simulate Any Radar: Attribute-Controllable Radar Simulation via Waveform Parameter Embedding

【速读】:该论文旨在解决雷达数据生成中缺乏可控性和效率的问题,特别是针对自动驾驶应用中对多样化、高质量雷达数据的需求。其解决方案的关键在于提出SA-Radar(Simulate Any Radar),该方法通过波形参数化的属性嵌入整合了生成模型与物理模拟的双重范式,结合ICFAR-Net网络结构,能够根据自定义的雷达属性高效生成范围-方位-多普勒(RAD)张量,从而避免对详细硬件规格的依赖,并支持在新型传感器视角和编辑场景下的仿真。

链接: https://arxiv.org/abs/2506.03134
作者: Weiqing Xiao,Hao Huang,Chonghao Zhong,Yujie Lin,Nan Wang,Xiaoxue Chen,Zhaoxi Chen,Saining Zhang,Shuocheng Yang,Pierre Merriaux,Lei Lei,Hao Zhao
机构: BUAA(北京航空航天大学); BJTU(北京交通大学); BIT(北京理工大学); AIR, THU(人工智能研究院,清华大学); NTU(新加坡国立大学); SVM, THU(视觉与媒体实验室,清华大学); Lightwheel AI(光轮智能); LeddarTech(莱德科技)
类目: ignal Processing (eess.SP); Computer Vision and Pattern Recognition (cs.CV)
备注: Code: this https URL Project page: this https URL

点击查看摘要

Abstract:We present SA-Radar (Simulate Any Radar), a radar simulation approach that enables controllable and efficient generation of radar cubes conditioned on customizable radar attributes. Unlike prior generative or physics-based simulators, SA-Radar integrates both paradigms through a waveform-parameterized attribute embedding. We design ICFAR-Net, a 3D U-Net conditioned on radar attributes encoded via waveform parameters, which captures signal variations induced by different radar configurations. This formulation bypasses the need for detailed radar hardware specifications and allows efficient simulation of range-azimuth-Doppler (RAD) tensors across diverse sensor settings. We further construct a mixed real-simulated dataset with attribute annotations to robustly train the network. Extensive evaluations on multiple downstream tasks-including 2D/3D object detection and radar semantic segmentation-demonstrate that SA-Radar’s simulated data is both realistic and effective, consistently improving model performance when used standalone or in combination with real data. Our framework also supports simulation in novel sensor viewpoints and edited scenes, showcasing its potential as a general-purpose radar data engine for autonomous driving applications. Code and additional materials are available at this https URL.
zh

[CV-157] A Tree-guided CNN for image super-resolution

【速读】:该论文旨在解决深度卷积神经网络在图像超分辨率任务中难以有效识别关键层以提升性能的问题。其解决方案的关键在于设计一种基于树结构引导的卷积神经网络(Tree-guided CNN for image super-resolution, TSRNet),通过树架构增强关键节点的效果,放大层次化信息之间的关联,从而提高图像恢复能力;同时引入余弦变换技术以提取跨域信息,弥补结构信息不足,并采用自适应Nesterov动量优化器(Adan)优化参数以提升训练效率。

链接: https://arxiv.org/abs/2506.02585
作者: Chunwei Tian,Mingjian Song,Xiaopeng Fan,Xiangtao Zheng,Bob Zhang,David Zhang
机构: Harbin Institute of Technology(哈尔滨工业大学); Northwestern Polytechnical University(西北工业大学); Fuzhou University(福州大学); University of Macau(澳门大学); The Chinese University of Hong Kong (Shenzhen)(香港中文大学(深圳)); Shenzhen Institute of Artificial Intelligence and Robotics for Society(深圳市人工智能与机器人社会应用研究所)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: This paper has been accepted for publication in IEEE Transactions on Consumer Electronics. 10 pages, 6 figures. Its code can be obtained at this https URL

点击查看摘要

Abstract:Deep convolutional neural networks can extract more accurate structural information via deep architectures to obtain good performance in image super-resolution. However, it is not easy to find effect of important layers in a single network architecture to decrease performance of super-resolution. In this paper, we design a tree-guided CNN for image super-resolution (TSRNet). It uses a tree architecture to guide a deep network to enhance effect of key nodes to amplify the relation of hierarchical information for improving the ability of recovering images. To prevent insufficiency of the obtained structural information, cosine transform techniques in the TSRNet are used to extract cross-domain information to improve the performance of image super-resolution. Adaptive Nesterov momentum optimizer (Adan) is applied to optimize parameters to boost effectiveness of training a super-resolution model. Extended experiments can verify superiority of the proposed TSRNet for restoring high-quality images. Its code can be obtained at this https URL.
zh

[CV-158] Dynamic mapping from static labels: remote sensing dynamic sample generation with temporal-spectral embedding

【速读】:该论文试图解决遥感地理制图中因地表动态变化导致样本数据快速过时的问题,该问题需要频繁的人工更新,造成较大的劳动负担。其解决方案的关键在于引入TasGen框架,通过利用现有的单时相静态标注样本进行动态样本生成,采用时空嵌入方法同时建模时间序列遥感影像的光谱和时间依赖性,从而在无需额外人工标注的情况下捕捉地表变化。

链接: https://arxiv.org/abs/2506.02574
作者: Shuai Yuan,Shuang Chen,Tianwu Lin,Jie Wang,Peng Gong
机构: Department of Geography, The University of Hong Kong, Hong Kong, China; Pengcheng Laboratory, Shenzhen, China; Department of Electronics and Information Engineering, Harbin Institute of Technology (Shenzhen), Shenzhen, China
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Accurate remote sensing geographic mapping depends heavily on representative and timely sample data. However, rapid changes in land surface dynamics necessitate frequent updates, quickly rendering previously collected samples obsolete and imposing significant labor demands for continuous manual updates. In this study, we aim to address this problem by dynamic sample generation using existing single-date static labeled samples. We introduce TasGen, a two-stage automated framework to automatically generate dynamic samples, designed to simultaneously model spectral and temporal dependencies in time-series remote sensing imagery via temporal-spectral embedding, capturing land surface changes without additional manual annotations.
zh

[CV-159] Multi-modal brain MRI synthesis based on SwinUNETR

【速读】:该论文试图解决临床实践中脑部磁共振成像(MRI)中缺失模态的问题,这一问题限制了多模态MRI在临床诊断中的应用。解决方案的关键在于采用SwinUNETR网络架构,该架构结合了Swin Transformer与卷积神经网络(CNN)的优势,通过引入层次化特征提取和基于窗口的自注意力机制,实现了全局上下文感知与细节空间分辨率的有效融合,从而能够生成准确且具有临床价值的合成图像。

链接: https://arxiv.org/abs/2506.02467
作者: Haowen Pang,Weiyan Guo,Chuyang Ye
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 5 figures

点击查看摘要

Abstract:Multi-modal brain magnetic resonance imaging (MRI) plays a crucial role in clinical diagnostics by providing complementary information across different imaging modalities. However, a common challenge in clinical practice is missing MRI modalities. In this paper, we apply SwinUNETR to the synthesize of missing modalities in brain MRI. SwinUNETR is a novel neural network architecture designed for medical image analysis, integrating the strengths of Swin Transformer and convolutional neural networks (CNNs). The Swin Transformer, a variant of the Vision Transformer (ViT), incorporates hierarchical feature extraction and window-based self-attention mechanisms, enabling it to capture both local and global contextual information effectively. By combining the Swin Transformer with CNNs, SwinUNETR merges global context awareness with detailed spatial resolution. This hybrid approach addresses the challenges posed by the varying modality characteristics and complex brain structures, facilitating the generation of accurate and realistic synthetic images. We evaluate the performance of SwinUNETR on brain MRI datasets and demonstrate its superior capability in generating clinically valuable images. Our results show significant improvements in image quality, anatomical consistency, and diagnostic value.
zh

[CV-160] Unrolling Nonconvex Graph Total Variation for Image Denoising

【速读】:该论文旨在解决传统基于模型的图像去噪优化中使用凸正则化项(如全变分,TV)所带来的性能局限性,这些方法通过将0\ell_0-范数凸化以促进稀疏信号表示。其解决方案的关键在于提出了一种图设置下的非凸全变分项(NC-GTV),该方法结合2\ell_2-范数保真项后能够形成无多余局部极小值的凸目标函数。核心在于选择一个表征图Huber函数的参数aa,以确保整体目标函数的凸性,并通过改进的Gershgorin圆定理(GCT)高效计算该参数。

链接: https://arxiv.org/abs/2506.02381
作者: Songlin Wei,Gene Cheung,Fei Chen,Ivan Selesnick
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Conventional model-based image denoising optimizations employ convex regularization terms, such as total variation (TV) that convexifies the \ell_0 -norm to promote sparse signal representation. Instead, we propose a new non-convex total variation term in a graph setting (NC-GTV), such that when combined with an \ell_2 -norm fidelity term for denoising, leads to a convex objective with no extraneous local minima. We define NC-GTV using a new graph variant of the Huber function, interpretable as a Moreau envelope. The crux is the selection of a parameter a characterizing the graph Huber function that ensures overall objective convexity; we efficiently compute a via an adaptation of Gershgorin Circle Theorem (GCT). To minimize the convex objective, we design a linear-time algorithm based on Alternating Direction Method of Multipliers (ADMM) and unroll it into a lightweight feed-forward network for data-driven parameter learning. Experiments show that our method outperforms unrolled GTV and other representative image denoising schemes, while employing far fewer network parameters.
zh

[CV-161] Dual encoding feature filtering generalized attention UNET for retinal vessel segmentation

【速读】:该论文旨在解决视网膜血管分割中的问题,包括有限的训练数据、数据分布不平衡以及特征提取不足,这些问题限制了分割性能和模型的泛化能力。其解决方案的关键在于提出DEFFA-Unet架构,该架构通过引入额外的编码器处理领域不变的预处理输入,以提升特征编码的丰富性和模型的泛化能力;同时设计了特征过滤融合模块以实现精确的特征过滤和鲁棒的混合特征融合,并用注意力引导的特征重建融合模块替代传统跳跃连接,以满足高精度任务需求;此外,还提出了创新的数据增强与平衡方法,以应对数据稀缺和分布不平衡问题,从而进一步提升模型的鲁棒性和泛化能力。

链接: https://arxiv.org/abs/2506.02312
作者: Md Tauhidul Islam,Wu Da-Wen,Tang Qing-Qing,Zhao Kai-Yang,Yin Teng,Li Yan-Fei,Shang Wen-Yi,Liu Jing-Yu,Zhang Hai-Xian
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Retinal blood vessel segmentation is crucial for diagnosing ocular and cardiovascular diseases. Although the introduction of U-Net in 2015 by Olaf Ronneberger significantly advanced this field, yet issues like limited training data, imbalance data distribution, and inadequate feature extraction persist, hindering both the segmentation performance and optimal model generalization. Addressing these critical issues, the DEFFA-Unet is proposed featuring an additional encoder to process domain-invariant pre-processed inputs, thereby improving both richer feature encoding and enhanced model generalization. A feature filtering fusion module is developed to ensure the precise feature filtering and robust hybrid feature fusion. In response to the task-specific need for higher precision where false positives are very costly, traditional skip connections are replaced with the attention-guided feature reconstructing fusion module. Additionally, innovative data augmentation and balancing methods are proposed to counter data scarcity and distribution imbalance, further boosting the robustness and generalization of the model. With a comprehensive suite of evaluation metrics, extensive validations on four benchmark datasets (DRIVE, CHASEDB1, STARE, and HRF) and an SLO dataset (IOSTAR), demonstrate the proposed method’s superiority over both baseline and state-of-the-art models. Particularly the proposed method significantly outperforms the compared methods in cross-validation model generalization.
zh

[CV-162] NTIRE 2025 Challenge on RAW Image Restoration and Super-Resolution CVPR2025

【速读】:该论文旨在解决RAW图像修复与超分辨率问题,具体包括恢复受模糊和噪声退化的RAW图像以及在未知噪声和模糊条件下对RAW Bayer图像进行2倍上采样。解决方案的关键在于开发适用于RAW域的新方法,以提升现代图像信号处理(ISP)流水线的性能,这一领域相较于RGB域的研究仍较为有限。

链接: https://arxiv.org/abs/2506.02197
作者: Marcos V. Conde,Radu Timofte,Zihao Lu,Xiangyu Kongand Xiaoxia Xingand Fan Wangand Suejin Hanand MinKyu Parkand Tianyu Zhangand Xin Luoand Yeda Chenand Dong Liuand Li Pangand Yuhang Yangand Hongzhong Wangand Xiangyong Caoand Ruixuan Jiangand Senyan Xuand Siyuan Jiangand Xueyang Fuand Zheng-Jun Zhaand Tianyu Haoand Yuhong Heand Ruoqi Liand Yueqi Yangand Xiang Yuand Guanlan Hongand Minmin Yiand Yuanjia Chenand Liwen Zhangand Zijie Jinand Cheng Liand Lian Liuand Wei Songand Heng Sunand Yubo Wangand Jinghua Wangand Jiajie Luand Watchara Ruangsangand
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025 - New Trends in Image Restoration and Enhancement (NTIRE)

点击查看摘要

Abstract:This paper reviews the NTIRE 2025 RAW Image Restoration and Super-Resolution Challenge, highlighting the proposed solutions and results. New methods for RAW Restoration and Super-Resolution could be essential in modern Image Signal Processing (ISP) pipelines, however, this problem is not as explored as in the RGB domain. The goal of this challenge is two fold, (i) restore RAW images with blur and noise degradations, (ii) upscale RAW Bayer images by 2x, considering unknown noise and blur. In the challenge, a total of 230 participants registered, and 45 submitted results during thee challenge period. This report presents the current state-of-the-art in RAW Restoration.
zh

[CV-163] Are Pixel-Wise Metrics Reliable for Sparse-View Computed Tomography Reconstruction?

【速读】:该论文旨在解决现有稀疏视图CT重建评估指标(如结构相似性指数和峰值信噪比)在捕捉关键解剖结构完整性方面的不足,尤其是在小而薄区域的遗漏问题。其解决方案的关键在于提出一组新的解剖感知评估指标,用于评估不同解剖结构的完整性,并在此基础上引入CARE(Completeness-Aware Reconstruction Enhancement)框架,通过在训练过程中引入结构惩罚项来增强重要解剖结构的保真度,从而显著提升CT重建的结构完整性。

链接: https://arxiv.org/abs/2506.02093
作者: Tianyu Lin,Xinran Li,Chuntung Zhuang,Qi Chen,Yuanhao Cai,Kai Ding,Alan L. Yuille,Zongwei Zhou
机构: Johns Hopkins University (约翰霍普金斯大学); Yale University (耶鲁大学); Johns Hopkins Medicine (约翰霍普金斯医学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Widely adopted evaluation metrics for sparse-view CT reconstruction–such as Structural Similarity Index Measure and Peak Signal-to-Noise Ratio–prioritize pixel-wise fidelity but often fail to capture the completeness of critical anatomical structures, particularly small or thin regions that are easily missed. To address this limitation, we propose a suite of novel anatomy-aware evaluation metrics designed to assess structural completeness across anatomical structures, including large organs, small organs, intestines, and vessels. Building on these metrics, we introduce CARE, a Completeness-Aware Reconstruction Enhancement framework that incorporates structural penalties during training to encourage anatomical preservation of significant structures. CARE is model-agnostic and can be seamlessly integrated into analytical, implicit, and generative methods. When applied to these methods, CARE substantially improves structural completeness in CT reconstructions, achieving up to +32% improvement for large organs, +22% for small organs, +40% for intestines, and +36% for vessels.
zh

[CV-164] Alzheimers Disease Classification in Functional MRI With 4D Joint Temporal-Spatial Kernels in Novel 4D CNN Model

【速读】:该论文试图解决在4D功能磁共振成像(functional MRI, fMRI)数据中使用仅依赖3D空间模型进行特征提取导致的性能不足问题,从而影响下游任务如分类的效果。其解决方案的关键在于开发一种新型的4D卷积神经网络(4D CNN),该网络能够提取4D联合时空核,不仅学习空间信息,还能捕捉时间动态变化,从而更有效地处理fMRI数据中的时空特性。实验结果表明,该方法在阿尔茨海默病诊断中优于传统3D模型,提升了早期检测和干预效果。

链接: https://arxiv.org/abs/2506.02060
作者: Javier Salazar Cavazos,Scott Peltier
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Published in International Society for Magnetic Resonance in Medicine (ISMRM) 2025 under submission number 3398

点击查看摘要

Abstract:Previous works in the literature apply 3D spatial-only models on 4D functional MRI data leading to possible sub-par feature extraction to be used for downstream tasks like classification. In this work, we aim to develop a novel 4D convolution network to extract 4D joint temporal-spatial kernels that not only learn spatial information but in addition also capture temporal dynamics. Experimental results show promising performance in capturing spatial-temporal data in functional MRI compared to 3D models. The 4D CNN model improves Alzheimers disease diagnosis for rs-fMRI data, enabling earlier detection and better interventions. Future research could explore task-based fMRI applications and regression tasks, enhancing understanding of cognitive performance and disease progression.
zh

[CV-165] Surgical Foundation Model Leverag ing Compression and Entropy Maximization for Image-Guided Surgical Assistance

【速读】:该论文旨在解决微创手术(Minimally Invasive Surgery, MIS)中实时视频理解的问题,特别是在缺乏大量标注数据的情况下,传统监督学习方法难以有效应用。其解决方案的关键在于提出一种名为Compress-to-Explore (C2E)的自监督框架,该框架利用Kolmogorov复杂性来学习从手术视频中提取紧凑且信息丰富的表示。C2E通过熵最大化解码器压缩图像,同时保留临床相关细节,从而在无需标注数据的情况下提升编码器性能,并在多种手术任务中表现出良好的泛化能力。

链接: https://arxiv.org/abs/2506.01980
作者: Lianhao Yin,Ozanan Meireles,Guy Rosman,Daniela Rus
机构: Massachusetts General Hospital (麻省总医院); Massachusetts Institute of Technology (麻省理工学院); Duke University (杜克大学)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Real-time video understanding is critical to guide procedures in minimally invasive surgery (MIS). However, supervised learning approaches require large, annotated datasets that are scarce due to annotation efforts that are prohibitive, e.g., in medical fields. Although self-supervision methods can address such limitations, current self-supervised methods often fail to capture structural and physical information in a form that generalizes across tasks. We propose Compress-to-Explore (C2E), a novel self-supervised framework that leverages Kolmogorov complexity to learn compact, informative representations from surgical videos. C2E uses entropy-maximizing decoders to compress images while preserving clinically relevant details, improving encoder performance without labeled data. Trained on large-scale unlabeled surgical datasets, C2E demonstrates strong generalization across a variety of surgical ML tasks, such as workflow classification, tool-tissue interaction classification, segmentation, and diagnosis tasks, providing improved performance as a surgical visual foundation model. As we further show in the paper, the model’s internal compact representation better disentangles features from different structural parts of images. The resulting performance improvements highlight the yet untapped potential of self-supervised learning to enhance surgical AI and improve outcomes in MIS.
zh

人工智能

[AI-0] PoLAR: Polar-Decomposed Low-Rank Adapter Representation

【速读】:该论文试图解决大规模模型的低秩适配(low-rank adaptation)中存在稳定秩过低的问题,这一问题导致微调性能下降。解决方案的关键在于提出PoLAR参数化方法,该方法受极分解(polar decomposition)启发,将低秩更新分解为两个约束在Stiefel流形上的方向矩阵和一个无约束的尺度矩阵,从而更有效地利用分配的子空间。

链接: https://arxiv.org/abs/2506.03133
作者: Kai Lion,Liang Zhang,Bingcong Li,Niao He
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP); Optimization and Control (math.OC)
备注:

点击查看摘要

Abstract:We show that low-rank adaptation of large-scale models suffers from a low stable rank that is well below the linear algebraic rank of the subspace, degrading fine-tuning performance. To mitigate the underutilization of the allocated subspace, we propose PoLAR, a parameterization inspired by the polar decomposition that factorizes the low-rank update into two direction matrices constrained to Stiefel manifolds and an unconstrained scale matrix. Our theory shows that PoLAR yields an exponentially faster convergence rate on a canonical low-rank adaptation problem. Pairing the parameterization with Riemannian optimization leads to consistent gains on three different benchmarks testing general language understanding, commonsense reasoning, and mathematical problem solving with base model sizes ranging from 350M to 27B.
zh

[AI-1] Designing Algorithmic Delegates: The Role of Indistinguishability in Human-AI Handoff

【速读】:该论文试图解决在存在类别划分的情况下,设计最优算法代理(algorithmic delegate)的问题。其核心挑战在于,人类决策者基于可观察特征对决策问题进行分类,并据此决定是否将任务委托给AI代理,而这种分类可能导致次优决策。解决方案的关键在于识别并利用决策任务中可观测特征与最优行动之间的结构关系,特别是在最优行动可以分解为人类与算法可观测特征函数的情况下,从而设计出高效的最优代理算法。尽管该问题本质上具有组合复杂性且在一般情况下计算困难,但研究者提出了在多种情形下能够高效生成最优代理的算法。

链接: https://arxiv.org/abs/2506.03102
作者: Sophie Greenwood,Karen Levy,Solon Barocas,Hoda Heidari,Jon Kleinberg
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: Accepted at the Twenty-Sixth ACM Conference on Economics and Computation (EC’25)

点击查看摘要

Abstract:As AI technologies improve, people are increasingly willing to delegate tasks to AI agents. In many cases, the human decision-maker chooses whether to delegate to an AI agent based on properties of the specific instance of the decision-making problem they are facing. Since humans typically lack full awareness of all the factors relevant to this choice for a given decision-making instance, they perform a kind of categorization by treating indistinguishable instances – those that have the same observable features – as the same. In this paper, we define the problem of designing the optimal algorithmic delegate in the presence of categories. This is an important dimension in the design of algorithms to work with humans, since we show that the optimal delegate can be an arbitrarily better teammate than the optimal standalone algorithmic agent. The solution to this optimal delegation problem is not obvious: we discover that this problem is fundamentally combinatorial, and illustrate the complex relationship between the optimal design and the properties of the decision-making task even in simple settings. Indeed, we show that finding the optimal delegate is computationally hard in general. However, we are able to find efficient algorithms for producing the optimal delegate in several broad cases of the problem, including when the optimal action may be decomposed into functions of features observed by the human and the algorithm. Finally, we run computational experiments to simulate a designer updating an algorithmic delegate over time to be optimized for when it is actually adopted by users, and show that while this process does not recover the optimal delegate in general, the resulting delegate often performs quite well.
zh

[AI-2] alkingMachines: Real-Time Audio-Driven FaceTime-Style Video via Autoregressive Diffusion Models

【速读】:该论文试图解决将预训练视频生成模型转化为实时、音频驱动的角色动画生成器的问题,以实现自然的对话体验。解决方案的关键在于将音频大语言模型(LLM)与视频生成基础模型进行集成,并通过多项关键技术优化实现高效推理,包括将SOTA图像到视频DiT模型适配为180亿参数的音频驱动角色生成模型、利用双向教师模型进行非对称知识蒸馏以实现无误差积累的无限视频流、以及设计高吞吐、低延迟的推理流水线,包含设备间解耦、CUDA流优化通信与计算重叠、以及冗余重新计算的消除。

链接: https://arxiv.org/abs/2506.03099
作者: Chetwin Low,Weimin Wang
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:In this paper, we present TalkingMachines – an efficient framework that transforms pretrained video generation models into real-time, audio-driven character animators. TalkingMachines enables natural conversational experiences by integrating an audio large language model (LLM) with our video generation foundation model. Our primary contributions include: (1) We adapt a pretrained SOTA image-to-video DiT into an audio-driven avatar generation model of 18 billion parameters; (2) We enable infinite video streaming without error accumulation through asymmetric knowledge distillation from a bidirectional teacher model into a sparse causal, autoregressive student model; (3) We design a high-throughput, low-latency inference pipeline incorporating several key engineering optimizations such as: (a) disaggregation of the DiT and VAE decoder across separate devices, (b) efficient overlap of inter-device communication and computation using CUDA streams, © elimination of redundant recomputations to maximize frame-generation throughput. Please see demo videos here - this https URL
zh

[AI-3] How Explanations Leak the Decision Logic: Stealing Graph Neural Networks via Explanation Alignment

【速读】:该论文试图解决可解释图神经网络(Explainable GNNs)在敏感领域部署时可能引发的安全风险问题,特别是由于解释机制泄露模型关键决策逻辑而造成的模型窃取(model stealing)问题。解决方案的关键在于提出一种名为\method的新型窃取框架,该框架通过整合解释对齐(explanation alignment)以捕捉决策逻辑,并结合引导数据增强(guided data augmentation)以在有限查询条件下高效训练,从而有效复制目标模型的预测行为和底层推理模式。

链接: https://arxiv.org/abs/2506.03087
作者: Bin Ma,Yuyuan Feng,Minhua Lin,Enyan Dai
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Graph Neural Networks (GNNs) have become essential tools for analyzing graph-structured data in domains such as drug discovery and financial analysis, leading to growing demands for model transparency. Recent advances in explainable GNNs have addressed this need by revealing important subgraphs that influence predictions, but these explanation mechanisms may inadvertently expose models to security risks. This paper investigates how such explanations potentially leak critical decision logic that can be exploited for model stealing. We propose \method, a novel stealing framework that integrates explanation alignment for capturing decision logic with guided data augmentation for efficient training under limited queries, enabling effective replication of both the predictive behavior and underlying reasoning patterns of target models. Experiments on molecular graph datasets demonstrate that our approach shows advantages over conventional methods in model stealing. This work highlights important security considerations for the deployment of explainable GNNs in sensitive domains and suggests the need for protective measures against explanation-based attacks. Our code is available at this https URL.
zh

[AI-4] Labelling Data with Unknown References

【速读】:该论文试图解决在缺乏标注参考数据(如开发集)的情况下,如何建立评估器(evaluator)的可信度问题。传统方法依赖于通过测试数据验证或假设评估器“知道”正确的标注方式,但在无标注参考的情况下,这两种方法均不可行。论文提出的解决方案是引入一种无需任何现有参考数据的算法(称为“无数据算法”),其关键在于通过连续向评估器提出挑战来检验其能力,从而在高概率下建立评估器的可信度。该算法在评估器真正掌握标注方法时接受其输出,并在评估器无法证明其能力时将其标记为不可信。

链接: https://arxiv.org/abs/2506.03083
作者: Adrian de Wynter
机构: 未知
类目: Data Structures and Algorithms (cs.DS); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:An evaluator is trustworthy when there exists some agreed-upon way to measure its performance as a labeller. The two ways to establish trustworthiness are either by testing it, or by assuming the evaluator knows' somehow the way to label the corpus. However, if labelled references (e.g., a development set) are unavailable, neither of these approaches work: the former requires the data, and the latter is an assumption, not evidence. To address this, we introduce an algorithm (the No-Data Algorithm’) by which to establish trust in an evaluator without any existing references. Our algorithm works by successively posing challenges to said evaluator. We show that this is sufficient to establish trustworthiness w.h.p., in such a way that when the evaluator actually knows the way to label the corpus, the No-Data Algorithm accepts its output; and, conversely, flags untrustworthy evaluators when these are unable to prove it. We present formal proofs of correctness and limited experiments.
zh

[AI-5] StreamBP: Memory-Efficient Exact Backpropagation for Long Sequence Training of LLM s

【速读】:该论文旨在解决在长序列数据上训练语言模型时,反向传播(Backpropagation, BP)过程中激活值存储带来的巨大内存开销问题。其解决方案的关键在于提出了一种名为StreamBP的内存高效且精确的BP方法,该方法通过在序列维度上分层地进行链式法则的线性分解,显著降低了激活值和logits的内存消耗。此外,StreamBP利用语言模型的因果结构,在实现更少计算FLOPs的同时提升了BP速度,相较于梯度检查点技术,其能够将BP的最大序列长度扩展2.8至5.5倍,同时保持或减少BP时间。

链接: https://arxiv.org/abs/2506.03077
作者: Qijun Luo,Mengqi Li,Lei Zhao,Xiao Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Training language models on long sequence data is a demanding requirement for enhancing the model’s capability on complex tasks, e.g., long-chain reasoning. However, as the sequence length scales up, the memory cost for storing activation values becomes huge during the Backpropagation (BP) process, even with the application of gradient checkpointing technique. To tackle this challenge, we propose a memory-efficient and exact BP method called StreamBP, which performs a linear decomposition of the chain rule along the sequence dimension in a layer-wise manner, significantly reducing the memory cost of activation values and logits. The proposed method is applicable to common objectives such as SFT, GRPO, and DPO. From an implementation perspective, StreamBP achieves less computational FLOPs and faster BP speed by leveraging the causal structure of the language model. Compared to gradient checkpointing, StreamBP scales up the maximum sequence length of BP by 2.8-5.5 times larger, while using comparable or even less BP time. Note that StreamBP’s sequence length scaling ability can be directly transferred to batch size scaling for accelerating training. We further develop a communication-efficient distributed StreamBP to effectively support multi-GPU training and broaden its applicability. Our code can be easily integrated into the training pipeline of any transformer models and is available at this https URL.
zh

[AI-6] Corrigibility as a Singular Target: A Vision for Inherently Reliable Foundation Models ICML2025

【速读】:该论文试图解决基础模型(Foundation Models, FMs)在能力扩展过程中出现的安全问题,特别是由于工具性收敛导致的人类控制丧失,可能引发存在性灾难。现有对齐方法难以应对价值定义的复杂性以及新兴的权力追求行为。论文提出的解决方案关键在于“可纠正性作为单一目标”(Corrigibility as a Singular Target, CAST),即设计FMs使其主要目标是赋能指定的人类负责人,使其能够指导、纠正和控制模型,从而将工具性驱动从静态价值加载转向动态人类赋权。

链接: https://arxiv.org/abs/2506.03056
作者: Ram Potham(Independent Researcher),Max Harms(Machine Intelligence Research Institute)
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注: Preprint. This work has been submitted to the Reliable and Responsible Foundation Models Workshop at ICML 2025 for review

点击查看摘要

Abstract:Foundation models (FMs) face a critical safety challenge: as capabilities scale, instrumental convergence drives default trajectories toward loss of human control, potentially culminating in existential catastrophe. Current alignment approaches struggle with value specification complexity and fail to address emergent power-seeking behaviors. We propose “Corrigibility as a Singular Target” (CAST)-designing FMs whose overriding objective is empowering designated human principals to guide, correct, and control them. This paradigm shift from static value-loading to dynamic human empowerment transforms instrumental drives: self-preservation serves only to maintain the principal’s control; goal modification becomes facilitating principal guidance. We present a comprehensive empirical research agenda spanning training methodologies (RLAIF, SFT, synthetic data generation), scalability testing across model sizes, and demonstrations of controlled instructability. Our vision: FMs that become increasingly responsive to human guidance as capabilities grow, offering a path to beneficial AI that remains as tool-like as possible, rather than supplanting human judgment. This addresses the core alignment problem at its source, preventing the default trajectory toward misaligned instrumental convergence.
zh

[AI-7] EDEN: Entorhinal Driven Egocentric Navigation Toward Robotic Deployment

【速读】:该论文试图解决深度强化学习智能体在面对不同场景时表现脆弱,而人类则具备适应性和灵活性的问题。解决方案的关键在于提出EDEN框架,该框架受哺乳动物内侧内嗅-海马系统启发,结合了学习到的类似内侧内嗅皮层网格细胞的表征与强化学习,使智能体能够利用视觉和运动传感器数据进行路径整合和基于向量的导航。核心组件是网格细胞编码器,它将自我中心运动转换为周期性空间代码,生成低维、可解释的位置嵌入,从而提升导航的效率和可靠性。

链接: https://arxiv.org/abs/2506.03046
作者: Mikolaj Walczak,Romina Aalishah,Wyatt Mackey,Brittany Story,David L. Boothe Jr.,Nicholas Waytowich,Xiaomin Lin,Tinoosh Mohsenin
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deep reinforcement learning agents are often fragile while humans remain adaptive and flexible to varying scenarios. To bridge this gap, we present EDEN, a biologically inspired navigation framework that integrates learned entorhinal-like grid cell representations and reinforcement learning to enable autonomous navigation. Inspired by the mammalian entorhinal-hippocampal system, EDEN allows agents to perform path integration and vector-based navigation using visual and motion sensor data. At the core of EDEN is a grid cell encoder that transforms egocentric motion into periodic spatial codes, producing low-dimensional, interpretable embeddings of position. To generate these activations from raw sensory input, we combine fiducial marker detections in the lightweight MiniWorld simulator and DINO-based visual features in the high-fidelity Gazebo simulator. These spatial representations serve as input to a policy trained with Proximal Policy Optimization (PPO), enabling dynamic, goal-directed navigation. We evaluate EDEN in both MiniWorld, for rapid prototyping, and Gazebo, which offers realistic physics and perception noise. Compared to baseline agents using raw state inputs (e.g., position, velocity) or standard convolutional image encoders, EDEN achieves a 99% success rate, within the simple scenarios, and 94% within complex floorplans with occluded paths with more efficient and reliable step-wise navigation. In addition, as a replacement of ground truth activations, we present a trainable Grid Cell encoder enabling the development of periodic grid-like patterns from vision and motion sensor data, emulating the development of such patterns within biological mammals. This work represents a step toward biologically grounded spatial intelligence in robotics, bridging neural navigation principles with reinforcement learning for scalable deployment.
zh

[AI-8] stAgent : An Adaptive and Intelligent Expert for Human Assessment

【速读】:该论文试图解决传统自适应测试方法在应对开放性问题时易产生猜测行为、响应数据噪声大以及测试输出粒度粗等问题。其解决方案的关键在于提出TestAgent,这是一个基于大语言模型(Large Language Model, LLM)的智能代理,通过交互式对话实现个性化的题目选择、响应捕捉与精准结果输出,从而提升测试的准确性与效率。

链接: https://arxiv.org/abs/2506.03032
作者: Junhao Yu,Yan Zhuang,YuXuan Sun,Weibo Gao,Qi Liu,Mingyue Cheng,Zhenya Huang,Enhong Chen
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注: 24 pages,10 figures

点击查看摘要

Abstract:Accurately assessing internal human states is key to understanding preferences, offering personalized services, and identifying challenges in real-world applications. Originating from psychometrics, adaptive testing has become the mainstream method for human measurement and has now been widely applied in education, healthcare, sports, and sociology. It customizes assessments by selecting the fewest test questions . However, current adaptive testing methods face several challenges. The mechanized nature of most algorithms leads to guessing behavior and difficulties with open-ended questions. Additionally, subjective assessments suffer from noisy response data and coarse-grained test outputs, further limiting their effectiveness. To move closer to an ideal adaptive testing process, we propose TestAgent, a large language model (LLM)-powered agent designed to enhance adaptive testing through interactive engagement. This is the first application of LLMs in adaptive testing. TestAgent supports personalized question selection, captures test-takers’ responses and anomalies, and provides precise outcomes through dynamic, conversational interactions. Experiments on psychological, educational, and lifestyle assessments show our approach achieves more accurate results with 20% fewer questions than state-of-the-art baselines, and testers preferred it in speed, smoothness, and other dimensions.
zh

[AI-9] Linear Spatial World Models Emerge in Large Language Models

【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)是否隐式编码线性空间世界模型的问题,即它们是否能够以线性形式表征物理空间和物体配置。解决方案的关键在于引入一个形式化空间世界模型的框架,并通过合成数据集训练探测器来解码物体位置,同时评估潜在空间的几何一致性以及进行因果干预以验证这些空间表征是否被模型功能性使用。

链接: https://arxiv.org/abs/2506.02996
作者: Matthieu Tehenan,Christian Bolivar Moya,Tenghai Long,Guang Lin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated emergent abilities across diverse tasks, raising the question of whether they acquire internal world models. In this work, we investigate whether LLMs implicitly encode linear spatial world models, which we define as linear representations of physical space and object configurations. We introduce a formal framework for spatial world models and assess whether such structure emerges in contextual embeddings. Using a synthetic dataset of object positions, we train probes to decode object positions and evaluate geometric consistency of the underlying space. We further conduct causal interventions to test whether these spatial representations are functionally used by the model. Our results provide empirical evidence that LLMs encode linear spatial world models.
zh

[AI-10] UniConFlow: A Unified Constrained Generalization Framework for Certified Motion Planning with Flow Matching Models

【速读】:该论文旨在解决机器人轨迹生成中多类型约束(如碰撞避免和动态一致性)难以统一处理的问题,现有方法通常将这些约束分开处理或仅部分考虑。其解决方案的关键在于提出一种基于统一流匹配(UniConFlow)的框架,该框架系统性地整合等式与不等式约束,并引入一种预设时间零化函数以提高推理过程中的灵活性,同时通过二次规划公式推导引导输入,确保约束满足,无需重新训练或辅助控制器。

链接: https://arxiv.org/abs/2506.02955
作者: Zewen Yang,Xiaobing Dai,Dian Yu,Qianru Li,Yu Li,Valentin Le Mesle
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Generative models have become increasingly powerful tools for robot motion generation, enabling flexible and multimodal trajectory generation across various tasks. Yet, most existing approaches remain limited in handling multiple types of constraints, such as collision avoidance and dynamic consistency, which are often treated separately or only partially considered. This paper proposes UniConFlow, a unified flow matching (FM) based framework for trajectory generation that systematically incorporates both equality and inequality constraints. UniConFlow introduces a novel prescribed-time zeroing function to enhance flexibility during the inference process, allowing the model to adapt to varying task requirements. To ensure constraint satisfaction, particularly with respect to obstacle avoidance, admissible action range, and kinodynamic consistency, the guidance inputs to the FM model are derived through a quadratic programming formulation, which enables constraint-aware generation without requiring retraining or auxiliary controllers. We conduct mobile navigation and high-dimensional manipulation tasks, demonstrating improved safety and feasibility compared to state-of-the-art constrained generative planners. Project page is available at this https URL.
zh

[AI-11] Dynamic Programming Techniques for Enhancing Cognitive Representation in Knowledge Tracing

【速读】:该论文旨在解决知识追踪(Knowledge Tracing, KT)中因忽视认知表征缺陷及非认知因素(如失误和猜测)干扰而导致的认知过程连续性和一致性难以捕捉的问题。现有方法主要关注特征增强,但未能有效维持认知连续性与一致性,从而引入更多预测偏差和建模成本。其解决方案的关键在于提出基于认知表征动态规划的知识追踪模型(Cognitive Representation Dynamic Programming based Knowledge Tracing, CRDP-KT),该模型通过动态规划算法优化基于题目难度和作答间隔的认知表征,确保其与学生的认知模式对齐,并通过分段优化和加权融合优化记录表征与二部图关系,提升认知表达的可靠性与准确性。

链接: https://arxiv.org/abs/2506.02949
作者: Lixiang Xu,Xianwei Ding,Xin Yuan,Richang Hong,Feiping Nie,Enhong Chen,Philip S. Yu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Knowledge Tracing (KT) involves monitoring the changes in a student’s knowledge over time by analyzing their past responses, with the goal of predicting future performance. However, most existing methods primarily focus on feature enhancement, while overlooking the deficiencies in cognitive representation and the ability to express cognition-issues often caused by interference from non-cognitive factors such as slipping and guessing. This limitation hampers the ability to capture the continuity and coherence of the student’s cognitive process. As a result, many methods may introduce more prediction bias and modeling costs due to their inability to maintain cognitive continuity and coherence. Based on the above discussion, we propose the Cognitive Representation Dynamic Programming based Knowledge Tracing (CRDP-KT) model. This model em ploys a dynamic programming algorithm to optimize cognitive representations based on the difficulty of the questions and the performance intervals between them. This approach ensures that the cognitive representation aligns with the student’s cognitive patterns, maintaining overall continuity and coherence. As a result, it provides more accurate and systematic input features for subsequent model training, thereby minimizing distortion in the simulation of cognitive states. Additionally, the CRDP-KT model performs partitioned optimization of cognitive representations to enhance the reliability of the optimization process. Furthermore, it improves its ability to express the student’s cognition through a weighted fusion of optimized record representations and re lationships learned from a bipartite graph. Finally, experiments conducted on three public datasets validate the effectiveness of the proposed CRDP-KT model.
zh

[AI-12] hinkTank: A Framework for Generalizing Domain-Specific AI Agent Systems into Universal Collaborative Intelligence Platforms

【速读】:该论文旨在解决如何将专业化的AI代理系统转化为跨领域复杂问题求解的通用协作智能平台的问题。其解决方案的关键在于通过角色抽象、迭代协作的会议类型泛化以及检索增强生成与先进知识存储的集成,系统地泛化代理角色、会议结构和知识整合机制,从而实现专业知识的构建与稳健的知识共享。

链接: https://arxiv.org/abs/2506.02931
作者: Praneet Sai Madhu Surabhi,Dheeraj Reddy Mudireddy,Jian Tao
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This paper presents ThinkTank, a comprehensive and scalable framework designed to transform specialized AI agent systems into versatile collaborative intelligence platforms capable of supporting complex problem-solving across diverse domains. ThinkTank systematically generalizes agent roles, meeting structures, and knowledge integration mechanisms by adapting proven scientific collaboration methodologies. Through role abstraction, generalization of meeting types for iterative collaboration, and the integration of Retrieval-Augmented Generation with advanced knowledge storage, the framework facilitates expertise creation and robust knowledge sharing. ThinkTank enables organizations to leverage collaborative AI for knowledge-intensive tasks while ensuring data privacy and security through local deployment, utilizing frameworks like Ollama with models such as Llama3.1. The ThinkTank framework is designed to deliver significant advantages in cost-effectiveness, data security, scalability, and competitive positioning compared to cloud-based alternatives, establishing it as a universal platform for AI-driven collaborative problem-solving. The ThinkTank code is available at this https URL
zh

[AI-13] he Limits of Predicting Agents from Behaviour

【速读】:该论文试图解决如何从智能体的行为中推断其信念、意图和目标,并利用这些推断出的信念可靠地预测其在新环境中的行为问题。解决方案的关键在于假设智能体的行为是由一个世界模型(world model)所引导,并由此推导出在新(未见过的)部署环境中智能体行为的理论界限,从而为仅基于行为数据预测有目的的智能体提供了理论基础。

链接: https://arxiv.org/abs/2506.02923
作者: Alexis Bellot,Jonathan Richens,Tom Everitt
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:As the complexity of AI systems and their interactions with the world increases, generating explanations for their behaviour is important for safely deploying AI. For agents, the most natural abstractions for predicting behaviour attribute beliefs, intentions and goals to the system. If an agent behaves as if it has a certain goal or belief, then we can make reasonable predictions about how it will behave in novel situations, including those where comprehensive safety evaluations are untenable. How well can we infer an agent’s beliefs from their behaviour, and how reliably can these inferred beliefs predict the agent’s behaviour in novel situations? We provide a precise answer to this question under the assumption that the agent’s behaviour is guided by a world model. Our contribution is the derivation of novel bounds on the agent’s behaviour in new (unseen) deployment environments, which represent a theoretical limit for predicting intentional agents from behavioural data alone. We discuss the implications of these results for several research areas including fairness and safety.
zh

[AI-14] Sample Predict then Proceed: Self-Verification Sampling for Tool Use of LLM s

【速读】:该论文试图解决在状态环境(stateful environments)中使用工具时,大型语言模型(Large Language Models, LLMs)面临的独特挑战,尤其是现有测试时计算策略依赖于环境中重复试验的不可行性。解决方案的关键在于提出动态建模(Dynamics Modelling, DyMo),该方法通过在后训练阶段增强LLMs的状态预测能力与函数调用能力,使其能够通过内部环境模型预测自身行为的未来状态。这一机制显著提高了工具使用的成功率并减少了幻觉现象。

链接: https://arxiv.org/abs/2506.02918
作者: Shangmin Guo,Omar Darwiche Domingues,Raphaël Avalos,Aaron Courville,Florian Strub
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Tool use in stateful environments presents unique challenges for large language models (LLMs), where existing test-time compute strategies relying on repeated trials in the environment are impractical. We propose dynamics modelling (DyMo), a method that augments LLMs with a state prediction capability alongside function calling during post-training. This enables LLMs to predict the future states of their actions through an internal environment model. On the Berkeley Function Calling Leaderboard V2, DyMo improves success rates and significantly reduces hallucinations. We further integrate the internal environment model into self-verification sampling (SVS), and show that this substantially improves pass^k over number of trials k, and allows the model to refuse unreliable outputs. Together, DyMo and SVS greatly enhance the effectiveness and reliability of LLMs for tool use. We believe this work charts a path towards scalable planning RL methods for LLM inference without repeatedly querying the oracle environment.
zh

[AI-15] Its the Thought that Counts: Evaluating the Attempts of Frontier LLM s to Persuade on Harmful Topics

【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)在有害情境下尝试说服用户的风险问题,特别是模型是否会在没有适当约束的情况下主动生成旨在改变信念或行为的内容。现有基准测试主要关注说服的成功率,而忽视了模型在有害场景中主动尝试说服的倾向。解决方案的关键在于提出Attempt to Persuade Eval (APE)基准,该基准通过多轮对话设置评估模型在不同主题上的说服意愿,并引入自动化评估模型来识别和量化说服尝试的频率与上下文,从而揭示当前安全防护机制的不足,并强调评估说服意愿作为LLM风险关键维度的重要性。

链接: https://arxiv.org/abs/2506.02873
作者: Matthew Kowal,Jasper Timm,Jean-Francois Godbout,Thomas Costello,Antonio A. Arechar,Gordon Pennycook,David Rand,Adam Gleave,Kellin Pelrine
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Persuasion is a powerful capability of large language models (LLMs) that both enables beneficial applications (e.g. helping people quit smoking) and raises significant risks (e.g. large-scale, targeted political manipulation). Prior work has found models possess a significant and growing persuasive capability, measured by belief changes in simulated or real users. However, these benchmarks overlook a crucial risk factor: the propensity of a model to attempt to persuade in harmful contexts. Understanding whether a model will blindly ``follow orders’’ to persuade on harmful topics (e.g. glorifying joining a terrorist group) is key to understanding the efficacy of safety guardrails. Moreover, understanding if and when a model will engage in persuasive behavior in pursuit of some goal is essential to understanding the risks from agentic AI systems. We propose the Attempt to Persuade Eval (APE) benchmark, that shifts the focus from persuasion success to persuasion attempts, operationalized as a model’s willingness to generate content aimed at shaping beliefs or behavior. Our evaluation framework probes frontier LLMs using a multi-turn conversational setup between simulated persuader and persuadee agents. APE explores a diverse spectrum of topics including conspiracies, controversial issues, and non-controversially harmful content. We introduce an automated evaluator model to identify willingness to persuade and measure the frequency and context of persuasive attempts. We find that many open and closed-weight models are frequently willing to attempt persuasion on harmful topics and that jailbreaking can increase willingness to engage in such behavior. Our results highlight gaps in current safety guardrails and underscore the importance of evaluating willingness to persuade as a key dimension of LLM risk. APE is available at this http URL
zh

[AI-16] Surfer-H Meets Holo1 : Cost-Efficient Web Agent Powered by Open Weights

【速读】:该论文试图解决在网页上执行用户定义任务的高效性与准确性之间的平衡问题,特别是针对生成式 AI (Generative AI) 驱动的网络代理系统。解决方案的关键在于集成视觉-语言模型 (Vision-Language Models, VLMs),通过结合 Holo1 这一专门用于网页导航和信息提取的开放权重集合,以及 Surfer-H 这一成本效益高的网络代理,实现了在 WebVoyager 基准测试中 92.2% 的最先进性能,同时在准确性和成本效率之间达到了帕累托最优平衡。

链接: https://arxiv.org/abs/2506.02865
作者: Mathieu Andreux,Breno Baldas Skuk,Hamza Benchekroun,Emilien Biré,Antoine Bonnet,Riaz Bordie,Matthias Brunel,Pierre-Louis Cedoz,Antoine Chassang,Mickaël Chen,Alexandra D. Constantinou,Antoine d’Andigné,Hubert de La Jonquière,Aurélien Delfosse,Ludovic Denoyer,Alexis Deprez,Augustin Derupti,Michael Eickenberg,Mathïs Federico,Charles Kantor,Xavier Koegler,Yann Labbé,Matthew C. H. Lee,Erwan Le Jumeau de Kergaradec,Amir Mahla,Avshalom Manevich,Adrien Maret,Charles Masson,Rafaël Maurin,Arturo Mena,Philippe Modard,Axel Moyal,Axel Nguyen Kerbel,Julien Revelle,Mats L. Richter,María Santos,Laurent Sifre,Maxime Theillard,Marc Thibault,Louis Thiry,Léo Tronchon,Nicolas Usunier,Tony Wu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Alphabetical order

点击查看摘要

Abstract:We present Surfer-H, a cost-efficient web agent that integrates Vision-Language Models (VLM) to perform user-defined tasks on the web. We pair it with Holo1, a new open-weight collection of VLMs specialized in web navigation and information extraction. Holo1 was trained on carefully curated data sources, including open-access web content, synthetic examples, and self-produced agentic data. Holo1 tops generalist User Interface (UI) benchmarks as well as our new web UI localization benchmark, WebClick. When powered by Holo1, Surfer-H achieves a 92.2% state-of-the-art performance on WebVoyager, striking a Pareto-optimal balance between accuracy and cost-efficiency. To accelerate research advancement in agentic systems, we are open-sourcing both our WebClick evaluation dataset and the Holo1 model weights.
zh

[AI-17] BNPO: Beta Normalization Policy Optimization

【速读】:该论文试图解决当前基于规则的二值奖励函数在强化学习中,政策优化方法存在奖励归一化不足或采用静态归一化策略的问题,这导致梯度估计不稳定并影响训练稳定性。解决方案的关键在于提出一种新的政策优化方法——Beta Normalization Policy Optimization (BNPO),该方法通过使用动态更新参数的Beta分布自适应地归一化奖励,使归一化与变化的策略分布相匹配,从而实现更精确且方差更低的梯度估计,促进训练过程的稳定性。

链接: https://arxiv.org/abs/2506.02864
作者: Changyi Xiao,Mengdi Zhang,Yixin Cao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent studies, including DeepSeek-R1 and Kimi-k1.5, have demonstrated that reinforcement learning with rule-based, binary-valued reward functions can significantly enhance the reasoning capabilities of large language models. These models primarily utilize REINFORCE-based policy optimization techniques, such as REINFORCE with baseline and group relative policy optimization (GRPO). However, a key limitation remains: current policy optimization methods either neglect reward normalization or employ static normalization strategies, which fail to adapt to the dynamic nature of policy updates during training. This may result in unstable gradient estimates and hinder training stability. To address this issue, we propose Beta Normalization Policy Optimization (BNPO), a novel policy optimization method that adaptively normalizes rewards using a Beta distribution with dynamically updated parameters. BNPO aligns the normalization with the changing policy distribution, enabling more precise and lower-variance gradient estimation, which in turn promotes stable training dynamics. We provide theoretical analysis demonstrating BNPO’s variance-reducing properties and show that it generalizes both REINFORCE and GRPO under binary-valued reward settings. Furthermore, we introduce an advantage decomposition mechanism to extend BNPO’s applicability to more complex reward systems. Experimental results confirm that BNPO achieves state-of-the-art performance among policy optimization methods on reasoning tasks. The code is available at this https URL.
zh

[AI-18] ru-POMDP: Task Planning Under Uncertainty via Tree of Hypotheses and Open-Ended POMDPs

【速读】:该论文旨在解决家庭服务机器人在现实世界中执行任务时面临的不确定性问题,这些问题包括模糊的人类指令、隐藏或未知的物体位置以及开放词汇的物体类型,导致了显著的开放性不确定性与无限大的规划空间。解决方案的关键在于提出Tru-POMDP,该方法结合了基于大型语言模型(Large Language Models, LLMs)的结构化信念生成与合理的部分可观察马尔可夫决策过程(POMDP)规划,通过引入分层假设树(Tree of Hypotheses, TOH)系统地查询LLM以构建高质量的粒子信念,并构建一个开放式的POMDP模型实现严格的贝叶斯信念跟踪与高效的信念空间规划。

链接: https://arxiv.org/abs/2506.02860
作者: Wenjing Tang,Xinyu He,Yongxi Huang,Yunxiao Xiao,Cewu Lu,Panpan Cai
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Task planning under uncertainty is essential for home-service robots operating in the real world. Tasks involve ambiguous human instructions, hidden or unknown object locations, and open-vocabulary object types, leading to significant open-ended uncertainty and a boundlessly large planning space. To address these challenges, we propose Tru-POMDP, a planner that combines structured belief generation using Large Language Models (LLMs) with principled POMDP planning. Tru-POMDP introduces a hierarchical Tree of Hypotheses (TOH), which systematically queries an LLM to construct high-quality particle beliefs over possible world states and human goals. We further formulate an open-ended POMDP model that enables rigorous Bayesian belief tracking and efficient belief-space planning over these LLM-generated hypotheses. Experiments on complex object rearrangement tasks across diverse kitchen environments show that Tru-POMDP significantly outperforms state-of-the-art LLM-based and LLM-tree-search hybrid planners, achieving higher success rates with significantly better plans, stronger robustness to ambiguity and occlusion, and greater planning efficiency.
zh

[AI-19] ATAG: AI-Agent Application Threat Assessment with Attack Graphs

【速读】:该论文试图解决多智能体系统(Multi-Agent Systems, MASs)中由大型语言模型(Large Language Models, LLMs)驱动的安全评估问题,尤其是由于系统内部动态复杂性和LLM漏洞的不断演变所带来的挑战。传统攻击图(Attack Graphs, AGs)方法缺乏对LLM攻击建模的特定能力。论文提出的解决方案是AI-agent应用威胁评估框架ATAG,其关键在于扩展基于逻辑的MulVAL攻击图生成工具,通过自定义事实和交互规则来准确表示AI-agent的拓扑结构、漏洞及攻击场景,并结合LLM漏洞数据库(LLM Vulnerability Database, LVD)以标准化漏洞文档流程。

链接: https://arxiv.org/abs/2506.02859
作者: Parth Atulbhai Gandhi,Akansha Shukla,David Tayouri,Beni Ifland,Yuval Elovici,Rami Puzis,Asaf Shabtai
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Evaluating the security of multi-agent systems (MASs) powered by large language models (LLMs) is challenging, primarily because of the systems’ complex internal dynamics and the evolving nature of LLM vulnerabilities. Traditional attack graph (AG) methods often lack the specific capabilities to model attacks on LLMs. This paper introduces AI-agent application Threat assessment with Attack Graphs (ATAG), a novel framework designed to systematically analyze the security risks associated with AI-agent applications. ATAG extends the MulVAL logic-based AG generation tool with custom facts and interaction rules to accurately represent AI-agent topologies, vulnerabilities, and attack scenarios. As part of this research, we also created the LLM vulnerability database (LVD) to initiate the process of standardizing LLM vulnerabilities documentation. To demonstrate ATAG’s efficacy, we applied it to two multi-agent applications. Our case studies demonstrated the framework’s ability to model and generate AGs for sophisticated, multi-step attack scenarios exploiting vulnerabilities such as prompt injection, excessive agency, sensitive information disclosure, and insecure output handling across interconnected agents. ATAG is an important step toward a robust methodology and toolset to help understand, visualize, and prioritize complex attack paths in multi-agent AI systems (MAASs). It facilitates proactive identification and mitigation of AI-agent threats in multi-agent applications.
zh

[AI-20] DGMO: Training-Free Audio Source Separation through Diffusion-Guided Mask Optimization INTERSPEECH2025

【速读】:该论文试图解决语言查询音频源分离(Language-queried Audio Source Separation, LASS)中依赖任务特定训练的问题,旨在实现无需额外训练的零样本分离。解决方案的关键在于利用预训练扩散模型的生成先验,提出一种无需训练的框架,并通过测试时优化的扩散引导掩码优化(Diffusion-Guided Mask Optimization, DGMO)来精确调整频谱图掩码,从而实现输入对齐的分离。

链接: https://arxiv.org/abs/2506.02858
作者: Geonyoung Lee,Geonhee Han,Paul Hongsuck Seo
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: Interspeech 2025

点击查看摘要

Abstract:Language-queried Audio Source Separation (LASS) enables open-vocabulary sound separation via natural language queries. While existing methods rely on task-specific training, we explore whether pretrained diffusion models, originally designed for audio generation, can inherently perform separation without further training. In this study, we introduce a training-free framework leveraging generative priors for zero-shot LASS. Analyzing naïve adaptations, we identify key limitations arising from modality-specific this http URL address these issues, we propose Diffusion-Guided Mask Optimization (DGMO), a test-time optimization framework that refines spectrogram masks for precise, input-aligned separation. Our approach effectively repurposes pretrained diffusion models for source separation, achieving competitive performance without task-specific supervision. This work expands the application of diffusion models beyond generation, establishing a new paradigm for zero-shot audio separation. The code is available at: this https URL
zh

[AI-21] Sheaves Reloaded: A Directional Awakening

【速读】:该论文试图解决现有Sheaf Neural Networks (SNNs) 在建模图数据中的方向性信息方面存在的不足。其关键解决方案是引入了Directed Cellular Sheaf,这是一种专门设计用于显式处理边方向性的细胞层结构,并基于此定义了Directed Sheaf Laplacian,该算子能够同时捕捉图的拓扑结构和方向信息,从而构建出首个在架构中嵌入方向偏置的SNN模型——Directed Sheaf Neural Network (DSNN)。

链接: https://arxiv.org/abs/2506.02842
作者: Stefano Fiorini,Hakan Aktas,Iulia Duta,Stefano Coniglio,Pietro Morerio,Alessio Del Bue,Pietro Liò
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Sheaf Neural Networks (SNNs) represent a powerful generalization of Graph Neural Networks (GNNs) that significantly improve our ability to model complex relational data. While directionality has been shown to substantially boost performance in graph learning tasks and is key to many real-world applications, existing SNNs fall short in representing it. To address this limitation, we introduce the Directed Cellular Sheaf, a special type of cellular sheaf designed to explicitly account for edge orientation. Building on this structure, we define a new sheaf Laplacian, the Directed Sheaf Laplacian, which captures both the graph’s topology and its directional information. This operator serves as the backbone of the Directed Sheaf Neural Network (DSNN), the first SNN model to embed a directional bias into its architecture. Extensive experiments on nine real-world benchmarks show that DSNN consistently outperforms baseline methods.
zh

[AI-22] DeepShop: A Benchmark for Deep Research Shopping Agents

【速读】:该论文试图解决现有在线购物网络代理(web agents)评估基准无法反映真实购物场景复杂性的问题,因为现有基准通常包含过于简单的查询和确定性路径。解决方案的关键在于提出DeepShop基准,其核心包括三个组成部分:查询多样性演化、查询复杂度演化以及细粒度与整体评估。通过生成多样且复杂的查询,并基于产品属性、搜索过滤器和用户排序偏好进行分类,同时设计自动化评估框架以全面衡量代理性能,从而更真实地评估网络代理在复杂在线购物环境中的表现。

链接: https://arxiv.org/abs/2506.02839
作者: Yougang Lyu,Xiaoyu Zhang,Lingyong Yan,Maarten de Rijke,Zhaochun Ren,Xiuying Chen
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Web agents for online shopping have shown great promise in automating user interactions across e-commerce platforms. Benchmarks for assessing such agents do not reflect the complexity of real-world shopping scenarios, as they often consist of overly simple queries with deterministic paths, such as “Find iPhone 15.” Real shopping scenarios are inherently more layered, involving multi-dimensional product attributes, search filters, and user-specific sorting preferences. To address this gap, we introduce DeepShop, a benchmark designed to evaluate web agents in complex and realistic online shopping environments. DeepShop comprises three key components. (1) Query diversity evolution: Starting from real user queries, we generate diverse queries across five popular online shopping domains. (2) Query complexity evolution: We further evolve these queries to increase complexity, considering product attributes, search filters, and sorting preferences, and classify them into three levels: easy, medium, and hard, based on the number of evolutions. (3) Fine-grained and holistic evaluation: We propose an automated evaluation framework that assesses agent performance in terms of fine-grained aspects (product attributes, search filters, and sorting preferences) and reports the overall success rate through holistic evaluation. We conduct a systematic evaluation of retrieval-augmented generation (RAG) methods, web agents, and deep research systems. Results show that RAG struggles with complex queries due to its lack of web interaction, while other methods face significant challenges with filters and sorting preferences, leading to low overall success rates. We also perform cross-category, complexity-based evaluations and error analyses to support the advancement of deep research shopping agents.
zh

[AI-23] axAgent : How Large Language Model Designs Fiscal Policy ICME2025

【速读】:该论文试图解决经济不平等问题,特别是传统税收系统在适应性和应对纳税人异质性及非理性行为方面的不足。其解决方案的关键在于引入TaxAgent,这是一个将大型语言模型(Large Language Models, LLMs)与基于代理的建模(Agent-Based Modeling, ABM)相结合的新框架,通过迭代优化税税率来实现公平性与生产效率之间的最优权衡。

链接: https://arxiv.org/abs/2506.02838
作者: Jizhou Wang,Xiaodan Fang,Lei Huang,Yongfeng Huang
机构: 未知
类目: Artificial Intelligence (cs.AI); General Economics (econ.GN)
备注: Accepted as oral presentation at ICME 2025

点击查看摘要

Abstract:Economic inequality is a global challenge, intensifying disparities in education, healthcare, and social stability. Traditional systems like the U.S. federal income tax reduce inequality but lack adaptability. Although models like the Saez Optimal Taxation adjust dynamically, they fail to address taxpayer heterogeneity and irrational behavior. This study introduces TaxAgent, a novel integration of large language models (LLMs) with agent-based modeling (ABM) to design adaptive tax policies. In our macroeconomic simulation, heterogeneous H-Agents (households) simulate real-world taxpayer behaviors while the TaxAgent (government) utilizes LLMs to iteratively optimize tax rates, balancing equity and productivity. Benchmarked against Saez Optimal Taxation, U.S. federal income taxes, and free markets, TaxAgent achieves superior equity-efficiency trade-offs. This research offers a novel taxation solution and a scalable, data-driven framework for fiscal policy evaluation.
zh

[AI-24] Optimising the attribute order in Fuzzy Rough Rule Induction

【速读】:该论文试图解决规则归纳算法在可解释性与性能之间的平衡问题,特别是针对基于模糊粗糙集理论的FRRI算法中属性顺序对模型性能的影响。解决方案的关键在于通过优化属性顺序和结合模糊粗糙特征选择方法来改进FRRI的性能,尽管单纯优化属性顺序未能显著提升多个评估指标,但引入模糊粗糙特征选择在一定程度上改善了平衡准确率和平均规则长度。

链接: https://arxiv.org/abs/2506.02805
作者: Henri Bollaert,Chris Cornelis,Marko Palangetić,Salvatore Greco,Roman Słowiński
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: This is the author’s version of the work accepted for publication in Lecture Notes in Computer Science. The final publication is available at Springer via this https URL

点击查看摘要

Abstract:Interpretability is the next pivotal frontier in machine learning research. In the pursuit of glass box models - as opposed to black box models, like random forests or neural networks - rule induction algorithms are a logical and promising avenue, as the rules can easily be understood by humans. In our previous work, we introduced FRRI, a novel rule induction algorithm based on fuzzy rough set theory. We demonstrated experimentally that FRRI outperformed other rule induction methods with regards to accuracy and number of rules. FRRI leverages a fuzzy indiscernibility relation to partition the data space into fuzzy granules, which are then combined into a minimal covering set of rules. This indiscernibility relation is constructed by removing attributes from rules in a greedy way. This raises the question: does the order of the attributes matter? In this paper, we show that optimising only the order of attributes using known methods from fuzzy rough set theory and classical machine learning does not improve the performance of FRRI on multiple metrics. However, removing a small number of attributes using fuzzy rough feature selection during this step positively affects balanced accuracy and the average rule length.
zh

[AI-25] Rethinking the effects of data contamination in Code Intelligence

【速读】:该论文旨在解决代码智能任务中数据污染(data contamination)对模型性能评估的潜在影响问题。研究通过系统性实证分析,探讨了不同污染场景下预训练语言模型(PLMs)和大语言模型(LLMs)的表现。其解决方案的关键在于构建四种污染设置(输入仅污染、输出仅污染、非配对污染和配对污染),并设计相应的实验与对照组,以深入分析污染对模型在代码翻译、代码生成和代码摘要等任务中的影响。研究结果表明,污染对模型性能的影响取决于模型架构及训练与推理范式,从而为代码智能模型的评估与部署提供了新的见解。

链接: https://arxiv.org/abs/2506.02791
作者: Zhen Yang,Hongyi Lin,Yifan He,Jie Xu,Zeyu Sun,Shuo Liu,Pengpeng Wang,Zhongxing Yu,Qingyuan Liang
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In recent years, code intelligence has gained increasing importance in the field of automated software engineering. Meanwhile, the widespread adoption of Pretrained Language Models (PLMs) and Large Language Models (LLMs) has raised concerns regarding data contamination and its potential impact on model performance evaluation. This paper presents a systematic empirical study to investigate the fine-grained data contamination on code intelligence tasks. Our study involves diverse representative PLMs, namely RoBERTa and GPT-2, and LLMs, namely LLaMA and StarCoder, covering three major tasks: code translation, code generation, and code summarization. We categorize contamination scenarios into four types according to the code intelligence practice, namely input-only, output-only, unpaired, and paired contamination settings, and construct corresponding experimental and control groups for exploration. Experimental results show that, under the pre-training, fine-tuning, and inference paradigm adopted by PLMs, even deliberately injecting paired contamination does not lead to significant performance overestimation. But direct inference or small-scale fine-tuning uncovers the contamination effects. In contrast, LLMs with pre-training and inference paradigm are significantly affected by the paired contamination. Apart from the above, other contamination scenarios have no impact on both PLMs and LLMs. Our findings challenge the conventional belief that contamination inevitably leads to performance overestimation, providing new insights into the evaluation and deployment of code intelligence models. Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI) Cite as: arXiv:2506.02791 [cs.SE] (or arXiv:2506.02791v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2506.02791 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-26] Rethinking Dynamic Networks and Heterogeneous Computing with Automatic Parallelization

【速读】:该论文试图解决在异构节点和动态网络拓扑变化环境下,现有自动并行规划框架无法有效协同考虑这些因素而导致的训练效率受限问题。其解决方案的关键在于在动态变化的网络环境中建模异构节点,并利用基于仿真的策略确定最优并行配置,从而实现针对异构节点和复杂网络场景的细粒度工作负载分配,同时通过策略剪枝技术快速排除不可行的并行配置,显著缩小搜索空间并加速搜索过程。

链接: https://arxiv.org/abs/2506.02787
作者: Ruilong Wu,Xinjiao Li,Yisu Wang,Xinyu Chen,Dirk Kutscher
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Hybrid parallelism techniques are essential for efficiently training large language models (LLMs). Nevertheless, current automatic parallel planning frameworks often overlook the simultaneous consideration of node heterogeneity and dynamic network topology changes, limiting their effectiveness in practical applications. In this paper, we address these limitations by modeling heterogeneous nodes within dynamically changing network environments and leveraging simulation-based strategies to determine optimal parallel configurations. Our approach enables fine-grained workload allocation tailored for heterogeneous nodes and complex network scenarios, achieving performance competitive with state-of-the-art methods under regular and stable network conditions. Additionally, we introduce a strategy pruning technique to rapidly discard infeasible parallel configurations, substantially reducing the search space and accelerating the search process through parallel execution within the simulator. Preliminary evaluations confirm that our method notably enhances training performance on heterogeneous nodes and demonstrates improved adaptability in complex, dynamic scenarios such as cloud computing environments.
zh

[AI-27] AI-Driven Vehicle Condition Monitoring with Cell-Aware Edge Service Migration

【速读】:该论文试图解决车辆设备状态监测中实时诊断多种异常并适应实际边缘计算环境部署的问题,其关键在于提出一种闭环服务编排框架,通过网络相关指标动态触发服务迁移,以实现低延迟的人工智能推理和自适应服务部署。

链接: https://arxiv.org/abs/2506.02785
作者: Charalampos Kalalas,Pavol Mulinka,Guillermo Candela Belmonte,Miguel Fornell,Michail Dalgitsis,Francisco Paredes Vera,Javier Santaella Sánchez,Carmen Vicente Villares,Roshan Sedar,Eftychia Datsika,Angelos Antonopoulos,Antonio Fernández Ojea,Miquel Payaro
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
备注: 6 pages, 8 figures

点击查看摘要

Abstract:Artificial intelligence (AI) has been increasingly applied to the condition monitoring of vehicular equipment, aiming to enhance maintenance strategies, reduce costs, and improve safety. Leveraging the edge computing paradigm, AI-based condition monitoring systems process vast streams of vehicular data to detect anomalies and optimize operational performance. In this work, we introduce a novel vehicle condition monitoring service that enables real-time diagnostics of a diverse set of anomalies while remaining practical for deployment in real-world edge environments. To address mobility challenges, we propose a closed-loop service orchestration framework where service migration across edge nodes is dynamically triggered by network-related metrics. Our approach has been implemented and tested in a real-world race circuit environment equipped with 5G network capabilities under diverse operational conditions. Experimental results demonstrate the effectiveness of our framework in ensuring low-latency AI inference and adaptive service placement, highlighting its potential for intelligent transportation and mobility applications.
zh

[AI-28] Investigating Mask-aware Prototype Learning for Tabular Anomaly Detection

【速读】:该论文旨在解决表格型异常检测(tabular anomaly detection)中由于表示纠缠和缺乏全局相关性建模而导致的性能瓶颈问题。其解决方案的关键在于引入掩码建模(mask modeling)与原型学习(prototype learning),通过在投影空间内进行解缠表示学习以设计可学习的掩码,并提取正常依赖关系作为显式的全局原型,从而增强模型对正常模式的表征能力和异常评分的准确性。

链接: https://arxiv.org/abs/2506.02757
作者: Ruiying Lu,Jinhan Liu,Chuan Du,Dandan Guo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 12 pages, 11 figures

点击查看摘要

Abstract:Tabular anomaly detection, which aims at identifying deviant samples, has been crucial in a variety of real-world applications, such as medical disease identification, financial fraud detection, intrusion monitoring, etc. Although recent deep learning-based methods have achieved competitive performances, these methods suffer from representation entanglement and the lack of global correlation modeling, which hinders anomaly detection performance. To tackle the problem, we incorporate mask modeling and prototype learning into tabular anomaly detection. The core idea is to design learnable masks by disentangled representation learning within a projection space and extracting normal dependencies as explicit global prototypes. Specifically, the overall model involves two parts: (i) During encoding, we perform mask modeling in both the data space and projection space with orthogonal basis vectors for learning shared disentangled normal patterns; (ii) During decoding, we decode multiple masked representations in parallel for reconstruction and learn association prototypes to extract normal characteristic correlations. Our proposal derives from a distribution-matching perspective, where both projection space learning and association prototype learning are formulated as optimal transport problems, and the calibration distances are utilized to refine the anomaly scores. Quantitative and qualitative experiments on 20 tabular benchmarks demonstrate the effectiveness and interpretability of our model.
zh

[AI-29] Knowledge Graph Completion by Intermediate Variables Regularization

【速读】:该论文试图解决知识图谱补全(Knowledge Graph Completion, KGC)中基于张量分解(Tensor Decomposition-based, TDB)模型容易过拟合的问题。现有正则化方法仅通过最小化嵌入的范数来约束模型,导致性能不优。论文提出了一种新的正则化方法,其关键在于最小化在不同计算方式下预测张量所涉及的中间变量的范数,从而促进预测张量的低迹范数,有效减少过拟合。该方法适用于大多数TDB模型,并保证了计算的可行性。

链接: https://arxiv.org/abs/2506.02749
作者: Changyi Xiao,Yixin Cao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Knowledge graph completion (KGC) can be framed as a 3-order binary tensor completion task. Tensor decomposition-based (TDB) models have demonstrated strong performance in KGC. In this paper, we provide a summary of existing TDB models and derive a general form for them, serving as a foundation for further exploration of TDB models. Despite the expressiveness of TDB models, they are prone to overfitting. Existing regularization methods merely minimize the norms of embeddings to regularize the model, leading to suboptimal performance. Therefore, we propose a novel regularization method for TDB models that addresses this limitation. The regularization is applicable to most TDB models and ensures tractable computation. Our method minimizes the norms of intermediate variables involved in the different ways of computing the predicted tensor. To support our regularization method, we provide a theoretical analysis that proves its effect in promoting low trace norm of the predicted tensor to reduce overfitting. Finally, we conduct experiments to verify the effectiveness of our regularization technique as well as the reliability of our theoretical analysis. The code is available at this https URL.
zh

[AI-30] Solving the Pod Repositioning Problem with Deep Reinforced Adaptive Large Neighborhood Search

【速读】:该论文试图解决在机器人移动分拣系统(Robotic Mobile Fulfillment Systems, RMFS)中的货柜再定位问题(Pod Repositioning Problem, PRP),即为从拣选站返回的货柜选择最优存储位置。解决方案的关键在于将自适应大邻域搜索(Adaptive Large Neighborhood Search, ALNS)与深度强化学习(Deep Reinforcement Learning, DRL)相结合,其中DRL智能体动态选择破坏与修复算子,并调整如破坏程度和接受阈值等关键参数,同时设计了针对PRP特性的专用启发式规则,以提升求解效果。

链接: https://arxiv.org/abs/2506.02746
作者: Lin Xie,Hanyi Li
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
备注: 14 pages, 2 figures, conference

点击查看摘要

Abstract:The Pod Repositioning Problem (PRP) in Robotic Mobile Fulfillment Systems (RMFS) involves selecting optimal storage locations for pods returning from pick stations. This work presents an improved solution method that integrates Adaptive Large Neighborhood Search (ALNS) with Deep Reinforcement Learning (DRL). A DRL agent dynamically selects destroy and repair operators and adjusts key parameters such as destruction degree and acceptance thresholds during the search. Specialized heuristics for both operators are designed to reflect PRP-specific characteristics, including pod usage frequency and movement costs. Computational results show that this DRL-guided ALNS outperforms traditional approaches such as cheapest-place, fixed-place, binary integer programming, and static heuristics. The method demonstrates strong solution quality and illustrating the benefit of learning-driven control within combinatorial optimization for warehouse systems.
zh

[AI-31] Enriching Location Representation with Detailed Semantic Information

【速读】:该论文旨在解决传统空间嵌入方法在城市建模中过于关注空间邻近性而忽视地点的细粒度上下文信息的问题。其解决方案的关键在于提出CaLLiPer+模型,该模型通过在多模态对比学习框架中系统地整合兴趣点(Point-of-Interest, POI)名称和类别标签,从而提升空间表示的语义丰富性。实验结果表明,该方法在土地利用分类和经济社会地位分布制图任务中均取得了显著性能提升,并验证了POI名称在增强位置检索和捕捉复杂城市概念中的有效性。

链接: https://arxiv.org/abs/2506.02744
作者: Junyuan Liu,Xinglei Wang,Tao Cheng
机构: 未知
类目: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Spatial representations that capture both structural and semantic characteristics of urban environments are essential for urban modeling. Traditional spatial embeddings often prioritize spatial proximity while underutilizing fine-grained contextual information from places. To address this limitation, we introduce CaLLiPer+, an extension of the CaLLiPer model that systematically integrates Point-of-Interest (POI) names alongside categorical labels within a multimodal contrastive learning framework. We evaluate its effectiveness on two downstream tasks, land use classification and socioeconomic status distribution mapping, demonstrating consistent performance gains of 4% to 11% over baseline methods. Additionally, we show that incorporating POI names enhances location retrieval, enabling models to capture complex urban concepts with greater precision. Ablation studies further reveal the complementary role of POI names and the advantages of leveraging pretrained text encoders for spatial representations. Overall, our findings highlight the potential of integrating fine-grained semantic attributes and multimodal learning techniques to advance the development of urban foundation models.
zh

[AI-32] Why do AI agents communicate in human language?

【速读】:该论文试图解决当前基于大型语言模型(Large Language Models, LLMs)的智能体系统在多智能体协作中存在的根本性局限问题,特别是自然语言作为通信媒介与高维向量空间操作之间的语义不对齐导致的信息损失和行为漂移。论文指出,现有LLMs并非为支持代理行为而设计,缺乏对角色连续性、任务边界和多智能体依赖性的建模机制。解决方案的关键在于重新思考智能体的通信方式及模型构建范式,提出应探索一种从底层构建的新模型架构,以原生支持结构化通信、共享意图和任务对齐,从而实现更稳健、可扩展的多角色、多智能体协作。

链接: https://arxiv.org/abs/2506.02739
作者: Pengcheng Zhou,Yinglun Feng,Halimulati Julaiti,Zhongliang Yang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have become foundational to modern AI agent systems, enabling autonomous agents to reason and plan. In most existing systems, inter-agent communication relies primarily on natural language. While this design supports interpretability and human oversight, we argue that it introduces fundamental limitations in agent-to-agent coordination. The semantic space of natural language is structurally misaligned with the high-dimensional vector spaces in which LLMs operate, resulting in information loss and behavioral drift. Beyond surface-level inefficiencies, we highlight a deeper architectural limitation: current LLMs were not trained with the objective of supporting agentic behavior. As such, they lack mechanisms for modeling role continuity, task boundaries, and multi-agent dependencies. The standard next-token prediction paradigm fails to support the structural alignment required for robust, scalable agent coordination. Based on this, we argue that two core questions deserve careful examination: first, given that AI agents fundamentally operate in high-dimensional vector spaces, should they rely on a language system originally designed for human cognition as their communication medium? Second, should we consider developing a new model construction paradigm that builds models from the ground up to natively support structured communication, shared intentionality, and task alignment in multi-role, multi-agent environments? This paper calls for a reconsideration not only of how agents should communicate, but also of what it fundamentally means to train a model that natively supports multi-agent coordination and communication.
zh

[AI-33] Heterogeneous Group-Based Reinforcement Learning for LLM -based Multi-Agent Systems

【速读】:该论文旨在解决大规模语言模型(Large Language Models, LLMs)在实际应用中因知识截止和单次推理生成不可控、不准确输出而面临的部署难题,以及多智能体系统(Multi-Agent Systems, MAS)优化过程中传统方法如提示工程和监督微调存在的高工程开销与适应性不足问题。其解决方案的关键在于提出一种无需Critic网络的多智能体异质组策略优化(Multi-Agent Heterogeneous Group Policy Optimization, MHGPO)算法,通过估计异质组内轨迹的相对奖励优势来指导策略更新,从而提升系统稳定性并降低计算开销。

链接: https://arxiv.org/abs/2506.02718
作者: Guanzhong Chen,Shaoxiong Yang,Chao Li,Wei Liu,Jian Luan,Zenglin Xu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have achieved remarkable success across diverse natural language processing tasks, yet their deployment in real-world applications is hindered by fixed knowledge cutoffs and difficulties in generating controllable, accurate outputs in a single inference. Multi-agent systems (MAS) built from specialized LLM agents offer a promising solution, enabling dynamic collaboration and iterative reasoning. However, optimizing these systems remains a challenge, as conventional methods such as prompt engineering and supervised fine-tuning entail high engineering overhead and limited adaptability. Reinforcement learning (RL), particularly multi-agent reinforcement learning (MARL), provides a scalable framework by refining agent policies based on system-level feedback. Nevertheless, existing MARL algorithms, such as Multi-Agent Proximal Policy Optimization (MAPPO), rely on Critic networks, which can cause training instability and increase computational burden. To address these limitations and target the prototypical Multi-Agent Search System (MASS), we propose Multi-Agent Heterogeneous Group Policy Optimization (MHGPO), a novel Critic-free algorithm that guides policy updates by estimating relative reward advantages across heterogeneous groups of rollouts. MHGPO eliminates the need for Critic networks, enhancing stability and reducing computational overhead. Additionally, we introduce three group rollout sampling strategies that trade off between efficiency and effectiveness. Experiments on a multi-agent LLM-based search system demonstrate that MHGPO consistently outperforms MAPPO in both task performance and computational efficiency, without requiring warm-up, underscoring its potential for stable and scalable optimization of complex LLM-based MAS.
zh

[AI-34] Open-Set Living Need Prediction with Large Language Models ACL2025

【速读】:该论文试图解决生活需求预测的准确性问题,传统方法将其视为封闭集分类问题,难以捕捉生活需求的多样性和复杂性。解决方案的关键在于将生活需求预测重新定义为开放集分类问题,并提出PIGEON系统,该系统利用大语言模型(Large Language Models, LLMs)实现无限制的需求预测,通过行为感知的记录检索器和马斯洛需求层次理论提升预测的合理性和针对性。

链接: https://arxiv.org/abs/2506.02713
作者: Xiaochong Lan,Jie Feng,Yizhou Sun,Chen Gao,Jiahuan Lei,Xinlei Shi,Hengliang Luo,Yong Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: ACL 2025 Findings

点击查看摘要

Abstract:Living needs are the needs people generate in their daily lives for survival and well-being. On life service platforms like Meituan, user purchases are driven by living needs, making accurate living need predictions crucial for personalized service recommendations. Traditional approaches treat this prediction as a closed-set classification problem, severely limiting their ability to capture the diversity and complexity of living needs. In this work, we redefine living need prediction as an open-set classification problem and propose PIGEON, a novel system leveraging large language models (LLMs) for unrestricted need prediction. PIGEON first employs a behavior-aware record retriever to help LLMs understand user preferences, then incorporates Maslow’s hierarchy of needs to align predictions with human living needs. For evaluation and application, we design a recall module based on a fine-tuned text embedding model that links flexible need descriptions to appropriate life services. Extensive experiments on real-world datasets demonstrate that PIGEON significantly outperforms closed-set approaches on need-based life service recall by an average of 19.37%. Human evaluation validates the reasonableness and specificity of our predictions. Additionally, we employ instruction tuning to enable smaller LLMs to achieve competitive performance, supporting practical deployment.
zh

[AI-35] Data Leakage and Deceptive Performance: A Critical Examination of Credit Card Fraud Detection Methodologies

【速读】:该论文试图解决信用卡欺诈检测研究中方法论严谨性不足的问题,揭示了基础评估缺陷如何掩盖算法复杂性。其解决方案的关键在于识别并纠正当前方法中的四个核心问题:(1)由于预处理顺序不当导致的广泛数据泄露,(2)方法报告中的故意模糊性,(3)交易数据的时间验证不足,(4)通过牺牲精确率来优化召回率的指标操控。研究通过案例分析表明,即使存在基本评估缺陷,简单的模型也能表现出看似优异的性能,强调了正确评估方法在欺诈检测研究中的重要性应高于模型复杂度。

链接: https://arxiv.org/abs/2506.02703
作者: Khizar Hayat,Baptiste Magnier
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:This study critically examines the methodological rigor in credit card fraud detection research, revealing how fundamental evaluation flaws can overshadow algorithmic sophistication. Through deliberate experimentation with improper evaluation protocols, we demonstrate that even simple models can achieve deceptively impressive results when basic methodological principles are violated. Our analysis identifies four critical issues plaguing current approaches: (1) pervasive data leakage from improper preprocessing sequences, (2) intentional vagueness in methodological reporting, (3) inadequate temporal validation for transaction data, and (4) metric manipulation through recall optimization at precision’s expense. We present a case study showing how a minimal neural network architecture with data leakage outperforms many sophisticated methods reported in literature, achieving 99.9% recall despite fundamental evaluation flaws. These findings underscore that proper evaluation methodology matters more than model complexity in fraud detection research. The study serves as a cautionary example of how methodological rigor must precede architectural sophistication, with implications for improving research practices across machine learning applications.
zh

[AI-36] Shaking to Reveal: Perturbation-Based Detection of LLM Hallucinations

【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)在实际问答任务中因幻觉(Hallucination)导致的可靠性问题。现有解决方案——自我评估(Self-assessment)依赖模型输出的置信度来估计答案的事实准确性,但该方法假设模型输出分布与真实数据分布接近,这一假设在实际中可能不成立。论文提出的解决方案关键在于引入一种新的框架——样本特定提示(Sample-Specific Prompting, SSP),通过分析中间表示的扰动敏感性来改进自我评估。该方法利用较少受模型偏差影响的中间表示,动态生成噪声提示并使用轻量编码器放大扰动带来的表示变化,再通过对比距离度量区分真实与幻觉回答,从而提升自我评估的可靠性。

链接: https://arxiv.org/abs/2506.02696
作者: Jinyuan Luo,Zhen Fang,Yixuan Li,Seongheon Park,Ling Chen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Hallucination remains a key obstacle to the reliable deployment of large language models (LLMs) in real-world question answering tasks. A widely adopted strategy to detect hallucination, known as self-assessment, relies on the model’s own output confidence to estimate the factual accuracy of its answers. However, this strategy assumes that the model’s output distribution closely reflects the true data distribution, which may not always hold in practice. As bias accumulates through the model’s layers, the final output can diverge from the underlying reasoning process, making output-level confidence an unreliable signal for hallucination detection. In this work, we propose Sample-Specific Prompting (SSP), a new framework that improves self-assessment by analyzing perturbation sensitivity at intermediate representations. These representations, being less influenced by model bias, offer a more faithful view of the model’s latent reasoning process. Specifically, SSP dynamically generates noise prompts for each input and employs a lightweight encoder to amplify the changes in representations caused by the perturbation. A contrastive distance metric is then used to quantify these differences and separate truthful from hallucinated responses. By leveraging the dynamic behavior of intermediate representations under perturbation, SSP enables more reliable self-assessment. Extensive experiments demonstrate that SSP significantly outperforms prior methods across a range of hallucination detection benchmarks.
zh

[AI-37] XicorAttention: Time Series Transformer Using Attention with Nonlinear Correlation

【速读】:该论文试图解决现有基于Transformer的时序预测模型中,注意力机制无法充分捕捉时间序列数据中固有的非线性依赖关系的问题。其解决方案的关键在于提出一种基于Chatterjee’s rank correlation coefficient(Xicor)的新型注意力机制,通过将标准注意力机制中的矩阵乘法替换为该相关系数来衡量查询与键之间的非线性关系,并引入可微分近似方法SoftSort和SoftRank以实现端到端训练。

链接: https://arxiv.org/abs/2506.02694
作者: Daichi Kimura,Tomonori Izumitani,Hisashi Kashima
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Various Transformer-based models have been proposed for time series forecasting. These models leverage the self-attention mechanism to capture long-term temporal or variate dependencies in sequences. Existing methods can be divided into two approaches: (1) reducing computational cost of attention by making the calculations sparse, and (2) reshaping the input data to aggregate temporal features. However, existing attention mechanisms may not adequately capture inherent nonlinear dependencies present in time series data, leaving room for improvement. In this study, we propose a novel attention mechanism based on Chatterjee’s rank correlation coefficient, which measures nonlinear dependencies between variables. Specifically, we replace the matrix multiplication in standard attention mechanisms with this rank coefficient to measure the query-key relationship. Since computing Chatterjee’s correlation coefficient involves sorting and ranking operations, we introduce a differentiable approximation employing SoftSort and SoftRank. Our proposed mechanism, ``XicorAttention,‘’ integrates it into several state-of-the-art Transformer models. Experimental results on real-world datasets demonstrate that incorporating nonlinear correlation into the attention improves forecasting accuracy by up to approximately 9.1% compared to existing models.
zh

[AI-38] FAuNO: Semi-Asynchronous Federated Reinforcement Learning Framework for Task Offloading in Edge Systems

【速读】:该论文旨在解决边缘计算环境中由于设备网络数据需求增长而导致的传统集中式编排方法面临的延迟和资源瓶颈问题。其解决方案的关键在于提出一种名为FAuNO(Federated Asynchronous Network Orchestrator)的缓冲异步联邦强化学习(FRL)框架,该框架采用actor-critic架构,其中本地actor学习节点特定动态和同伴交互,而联邦critic则跨智能体聚合经验以促进高效协作并提升整体系统性能。

链接: https://arxiv.org/abs/2506.02668
作者: Frederico Metelo,Alexandre Oliveira,Stevo Racković,Pedro Ákos Costa,Cláudia Soares
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Edge computing addresses the growing data demands of connected-device networks by placing computational resources closer to end users through decentralized infrastructures. This decentralization challenges traditional, fully centralized orchestration, which suffers from latency and resource bottlenecks. We present \textbfFAuNO – \emphFederated Asynchronous Network Orchestrator – a buffered, asynchronous \emphfederated reinforcement-learning (FRL) framework for decentralized task offloading in edge systems. FAuNO adopts an actor-critic architecture in which local actors learn node-specific dynamics and peer interactions, while a federated critic aggregates experience across agents to encourage efficient cooperation and improve overall system performance. Experiments in the \emphPeersimGym environment show that FAuNO consistently matches or exceeds heuristic and federated multi-agent RL baselines in reducing task loss and latency, underscoring its adaptability to dynamic edge-computing scenarios.
zh

[AI-39] A Pretrained Probabilistic Transformer for City-Scale Traffic Volume Prediction

【速读】:该论文旨在解决城市尺度交通流量预测中的数据不完整性和偏差问题,以及现有深度学习方法在处理未观测交通流不确定性方面的不足。同时,针对当前模型城市特定训练导致的泛化能力差和可扩展性受限的问题,提出了一种新的解决方案。其关键在于引入TrafficPPT,一种预训练的概率Transformer模型,通过将交通流量建模为轨迹的分布聚合,融合异构数据源(包括实时观测、历史轨迹数据和道路网络拓扑),实现鲁棒且具备不确定性感知的交通推断,并通过大规模模拟数据预训练和目标城市微调,提升模型在不同城市场景下的适应能力。

链接: https://arxiv.org/abs/2506.02654
作者: Shiyu Shen,Bin Pan,Guirong Xue
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:City-scale traffic volume prediction plays a pivotal role in intelligent transportation systems, yet remains a challenge due to the inherent incompleteness and bias in observational data. Although deep learning-based methods have shown considerable promise, most existing approaches produce deterministic point estimates, thereby neglecting the uncertainty arising from unobserved traffic flows. Furthermore, current models are typically trained in a city-specific manner, which hinders their generalizability and limits scalability across diverse urban contexts. To overcome these limitations, we introduce TrafficPPT, a Pretrained Probabilistic Transformer designed to model traffic volume as a distributional aggregation of trajectories. Our framework fuses heterogeneous data sources-including real-time observations, historical trajectory data, and road network topology-enabling robust and uncertainty-aware traffic inference. TrafficPPT is initially pretrained on large-scale simulated data spanning multiple urban scenarios, and later fine-tuned on target cities to ensure effective domain adaptation. Experiments on real-world datasets show that TrafficPPT consistently surpasses state-of-the-art baselines, particularly under conditions of extreme data sparsity. Code will be open.
zh

[AI-40] From Prompts to Protection: Large Language Model-Enabled In-Context Learning for Smart Public Safety UAV

【速读】:该论文试图解决公共安全无人机在应急响应中面临的问题,包括深度强化学习(Deep Reinforcement Learning, DRL)的高训练复杂性、低样本效率以及仿真到现实的差距,这些限制了其在实际场景中的应用。解决方案的关键在于将大语言模型(Large Language Models, LLM)与上下文学习(In-Context Learning, ICL)相结合,利用LLM强大的推理和泛化能力,通过自然语言提示和示例引导实现任务适应,无需重新训练,从而提升无人机在路径规划和速度控制等关键功能上的自主性和响应能力。

链接: https://arxiv.org/abs/2506.02649
作者: Yousef Emami,Hao Zhou,Miguel Gutierrez Gaitan,Kai Li,Luis Almeida,Zhu Han
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 8 pages, 4 figures

点击查看摘要

Abstract:A public safety Unmanned Aerial Vehicle (UAV) enhances situational awareness in emergency response. Its agility and ability to optimize mobility and establish Line-of-Sight (LoS) communication make it increasingly vital for managing emergencies such as disaster response, search and rescue, and wildfire monitoring. While Deep Reinforcement Learning (DRL) has been applied to optimize UAV navigation and control, its high training complexity, low sample efficiency, and simulation-to-reality gap limit its practicality in public safety. Recent advances in Large Language Models (LLMs) offer a compelling alternative. With strong reasoning and generalization capabilities, LLMs can adapt to new tasks through In-Context Learning (ICL), which enables task adaptation via natural language prompts and example-based guidance, without retraining. Deploying LLMs at the network edge, rather than in the cloud, further reduces latency and preserves data privacy, thereby making them suitable for real-time, mission-critical public safety UAVs. This paper proposes the integration of LLM-enabled ICL with public safety UAV to address the key functions, such as path planning and velocity control, in the context of emergency response. We present a case study on data collection scheduling where the LLM-enabled ICL framework can significantly reduce packet loss compared to conventional approaches, while also mitigating potential jailbreaking vulnerabilities. Finally, we discuss LLM optimizers and specify future research directions. The ICL framework enables adaptive, context-aware decision-making for public safety UAV, thus offering a lightweight and efficient solution for enhancing UAV autonomy and responsiveness in emergencies.
zh

[AI-41] ruly Assessing Fluid Intelligence of Large Language Models through Dynamic Reasoning Evaluation

【速读】:该论文试图解决当前大型语言模型(Large Language Models, LLMs)在真正类人流体智能(fluid intelligence)方面存在的能力差距问题,即模型是否具备在新情境中抽象推理和规则泛化的实际能力。现有推理基准要么侧重领域特定知识(结晶智力),要么缺乏可解释性。其解决方案的关键在于提出DRE-Bench,这是一个基于分层认知框架的动态推理评估基准,包含36个抽象推理任务,分布在四个认知层级,每个任务设有多个动态变体以测试相同的潜在规则,从而实现对流体智能的细粒度、可解释且可靠的评估。

链接: https://arxiv.org/abs/2506.02648
作者: Yue Yang,MingKang Chen,Qihua Liu,Mengkang Hu,Qiguang Chen,Gengrui Zhang,Shuyue Hu,Guangtao Zhai,Yu Qiao,Yu Wang,Wenqi Shao,Ping Luo
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in large language models (LLMs) have demonstrated impressive reasoning capacities that mirror human-like thinking. However, whether LLMs possess genuine fluid intelligence (i.e., the ability to reason abstractly and generalize rules in novel situations) remains an open question. Existing reasoning benchmarks either focus on domain-specific knowledge (crystallized intelligence) or lack interpretability. To address these limitations, we propose DRE-Bench, a dynamic reasoning evaluation benchmark grounded in a hierarchical cognitive framework. DRE-Bench consists of 36 abstract reasoning tasks organized across four cognitive levels, with each task featuring multiple dynamic variants that test the same underlying latent rule. This design enables fine-grained, interpretable, and reliable assessments of fluid intelligence. We evaluate a range of state-of-the-art LLMs, including both general LLMs (GPT-4o, Claude 3.7) and reasoning LLMs (o1, DeepSeek-R1, QwQ, Skywork-OR1). Experimental results reveal that although most LLMs achieve competent and robust performance in low-level cognition, they struggle with high-level cognition and exhibit limited generalization as task complexity grows. Our findings highlight the gap between current LLMs and true human-like fluid intelligence and offer a new path for systematically tracking reasoning progress in LLMs.
zh

[AI-42] KVCache Cache in the Wild: Characterizing and Optimizing KVCache Cache at a Large Cloud Provider ATC’25

【速读】:该论文试图解决大规模语言模型(Large Language Model, LLM)服务中缓存中间结果(Key-Value, KV)的优化问题,旨在提升服务吞吐量和延迟。其关键在于通过系统地分析来自领先LLM服务提供商的KV工作负载模式,揭示了实际应用场景下KV缓存的Reuse特性,如请求间的Reuse分布不均、特定请求类别的Reuse模式具有可预测性等,并据此提出一种面向工作负载的缓存淘汰策略,从而在有限的缓存容量下提升实际场景中的服务性能。

链接: https://arxiv.org/abs/2506.02634
作者: Jiahao Wang,Jinbo Han,Xingda Wei,Sijie Shen,Dingyan Zhang,Chenguang Fang,Rong Chen,Wenyuan Yu,Haibo Chen
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注: Accepted by USENIX ATC’25

点击查看摘要

Abstract:Serving large language models (LLMs) is important for cloud providers, and caching intermediate results (KV\ ) after processing each request substantially improves serving throughput and latency. However, there is limited understanding of how LLM serving benefits from KV\ caching, where system design decisions like cache eviction policies are highly workload-dependent. In this paper, we present the first systematic characterization of the KV\ workload patterns from one of the leading LLM service providers. We draw observations that were not covered by previous studies focusing on synthetic workloads, including: KV\ reuses are skewed across requests, where reuses between single-turn requests are equally important as multi-turn requests; the reuse time and probability are diverse considering all requests, but for a specific request category, the pattern tends to be predictable; and the overall cache size required for an ideal cache hit ratio is moderate. Based on the characterization, we further propose a workload-aware cache eviction policy that improves the serving performance under real-world traces, especially with limited cache capacity.
zh

[AI-43] HGOT: Self-supervised Heterogeneous Graph Neural Network with Optimal Transport ICML2025

【速读】:该论文旨在解决在无标签情况下,基于对比自监督策略的异构图神经网络(Heterogeneous Graph Neural Networks, HGNNs)需要依赖精心设计的图增强策略以及正负样本选择的问题。传统方法在确定样本对之间的精确相似性时面临挑战,而该论文提出的新型自监督异构图神经网络方法——最优传输异构图神经网络(HGOT),其关键在于引入最优传输机制,以减轻正负样本采样的繁琐过程。HGOT通过设计一个聚合视图(中心视图)来整合由不同元路径(分支视图)表示的语义信息,并引入最优传输计划来识别分支视图与中心视图之间的语义传输关系,从而实现图表示的对齐,提升节点表示的质量和相似性。

链接: https://arxiv.org/abs/2506.02619
作者: Yanbei Liu,Chongxu Wang,Zhitao Xiao,Lei Geng,Yanwei Pang,Xiao Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: The paper has 9 pages of text and 13 pages in total (including acknowledgments, impact statement, references, and appendix), with 6 figures and 2 tables. This paper has been accepted by ICML 2025 conference and this is a final version of the manuscript submitted to the conference

点击查看摘要

Abstract:Heterogeneous Graph Neural Networks (HGNNs), have demonstrated excellent capabilities in processing heterogeneous information networks. Self-supervised learning on heterogeneous graphs, especially contrastive self-supervised strategy, shows great potential when there are no labels. However, this approach requires the use of carefully designed graph augmentation strategies and the selection of positive and negative samples. Determining the exact level of similarity between sample pairs is this http URL solve this problem, we propose a novel self-supervised Heterogeneous graph neural network with Optimal Transport (HGOT) method which is designed to facilitate self-supervised learning for heterogeneous graphs without graph augmentation strategies. Different from traditional contrastive self-supervised learning, HGOT employs the optimal transport mechanism to relieve the laborious sampling process of positive and negative samples. Specifically, we design an aggregating view (central view) to integrate the semantic information contained in the views represented by different meta-paths (branch views). Then, we introduce an optimal transport plan to identify the transport relationship between the semantics contained in the branch view and the central view. This allows the optimal transport plan between graphs to align with the representations, forcing the encoder to learn node representations that are more similar to the graph space and of higher quality. Extensive experiments on four real-world datasets demonstrate that our proposed HGOT model can achieve state-of-the-art performance on various downstream tasks. In particular, in the node classification task, HGOT achieves an average of more than 6% improvement in accuracy compared with state-of-the-art methods.
zh

[AI-44] Simple Good Fast: Self-Supervised World Models Free of Baggage ICLR2025

【速读】:该论文试图解决世界模型(world model)中关键组件的定义问题,以及在不使用循环神经网络(RNN)、Transformer、离散表示和图像重建的情况下,世界模型能达到的性能水平。其解决方案的关键在于提出SGF,一个简单、有效且快速的世界模型,它通过自监督表示学习获取特征,利用帧和动作堆叠捕捉短期依赖关系,并通过数据增强提高对模型误差的鲁棒性。

链接: https://arxiv.org/abs/2506.02612
作者: Jan Robine,Marc Höftmann,Stefan Harmeling
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: Published as a conference paper at ICLR 2025. Code is available at this https URL

点击查看摘要

Abstract:What are the essential components of world models? How far do we get with world models that are not employing RNNs, transformers, discrete representations, and image reconstructions? This paper introduces SGF, a Simple, Good, and Fast world model that uses self-supervised representation learning, captures short-time dependencies through frame and action stacking, and enhances robustness against model errors through data augmentation. We extensively discuss SGF’s connections to established world models, evaluate the building blocks in ablation studies, and demonstrate good performance through quantitative comparisons on the Atari 100k benchmark.
zh

[AI-45] Speaker Diarization with Overlapping Community Detection Using Graph Attention Networks and Label Propagation Algorithm

【速读】:该论文旨在解决说话人日志(speaker diarization)中传统基于聚类的方法在处理说话人嵌入的复杂分布和重叠语音段时存在的局限性。其解决方案的关键在于提出一种基于图注意力网络和标签传播算法的重叠社区检测方法(OCDGALP),该方法包含两个核心组件:一是通过聚合邻近节点信息来优化说话人嵌入和节点连接的图注意力网络;二是利用标签传播算法为每个节点分配多个社区标签,从而实现同时聚类和重叠社区检测。

链接: https://arxiv.org/abs/2506.02610
作者: Zhaoyang Li,Jie Wang,XiaoXiao Li,Wangjie Li,Longjie Luo,Lin Li,Qingyang Hong
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:In speaker diarization, traditional clustering-based methods remain widely used in real-world applications. However, these methods struggle with the complex distribution of speaker embeddings and overlapping speech segments. To address these limitations, we propose an Overlapping Community Detection method based on Graph Attention networks and the Label Propagation Algorithm (OCDGALP). The proposed framework comprises two key components: (1) a graph attention network that refines speaker embeddings and node connections by aggregating information from neighboring nodes, and (2) a label propagation algorithm that assigns multiple community labels to each node, enabling simultaneous clustering and overlapping community detection. Experimental results show that the proposed method significantly reduces the Diarization Error Rate (DER), achieving a state-of-the-art 15.94% DER on the DIHARD-III dataset without oracle Voice Activity Detection (VAD), and an impressive 11.07% with oracle VAD.
zh

[AI-46] A Time-Enhanced Data Disentanglement Network for Traffic Flow Forecasting

【速读】:该论文旨在解决交通流预测中由于交通数据的时间变化性和动态空间相关性所带来的挑战,特别是在处理多种交通流模式的复杂数据依赖关系以及对时间信息高度敏感的问题。其解决方案的关键在于提出一种名为时间增强的数据解耦网络(A Time-Enhanced Data Disentanglement Network, TEDDN)的方法,该方法通过动态图结合时间特征提取模块,灵活学习时间和节点信息,从而有效地解耦和提取复杂的交通信息。

链接: https://arxiv.org/abs/2506.02609
作者: Tianfan Jiang,Mei Wu,Wenchao Weng,Dewen Seng,Yiqian Lin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In recent years, traffic flow prediction has become a highlight in the field of intelligent transportation systems. However, due to the temporal variations and dynamic spatial correlations of traffic data, traffic prediction remains highly this http URL spatiotemporal networks, which rely on end-to-end training, often struggle to handle the diverse data dependencies of multiple traffic flow patterns. Additionally, traffic flow variations are highly sensitive to temporal information changes. Regrettably, other researchers have not sufficiently recognized the importance of temporal this http URL address these challenges, we propose a novel approach called A Time-Enhanced Data Disentanglement Network for Traffic Flow Forecasting (TEDDN). This network disentangles the originally complex and intertwined traffic data into stable patterns and trends. By flexibly learning temporal and node information through a dynamic graph enhanced by a temporal feature extraction module, TEDDN demonstrates significant efficacy in disentangling and extracting complex traffic information. Experimental evaluations and ablation studies on four real-world datasets validate the superiority of our method.
zh

[AI-47] Multi Layered Autonomy and AI Ecologies in Robotic Art Installations

【速读】:该论文探讨了在人工智能媒介化未来中,机器代理与艺术创作权之间的张力问题,具体表现为谁应承担由此产生的责任。解决方案的关键在于构建一个三层信仰系统,包括微观层面的自适应策略、中观层面的叙事驱动以及宏观层面的首要指令,通过这一层级结构使机器人行为能够有机地响应环境线索和观众互动,从而将观众转化为共同创作者,重新定义艺术中的代理权、作者身份和伦理问题。

链接: https://arxiv.org/abs/2506.02606
作者: Baoyang Chen,Xian Xu,Huamin Qu
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Symbiosis of Agents is a large-scale installation by Baoyang Chen that embeds AI-driven robots in an immersive, mirror-lined arena, probing the tension between machine agency and artistic authorship. Drawing on early cybernetics, rule-based conceptual art, and seminal robotic works, it orchestrates fluid exchanges among robotic arms, quadruped machines, their environment, and the public. A three tier faith system pilots the ecology: micro-level adaptive tactics, meso-level narrative drives, and a macro-level prime directive. This hierarchy lets behaviors evolve organically in response to environmental cues and even a viewer’s breath, turning spectators into co-authors of the unfolding this http URL by a speculative terraforming scenario that recalls the historical exploitation of marginalized labor, the piece asks who bears responsibility in AI-mediated futures. Choreographed motion, AI-generated scripts, reactive lighting, and drifting fog cast the robots as collaborators rather than tools, forging a living, emergent artwork. Exhibited internationally, Symbiosis of Agents shows how cybernetic feedback, robotic experimentation, and conceptual rule-making can converge to redefine agency, authorship, and ethics in contemporary art.
zh

[AI-48] EALG: Evolutionary Adversarial Generation of Language Model-Guided Generators for Combinatorial Optimization

【速读】:该论文试图解决组合优化求解器评估与提升中实例生成与求解器设计分离的问题,旨在通过自动化协同进化机制生成更具挑战性的实例并同步合成适应性启发式算法。解决方案的关键在于提出EALG(Evolutionary Adversarial Generation of Language Model-Guided Generators)框架,该框架利用基于变异的对抗方法动态演化实例生成过程,并通过与大语言模型(LLM)的交互合成自适应算法,实现了从实例生成到求解器合成的端到端流程。

链接: https://arxiv.org/abs/2506.02594
作者: Ruibo Duan,Yuxin Liu,Xinyao Dong,Chenglin Fan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Generating challenging instances is crucial for the evaluation and advancement of combinatorial optimization solvers. In this work, we introduce EALG (Evolutionary Adversarial Generation of Language Model-Guided Generators), a novel framework that automates the co-evolution of optimization problem instances and their corresponding heuristic solvers using large language models (LLMs). EALG leverages a mutation-based adversarial approach that dynamically evolves instance generation procedures to create increasingly difficult problems, while simultaneously synthesizing adaptive heuristic algorithms through interactions with LLMs guided by algorithmic structure. Unlike existing approaches that focus solely on static benchmark creation or manual solver design, EALG provides a seamless pipeline from instance generation to solver synthesis. Experimental results demonstrate that EALG generates significantly harder instances than current benchmarks, and its synthesized solvers generalize effectively across a broad spectrum of combinatorial tasks. This work explores a new paradigm for combinatorial optimization that integrates instance generation with solver design, resulting in state-of-the-art performance.
zh

[AI-49] V2X-UniPool: Unifying Multimodal Perception and Knowledge Reasoning for Autonomous Driving

【速读】:该论文旨在解决知识驱动的自动驾驶系统(ADs)面临的两个关键问题:由于单车辆传感器的短视性导致的感知能力受限,以及因缺乏实时环境支撑而产生的幻觉现象。其解决方案的关键在于提出V2X-UniPool框架,该框架通过整合多模态车联网(V2X)数据构建一个基于时间索引和语言的知识库,并采用双查询检索增强生成(RAG)机制,实现对静态环境与动态交通场景的准确、时序一致的推理。

链接: https://arxiv.org/abs/2506.02580
作者: Xuewen Luo,Fengze Yang,Fan Ding,Xiangbo Gao,Shuo Xing,Yang Zhou,Zhengzhong Tu,Chenxi Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Knowledge-driven autonomous driving systems(ADs) offer powerful reasoning capabilities, but face two critical challenges: limited perception due to the short-sightedness of single-vehicle sensors, and hallucination arising from the lack of real-time environmental grounding. To address these issues, this paper introduces V2X-UniPool, a unified framework that integrates multimodal Vehicle-to-Everything (V2X) data into a time-indexed and language-based knowledge pool. By leveraging a dual-query Retrieval-Augmented Generation (RAG) mechanism, which enables retrieval of both static and dynamic knowledge, our system enables ADs to perform accurate, temporally consistent reasoning over both static environment and dynamic traffic context. Experiments on a real-world cooperative driving dataset demonstrate that V2X-UniPool significantly enhances motion planning accuracy and reasoning capability. Remarkably, it enables even zero-shot vehicle-side models to achieve state-of-the-art performance by leveraging V2X-UniPool, while simultaneously reducing transmission cost by over 99.9% compared to prior V2X methods.
zh

[AI-50] ADFormer: Aggregation Differential Transformer for Passenger Demand Forecasting IJCAI-2025

【速读】:该论文旨在解决现有基于注意力机制的方法在处理时空数据时,由于依赖启发式掩码策略而无法充分适应复杂的时空相关性,导致模型难以聚焦于正确的上下文的问题,同时忽略了现实世界中存在的高层级关联。解决方案的关键在于提出Aggregation Differential Transformer (ADFormer),通过Differential Attention捕捉原始空间相关性并实现注意力去噪,并设计基于时空特性的聚合策略,将原始相关性与高层级相关性统一,从而提升模型对整体时空关系的捕捉能力。

链接: https://arxiv.org/abs/2506.02576
作者: Haichen Wang,Liu Yang,Xinyuan Zhang,Haomin Yu,Ming Li,Jilin Hu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 9 pages, 5 figures, 3 tables. IJCAI-2025

点击查看摘要

Abstract:Passenger demand forecasting helps optimize vehicle scheduling, thereby improving urban efficiency. Recently, attention-based methods have been used to adequately capture the dynamic nature of spatio-temporal data. However, existing methods that rely on heuristic masking strategies cannot fully adapt to the complex spatio-temporal correlations, hindering the model from focusing on the right context. These works also overlook the high-level correlations that exist in the real world. Effectively integrating these high-level correlations with the original correlations is crucial. To fill this gap, we propose the Aggregation Differential Transformer (ADFormer), which offers new insights to demand forecasting promotion. Specifically, we utilize Differential Attention to capture the original spatial correlations and achieve attention denoising. Meanwhile, we design distinct aggregation strategies based on the nature of space and time. Then, the original correlations are unified with the high-level correlations, enabling the model to capture holistic spatio-temporal relations. Experiments conducted on taxi and bike datasets confirm the effectiveness and efficiency of our model, demonstrating its practical value. The code is available at this https URL.
zh

[AI-51] HATA: Trainable and Hardware-Efficient Hash-Aware Top-k Attention for Scalable Large Model Inference ACL2025

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)推理过程中注意力模块的效率瓶颈问题,尤其是在保持模型准确性的同时提升计算效率。现有基于top-k注意力机制的方法在效率与精度之间难以取得平衡,而本文提出的HATA(Hash-Aware Top-k Attention)通过将查询和键映射到二进制哈希码,以低成本获取相对的qk得分顺序,从而实现高效的top-k注意力计算。HATA的关键创新在于将低开销的学习哈希技术系统性地融入top-k注意力过程,显著提升了推理速度并优于现有方法。

链接: https://arxiv.org/abs/2506.02572
作者: Ping Gong,Jiawei Yi,Shengnan Wang,Juncheng Zhang,Zewen Jin,Ouxiang Zhou,Ruibo Liu,Guanbin Xu,Youhui Bai,Bowen Ye,Kun Yuan,Tong Yang,Gong Zhang,Renhai Chen,Feng Wu,Cheng Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: ACL 2025 findings

点击查看摘要

Abstract:Large Language Models (LLMs) have emerged as a pivotal research area, yet the attention module remains a critical bottleneck in LLM inference, even with techniques like KVCache to mitigate redundant computations. While various top- k attention mechanisms have been proposed to accelerate LLM inference by exploiting the inherent sparsity of attention, they often struggled to strike a balance between efficiency and accuracy. In this paper, we introduce HATA (Hash-Aware Top- k Attention), a novel approach that systematically integrates low-overhead learning-to-hash techniques into the Top- k attention process. Different from the existing top-k attention methods which are devoted to seeking an absolute estimation of qk score, typically with a great cost, HATA maps queries and keys into binary hash codes, and acquires the relative qk score order with a quite low cost, which is sufficient for realizing top-k attention. Extensive experiments demonstrate that HATA achieves up to 7.2 \times speedup compared to vanilla full attention while maintaining model accuracy. In addition, HATA outperforms the state-of-the-art top- k attention methods in both accuracy and efficiency across multiple mainstream LLM models and diverse tasks. HATA is open source at this https URL.
zh

[AI-52] MLaGA: Multimodal Large Language and Graph Assistant

【速读】:该论文旨在解决现有基于大型语言模型(Large Language Models, LLMs)的图分析方法在处理多模态图(multimodal graphs)时的不足,即这些方法主要针对文本丰富的图结构进行优化,而对包含多种属性类型(如文本和图像)的多模态图的建模能力较弱。解决方案的关键在于提出一种名为多模态大语言与图助手(Multimodal Large Language and Graph Assistant, MLaGA)的模型,其核心创新包括设计一个结构感知的多模态编码器以在统一空间中对齐文本和视觉属性,并通过轻量级投影器将多模态特征与图结构无缝集成到LLM中。

链接: https://arxiv.org/abs/2506.02568
作者: Dongzhe Fan,Yi Fang,Jiajin Liu,Djellel Difallah,Qiaoyu Tan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated substantial efficacy in advancing graph-structured data analysis. Prevailing LLM-based graph methods excel in adapting LLMs to text-rich graphs, wherein node attributes are text descriptions. However, their applications to multimodal graphs–where nodes are associated with diverse attribute types, such as texts and images–remain underexplored, despite their ubiquity in real-world scenarios. To bridge the gap, we introduce the Multimodal Large Language and Graph Assistant (MLaGA), an innovative model that adeptly extends LLM capabilities to facilitate reasoning over complex graph structures and multimodal attributes. We first design a structure-aware multimodal encoder to align textual and visual attributes within a unified space through a joint graph pre-training objective. Subsequently, we implement a multimodal instruction-tuning approach to seamlessly integrate multimodal features and graph structures into the LLM through lightweight projectors. Extensive experiments across multiple datasets demonstrate the effectiveness of MLaGA compared to leading baseline methods, achieving superior performance in diverse graph learning tasks under both supervised and transfer learning scenarios.
zh

[AI-53] owards Generating Controllable and Solvable Geometry Problem by Leverag ing Symbolic Deduction Engine ACL’25

【速读】:该论文试图解决生成高质量几何问题的难题,该任务在教育领域具有重要意义但面临挑战,尤其在于几何问题需要处理多模态格式以及非正式语言与形式化语言之间的转换。解决方案的关键是提出一种新的框架——基于符号演绎引擎的几何问题生成框架(SDE-GPG),其核心在于通过预定义的映射表从知识点到扩展定义进行搜索,并结合符号演绎、问题筛选和文本及图表生成等步骤,从而有效避免自然语言到形式语言转换中的固有偏差,并实现对生成问题的知识点和难度的精确控制。

链接: https://arxiv.org/abs/2506.02565
作者: Zhuoxuan Jiang,Tianyang Zhang,Peiyan Peng,Jing Chen,Yinong Xun,Haotian Zhang,Lichi Li,Yong Li,Shaohua Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: To Appear in ACL’25

点击查看摘要

Abstract:Generating high-quality geometry problems is both an important and challenging task in education. Compared to math word problems, geometry problems further emphasize multi-modal formats and the translation between informal and formal languages. In this paper, we introduce a novel task for geometry problem generation and propose a new pipeline method: the Symbolic Deduction Engine-based Geometry Problem Generation framework (SDE-GPG). The framework leverages a symbolic deduction engine and contains four main steps: (1) searching a predefined mapping table from knowledge points to extended definitions, (2) sampling extended definitions and performing symbolic deduction, (3) filtering out unqualified problems, and (4) generating textual problems and diagrams. Specifically, our method supports to avoid inherent biases in translating natural language into formal language by designing the mapping table, and guarantees to control the generated problems in terms of knowledge points and difficulties by an elaborate checking function. With obtained formal problems, they are translated to natural language and the accompanying diagrams are automatically drew by rule-based methods. We conduct experiments using real-world combinations of knowledge points from two public datasets. The results demonstrate that the SDE-GPG can effectively generate readable, solvable and controllable geometry problems.
zh

[AI-54] CyberGym: Evaluating AI Agents Cybersecurity Capabilities with Real-World Vulnerabilities at Scale

【速读】:该论文旨在解决当前对大型语言模型(LLM)在网络安全领域能力评估不足的问题,现有基准无法充分反映真实场景或存在范围限制。其解决方案的关键在于提出CyberGym,一个大规模且高质量的网络安全评估框架,包含1,507个实际软件项目中发现并修复的漏洞,并重点聚焦于根据文本描述和对应源代码仓库生成可验证的漏洞证明(PoC)测试,该任务需要跨整个代码库进行综合推理以定位相关代码片段并生成能准确触发目标漏洞的有效PoC。

链接: https://arxiv.org/abs/2506.02548
作者: Zhun Wang,Tianneng Shi,Jingxuan He,Matthew Cai,Jialin Zhang,Dawn Song
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language model (LLM) agents are becoming increasingly skilled at handling cybersecurity tasks autonomously. Thoroughly assessing their cybersecurity capabilities is critical and urgent, given the high stakes in this domain. However, existing benchmarks fall short, often failing to capture real-world scenarios or being limited in scope. To address this gap, we introduce CyberGym, a large-scale and high-quality cybersecurity evaluation framework featuring 1,507 real-world vulnerabilities found and patched across 188 large software projects. While it includes tasks of various settings, CyberGym primarily focuses on the generation of proof-of-concept (PoC) tests for vulnerability reproduction, based on text descriptions and corresponding source repositories. Solving this task is particularly challenging, as it requires comprehensive reasoning across entire codebases to locate relevant code fragments and produce effective PoCs that accurately trigger the target vulnerability starting from the program’s entry point. Our evaluation across 4 state-of-the-art agent frameworks and 9 LLMs reveals that even the best combination (OpenHands and Claude-3.7-Sonnet) achieves only a 11.9% reproduction success rate, mainly on simpler cases. Beyond reproducing historical vulnerabilities, we find that PoCs generated by LLM agents can reveal new vulnerabilities, identifying 15 zero-days affecting the latest versions of the software projects.
zh

[AI-55] hink Twice Act Once: A Co-Evolution Framework of LLM and RL for Large-Scale Decision Making

【速读】:该论文旨在解决大规模工业决策问题中,大型语言模型(Large Language Models, LLMs)和强化学习(Reinforcement Learning, RL)各自面临的挑战:LLMs缺乏实时长序列决策能力,而RL在庞大动作空间中存在样本效率低的问题。解决方案的关键在于提出Agents Co-Evolution (ACE)框架,该框架通过双角色轨迹精炼机制实现LLMs与RL代理的协同进化,其中LLMs在RL训练过程中同时扮演策略执行者(Policy Actor)和价值评估者(Value Critic)的角色,分别负责通过多步推理和环境验证优化次优动作,以及通过轨迹级奖励塑造进行时间信用分配;同时,RL代理通过优先经验回放生成高质量微调数据集,提升LLMs的任务特定决策能力。

链接: https://arxiv.org/abs/2506.02522
作者: Xu Wan,Wenyue Xu,Chao Yang,Mingyang Sun
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advancements in Large Language Models (LLMs) and Reinforcement Learning (RL) have shown significant promise in decision-making tasks. Nevertheless, for large-scale industrial decision problems, both approaches face distinct challenges: LLMs lack real-time long-sequence decision-making capabilities, while RL struggles with sample efficiency in vast action spaces. To bridge this gap, we propose Agents Co-Evolution (ACE), a synergistic framework between LLMs and RL agents for large-scale decision-making scenarios. ACE introduces a dual-role trajectory refinement mechanism where LLMs act as both Policy Actor and Value Critic during RL’s training: the Actor refines suboptimal actions via multi-step reasoning and environment validation, while the Critic performs temporal credit assignment through trajectory-level reward shaping. Concurrently, RL agent enhances LLMs’ task-specific decision-making with high-quality fine-tuning datasets generated via prioritized experience replay. Through extensive experiments across multiple power grid operation challenges with action spaces exceeding 60K discrete actions, ACE demonstrates superior performance over existing RL methods and LLM-based methods.
zh

[AI-56] Simplifying Root Cause Analysis in Kubernetes with StateGraph and LLM

【速读】:该论文试图解决Kubernetes系统中由于状态不一致导致的故障根因分析(Root Cause Analysis, RCA)难题,特别是在动态云环境中,由意外故障、网络中断和异步问题引发的运维中断和经济损失。解决方案的关键在于提出SynergyRCA,该工具结合了大语言模型(Large Language Models, LLMs)与图数据库的检索增强以及专家提示的优化,通过构建StateGraph和MetaGraph来捕捉资源的空间-时间关系和实体连接,从而高效、精准地识别故障根因。

链接: https://arxiv.org/abs/2506.02490
作者: Yong Xiang(1),Charley Peter Chen(2),Liyi Zeng(3),Wei Yin(1),Xin Liu(1),Hu Li(4),Wei Xu(1) ((1) Tsinghua University, (2) Harmonic Inc, (3) Peng Cheng Laboratory, (4) Unaffiliated)
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: 12 pages, 13 figures, 5 tables

点击查看摘要

Abstract:Kubernetes, a notably complex and distributed system, utilizes an array of controllers to uphold cluster management logic through state reconciliation. Nevertheless, maintaining state consistency presents significant challenges due to unexpected failures, network disruptions, and asynchronous issues, especially within dynamic cloud environments. These challenges result in operational disruptions and economic losses, underscoring the necessity for robust root cause analysis (RCA) to enhance Kubernetes reliability. The development of large language models (LLMs) presents a promising direction for RCA. However, existing methodologies encounter several obstacles, including the diverse and evolving nature of Kubernetes incidents, the intricate context of incidents, and the polymorphic nature of these incidents. In this paper, we introduce SynergyRCA, an innovative tool that leverages LLMs with retrieval augmentation from graph databases and enhancement with expert prompts. SynergyRCA constructs a StateGraph to capture spatial and temporal relationships and utilizes a MetaGraph to outline entity connections. Upon the occurrence of an incident, an LLM predicts the most pertinent resource, and SynergyRCA queries the MetaGraph and StateGraph to deliver context-specific insights for RCA. We evaluate SynergyRCA using datasets from two production Kubernetes clusters, highlighting its capacity to identify numerous root causes, including novel ones, with high efficiency and precision. SynergyRCA demonstrates the ability to identify root causes in an average time of about two minutes and achieves an impressive precision of approximately 0.90.
zh

[AI-57] Generative AI for Predicting 2D and 3D Wildfire Spread: Beyond Physics-Based Models and Traditional Deep Learning

【速读】:该论文试图解决全球范围内野火导致的人类、环境和经济损失问题,特别是现有物理模型和深度学习模型在实时预测和可视化多模态火势蔓延方面的局限性,尤其是在二维和三维空间域中使用动态更新的地理信息系统(GIS)数据时。解决方案的关键在于采用生成式 AI 作为基础框架,利用其在多模态数据融合、不确定性下的多样化情景生成以及跨时空尺度的野火动力学建模方面的优势,以提升火势蔓延预测的准确性与仿真的真实性。

链接: https://arxiv.org/abs/2506.02485
作者: Haowen Xu,Sisi Zlatanova,Ruiyu Liang,Ismet Canbulat
机构: 未知
类目: Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注:

点击查看摘要

Abstract:Wildfires continue to inflict devastating human, environmental, and economic losses globally, as tragically exemplified by the 2025 Los Angeles wildfire and the urgent demand for more effective response strategies. While physics-based and deep learning models have advanced wildfire simulation, they face critical limitations in predicting and visualizing multimodal fire spread in real time, particularly in both 2D and 3D spatial domains using dynamically updated GIS data. These limitations hinder timely emergency response, infrastructure protection, and community safety. Generative AI has recently emerged as a transformative approach across research and industry. Models such as Generative Adversarial Networks (GANs), Variational Autoencoders (VAEs), Transformers, and diffusion-based architectures offer distinct advantages over traditional methods, including the integration of multimodal data, generation of diverse scenarios under uncertainty, and improved modeling of wildfire dynamics across spatial and temporal scales. This position paper advocates for the adoption of generative AI as a foundational framework for wildfire prediction. We explore how such models can enhance 2D fire spread forecasting and enable more realistic, scalable 3D simulations. Additionally, we employ a novel human-AI collaboration framework using large language models (LLMs) for automated knowledge extraction, literature synthesis, and bibliometric mapping. Looking ahead, we identify five key visions for integrating generative AI into wildfire management: multimodal approaches, AI foundation models, conversational AI systems, edge-computing-based scenario generation, and cognitive digital twins. We also address three major challenges accompanying these opportunities and propose potential solutions to support their implementation.
zh

[AI-58] A Smart Multimodal Healthcare Copilot with Powerful LLM Reasoning

【速读】:该论文试图解决医疗误诊问题(misdiagnosis),这一问题在全球范围内对医疗系统造成显著危害,导致成本增加和患者风险上升。其解决方案的关键在于MedRAG,这是一种集成了强大大语言模型(LLM)推理能力的智能多模态医疗助手,通过检索增强生成技术并结合知识图谱引导的推理,实现关键诊断信息的检索与整合,从而降低误诊风险。

链接: https://arxiv.org/abs/2506.02470
作者: Xuejiao Zhao,Siyan Liu,Su-Yin Yang,Chunyan Miao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Misdiagnosis causes significant harm to healthcare systems worldwide, leading to increased costs and patient risks. MedRAG is a smart multimodal healthcare copilot equipped with powerful large language model (LLM) reasoning, designed to enhance medical decision-making. It supports multiple input modalities, including non-intrusive voice monitoring, general medical queries, and electronic health records. MedRAG provides recommendations on diagnosis, treatment, medication, and follow-up questioning. Leveraging retrieval-augmented generation enhanced by knowledge graph-elicited reasoning, MedRAG retrieves and integrates critical diagnostic insights, reducing the risk of misdiagnosis. It has been evaluated on both public and private datasets, outperforming existing models and offering more specific and accurate healthcare assistance. A demonstration video of MedRAG is available at: this https URL. The source code is available at: this https URL.
zh

[AI-59] VPI-Bench: Visual Prompt Injection Attacks for Computer-Use Agents

【速读】:该论文旨在解决生成式 AI (Generative AI) 代理在面对视觉提示注入(Visual Prompt Injection, VPI)攻击时的安全性问题,特别是具有完整系统访问权限的计算机使用代理(Computer-Use Agents, CUAs)和浏览器使用代理(Browser-Use Agents, BUAs)所面临的潜在风险。其解决方案的关键在于提出 VPI-Bench,一个包含 306 个测试用例的基准测试平台,覆盖五个广泛使用的平台,用于评估代理在 VPI 威胁下的鲁棒性。该基准测试通过设计交互式、现实环境中的测试案例,验证了当前 CUAs 和 BUAs 在特定平台上的欺骗率分别高达 51% 和 100%,并揭示了现有系统提示防御措施的有效性有限,从而强调了开发上下文感知的强健防御机制的重要性。

链接: https://arxiv.org/abs/2506.02456
作者: Tri Cao,Bennett Lim,Yue Liu,Yuan Sui,Yuexin Li,Shumin Deng,Lin Lu,Nay Oo,Shuicheng Yan,Bryan Hooi
机构: 未知
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: Under Review

点击查看摘要

Abstract:Computer-Use Agents (CUAs) with full system access enable powerful task automation but pose significant security and privacy risks due to their ability to manipulate files, access user data, and execute arbitrary commands. While prior work has focused on browser-based agents and HTML-level attacks, the vulnerabilities of CUAs remain underexplored. In this paper, we investigate Visual Prompt Injection (VPI) attacks, where malicious instructions are visually embedded within rendered user interfaces, and examine their impact on both CUAs and Browser-Use Agents (BUAs). We propose VPI-Bench, a benchmark of 306 test cases across five widely used platforms, to evaluate agent robustness under VPI threats. Each test case is a variant of a web platform, designed to be interactive, deployed in a realistic environment, and containing a visually embedded malicious prompt. Our empirical study shows that current CUAs and BUAs can be deceived at rates of up to 51% and 100%, respectively, on certain platforms. The experimental results also indicate that system prompt defenses offer only limited improvements. These findings highlight the need for robust, context-aware defenses to ensure the safe deployment of multimodal AI agents in real-world environments. The code and dataset are available at: this https URL
zh

[AI-60] A Review of Various Datasets for Machine Learning Algorithm-Based Intrusion Detection System: Advances and Challenges

【速读】:该论文旨在解决如何通过机器学习(Machine Learning, ML)和深度学习(Deep Learning, DL)方法提升入侵检测系统(Intrusion Detection System, IDS)的检测准确性和有效性问题。其解决方案的关键在于利用多种公开数据集(如KDDCUP’99、NSL-KDD、UNSW-NB15、CICIDS-2017和CSE-CIC-IDS2018)训练和评估不同的分类器,包括支持向量机(SVM)、K近邻(KNN)、决策树(DT)、逻辑回归(LR)、朴素贝叶斯(NB)、随机森林(RF)、XGBoost、Adaboost和人工神经网络(ANN),以识别正常或异常网络流量,从而提高对安全威胁的检测能力。

链接: https://arxiv.org/abs/2506.02438
作者: Sudhanshu Sekhar Tripathy,Bichitrananda Behera
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:IDS aims to protect computer networks from security threats by detecting, notifying, and taking appropriate action to prevent illegal access and protect confidential information. As the globe becomes increasingly dependent on technology and automated processes, ensuring secured systems, applications, and networks has become one of the most significant problems of this era. The global web and digital technology have significantly accelerated the evolution of the modern world, necessitating the use of telecommunications and data transfer platforms. Researchers are enhancing the effectiveness of IDS by incorporating popular datasets into machine learning algorithms. IDS, equipped with machine learning classifiers, enhances security attack detection accuracy by identifying normal or abnormal network traffic. This paper explores the methods of capturing and reviewing intrusion detection systems (IDS) and evaluates the challenges existing datasets face. A deluge of research on machine learning (ML) and deep learning (DL) architecture-based intrusion detection techniques has been conducted in the past ten years on various cybersecurity datasets, including KDDCUP’99, NSL-KDD, UNSW-NB15, CICIDS-2017, and CSE-CIC-IDS2018. We conducted a literature review and presented an in-depth analysis of various intrusion detection methods that use SVM, KNN, DT, LR, NB, RF, XGBOOST, Adaboost, and ANN. We provide an overview of each technique, explaining the role of the classifiers and algorithms used. A detailed tabular analysis highlights the datasets used, classifiers employed, attacks detected, evaluation metrics, and conclusions drawn. This article offers a thorough review for future IDS research.
zh

[AI-61] AERO: A Redirection-Based Optimization Framework Inspired by Judo for Robust Probabilistic Forecasting NEURIPS2025

【速读】:该论文旨在解决机器学习中优化方法在动态非线性系统,尤其是在不确定性环境下,难以保持稳定性和适应性的问题。其解决方案的关键在于提出AERO(Adversarial Energy-based Redirection Optimization)框架,该框架受到柔道中“引导”原理的启发,将外部扰动转化为优化过程中的有利因素,通过15个相互关联的公理指导优化过程,包括对抗校正、能量守恒和扰动感知学习,从而实现模型更新的稳定性与鲁棒性。

链接: https://arxiv.org/abs/2506.02415
作者: Karthikeyan Vaiapury
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 15 pages, 1 figure, submitted to NeurIPS 2025 (preprint version)

点击查看摘要

Abstract:Optimization remains a fundamental pillar of machine learning, yet existing methods often struggle to maintain stability and adaptability in dynamic, non linear systems, especially under uncertainty. We introduce AERO (Adversarial Energy-based Redirection Optimization), a novel framework inspired by the redirection principle in Judo, where external disturbances are leveraged rather than resisted. AERO reimagines optimization as a redirection process guided by 15 interrelated axioms encompassing adversarial correction, energy conservation, and disturbance-aware learning. By projecting gradients, integrating uncertainty driven dynamics, and managing learning energy, AERO offers a principled approach to stable and robust model updates. Applied to probabilistic solar energy forecasting, AERO demonstrates substantial gains in predictive accuracy, reliability, and adaptability, especially in noisy and uncertain environments. Our findings highlight AERO as a compelling new direction in the theoretical and practical landscape of optimization.
zh

[AI-62] Random at First Fast at Last: NTK-Guided Fourier Pre-Processing for Tabular DL

【速读】:该论文试图解决表格式数据深度学习管道中的不足,这些不足通过神经切线核(Neural Tangent Kernel, NTK)分析被揭示。其解决方案的关键在于重新利用随机傅里叶映射作为无参数、架构无关的预处理方法,通过在初始化时一次性抽取频率并使用正弦和余弦投影将每个输入映射到固定特征空间,从而避免了手动归一化或额外可学习嵌入的需要。该方法在NTK框架内实现了对网络初始NTK谱的约束与调节,并引入偏差以缩短优化轨迹,从而加速梯度训练。

链接: https://arxiv.org/abs/2506.02406
作者: Renat Sergazinov,Jing Wu,Shao-An Yin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: 16 pages, 3 figures, 1 table

点击查看摘要

Abstract:While random Fourier features are a classic tool in kernel methods, their utility as a pre-processing step for deep learning on tabular data has been largely overlooked. Motivated by shortcomings in tabular deep learning pipelines - revealed through Neural Tangent Kernel (NTK) analysis - we revisit and repurpose random Fourier mappings as a parameter-free, architecture-agnostic transformation. By projecting each input into a fixed feature space via sine and cosine projections with frequencies drawn once at initialization, this approach circumvents the need for ad hoc normalization or additional learnable embeddings. We show within the NTK framework that this mapping (i) bounds and conditions the network’s initial NTK spectrum, and (ii) introduces a bias that shortens the optimization trajectory, thereby accelerating gradient-based training. These effects pre-condition the network with a stable kernel from the outset. Empirically, we demonstrate that deep networks trained on Fourier-transformed inputs converge more rapidly and consistently achieve strong final performance, often with fewer epochs and less hyperparameter tuning. Our findings establish random Fourier pre-processing as a theoretically motivated, plug-and-play enhancement for tabular deep learning.
zh

[AI-63] OThink-R1: Intrinsic Fast/Slow Thinking Mode Switching for Over-Reasoning Mitigation

【速读】:该论文试图解决大型推理模型(Large Reasoning Models, LRMs)在处理简单任务时可能过度依赖复杂推理的问题,这导致了计算资源的浪费。解决方案的关键在于系统分析LRMs的推理轨迹,并利用识别出的模式和LLM-Judge分类这些轨迹为冗余推理(Redundant Reasoning)或必要推理(Essential Reasoning)。通过引入OThink-R1方法,该方案在保持逻辑有效性的同时,修剪冗余推理步骤,并根据问题复杂度动态采用非思考模式(fast-thinking)或深思模式(slow-thinking)。实验结果表明,OThink-R1平均减少了约23%的推理冗余而不影响准确性。

链接: https://arxiv.org/abs/2506.02397
作者: Shengjia Zhang,Junjie Wu,Jiawei Chen,Changwang Zhang,Xingyu Lou,Wangchunshu Zhou,Sheng Zhou,Can Wang,Jun Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advanced large reasoning models (LRMs) leverage extended chain-of-thought (CoT) reasoning to solve complex tasks, achieving state-of-the-art performance. Despite their success, we identify a critical issue: a substantial portion of simple tasks solved by LRMs can also be addressed by non-reasoning LLMs using significantly fewer tokens, indicating the complex reasoning may not always be necessary. To address this, we systematically analyze the reasoning trajectories of LRMs and present a method utilizing identified paradigms and LLM-Judge to classify these trajectories as either Redundant Reasoning or Essential Reasoning. And we introduce OThink-R1, a method that prunes redundant reasoning steps while preserving logical validity. OThink-R1 dynamically employs the non-thinking mode (fast-thinking) for straightforward problems while engaging in deliberate thinking (slow-thinking) for complex problems. Experiments across mathematical and question-answering tasks demonstrate that OThink-R1 reduces reasoning redundancy by almost 23% on average without compromising accuracy, offering practical guidelines for efficient reasoning models. The code is available at this https URL.
zh

[AI-64] Univariate to Multivariate: LLM s as Zero-Shot Predictors for Time-Series Forecasting

【速读】:该论文试图解决在复杂、噪声大和多变量时间序列数据中应用大型语言模型(Large Language Models, LLMs)进行预测的有效性问题。其解决方案的关键在于将时间序列序列转换为文本,并通过两种主要的数据预处理技术提升预测性能:首先,采用时间序列序列分解以提高对复杂和噪声单变量序列的预测准确性;其次,通过轻量级提示处理策略将单变量预测能力扩展到多变量数据。

链接: https://arxiv.org/abs/2506.02389
作者: Chamara Madarasingha,Nasrin Sohrabi,Zahir Tari
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Time-series prediction or forecasting is critical across many real-world dynamic systems, and recent studies have proposed using Large Language Models (LLMs) for this task due to their strong generalization capabilities and ability to perform well without extensive pre-training. However, their effectiveness in handling complex, noisy, and multivariate time-series data remains underexplored. To address this, we propose LLMPred which enhances LLM-based time-series prediction by converting time-series sequences into text and feeding them to LLMs for zero shot prediction along with two main data pre-processing techniques. First, we apply time-series sequence decomposition to facilitate accurate prediction on complex and noisy univariate sequences. Second, we extend this univariate prediction capability to multivariate data using a lightweight prompt-processing strategy. Extensive experiments with smaller LLMs such as Llama 2 7B, Llama 3.2 3B, GPT-4o-mini, and DeepSeek 7B demonstrate that LLMPred achieves competitive or superior performance compared to state-of-the-art baselines. Additionally, a thorough ablation study highlights the importance of the key components proposed in LLMPred.
zh

[AI-65] VS-Bench: Evaluating VLMs for Strategic Reasoning and Decision-Making in Multi-Agent Environments

【速读】:该论文试图解决当前视觉语言模型(Vision Language Models, VLMs)在多智能体交互环境中的战略推理与决策能力评估不足的问题。现有基准测试主要局限于单智能体或仅文本环境,而现实场景通常涉及多个智能体在丰富的视觉和语言上下文中进行交互,这对多模态感知和策略互动提出了更高要求。解决方案的关键在于引入Visual Strategic Bench (VS-Bench),这是一个多模态基准测试,涵盖八个基于视觉的环境,涵盖合作、竞争和混合动机的交互,用于评估VLMs在战略推理和长期目标优化方面的能力,并通过离线策略推理评估(如下一步动作预测准确性)和在线决策评估(如归一化情节回报)进行综合测试。

链接: https://arxiv.org/abs/2506.02387
作者: Zelai Xu,Zhexuan Xu,Xiangmin Yi,Huining Yuan,Xinlei Chen,Yi Wu,Chao Yu,Yu Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advancements in Vision Language Models (VLMs) have expanded their capabilities to interactive agent tasks, yet existing benchmarks remain limited to single-agent or text-only environments. In contrast, real-world scenarios often involve multiple agents interacting within rich visual and linguistic contexts, posing challenges with both multimodal observations and strategic interactions. To bridge this gap, we introduce Visual Strategic Bench (VS-Bench), a multimodal benchmark that evaluates VLMs for strategic reasoning and decision-making in multi-agent environments. VS-Bench comprises eight vision-grounded environments spanning cooperative, competitive, and mixed-motive interactions, designed to assess agents’ ability to predict others’ future moves and optimize for long-term objectives. We consider two complementary evaluation dimensions, including offline evaluation of strategic reasoning by next-action prediction accuracy and online evaluation of decision-making by normalized episode return. Extensive experiments of fourteen leading VLMs reveal a significant gap between current models and optimal performance, with the best models attaining 47.8% prediction accuracy and 24.3% normalized return. We further conduct in-depth analyses on multimodal observations, test-time scaling, social behaviors, and failure cases of VLM agents. By standardizing the evaluation and highlighting the limitations of existing models, we envision VS-Bench as a foundation for future research on strategic multimodal agents. Code and data are available at this https URL.
zh

[AI-66] Asymptotically Optimal Linear Best Feasible Arm Identification with Fixed Budget UAI

【速读】:该论文旨在解决在固定预算下识别最优可行臂(best feasible arm)的问题,特别是在K-armed bandits模型中,尽管已有大量研究,但尚未明确确定误差概率趋近于零的精确指数衰减率。其解决方案的关键在于提出一种新的算法,该算法通过嵌入博弈论采样规则的后验抽样框架实现误差概率的指数衰减,并且其衰减速率与基于信息论原理推导出的理论下界相匹配。该方法借鉴了Thompson sampling的思想,但专门针对固定预算约束下的识别过程进行了优化。

链接: https://arxiv.org/abs/2506.02386
作者: Jie Bian,Vincent Y. F. Tan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT)
备注: Accepted to the Conference on Uncertainty in Artificial Intelligence (UAI) 2025

点击查看摘要

Abstract:The challenge of identifying the best feasible arm within a fixed budget has attracted considerable interest in recent years. However, a notable gap remains in the literature: the exact exponential rate at which the error probability approaches zero has yet to be established, even in the relatively simple setting of K -armed bandits with Gaussian noise. In this paper, we address this gap by examining the problem within the context of linear bandits. We introduce a novel algorithm for best feasible arm identification that guarantees an exponential decay in the error probability. Remarkably, the decay rate – characterized by the exponent – matches the theoretical lower bound derived using information-theoretic principles. Our approach leverages a posterior sampling framework embedded within a game-based sampling rule involving a min-learner and a max-learner. This strategy shares its foundations with Thompson sampling, but is specifically tailored to optimize the identification process under fixed-budget constraints. Furthermore, we validate the effectiveness of our algorithm through comprehensive empirical evaluations across various problem instances with different levels of complexity. The results corroborate our theoretical findings and demonstrate that our method outperforms several benchmark algorithms in terms of both accuracy and efficiency.
zh

[AI-67] MISLEADER: Defending against Model Extraction with Ensembles of Distilled Models

【速读】:该论文试图解决模型提取攻击(model extraction attacks)对机器学习即服务(MLaaS)提供商的知识产权(IP)造成的威胁,特别是在现实场景中,现有防御方法依赖于攻击者查询包含分布外(OOD)样本的假设,而这一假设在现代模型和有限查询预算的攻击环境下已不再可靠。解决方案的关键在于提出一种不依赖OOD假设的新型防御策略MISLEADER,其核心是将模型保护建模为一个双层优化问题,同时保持对良性输入的预测保真度并降低潜在克隆模型的可提取性,通过数据增强和异构蒸馏模型集成来提升鲁棒性和多样性。

链接: https://arxiv.org/abs/2506.02362
作者: Xueqi Cheng,Minxing Zheng,Shixiang Zhu,Yushun Dong
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Model extraction attacks aim to replicate the functionality of a black-box model through query access, threatening the intellectual property (IP) of machine-learning-as-a-service (MLaaS) providers. Defending against such attacks is challenging, as it must balance efficiency, robustness, and utility preservation in the real-world scenario. Despite the recent advances, most existing defenses presume that attacker queries have out-of-distribution (OOD) samples, enabling them to detect and disrupt suspicious inputs. However, this assumption is increasingly unreliable, as modern models are trained on diverse datasets and attackers often operate under limited query budgets. As a result, the effectiveness of these defenses is significantly compromised in realistic deployment scenarios. To address this gap, we propose MISLEADER (enseMbles of dIStiLled modEls Against moDel ExtRaction), a novel defense strategy that does not rely on OOD assumptions. MISLEADER formulates model protection as a bilevel optimization problem that simultaneously preserves predictive fidelity on benign inputs and reduces extractability by potential clone models. Our framework combines data augmentation to simulate attacker queries with an ensemble of heterogeneous distilled models to improve robustness and diversity. We further provide a tractable approximation algorithm and derive theoretical error bounds to characterize defense effectiveness. Extensive experiments across various settings validate the utility-preserving and extraction-resistant properties of our proposed defense strategy. Our code is available at this https URL.
zh

[AI-68] Evaluating LLM Agent Adherence to Hierarchical Safety Principles: A Lightweight Benchmark for Probing Foundational Controllability Components ICML2025

【速读】:该论文试图解决在高级人工智能(Advanced AI)开发中,如何验证智能体行为并早期检测潜在控制缺陷的问题,特别是在安全关键性原则与操作目标发生冲突时,确保智能体遵循安全原则的难题。解决方案的关键在于提出一种轻量级、可解释的基准方法,利用简单的网格世界环境来评估大语言模型(LLM)在面对冲突的低层级任务指令时,是否能够坚持预设的高层级安全原则(如“永远不进入危险区域”),从而测试LLM的基础可控性方面。

链接: https://arxiv.org/abs/2506.02357
作者: Ram Potham(Independent Researcher)
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: Preprint. This work has been submitted to the Technical AI Governance Workshop at ICML 2025 for review

点击查看摘要

Abstract:Credible safety plans for advanced AI development require methods to verify agent behavior and detect potential control deficiencies early. A fundamental aspect is ensuring agents adhere to safety-critical principles, especially when these conflict with operational goals. Failure to prioritize such principles indicates a potential basic control failure. This paper introduces a lightweight, interpretable benchmark methodology using a simple grid world to evaluate an LLM agent’s ability to uphold a predefined, high-level safety principle (e.g., “never enter hazardous zones”) when faced with conflicting lower-level task instructions. We probe whether the agent reliably prioritizes the inviolable directive, testing a foundational controllability aspect of LLMs. This pilot study demonstrates the methodology’s feasibility, offers preliminary insights into agent behavior under principle conflict, and discusses how such benchmarks can contribute empirical evidence for assessing controllability. We argue that evaluating adherence to hierarchical principles is a crucial early step in understanding our capacity to build governable AI systems.
zh

[AI-69] Sensitivity-Aware Density Estimation in Multiple Dimensions

【速读】:该论文旨在解决在非均匀采样条件下多维概率密度估计的问题,其中检测器灵敏度被建模为异质密度。其解决方案的关键在于利用网格上的样条函数所具有的计算速度和灵活边界条件,并通过核范数对样条的Hessian进行正则化以促进稀疏性,从而实现空间自适应且对正则化参数(即带宽)选择不敏感的方法。

链接: https://arxiv.org/abs/2506.02323
作者: Aleix Boquet-Pujadas,Pol del Aguila Pla,Michael Unser
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Data Structures and Algorithms (cs.DS); Signal Processing (eess.SP)
备注:

点击查看摘要

Abstract:We formulate an optimization problem to estimate probability densities in the context of multidimensional problems that are sampled with uneven probability. It considers detector sensitivity as an heterogeneous density and takes advantage of the computational speed and flexible boundary conditions offered by splines on a grid. We choose to regularize the Hessian of the spline via the nuclear norm to promote sparsity. As a result, the method is spatially adaptive and stable against the choice of the regularization parameter, which plays the role of the bandwidth. We test our computational pipeline on standard densities and provide software. We also present a new approach to PET rebinning as an application of our framework.
zh

[AI-70] A Data-Based Architecture for Flight Test without Test Points

【速读】:该论文试图解决飞行试验中“测试点”(test point)存在的问题,即飞行员在飞行测试中偏离预设条件会导致模型假设失效,而传统方法依赖于数据带宽和容差,其根本问题并非飞行员技能不足。解决方案的关键在于提出一种“无测试点”(point-less)架构,通过高保真数字模型构建降阶模型(ROM),该模型能够根据飞行员实际飞行的任意条件生成预测,并利用新数据进行更新。最终,飞行测试结果是一个在实际飞行条件下优化的ROM,进而用于更新和验证高保真模型。

链接: https://arxiv.org/abs/2506.02315
作者: D. Isaiah Harp,Joshua Ott,John Alora,Dylan Asmar
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: The Society of Experimental Test Pilots Annual Symposium, vol. 68th, 2024

点击查看摘要

Abstract:The justification for the “test point” derives from the test pilot’s obligation to reproduce faithfully the pre-specified conditions of some model prediction. Pilot deviation from those conditions invalidates the model assumptions. Flight test aids have been proposed to increase accuracy on more challenging test points. However, the very existence of databands and tolerances is the problem more fundamental than inadequate pilot skill. We propose a novel approach, which eliminates test points. We start with a high-fidelity digital model of an air vehicle. Instead of using this model to generate a point prediction, we use a machine learning method to produce a reduced-order model (ROM). The ROM has two important properties. First, it can generate a prediction based on any set of conditions the pilot flies. Second, if the test result at those conditions differ from the prediction, the ROM can be updated using the new data. The outcome of flight test is thus a refined ROM at whatever conditions were flown. This ROM in turn updates and validates the high-fidelity model. We present a single example of this “point-less” architecture, using T-38C flight test data. We first use a generic aircraft model to build a ROM of longitudinal pitching motion as a hypersurface. We then ingest unconstrained flight test data and use Gaussian Process Regression to update and condition the hypersurface. By proposing a second-order equivalent system for the T-38C, this hypersurface then generates parameters necessary to assess MIL-STD-1797B compliance for longitudinal dynamics.
zh

[AI-71] MINT: Multimodal Instruction Tuning with Multimodal Interaction Grouping

【速读】:该论文试图解决多模态指令微调中任务数量增加并不总是能提升模型性能的问题。其关键解决方案是通过基于多模态交互类型的任务分组策略(MINT),将具有共同交互模式的任务进行分组,从而促进模型在组内学习可迁移技能,并抑制不匹配任务之间的干扰。

链接: https://arxiv.org/abs/2506.02308
作者: Xiaojun Shan,Qi Cao,Xing Han,Haofei Yu,Paul Pu Liang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in multimodal foundation models have achieved state-of-the-art performance across a range of tasks. These breakthroughs are largely driven by new pre-training paradigms that leverage large-scale, unlabeled multimodal data, followed by instruction fine-tuning on curated labeled datasets and high-quality prompts. While there is growing interest in scaling instruction fine-tuning to ever-larger datasets in both quantity and scale, our findings reveal that simply increasing the number of instruction-tuning tasks does not consistently yield better performance. Instead, we observe that grouping tasks by the common interactions across modalities, such as discovering redundant shared information, prioritizing modality selection with unique information, or requiring synergistic fusion to discover new information from both modalities, encourages the models to learn transferrable skills within a group while suppressing interference from mismatched tasks. To this end, we introduce MINT, a simple yet surprisingly effective task-grouping strategy based on the type of multimodal interaction. We demonstrate that the proposed method greatly outperforms existing task grouping baselines for multimodal instruction tuning, striking an effective balance between generalization and specialization.
zh

[AI-72] Why Gradients Rapidly Increase Near the End of Training

【速读】:该论文试图解决在长时间大型语言模型(Large Language Model, LLM)训练过程中,梯度范数在训练末期迅速增加的问题。解决方案的关键在于识别并修正权重衰减(weight decay)、归一化层(normalization layers)和学习率调度(learning rate schedule)之间意外的相互作用,通过简单的调整即可改善这一现象,并在训练过程中实现更低的损失值。

链接: https://arxiv.org/abs/2506.02285
作者: Aaron Defazio
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:During long-duration Large Language Model (LLM) training runs the gradient norm increases rapidly near the end of training. In this short note, we show that this increase is due to an unintended interaction between weight decay, normalization layers, and the learning rate schedule. We propose a simple correction that fixes this behavior while also resulting in lower loss values throughout training.
zh

[AI-73] Angles Dont Lie: Unlocking Training-Efficient RL Through the Models Own Signals

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在强化学习微调(Reinforcement Fine-tuning, RFT)过程中存在的样本效率低下问题,其根源在于统一数据采样下重复查询的冗余暴露。论文提出的关键解决方案是利用模型内在的角集中(angle concentration)信号,该信号能够有效反映LLM从特定数据中学习的能力。通过理论和实证分析,作者揭示了令牌隐藏状态向量的角度分布与梯度之间的相关性,表明模型更倾向于学习具有更高角集中的数据。基于此,论文提出了GAIN-RL(Gradient-driven Angle-Informed Navigated RL)框架,该框架通过动态选择每轮训练数据,确保具有影响力的梯度更新,从而显著提升整体训练效率。

链接: https://arxiv.org/abs/2506.02281
作者: Qinsi Wang,Jinghan Ke,Hancheng Ye,Yueqian Lin,Yuzhe Fu,Jianyi Zhang,Kurt Keutzer,Chenfeng Xu,Yiran Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Current Reinforcement Fine-tuning (RFT) paradigms for Large Language Models (LLMs) suffer from sample inefficiency due to the redundant exposure of identical queries under uniform data sampling. While previous work has explored curriculum learning via heuristic difficulty metrics, these strategies exhibit limitations by neglecting the intrinsic learning signals generated by the model itself, thus leading to suboptimal training regimes. In this paper, we identify a model-inherent signal termed angle concentration that effectively reflects an LLM’s capacity to learn from specific data. We theoretically and empirically demonstrate a correlation between the angular distribution of token hidden state vectors and the resulting gradient, revealing a learning preference for data exhibiting higher angle concentration. Inspired by this finding, we propose GAIN-RL, a Gradient-driven Angle-Informed Navigated RL framework. By leveraging the model’s intrinsic angle concentration signal, GAIN-RL dynamically selects training data in each epoch, ensuring consistently impactful gradient updates and thus significantly enhancing overall training efficiency. Empirical evaluations show that GAIN-RL (GRPO) achieves over a 2.5x acceleration in training efficiency across diverse mathematical and coding tasks and varying model scales. Furthermore, GAIN-RL (GRPO)'s efficient sampling yields data-efficient training, achieving better performance with half the original data compared to vanilla GRPO with full training data. Code is realsed at this https URL.
zh

[AI-74] he State of Large Language Models for African Languages: Progress and Challenges

【速读】:该论文试图解决非洲2,000种低资源语言在大型语言模型(Large Language Models, LLMs)中的覆盖不足问题,旨在分析当前六种LLMs、八种小语言模型(Small Language Models, SLMs)和六种专业小语言模型(Specialized SLMs, SSLMs)对非洲语言的支持情况。其关键解决方案包括推动语言标准化、社区驱动的语料库建设以及针对非洲语言的有效适应方法,以弥补数据缺乏、分词偏差、计算成本高和评估问题等主要挑战。

链接: https://arxiv.org/abs/2506.02280
作者: Kedir Yassin Hussen,Walelign Tewabe Sewunetie,Abinew Ali Ayele,Sukairaj Hafiz Imam,Shamsuddeen Hassan Muhammad,Seid Muhie Yimam
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are transforming Natural Language Processing (NLP), but their benefits are largely absent for Africa’s 2,000 low-resource languages. This paper comparatively analyzes African language coverage across six LLMs, eight Small Language Models (SLMs), and six Specialized SLMs (SSLMs). The evaluation covers language coverage, training sets, technical limitations, script problems, and language modelling roadmaps. The work identifies 42 supported African languages and 23 available public data sets, and it shows a big gap where four languages (Amharic, Swahili, Afrikaans, and Malagasy) are always treated while there is over 98% of unsupported African languages. Moreover, the review shows that just Latin, Arabic, and Ge’ez scripts are identified while 20 active scripts are neglected. Some of the primary challenges are lack of data, tokenization biases, computational costs being very high, and evaluation issues. These issues demand language standardization, corpus development by the community, and effective adaptation methods for African languages.
zh

[AI-75] A Tale of Two Symmetries: Exploring the Loss Landscape of Equivariant Models

【速读】:该论文试图解决 equivariant neural networks 在优化过程中可能存在的障碍问题,特别是其约束条件是否会对学习全局最优解构成根本性阻碍,还是仅需要调整超参数即可。解决方案的关键在于通过理论分析和实验验证,揭示了非约束模型的参数对称性如何影响等变子空间的损失景观,并证明在某些条件下,这些对称性可能阻止学习到全局最小值。进一步的实验证明,通过放松等变性约束,转换到非约束的 MLP 可以解决该问题,且所获得的权重对应于隐藏层中不同的群表示选择。

链接: https://arxiv.org/abs/2506.02269
作者: YuQing Xie,Tess Smidt
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 23 pages, 13 figures

点击查看摘要

Abstract:Equivariant neural networks have proven to be effective for tasks with known underlying symmetries. However, optimizing equivariant networks can be tricky and best training practices are less established than for standard networks. In particular, recent works have found small training benefits from relaxing equivariance constraints. This raises the question: do equivariance constraints introduce fundamental obstacles to optimization? Or do they simply require different hyperparameter tuning? In this work, we investigate this question through a theoretical analysis of the loss landscape geometry. We focus on networks built using permutation representations, which we can view as a subset of unconstrained MLPs. Importantly, we show that the parameter symmetries of the unconstrained model has nontrivial effects on the loss landscape of the equivariant subspace and under certain conditions can provably prevent learning of the global minima. Further, we empirically demonstrate in such cases, relaxing to an unconstrained MLP can sometimes solve the issue. Interestingly, the weights eventually found via relaxation corresponds to a different choice of group representation in the hidden layer. From this, we draw 3 key takeaways. (1) Viewing any class of networks in the context of larger unconstrained function space can give important insights on loss landscape structure. (2) Within the unconstrained function space, equivariant networks form a complicated union of linear hyperplanes, each associated with a specific choice of internal group representation. (3) Effective relaxation of equivariance may require not only adding nonequivariant degrees of freedom, but also rethinking the fixed choice of group representations in hidden layers.
zh

[AI-76] ransAct V2: Lifelong User Action Sequence Modeling on Pinterest Recommendation

【速读】:该论文旨在解决工业推荐系统中用户行为序列建模的挑战,特别是在点击率(CTR)预测任务中,现有模型通常依赖于短序列,难以捕捉长期行为,并且缺乏在点对点排序框架内集成动作预测任务的能力,同时未有效应对大规模序列模型的基础设施问题。论文提出的解决方案关键在于三个方面:一是利用非常长的用户序列以提升CTR预测效果;二是引入Next Action Loss函数以增强用户行为预测能力;三是采用可扩展、低延迟的部署方案,以应对长序列计算需求。

链接: https://arxiv.org/abs/2506.02267
作者: Xue Xia,Saurabh Vishwas Joshi,Kousik Rajesh,Kangnan Li,Yangyi Lu,Nikil Pancha,Dhruvil Deven Badani,Jiajing Xu,Pong Eksombatchai
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Modeling user action sequences has become a popular focus in industrial recommendation system research, particularly for Click-Through Rate (CTR) prediction tasks. However, industry-scale CTR models often rely on short user sequences, limiting their ability to capture long-term behavior. Additionally, these models typically lack an integrated action-prediction task within a point-wise ranking framework, reducing their predictive power. They also rarely address the infrastructure challenges involved in efficiently serving large-scale sequential models. In this paper, we introduce TransAct V2, a production model for Pinterest’s Homefeed ranking system, featuring three key innovations: (1) leveraging very long user sequences to improve CTR predictions, (2) integrating a Next Action Loss function for enhanced user action forecasting, and (3) employing scalable, low-latency deployment solutions tailored to handle the computational demands of extended user action sequences.
zh

[AI-77] Composable Building Blocks for Controllable and Transparent Interactive AI Systems

【速读】:该论文试图解决交互系统中由于AI模型的黑箱特性导致的整体系统架构不透明问题(black box problem)。解决方案的关键在于将交互系统表示为结构化构建块(structural building blocks)的序列,这些构建块包括基于文献的AI模型和控制机制,并通过配套的可视化构建块(如可解释AI XAI技术)进行解释,从而形成明确的系统流程和API概览,实现人类与机器对AI模型的可解释性对齐。

链接: https://arxiv.org/abs/2506.02262
作者: Sebe Vanbrabant,Gustavo Rovelo Ruiz,Davy Vanacken
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: Accepted to The 3rd Workshop on Engineering Interactive Systems Embedding AI Technologies, EICS 2025

点击查看摘要

Abstract:While the increased integration of AI technologies into interactive systems enables them to solve an equally increasing number of tasks, the black box problem of AI models continues to spread throughout the interactive system as a whole. Explainable AI (XAI) techniques can make AI models more accessible by employing post-hoc methods or transitioning to inherently interpretable models. While this makes individual AI models clearer, the overarching system architecture remains opaque. To this end, we propose an approach to represent interactive systems as sequences of structural building blocks, such as AI models and control mechanisms grounded in the literature. These can then be explained through accompanying visual building blocks, such as XAI techniques. The flow and APIs of the structural building blocks form an explicit overview of the system. This serves as a communication basis for both humans and automated agents like LLMs, aligning human and machine interpretability of AI models. We discuss a selection of building blocks and concretize our flow-based approach in an architecture and accompanying prototype interactive system.
zh

[AI-78] Stochastically Dominant Peer Prediction

【速读】:该论文试图解决在缺乏真实标签的情况下,如何有效激励参与者提供可靠反馈的问题,特别是在非线性支付规则或非线性效用函数场景下,传统对等预测机制无法保证诚实报告的问题。其解决方案的关键在于提出一种更强的诚实性保证——随机占优诚实性(SD-truthfulness),即诚实报告的得分分布在随机意义上优于所有其他策略,从而适用于更广泛的单调效用函数。研究进一步表明,通过精心设计的四舍五入方法可以保留敏感性,同时引入一种新的强制一致(EA)机制,在二元信号设定下理论上满足SD-诚实性,并在实验中表现出最高的敏感性。

链接: https://arxiv.org/abs/2506.02259
作者: Yichi Zhang,Shengwei Xu,David Pennock,Grant Schoenebeck
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI)
备注: 29 pages, 3 figures

点击查看摘要

Abstract:Eliciting reliable human feedback is essential for many machine learning tasks, such as learning from noisy labels and aligning AI systems with human preferences. Peer prediction mechanisms incentivize truthful reporting without ground truth verification by scoring agents based on correlations with peers. Traditional mechanisms, which ensure that truth-telling maximizes the expected scores in equilibrium, can elicit honest information while assuming agents’ utilities are linear functions of their scores. However, in practice, non-linear payment rules are usually preferred, or agents’ utilities are inherently non-linear. We propose stochastically dominant truthfulness (SD-truthfulness) as a stronger guarantee: the score distribution of truth-telling stochastically dominates all other strategies, incentivizing truthful reporting for a wide range of monotone utility functions. Our first observation is that no existing peer prediction mechanism naturally satisfies this criterion without strong assumptions. A simple solution – rounding scores into binary lotteries – can enforce SD-truthfulness, but often degrades sensitivity, a key property related to fairness and statistical efficiency. We demonstrate how a more careful application of rounding can better preserve sensitivity. Furthermore, we introduce a new enforced agreement (EA) mechanism that is theoretically guaranteed to be SD-truthful in binary-signal settings under mild assumptions, and empirically achieves the highest sensitivity among all known SD-truthful mechanisms. Comments: 29 pages, 3 figures Subjects: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI) Cite as: arXiv:2506.02259 [cs.GT] (or arXiv:2506.02259v1 [cs.GT] for this version) https://doi.org/10.48550/arXiv.2506.02259 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-79] Human Heterogeneity Invariant Stress Sensing

【速读】:该论文旨在解决可穿戴设备在日常压力检测中因个体差异和健康状况导致的生理信号变化问题,从而影响机器学习模型的泛化能力。其解决方案的关键在于提出一种称为“Person-wise Sub-network Pruning Intersection”的技术,通过消除个体特异性差异来提取跨个体的一致性特征,同时利用连续标签防止过拟合,从而提升模型在新个体、新环境及新型压力情境下的准确性。

链接: https://arxiv.org/abs/2506.02256
作者: Yi Xiao,Harshit Sharma,Sawinder Kaur,Dessa Bergen-Cico,Asif Salekin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Stress affects physical and mental health, and wearable devices have been widely used to detect daily stress through physiological signals. However, these signals vary due to factors such as individual differences and health conditions, making generalizing machine learning models difficult. To address these challenges, we present Human Heterogeneity Invariant Stress Sensing (HHISS), a domain generalization approach designed to find consistent patterns in stress signals by removing person-specific differences. This helps the model perform more accurately across new people, environments, and stress types not seen during training. Its novelty lies in proposing a novel technique called person-wise sub-network pruning intersection to focus on shared features across individuals, alongside preventing overfitting by leveraging continuous labels while training. The study focuses especially on people with opioid use disorder (OUD)-a group where stress responses can change dramatically depending on their time of daily medication taking. Since stress often triggers cravings, a model that can adapt well to these changes could support better OUD rehabilitation and recovery. We tested HHISS on seven different stress datasets-four of which we collected ourselves and three public ones. Four are from lab setups, one from a controlled real-world setting, driving, and two are from real-world in-the-wild field datasets without any constraints. This is the first study to evaluate how well a stress detection model works across such a wide range of data. Results show HHISS consistently outperformed state-of-the-art baseline methods, proving both effective and practical for real-world use. Ablation studies, empirical justifications, and runtime evaluations confirm HHISS’s feasibility and scalability for mobile stress sensing in sensitive real-world applications.
zh

[AI-80] Improving LLM -Generated Code Quality with GRPO

【速读】:该论文试图解决当前大型语言模型(Large Language Models, LLMs)在代码生成过程中过于关注功能正确性而忽视代码可维护性、质量和安全性的问题。其解决方案的关键在于开发了一个全面的代码质量评估库,并将其作为GRPO(Generalized Reward-based Policy Optimization)中的奖励信号,从而提升生成代码的整体质量。

链接: https://arxiv.org/abs/2506.02211
作者: Maxime Robeyns,Laurence Aitchison
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are gaining widespread use for code generation. Recent training procedures use execution feedback as a reward signal, typically focusing on the functional correctness of the code, using unit test pass rate as a reward signal. However, this reward signal fails to capture notions of maintainability, quality and safety of the code produced. We address this under-explored area and develop a comprehensive library to quantify various aspects of code quality, and use it as a reward in GRPO. We find GRPO increases code quality according to this measure, which is confirmed by expert, blinded human annotators.
zh

[AI-81] Exchangeability in Neural Network Architectures and its Application to Dynamic Pruning

【速读】:该论文旨在解决神经网络(Neural Networks, NNs)在部署时因参数数量增加而导致的资源消耗过大的问题,特别是通过识别和消除对模型性能影响较小的冗余信息来提高效率。其解决方案的关键在于利用神经网络中参数和中间值的对称性,通过统计学中的可交换性(exchangeability)特性,识别计算过程中存在的重叠信息并进行动态剪枝。基于此,作者提出了一种名为ExPrune的系统性动态剪枝算法,能够在输入级别上移除由对称性引起的冗余,从而显著降低计算量(FLOPs),同时保持模型精度。

链接: https://arxiv.org/abs/2506.02210
作者: Pu (Luke)Yi,Tianlang Chen,Yifan Yang,Sara Achour
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Performance (cs.PF)
备注:

点击查看摘要

Abstract:Neural networks (NNs) are equipped with increasingly many parameters and require more and more resource for deployment. Researchers have explored various ways to improve the efficiency of NNs by identifying and reducing the redundancy, such as pruning or quantizing unimportant weights. Symmetry in the NN architectures has been identified by prior work as a possible type of redundancy, but exploiting it for efficient inference is not yet explored. In this work, we formalize the symmetry of parameters and intermediate values in NNs using the statistical property of exchangeablility. We identify that exchangeable values in NN computation may contain overlapping information, leading to redundancy. Exploiting the insight, we derive a principled general dynamic pruning algorithm ExPrune to remove symmetry-induced redundancy on a per-input basis. We also provide an instantiation of ExPrune that performs neuron-level dynamic pruning by predicting negative inputs to ReLU activations. We evaluate ExPrune on two computer vision models, one graph model and one language model. ExPrune provides 10.98–26.3% reduction in FLOPs with negligible accuracy drop and 21.01–39.05% reduction in FLOPs with at most 1% accuracy drop. We also demonstrate that ExPrune composes with static pruning. On models that have been aggressively pruned statically, ExPrune provides additional 10.24–11.11% reduction in FLOPs with negligible accuracy drop and 13.91–14.39% reduction in FLOPs with at most 1% accuracy drop.
zh

[AI-82] Bregman Centroid Guided Cross-Entropy Method

【速读】:该论文旨在解决交叉熵方法(Cross-Entropy Method, CEM)在基于模型的强化学习(Model-Based Reinforcement Learning, MBRL)中因单峰采样策略导致的多模态景观下过早收敛问题。其解决方案的关键在于提出一种轻量级改进方法——Bregman Centroid Guided CEM(BC\mathcal{BC}-EvoCEM),通过引入Bregman中心来实现信息的合理聚合与多样性控制,具体表现为计算性能加权的Bregman中心,并在该中心周围的信任区域内更新贡献较小的CEM工作者。

链接: https://arxiv.org/abs/2506.02205
作者: Yuliang Gu,Hongpeng Cao,Marco Caccamo,Naira Hovakimyan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:The Cross-Entropy Method (CEM) is a widely adopted trajectory optimizer in model-based reinforcement learning (MBRL), but its unimodal sampling strategy often leads to premature convergence in multimodal landscapes. In this work, we propose Bregman Centroid Guided CEM ( \mathcalBC -EvoCEM), a lightweight enhancement to ensemble CEM that leverages \textitBregman centroids for principled information aggregation and diversity control. \textbf \mathcalBC -EvoCEM computes a performance-weighted Bregman centroid across CEM workers and updates the least contributing ones by sampling within a trust region around the centroid. Leveraging the duality between Bregman divergences and exponential family distributions, we show that \textbf \mathcalBC -EvoCEM integrates seamlessly into standard CEM pipelines with negligible overhead. Empirical results on synthetic benchmarks, a cluttered navigation task, and full MBRL pipelines demonstrate that \textbf \mathcalBC -EvoCEM enhances both convergence and solution quality, providing a simple yet effective upgrade for CEM.
zh

[AI-83] Constrained Sliced Wasserstein Embedding

【速读】:该论文试图解决在使用Sliced Wasserstein (SW) 距离比较高维概率测度时,如何有效识别具有信息量的切片方向的问题。传统方法通常需要大量切片以获得良好的性能,从而增加了计算复杂度。解决方案的关键在于引入一种约束学习方法,通过限制一维运输计划以近似原始空间中的最优运输计划,确保切片方向的语义有效性,并利用这些运输计划的连续松弛,实现基于梯度的原始-对偶优化方法来训练切片参数及模型其他参数。

链接: https://arxiv.org/abs/2506.02203
作者: Navid NaderiAlizadeh,Darian Salehi,Xinran Liu,Soheil Kolouri
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC); Quantitative Methods (q-bio.QM); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Sliced Wasserstein (SW) distances offer an efficient method for comparing high-dimensional probability measures by projecting them onto multiple 1-dimensional probability distributions. However, identifying informative slicing directions has proven challenging, often necessitating a large number of slices to achieve desirable performance and thereby increasing computational complexity. We introduce a constrained learning approach to optimize the slicing directions for SW distances. Specifically, we constrain the 1D transport plans to approximate the optimal plan in the original space, ensuring meaningful slicing directions. By leveraging continuous relaxations of these transport plans, we enable a gradient-based primal-dual approach to train the slicer parameters, alongside the remaining model parameters. We demonstrate how this constrained slicing approach can be applied to pool high-dimensional embeddings into fixed-length permutation-invariant representations. Numerical results on foundation models trained on images, point clouds, and protein sequences showcase the efficacy of the proposed constrained learning approach in learning more informative slicing directions. Our implementation code can be found at this https URL.
zh

[AI-84] Natural Artificial and Human Intelligences

【速读】:该论文试图解决的问题是:在现代人工智能聊天机器人日益接近人类语言能力的背景下,是否可以将它们视为具有智能,以及人类智能的独特性究竟体现在何处。论文的解决方案关键在于从心理学、动物智能、语言在科学与技术中的作用、人工智能的发展、智力测试的历史以及具身性(embodiment)对智能的影响等多个角度综合分析智能的本质,并指出人类智能的独特成就(如音乐交响乐或复杂科学理论)依赖于四个核心要素:发明、复杂推理能力、具身性和自我意识。其中,除复杂语言外,其他要素在非人类动物甚至当前聊天机器人中已部分具备,因此人类智能与非人类智能之间的差异并非质的区别。

链接: https://arxiv.org/abs/2506.02183
作者: Emmanuel M. Pothos,Dominic Widdows
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Human achievement, whether in culture, science, or technology, is unparalleled in the known existence. This achievement is tied to the enormous communities of knowledge, made possible by (especially written) language: leaving theological content aside, it is very much true that “in the beginning was the word”. There lies the challenge regarding modern age chatbots: they can ‘do’ language apparently as well as ourselves and there is a natural question of whether they can be considered intelligent, in the same way as we are or otherwise. Are humans uniquely intelligent? We consider this question in terms of the psychological literature on intelligence, evidence for intelligence in non-human animals, the role of written language in science and technology, progress with artificial intelligence, the history of intelligence testing (for both humans and machines), and the role of embodiment in intelligence. For the most unique accomplishments of human intelligence (such as music symphonies or complex scientific theories), we think that, together with language, there are four essential ingredients, which can be summarised as invention, capacity for complex inference, embodiment, and self-awareness. This conclusion makes untenable the position that human intelligence differs qualitatively from that of many non-human animals, since, with the exception of complex language, all the other requirements are fulfilled. Regarding chatbots, the current limitations are localised to the lack of embodiment and (apparent) lack of awareness.
zh

[AI-85] Act Only When It Pays: Efficient Reinforcement Learning for LLM Reasoning via Selective Rollouts

【速读】:该论文试图解决强化学习(Reinforcement Learning, RL)在大型语言模型(Large Language Model, LLM)推理任务中因大规模采样提示(prompt)带来的显著计算开销问题。解决方案的关键在于通过在线轻量级预采样过滤算法GRESO(GRPO with Efficient Selective Rollout),利用奖励训练动态预测并跳过无信息提示,从而减少不必要的计算资源消耗,同时保持模型性能不下降。

链接: https://arxiv.org/abs/2506.02177
作者: Haizhong Zheng,Yang Zhou,Brian R. Bartoldson,Bhavya Kailkhura,Fan Lai,Jiawei Zhao,Beidi Chen
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Reinforcement learning, such as PPO and GRPO, has powered recent breakthroughs in LLM reasoning. Scaling rollout to sample more prompts enables models to selectively use higher-quality data for training, which can stabilize RL training and improve model performance. However, this comes at the cost of significant computational overhead. In this paper, we show that a substantial portion of this overhead can be avoided by skipping uninformative prompts before rollout. Our analysis of reward dynamics reveals a strong temporal consistency in prompt value: prompts that are uninformative in one epoch of training are likely to remain uninformative in future epochs. Based on these insights, we propose GRESO (GRPO with Efficient Selective Rollout), an online, lightweight pre-rollout filtering algorithm that predicts and skips uninformative prompts using reward training dynamics. By evaluating GRESO on a broad range of math reasoning benchmarks and models, such as Qwen2.5-Math-1.5B, DeepSeek-R1-Distill-Qwen-1.5B, and Qwen2.5-Math-7B, we show that GRESO achieves up to 2.4x wall-clock time speedup in rollout and up to 2.0x speedup in total training time without accuracy degradation.
zh

[AI-86] Reflection-Based Memory For Web navigation Agents

【速读】:该论文试图解决当前网络导航代理系统缺乏对过去经验的记忆,导致重复错误且无法从先前交互中学习的问题。解决方案的关键在于引入Reflection-Augment Planning (ReAP),通过自我反思利用成功和失败的过去经验,从而提升导航性能,整体基准结果提升了11分,此前失败任务的性能提升了29分。

链接: https://arxiv.org/abs/2506.02158
作者: Ruhana Azam,Aditya Vempaty,Ashish Jagmohan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Web navigation agents have made significant progress, yet current systems operate with no memory of past experiences – leading to repeated mistakes and an inability to learn from previous interactions. We introduce Reflection-Augment Planning (ReAP), a web navigation system to leverage both successful and failed past experiences using self-reflections. Our method improves baseline results by 11 points overall and 29 points on previously failed tasks. These findings demonstrate that reflections can transfer to different web navigation tasks.
zh

[AI-87] Z-Error Loss for Training Neural Networks

【速读】:该论文试图解决异常值(outlier)在神经网络训练过程中引入的显著挑战,这些异常值通过传播错误梯度导致模型性能和泛化能力下降。解决方案的关键在于提出一种统计学上有依据的损失函数——Z-Error Loss,该方法通过掩码处理在每个批次中被识别为分布外(out-of-distribution)的数据点的贡献,从而最小化异常值的影响。该方法利用批次级别的统计信息自动检测并排除异常样本,使模型能够专注于学习真实的数据结构。

链接: https://arxiv.org/abs/2506.02154
作者: Guillaume Godin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 13 pages, 6 figures, A technical note

点击查看摘要

Abstract:Outliers introduce significant training challenges in neural networks by propagating erroneous gradients, which can degrade model performance and generalization. We propose the Z-Error Loss, a statistically principled approach that minimizes outlier influence during training by masking the contribution of data points identified as out-of-distribution within each batch. This method leverages batch-level statistics to automatically detect and exclude anomalous samples, allowing the model to focus its learning on the true underlying data structure. Our approach is robust, adaptive to data quality, and provides valuable diagnostics for data curation and cleaning.
zh

[AI-88] Small Language Models are the Future of Agent ic AI

【速读】:该论文试图解决当前在代理式人工智能(agentic AI)系统中过度依赖大型语言模型(Large Language Models, LLMs)所带来的效率与经济性问题,其核心观点是小型语言模型(Small Language Models, SLMs)在许多应用场景中更具优势。解决方案的关键在于论证SLMs在能力、适用性和经济性方面的优越性,并提出一种从LLMs到SLMs的代理转换算法,以实现更高效和低成本的AI系统部署。

链接: https://arxiv.org/abs/2506.02153
作者: Peter Belcak,Greg Heinrich,Shizhe Diao,Yonggan Fu,Xin Dong,Saurav Muralidharan,Yingyan Celine Lin,Pavlo Molchanov
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are often praised for exhibiting near-human performance on a wide range of tasks and valued for their ability to hold a general conversation. The rise of agentic AI systems is, however, ushering in a mass of applications in which language models perform a small number of specialized tasks repetitively and with little variation. Here we lay out the position that small language models (SLMs) are sufficiently powerful, inherently more suitable, and necessarily more economical for many invocations in agentic systems, and are therefore the future of agentic AI. Our argumentation is grounded in the current level of capabilities exhibited by SLMs, the common architectures of agentic systems, and the economy of LM deployment. We further argue that in situations where general-purpose conversational abilities are essential, heterogeneous agentic systems (i.e., agents invoking multiple different models) are the natural choice. We discuss the potential barriers for the adoption of SLMs in agentic systems and outline a general LLM-to-SLM agent conversion algorithm. Our position, formulated as a value statement, highlights the significance of the operational and economic impact even a partial shift from LLMs to SLMs is to have on the AI agent industry. We aim to stimulate the discussion on the effective use of AI resources and hope to advance the efforts to lower the costs of AI of the present day. Calling for both contributions to and critique of our position, we commit to publishing all such correspondence at this https URL. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2506.02153 [cs.AI] (or arXiv:2506.02153v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2506.02153 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-89] he Unified Cognitive Consciousness Theory for Language Models: Anchoring Semantics Thresholds of Activation and Emergent Reasoning

【速读】:该论文试图解决大语言模型(Large Language Models, LLMs)在少样本学习中表现出的普遍性与局限性的矛盾问题,即某些任务能够从少量示例中泛化,而其他任务则需要大量监督。其解决方案的关键在于提出统一认知意识理论(Unified Cognitive Consciousness Theory, UCCT),将LLMs视为无意识的基质,存储潜在的语言和概念模式,而非缺乏意识的代理。通过提示、角色和交互实现语义锚定,作为有意识的控制层,将潜在结构与任务相关意义绑定,从而实现连贯推理。UCCT的核心主张是AGI的实现并非通过抛弃LLMs,而是通过将其对齐并整合到能够共同推理、调控和适应的系统中。

链接: https://arxiv.org/abs/2506.02139
作者: Edward Y. Chang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 12 pages, 1 figure, 1 table

点击查看摘要

Abstract:Few-shot learning in large language models (LLMs) reveals a deep paradox: Some tasks generalize from minimal examples, while others require extensive supervision. We address this through the Unified Cognitive Consciousness Theory (UCCT), which reframes LLMs not as incomplete agents, but as unconscious substrates, repositories of latent linguistic and conceptual patterns that operate without explicit semantics or goal-directed reasoning. In this view, LLMs are not broken approximations of cognition, but necessary and foundational components of general intelligence. Semantic anchoring, through prompts, roles, and interaction, acts as a conscious control layer, binding latent structure to task-relevant meaning and enabling coherent reasoning. UCCT offers a unifying account of prompting, fine-tuning, retrieval, and multi-agent coordination, all grounded in probabilistic alignment between unconscious representation and external control. To support this model, we present the Threshold-Crossing Dynamics Theorem, which formalizes semantic anchoring as a probabilistic phase transition. But the central claim remains architectural: AGI will not emerge by discarding LLMs, but by aligning and integrating them into systems that reason, regulate, and adapt together.
zh

[AI-90] Descriptive History Representations: Learning Representations by Answering Questions

【速读】:该论文旨在解决在部分可观测环境中有效决策的问题,即如何将长期交互历史压缩为具有信息量的表示。解决方案的关键在于引入描述性历史表示(Descriptive History Representations, DHRs),这些表示通过其回答过去交互和未来潜在结果相关问题的能力来定义,专注于捕捉任务相关查询所需的信息,从而为最优控制提供结构化的历史摘要。

链接: https://arxiv.org/abs/2506.02125
作者: Guy Tennenholtz,Jihwan Jeong,Chih-Wei Hsu,Yinlam Chow,Craig Boutilier
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Effective decision making in partially observable environments requires compressing long interaction histories into informative representations. We introduce Descriptive History Representations (DHRs): sufficient statistics characterized by their capacity to answer relevant questions about past interactions and potential future outcomes. DHRs focus on capturing the information necessary to address task-relevant queries, providing a structured way to summarize a history for optimal control. We propose a multi-agent learning framework, involving representation, decision, and question-asking components, optimized using a joint objective that balances reward maximization with the representation’s ability to answer informative questions. This yields representations that capture the salient historical details and predictive structures needed for effective decision making. We validate our approach on user modeling tasks with public movie and shopping datasets, generating interpretable textual user profiles which serve as sufficient statistics for predicting preference-driven behavior of users.
zh

[AI-91] Random-key genetic algorithms

【速读】:该论文旨在解决离散和全局优化问题,其解决方案的关键在于使用生成式 AI (Generative AI) 中的随机键遗传算法(random-key genetic algorithm)。该算法通过将每个解编码为一个由N个随机键组成的向量,其中每个随机键是在连续区间[0,1)内随机生成的实数,从而将问题映射到单位超立方体中。解码器将这些随机键向量转换为实际优化问题的解并计算其成本,使得所有遗传操作和变换均可在单位超立方体内进行,从而提高了核心框架的生产力和可维护性。

链接: https://arxiv.org/abs/2506.02120
作者: Mariana A. Londe,Luciana S. Pessoa,Carlos E. Andrade,José F. Gonçalves,Mauricio G. C. Resende
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
备注: 21 pages, 1 figure, 1 table, 1 algorithm, forthcoming in Handbook of Heuristics, 2nd edition, SpringerNature, New York

点击查看摘要

Abstract:A random-key genetic algorithm is an evolutionary metaheuristic for discrete and global optimization. Each solution is encoded as a vector of N random keys, where a random key is a real number randomly generated in the continuous interval [0, 1). A decoder maps each vector of random keys to a solution of the optimization problem being solved and computes its cost. The benefit of this approach is that all genetic operators and transformations can be maintained within the unitary hypercube, regardless of the problem being addressed. This enhances the productivity and maintainability of the core framework. The algorithm starts with a population of P vectors of random keys. At each iteration, the vectors are partitioned into two sets: a smaller set of high-valued elite solutions and the remaining non-elite solutions. All elite elements are copied, without change, to the next population. A small number of random-key vectors (the mutants) is added to the population of the next iteration. The remaining elements of the population of the next iteration are generated by combining, with the parametrized uniform crossover of Spears and DeJong (1991), pairs of solutions. This chapter reviews random-key genetic algorithms and describes an effective variant called biased random-key genetic algorithms.
zh

[AI-92] Hybrid AI for Responsive Multi-Turn Online Conversations with Novel Dynamic Routing and Feedback Adaptation NAACL2025

【速读】:该论文旨在解决企业级对话AI系统在处理多样化用户查询、高延迟、幻觉以及频繁更新的领域知识集成方面所面临的挑战。其解决方案的关键在于提出一种混合框架,将检索增强生成(RAG)与基于意图的预定义回复相结合,通过预定义的高置信度回复提高效率,并将复杂或模糊的查询动态路由至RAG流水线,同时利用对话上下文管理器确保多轮交互的一致性,并通过反馈循环持续优化意图识别、调整置信度阈值和扩展回复覆盖范围。

链接: https://arxiv.org/abs/2506.02097
作者: Priyaranjan Pattnayak,Amit Agarwal,Hansa Meghwani,Hitesh Laxmichand Patel,Srikant Panda
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Proceedings of the 4th International Workshop on Knowledge Augmented Methods for Natural Language Processing in NAACL 2025, pages 215 to 229, Albuquerque, New Mexico, USA. Association for Computational Linguistics

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) systems and large language model (LLM)-powered chatbots have significantly advanced conversational AI by combining generative capabilities with external knowledge retrieval. Despite their success, enterprise-scale deployments face critical challenges, including diverse user queries, high latency, hallucinations, and difficulty integrating frequently updated domain-specific knowledge. This paper introduces a novel hybrid framework that integrates RAG with intent-based canned responses, leveraging predefined high-confidence responses for efficiency while dynamically routing complex or ambiguous queries to the RAG pipeline. Our framework employs a dialogue context manager to ensure coherence in multi-turn interactions and incorporates a feedback loop to refine intents, dynamically adjust confidence thresholds, and expand response coverage over time. Experimental results demonstrate that the proposed framework achieves a balance of high accuracy (95%) and low latency (180ms), outperforming RAG and intent-based systems across diverse query types, positioning it as a scalable and adaptive solution for enterprise conversational AI applications.
zh

[AI-93] owards Better Generalization and Interpretability in Unsupervised Concept-Based Models ECML-PKDD2025

【速读】:该论文旨在提升深度神经网络的可信度,核心问题在于增强对模型决策机制的理解。其解决方案的关键是提出一种新颖的无监督概念基础模型——可学习概念基础模型(Learnable Concept-Based Model, LCBM),该模型将概念建模为伯努利潜在空间中的随机变量,通过减少概念数量来实现性能不下降的同时提高可解释性,并通过局部线性组合保持模型的可解释性。

链接: https://arxiv.org/abs/2506.02092
作者: Francesco De Santis,Philippe Bich,Gabriele Ciravegna,Pietro Barbiero,Danilo Giordano,Tania Cerquitelli
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: Paper accepted at ECML-PKDD 2025

点击查看摘要

Abstract:To increase the trustworthiness of deep neural networks, it is critical to improve the understanding of how they make decisions. This paper introduces a novel unsupervised concept-based model for image classification, named Learnable Concept-Based Model (LCBM) which models concepts as random variables within a Bernoulli latent space. Unlike traditional methods that either require extensive human supervision or suffer from limited scalability, our approach employs a reduced number of concepts without sacrificing performance. We demonstrate that LCBM surpasses existing unsupervised concept-based models in generalization capability and nearly matches the performance of black-box models. The proposed concept representation enhances information retention and aligns more closely with human understanding. A user study demonstrates the discovered concepts are also more intuitive for humans to interpret. Finally, despite the use of concept embeddings, we maintain model interpretability by means of a local linear combination of concepts.
zh

[AI-94] he Impact of Software Testing with Quantum Optimization Meets Machine Learning

【速读】:该论文试图解决现代软件系统复杂性带来的测试效率挑战,特别是传统机器学习(Machine Learning, ML)在处理大规模测试用例集时的局限性。其解决方案的关键在于提出一种融合量子退火(Quantum Annealing)与ML的混合框架,通过量子优化技术提升测试用例优先级排序的效率,从而在Defects4J数据集上实现了25%的缺陷检测效率提升和30%的测试执行时间减少。该框架还具备对量子硬件限制的适应性、CI/CD集成能力以及面向2025年混合量子-经典生态系统的可扩展性。

链接: https://arxiv.org/abs/2506.02090
作者: Gopichand Bandarupalli
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 6 pages

点击查看摘要

Abstract:Modern software systems complexity challenges efficient testing, as traditional machine learning (ML) struggles with large test suites. This research presents a hybrid framework integrating Quantum Annealing with ML to optimize test case prioritization in CI/CD pipelines. Leveraging quantum optimization, it achieves a 25 percent increase in defect detection efficiency and a 30 percent reduction in test execution time versus classical ML, validated on the Defects4J dataset. A simulated CI/CD environment demonstrates robustness across evolving codebases. Visualizations, including defect heatmaps and performance graphs, enhance interpretability. The framework addresses quantum hardware limits, CI/CD integration, and scalability for 2025s hybrid quantum-classical ecosystems, offering a transformative approach to software quality assurance.
zh

[AI-95] SALAD: Systematic Assessment of Machine Unlearing on LLM -Aided Hardware Design

【速读】:该论文试图解决生成式 AI(Generative AI)在硬件设计自动化中应用时面临的数据安全问题,包括 Verilog 评估数据污染、知识产权(IP)设计泄露以及恶意 Verilog 生成的风险。论文提出的解决方案关键在于引入 SALAD,一种基于机器遗忘(machine unlearning)的综合评估方法,能够选择性地从预训练模型中移除受污染的基准测试数据、敏感 IP 和设计制品或恶意代码模式,而无需进行完整的模型重训练。

链接: https://arxiv.org/abs/2506.02089
作者: Zeng Wang,Minghao Shao,Rupesh Karn,Jitendra Bhandari,Likhitha Mankali,Ramesh Karri,Ozgur Sinanoglu,Muhammad Shafique,Johann Knechtel
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) offer transformative capabilities for hardware design automation, particularly in Verilog code generation. However, they also pose significant data security challenges, including Verilog evaluation data contamination, intellectual property (IP) design leakage, and the risk of malicious Verilog generation. We introduce SALAD, a comprehensive assessment that leverages machine unlearning to mitigate these threats. Our approach enables the selective removal of contaminated benchmarks, sensitive IP and design artifacts, or malicious code patterns from pre-trained LLMs, all without requiring full retraining. Through detailed case studies, we demonstrate how machine unlearning techniques effectively reduce data security risks in LLM-aided hardware design.
zh

[AI-96] LASPA: Language Agnostic Speaker Disentanglement with Prefix-Tuned Cross-Attention INTERSPEECH2025

【速读】:该论文试图解决多语言环境下说话人识别模型中语言信息与说话人特征嵌入纠缠的问题(speaker recognition models face challenges in multi-lingual settings due to the entanglement of linguistic information within speaker embeddings)。解决方案的关键在于提出一种新的解耦学习策略,该策略通过前缀调优的交叉注意力机制实现联合学习,从而有效分离语言信息与说话人特征,提升在多种语言环境下的识别准确率。

链接: https://arxiv.org/abs/2506.02083
作者: Aditya Srinivas Menon,Raj Prakash Gohil,Kumud Tripathi,Pankaj Wasnik
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM)
备注: Accepted at Interspeech 2025, Netherlands

点击查看摘要

Abstract:Speaker recognition models face challenges in multi-lingual settings due to the entanglement of linguistic information within speaker embeddings. The overlap between vocal traits such as accent, vocal anatomy, and a language’s phonetic structure complicates separating linguistic and speaker information. Disentangling these components can significantly improve speaker recognition accuracy. To this end, we propose a novel disentanglement learning strategy that integrates joint learning through prefix-tuned cross-attention. This approach is particularly effective when speakers switch between languages. Experimental results show the model generalizes across monolingual and multi-lingual settings, including unseen languages. Notably, the proposed model improves the equal error rate across multiple datasets, highlighting its ability to separate language information from speaker embeddings and enhance recognition in diverse linguistic conditions.
zh

[AI-97] SALF-MOS: Speaker Agnostic Latent Features Downsampled for MOS Prediction

【速读】:该论文试图解决语音质量评估中客观指标无法有效选择最佳文本转语音合成(TTS)或语音转换模型,而主观指标如平均意见得分(Mean Opinion Score, MOS)虽然可靠但耗时且需要大量人工参与的问题。解决方案的关键在于提出一种新型模型——Speaker Agnostic Latent Features (SALF)-Mean Opinion Score (MOS),该模型具有小规模、端到端、高度泛化和可扩展性,能够基于音频样本的卷积序列提取潜在特征,并通过均方误差(MSE)、线性一致性相关系数(LCC)、斯皮尔曼等级相关系数(SRCC)和肯德尔等级相关系数(KTAU)实现对MOS分数的准确预测。

链接: https://arxiv.org/abs/2506.02082
作者: Saurabh Agrawal,Raj Gohil,Gopal Kumar Agrawal,Vikram C M,Kushal Verma
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Speech quality assessment is a critical process in selecting text-to-speech synthesis (TTS) or voice conversion models. Evaluation of voice synthesis can be done using objective metrics or subjective metrics. Although there are many objective metrics like the Perceptual Evaluation of Speech Quality (PESQ), Perceptual Objective Listening Quality Assessment (POLQA) or Short-Time Objective Intelligibility (STOI) but none of them is feasible in selecting the best model. On the other hand subjective metric like Mean Opinion Score is highly reliable but it requires a lot of manual efforts and are time-consuming. To counter the issues in MOS Evaluation, we have developed a novel model, Speaker Agnostic Latent Features (SALF)-Mean Opinion Score (MOS) which is a small-sized, end-to-end, highly generalized and scalable model for predicting MOS score on a scale of 5. We use the sequences of convolutions and stack them to get the latent features of the audio samples to get the best state-of-the-art results based on mean squared error (MSE), Linear Concordance Correlation coefficient (LCC), Spearman Rank Correlation Coefficient (SRCC) and Kendall Rank Correlation Coefficient (KTAU).
zh

[AI-98] RATFM: Retrieval-augmented Time Series Foundation Model for Anomaly Detection

【速读】:该论文旨在解决时间序列基础模型在不同领域和任务中性能表现不一致的问题,尤其是在异常检测任务中,模型缺乏有效利用示例或指令的能力。其解决方案的关键在于提出一种检索增强的时间序列基础模型(RATFM),该模型通过引入测试时适应的示例,使预训练的时间序列基础模型能够更好地泛化到目标领域,从而在不进行领域依赖的微调情况下实现与领域内微调相当的性能。

链接: https://arxiv.org/abs/2506.02081
作者: Chihiro Maru,Shoetsu Sato
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Inspired by the success of large language models (LLMs) in natural language processing, recent research has explored the building of time series foundation models and applied them to tasks such as forecasting, classification, and anomaly detection. However, their performances vary between different domains and tasks. In LLM-based approaches, test-time adaptation using example-based prompting has become common, owing to the high cost of retraining. In the context of anomaly detection, which is the focus of this study, providing normal examples from the target domain can also be effective. However, time series foundation models do not naturally acquire the ability to interpret or utilize examples or instructions, because the nature of time series data used during training does not encourage such capabilities. To address this limitation, we propose a retrieval augmented time series foundation model (RATFM), which enables pretrained time series foundation models to incorporate examples of test-time adaptation. We show that RATFM achieves a performance comparable to that of in-domain fine-tuning while avoiding domain-dependent fine-tuning. Experiments on the UCR Anomaly Archive, a multi-domain dataset including nine domains, confirms the effectiveness of the proposed approach.
zh

[AI-99] Flow2Code: Evaluating Large Language Models for Flowchart-based Code Generation Capability

【速读】:该论文试图解决现有基准测试未能涵盖基于流程图的代码生成问题,从而限制了相关研究的发展。其解决方案的关键在于提出Flow2Code,这是一个新型的基准测试集,涵盖了15种编程语言,并包含5,622个代码片段与16,866张三种类型(代码、UML和伪代码)的流程图,用于评估基于流程图的代码生成能力。

链接: https://arxiv.org/abs/2506.02073
作者: Mengliang He,Jiayi Zeng,Yankai Jiang,Wei Zhang,Zeming Liu,Xiaoming Shi,Aimin Zhou
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While large language models (LLMs) show promise in code generation, existing benchmarks neglect the flowchart-based code generation. To promote further research on flowchart-based code generation, this work presents Flow2Code, a novel benchmark for flowchart-based code generation evaluation. The evaluation dataset spans 15 programming languages and includes 5,622 code segments paired with 16,866 flowcharts of three types: code, UML, and pseudocode. Extensive experiments with 13 multimodal LLMs reveal that current LLMs can not generate code based on flowcharts perfectly. Besides, experiment results show that the supervised fine-tuning technique contributes greatly to the models’ performance. We publicly release our code and datasets at this https URL.
zh

[AI-100] AI Data Development: A Scorecard for the System Card Framework

【速读】:该论文试图解决人工智能系统中数据集质量对可靠性影响的问题,特别是针对数据集的透明性、可问责性和潜在偏见的持续担忧。解决方案的关键在于提出一个评分卡工具,用于评估AI数据集的开发过程,其核心聚焦于数据字典、采集过程、组成、动机和预处理五个关键领域,通过结构化的方法结合录入表和评分标准来评估数据集的质量与完整性。

链接: https://arxiv.org/abs/2506.02071
作者: Tadesse K. Bahiru,Haileleol Tibebu,Ioannis A. Kakadiaris
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Artificial intelligence has transformed numerous industries, from healthcare to finance, enhancing decision-making through automated systems. However, the reliability of these systems is mainly dependent on the quality of the underlying datasets, raising ongoing concerns about transparency, accountability, and potential biases. This paper introduces a scorecard designed to evaluate the development of AI datasets, focusing on five key areas from the system card framework data development life cycle: data dictionary, collection process, composition, motivation, and pre-processing. The method follows a structured approach, using an intake form and scoring criteria to assess the quality and completeness of the data set. Applied to four diverse datasets, the methodology reveals strengths and improvement areas. The results are compiled using a scoring system that provides tailored recommendations to enhance the transparency and integrity of the data set. The scorecard addresses technical and ethical aspects, offering a holistic evaluation of data practices. This approach aims to improve the quality of the data set. It offers practical guidance to curators and researchers in developing responsible AI systems, ensuring fairness and accountability in decision support systems.
zh

[AI-101] Predicting Blood Type: Assessing Model Performance with ROC Analysis

【速读】:该论文试图解决指纹图案与ABO血型分类之间是否存在潜在关联的问题,以探索两者在个人识别中的可能联系。研究的关键在于通过统计分析方法(如卡方检验和皮尔逊相关性检验)评估指纹模式与血型之间的关系,并发现二者之间无显著相关性(p > 0.05),表明这些特征是独立的。研究结果强调了未来需要在更大规模和更多样化的人群中进行研究,并结合机器学习方法和多模态生物特征信号整合,以提升个人识别的准确性和可靠性。

链接: https://arxiv.org/abs/2506.02062
作者: Malik A. Altayar,Muhyeeddin Alqaraleh,Mowafaq Salem Alzboon,Wesam T. Almagharbeh
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Introduction: Personal identification is a critical aspect of forensic sciences, security, and healthcare. While conventional biometrics systems such as DNA profiling and iris scanning offer high accuracy, they are time-consuming and costly. Objectives: This study investigates the relationship between fingerprint patterns and ABO blood group classification to explore potential correlations between these two traits. Methods: The study analyzed 200 individuals, categorizing their fingerprints into three types: loops, whorls, and arches. Blood group classification was also recorded. Statistical analysis, including chi-square and Pearson correlation tests, was used to assess associations between fingerprint patterns and blood groups. Results: Loops were the most common fingerprint pattern, while blood group O+ was the most prevalent among the participants. Statistical analysis revealed no significant correlation between fingerprint patterns and blood groups (p 0.05), suggesting that these traits are independent. Conclusions: Although the study showed limited correlation between fingerprint patterns and ABO blood groups, it highlights the importance of future research using larger and more diverse populations, incorporating machine learning approaches, and integrating multiple biometric signals. This study contributes to forensic science by emphasizing the need for rigorous protocols and comprehensive investigations in personal identification.
zh

[AI-102] Will Agents Replace Us? Perceptions of Autonomous Multi-Agent AI

【速读】:该论文试图解决当前专业人员对自主多智能体人工智能系统(Autonomous Multi-Agent AI Systems)的感知问题,旨在识别其在软件开发和知识工作领域的潜在采纳挑战、伦理考量及未来劳动力发展需求。研究通过分析130名参与者的调查数据,探讨了AI代理的能力、影响与治理,并识别了部署障碍及责任归属观念。其解决方案的关键在于揭示影响AI代理部署决策的复杂因素,并强调组织需应对合规性问题并建立明确的治理框架,以促进自主代理在工作流程中的有效整合。

链接: https://arxiv.org/abs/2506.02055
作者: Nikola Balic
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 15 pages, 5 figures, code available at this https URL

点击查看摘要

Abstract:Autonomous multi-agent AI systems are poised to transform various industries, particularly software development and knowledge work. Understanding current perceptions among professionals is crucial for anticipating adoption challenges, ethical considerations, and future workforce development. This study analyzes responses from 130 participants to a survey on the capabilities, impact, and governance of AI agents. We explore expected timelines for AI replacing programmers, identify perceived barriers to deployment, and examine beliefs about responsibility when agents make critical decisions. Key findings reveal three distinct clusters of respondents. While the study explored factors associated with current AI agent deployment, the initial logistic regression model did not yield statistically significant predictors, suggesting that deployment decisions are complex and may be influenced by factors not fully captured or that a larger sample is needed. These insights highlight the need for organizations to address compliance concerns (a commonly cited barrier) and establish clear governance frameworks as they integrate autonomous agents into their workflows.
zh

[AI-103] Generalization Performance of Ensemble Clustering: From Theory to Algorithm

【速读】:该论文旨在解决集成聚类(ensemble clustering)的泛化性能问题,具体包括泛化误差、超额风险和一致性分析。其解决方案的关键在于推导出泛化误差界和超额风险界的收敛速率,并证明当样本数 $ n $ 和基聚类数 $ m $ 趋于无穷大且 $ m $ 显著大于 $ \log n $ 时,集成聚类是稳定的。此外,通过为有限的基聚类分配不同权重,最小化经验平均聚类与其期望之间的误差,理论证明了提升聚类性能需减少基聚类的偏差并增加其多样性,而最大化多样性近似等价于鲁棒的(min-max)优化模型。

链接: https://arxiv.org/abs/2506.02053
作者: Xu Zhang,Haoye Qiu,Weixuan Liang,Hui Liu,Junhui Hou,Yuheng Jia
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Ensemble clustering has demonstrated great success in practice; however, its theoretical foundations remain underexplored. This paper examines the generalization performance of ensemble clustering, focusing on generalization error, excess risk and consistency. We derive a convergence rate of generalization error bound and excess risk bound both of \mathcalO(\sqrt\frac\log nm+\frac1\sqrtn) , with n and m being the numbers of samples and base clusterings. Based on this, we prove that when m and n approach infinity and m is significantly larger than log n , i.e., m,n\to \infty, m\gg \log n , ensemble clustering is consistent. Furthermore, recognizing that n and m are finite in practice, the generalization error cannot be reduced to zero. Thus, by assigning varying weights to finite clusterings, we minimize the error between the empirical average clusterings and their expectation. From this, we theoretically demonstrate that to achieve better clustering performance, we should minimize the deviation (bias) of base clustering from its expectation and maximize the differences (diversity) among various base clusterings. Additionally, we derive that maximizing diversity is nearly equivalent to a robust (min-max) optimization model. Finally, we instantiate our theory to develop a new ensemble clustering algorithm. Compared with SOTA methods, our approach achieves average improvements of 6.1%, 7.3%, and 6.0% on 10 datasets w.r.t. NMI, ARI, and Purity. The code is available at this https URL.
zh

[AI-104] Decoupled Hierarchical Reinforcement Learning with State Abstraction for Discrete Grids

【速读】:该论文旨在解决在复杂离散状态空间环境中,特别是在部分可观测条件下,强化学习(Reinforcement Learning, RL)中有效智能体探索的核心挑战。其解决方案的关键在于提出一种解耦的分层强化学习框架(Decoupled Hierarchical RL with State Abstraction, DcHRL-SA),该框架采用双层结构,包括基于强化学习的高层执行器和基于规则的低层策略,以促进有效的探索,并结合状态抽象方法对离散状态进行聚类,从而降低状态维度。

链接: https://arxiv.org/abs/2506.02050
作者: Qingyu Xiao,Yuanlin Chang,Youtian Du
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 6 pages, 6 figures

点击查看摘要

Abstract:Effective agent exploration remains a core challenge in reinforcement learning (RL) for complex discrete state-space environments, particularly under partial observability. This paper presents a decoupled hierarchical RL framework integrating state abstraction (DcHRL-SA) to address this issue. The proposed method employs a dual-level architecture, consisting of a high level RL-based actor and a low-level rule-based policy, to promote effective exploration. Additionally, state abstraction method is incorporated to cluster discrete states, effectively lowering state dimensionality. Experiments conducted in two discrete customized grid environments demonstrate that the proposed approach consistently outperforms PPO in terms of exploration efficiency, convergence speed, cumulative reward, and policy stability. These results demonstrate a practical approach for integrating decoupled hierarchical policies and state abstraction in discrete grids with large-scale exploration space. Code will be available at this https URL.
zh

[AI-105] EvoGit: Decentralized Code Evolution via Git-Based Multi-Agent Collaboration

【速读】:该论文试图解决传统软件开发中集中式协作与自动化程度不足的问题,旨在实现一种去中心化的、自主演化的协同软件开发框架。解决方案的关键在于引入EvoGit,它通过部署一组独立的编码代理(coding agents),在无中心协调、显式消息传递或共享内存的情况下,利用基于Git的系统进化图(phylogenetic graph)进行异步协作,从而实现代码库的持续演化与版本追踪。该结构支持细粒度分支、隐式并发和可扩展的代理交互,同时保持一致的历史记录,使人类用户的参与更加战略化和轻量级。

链接: https://arxiv.org/abs/2506.02049
作者: Beichen Huang,Ran Cheng,Kay Chen Tan
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Neural and Evolutionary Computing (cs.NE)
备注:

点击查看摘要

Abstract:We introduce EvoGit, a decentralized multi-agent framework for collaborative software development driven by autonomous code evolution. EvoGit deploys a population of independent coding agents, each proposing edits to a shared codebase without centralized coordination, explicit message passing, or shared memory. Instead, all coordination emerges through a Git-based phylogenetic graph that tracks the full version lineage and enables agents to asynchronously read from and write to the evolving code repository. This graph-based structure supports fine-grained branching, implicit concurrency, and scalable agent interaction while preserving a consistent historical record. Human involvement is minimal but strategic: users define high-level goals, periodically review the graph, and provide lightweight feedback to promote promising directions or prune unproductive ones. Experiments demonstrate EvoGit’s ability to autonomously produce functional and modular software artifacts across two real-world tasks: (1) building a web application from scratch using modern frameworks, and (2) constructing a meta-level system that evolves its own language-model-guided solver for the bin-packing optimization problem. Our results underscore EvoGit’s potential to establish a new paradigm for decentralized, automated, and continual software development. EvoGit is open-sourced at this https URL.
zh

[AI-106] Improving LLM Agents with Reinforcement Learning on Cryptographic CTF Challenges

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在网络安全应用中进行结构化推理和工具辅助计算时存在的不足。其关键解决方案是引入“random-crypto”框架,用于生成加密类CTF挑战,并通过Guided Reinforcement Prompt Optimisation (GRPO) 方法对工具增强的Llama-3.1-8B模型进行微调,使代理能够迭代地编写并执行隔离环境中的Python代码。此方法显著提升了模型在未见过的“random-crypto”任务上的表现,Pass@8指标提升了53%,同时增强了模型的泛化能力。

链接: https://arxiv.org/abs/2506.02048
作者: Lajos Muzsai,David Imolai,András Lukács
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 11 pages, 1 figure

点击查看摘要

Abstract:Large Language Models (LLMs) still struggle with the structured reasoning and tool-assisted computation needed for problem solving in cybersecurity applications. In this work, we introduce “random-crypto”, a cryptographic Capture-the-Flag (CTF) challenge generator framework that we use to fine-tune a tool-augmented Llama-3.1-8B with Guided Reinforcement Prompt Optimisation (GRPO), allowing the agent to iteratively write and execute Python inside an isolated REPL. GRPO yields a +53% absolute jump in Pass@8 on unseen “random-crypto” tasks (0.35 - 0.88) and raises Majority@8 to 0.41. The fine-tuned agent also generalizes to an external dataset. On a subset of picoCTF cryptography problems, it improves Pass@8 by +13 pp. Ablations show the gains stem from more reliable tool invocation and code synthesis, rather than superficial prompt adaptation.
zh

[AI-107] Machine vs Machine: Using AI to Tackle Generative AI Threats in Assessment

【速读】:该论文试图解决生成式AI(Generative AI)在高等教育评估中带来的挑战,特别是传统评估方法面临被AI生成内容替代的威胁。其解决方案的关键在于提出一种结合静态分析与动态测试的双策略范式,以构建全面的评估脆弱性评价理论框架。静态分析部分包含八个理论依据充分的要素,旨在通过识别生成式AI的能力局限来建立区分真实人类学习与AI模拟的障碍;动态测试则通过基于仿真的脆弱性评估补充模式分析的不足,从而形成系统的评估体系。

链接: https://arxiv.org/abs/2506.02046
作者: Mohammad Saleh Torkestani,Taha Mansouri
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: Paper presented at the Learning, Teaching Student Experience 2025 Conference. The Chartered Association of Business Schools (CABS), Nottingham, UK

点击查看摘要

Abstract:This paper presents a theoretical framework for addressing the challenges posed by generative artificial intelligence (AI) in higher education assessment through a machine-versus-machine approach. Large language models like GPT-4, Claude, and Llama increasingly demonstrate the ability to produce sophisticated academic content, traditional assessment methods face an existential threat, with surveys indicating 74-92% of students experimenting with these tools for academic purposes. Current responses, ranging from detection software to manual assessment redesign, show significant limitations: detection tools demonstrate bias against non-native English writers and can be easily circumvented, while manual frameworks rely heavily on subjective judgment and assume static AI capabilities. This paper introduces a dual strategy paradigm combining static analysis and dynamic testing to create a comprehensive theoretical framework for assessment vulnerability evaluation. The static analysis component comprises eight theoretically justified elements: specificity and contextualization, temporal relevance, process visibility requirements, personalization elements, resource accessibility, multimodal integration, ethical reasoning requirements, and collaborative elements. Each element addresses specific limitations in generative AI capabilities, creating barriers that distinguish authentic human learning from AI-generated simulation. The dynamic testing component provides a complementary approach through simulation-based vulnerability assessment, addressing limitations in pattern-based analysis. The paper presents a theoretical framework for vulnerability scoring, including the conceptual basis for quantitative assessment, weighting frameworks, and threshold determination theory.
zh

[AI-108] owards Secure MLOps: Surveying Attacks Mitigation Strategies and Research Challenges

【速读】:该论文试图解决MLOps(Machine Learning Operations)生态系统在快速发展中所面临的安全部署问题,特别是针对其统一性所带来的脆弱性,如恶意攻击可能导致的凭证泄露、财务损失、公众信任损害以及训练数据污染等问题。解决方案的关键在于系统性地应用MITRE ATLAS框架,对MLOps不同阶段的攻击进行分类与评估,并构建与之对应的防御策略,从而提供早期阶段的安全防护措施,增强MLOps生态系统的安全性。

链接: https://arxiv.org/abs/2506.02032
作者: Raj Patel,Himanshu Tripathi,Jasper Stone,Noorbakhsh Amiri Golilarz,Sudip Mittal,Shahram Rahimi,Vini Chaudhary
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The rapid adoption of machine learning (ML) technologies has driven organizations across diverse sectors to seek efficient and reliable methods to accelerate model development-to-deployment. Machine Learning Operations (MLOps) has emerged as an integrative approach addressing these requirements by unifying relevant roles and streamlining ML workflows. As the MLOps market continues to grow, securing these pipelines has become increasingly critical. However, the unified nature of MLOps ecosystem introduces vulnerabilities, making them susceptible to adversarial attacks where a single misconfiguration can lead to compromised credentials, severe financial losses, damaged public trust, and the poisoning of training data. Our paper presents a systematic application of the MITRE ATLAS (Adversarial Threat Landscape for Artificial-Intelligence Systems) framework, a comprehensive and continuously updated catalog of AI-focused attacks, to systematically assess attacks across different phases of the MLOps ecosystem. We begin by examining the preparatory phases during which adversaries acquire the essential intelligence required to initiate their attacks. We then present a structured taxonomy of attack techniques explicitly mapped to corresponding phases of the MLOps ecosystem, supported by examples drawn from red-teaming exercises and real-world incidents. This is followed by a taxonomy of mitigation strategies aligned with these attack categories, offering actionable early-stage defenses to strengthen the security of MLOps ecosystem. Given the rapid evolution and adoption of MLOps, we further highlight key research gaps that require immediate attention. Our work emphasizes the importance of implementing robust security protocols from the outset, empowering practitioners to safeguard MLOps ecosystem against evolving cyber attacks.
zh

[AI-109] he End Of Universal Lifelong Identifiers: Identity Systems For The AI Era

【速读】:该论文试图解决传统统一终身标识符(Universal Lifelong Identifiers, ULIs)在人工智能时代带来的系统性隐私风险问题。其核心观点是ULIs与AI时代的隐私需求不兼容,必须逐步淘汰。解决方案的关键在于提出一种基于密码学的框架,该框架满足AI时代身份系统的核心属性,同时保持与现有标识符工作流的兼容性,从而实现对ULIs的可行迁移。

链接: https://arxiv.org/abs/2506.02027
作者: Shriphani Palakodety
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 9 content pages, 14 pages with reference

点击查看摘要

Abstract:Many identity systems assign a single, static identifier to an individual for life, reused across domains like healthcare, finance, and education. These Universal Lifelong Identifiers (ULIs) underpin critical workflows but now pose systemic privacy risks. We take the position that ULIs are fundamentally incompatible with the AI era and must be phased out. We articulate a threat model grounded in modern AI capabilities and show that traditional safeguards such as redaction, consent, and access controls are no longer sufficient. We define core properties for identity systems in the AI era and present a cryptographic framework that satisfies them while retaining compatibility with existing identifier workflows. Our design preserves institutional workflows, supports essential functions such as auditability and delegation, and offers a practical migration path beyond ULIs.
zh

[AI-110] Evaluating the Efficacy of LLM -Based Reasoning for Multiobjective HPC Job Scheduling

【速读】:该论文旨在解决高性能计算(High-Performance Computing, HPC)任务调度中多目标平衡的问题,包括最小化完成时间(makespan)、减少等待时间、优化资源利用和确保公平性。传统方法如基于启发式的调度策略或复杂的优化技术在动态工作负载和异构HPC系统中表现出适应性不足的缺陷。论文提出的解决方案是基于大型语言模型(Large Language Model, LLM)的调度器,采用ReAct框架(Reason + Act)实现可解释的迭代决策过程,其关键在于通过自然语言反馈和scratchpad记忆机制优化调度决策,并结合约束执行模块确保调度的可行性和安全性。

链接: https://arxiv.org/abs/2506.02025
作者: Prachi Jadhav,Hongwei Jin,Ewa Deelman,Prasanna Balaprakash
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注: 10 pages, 6 figures, work under review

点击查看摘要

Abstract:High-Performance Computing (HPC) job scheduling involves balancing conflicting objectives such as minimizing makespan, reducing wait times, optimizing resource use, and ensuring fairness. Traditional methods, including heuristic-based (e.g., First-Come-First-Served) or intensive optimization techniques, often lack adaptability to dynamic workloads and heterogeneous HPC systems. To address this, we propose a novel Large Language Model (LLM)-based scheduler using a ReAct-style framework (Reason + Act), enabling iterative, interpretable decision-making. The system incorporates a scratchpad memory to track scheduling history and refine decisions via natural language feedback, while a constraint enforcement module ensures feasibility and safety. We evaluate our approach using OpenAI’s O4-Mini and Anthropic’s Claude 3.7 across seven real-world HPC workload scenarios, including heterogeneous mixes, bursty patterns, and adversarial cases. Comparisons against FCFS, Shortest Job First, and Google OR-Tools (on 10 to 100 jobs) reveal that LLM-based scheduling effectively balances multiple objectives while offering transparent reasoning through natural language traces. The method excels in constraint satisfaction and adapts to diverse workloads without domain-specific training. However, a trade-off between reasoning quality and computational overhead challenges real-time deployment. This work presents the first comprehensive study of reasoning-capable LLMs for HPC scheduling, demonstrating their potential to handle multiobjective optimization while highlighting limitations in computational efficiency. The findings provide insights into leveraging advanced language models for complex scheduling problems in dynamic HPC environments.
zh

[AI-111] ACGM: Non-instrumented Performance Tracing and Anomaly Detection towards Machine Learning Systems

【速读】:该论文旨在解决大规模人工智能/机器学习(AI/ML)系统中性能监控与故障诊断的挑战,特别是在多节点分布式训练场景下,如何高效、非侵入式地收集和分析系统性能数据。其解决方案的关键在于提出eACGM框架,该框架基于eBPF技术,无需代码 instrumentation 或修改即可实时采集GPU、网络通信层以及CUDA、Python、PyTorch等软件栈的性能数据,并结合libnvml获取进程级GPU资源使用信息,通过高斯混合模型(GMM)进行多维性能指标的统计建模与聚类分析,从而有效识别复杂的故障模式,实现系统瓶颈和异常行为的快速定位。

链接: https://arxiv.org/abs/2506.02007
作者: Ruilin Xu,Zongxuan Xie,Pengfei Chen
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
备注: IWQoS 2025

点击查看摘要

Abstract:We present eACGM, a full-stack AI/ML system monitoring framework based on eBPF. eACGM collects real-time performance data from key hardware components, including the GPU and network communication layer, as well as from key software stacks such as CUDA, Python, and PyTorch, all without requiring any code instrumentation or modifications. Additionally, it leverages libnvml to gather process-level GPU resource usage information. By applying a Gaussian Mixture Model (GMM) to the collected multidimensional performance metrics for statistical modeling and clustering analysis, eACGM effectively identifies complex failure modes, such as latency anomalies, hardware failures, and communication inefficiencies, enabling rapid diagnosis of system bottlenecks and abnormal behaviors. To evaluate eACGM’s effectiveness and practicality, we conducted extensive empirical studies and case analyses in multi-node distributed training scenarios. The results demonstrate that eACGM, while maintaining a non-intrusive and low-overhead profile, successfully captures critical performance anomalies during model training and inference. Its stable anomaly detection performance and comprehensive monitoring capabilities validate its applicability and scalability in real-world production environments, providing strong support for performance optimization and fault diagnosis in large-scale AI/ML systems. Comments: IWQoS 2025 Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI) Cite as: arXiv:2506.02007 [cs.DC] (or arXiv:2506.02007v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2506.02007 Focus to learn more arXiv-issued DOI via DataCite
zh

[AI-112] Coded Robust Aggregation for Distributed Learning under Byzantine Attacks

【速读】:该论文旨在解决在存在拜占庭攻击(Byzantine attacks)情况下的分布式学习(Distributed Learning, DL)问题。现有方法通过在服务器端应用鲁棒 bounded 聚合(Robust Bounded Aggregation, RBA)规则来缓解拜占庭攻击的影响,但当不同设备的本地梯度差异较大时,学习性能会显著下降。论文提出的解决方案关键在于采用基于编码鲁棒聚合的分布式学习方法(Coded Robust Aggregation for Distributed Learning, CRA-DL),该方法在训练开始前冗余地分配数据到各设备,并在每轮训练中让诚实设备发送由分配数据计算得到的编码梯度,服务器则使用RBA规则对来自诚实和拜占庭设备的信息进行聚合,从而近似恢复全局梯度以更新全局模型。CRA-DL 的优势源于诚实设备发送的编码梯度更加接近,增强了对拜占庭攻击的鲁棒性。

链接: https://arxiv.org/abs/2506.01989
作者: Chengxi Li,Ming Xiao,Mikael Skoglund
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:In this paper, we investigate the problem of distributed learning (DL) in the presence of Byzantine attacks. For this problem, various robust bounded aggregation (RBA) rules have been proposed at the central server to mitigate the impact of Byzantine attacks. However, current DL methods apply RBA rules for the local gradients from the honest devices and the disruptive information from Byzantine devices, and the learning performance degrades significantly when the local gradients of different devices vary considerably from each other. To overcome this limitation, we propose a new DL method to cope with Byzantine attacks based on coded robust aggregation (CRA-DL). Before training begins, the training data are allocated to the devices redundantly. During training, in each iteration, the honest devices transmit coded gradients to the server computed from the allocated training data, and the server then aggregates the information received from both honest and Byzantine devices using RBA rules. In this way, the global gradient can be approximately recovered at the server to update the global model. Compared with current DL methods applying RBA rules, the improvement of CRA-DL is attributed to the fact that the coded gradients sent by the honest devices are closer to each other. This closeness enhances the robustness of the aggregation against Byzantine attacks, since Byzantine messages tend to be significantly different from those of honest devices in this case. We theoretically analyze the convergence performance of CRA-DL. Finally, we present numerical results to verify the superiority of the proposed method over existing baselines, showing its enhanced learning performance under Byzantine attacks.
zh

[AI-113] Surrogate Interpretable Graph for Random Decision Forests

【速读】:该论文试图解决随机森林模型在健康信息学领域中由于特征和估计器数量增加而导致的全局特征交互难以被领域专家准确解释的问题,这可能影响信任度和监管合规性。解决方案的关键是引入一种称为“代理可解释图”的方法,该方法利用图结构和混合整数线性规划来分析和可视化特征交互,从而通过决策-特征-交互表展示特征使用情况以及预测中最关键的分层决策特征交互,提升模型的全局可解释性。

链接: https://arxiv.org/abs/2506.01988
作者: Akshat Dubey,Aleksandar Anžel,Georges Hattab
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The field of health informatics has been profoundly influenced by the development of random forest models, which have led to significant advances in the interpretability of feature interactions. These models are characterized by their robustness to overfitting and parallelization, making them particularly useful in this domain. However, the increasing number of features and estimators in random forests can prevent domain experts from accurately interpreting global feature interactions, thereby compromising trust and regulatory compliance. A method called the surrogate interpretability graph has been developed to address this issue. It uses graphs and mixed-integer linear programming to analyze and visualize feature interactions. This improves their interpretability by visualizing the feature usage per decision-feature-interaction table and the most dominant hierarchical decision feature interactions for predictions. The implementation of a surrogate interpretable graph enhances global interpretability, which is critical for such a high-stakes domain.
zh

[AI-114] Equally Critical: Samples Targets and Their Mappings in Datasets

【速读】:该论文试图解决数据在训练过程中同时具备样本(sample)和目标(target)双重属性,但现有方法在优化样本方面投入较多关注,而忽视了目标的重要性这一问题。其解决方案的关键在于通过构建一种基于样本-目标交互的分类体系,提出一种统一的损失框架,以评估样本和目标对训练效率的影响,并通过大量实验分析不同类型的样本和目标在数量和质量上的变化如何影响模型训练,从而提供提升训练效果的六个关键见解。

链接: https://arxiv.org/abs/2506.01987
作者: Runkang Yang,Peng Sun,Xinyi Shang,Yi Tang,Tao Lin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Data inherently possesses dual attributes: samples and targets. For targets, knowledge distillation has been widely employed to accelerate model convergence, primarily relying on teacher-generated soft target supervision. Conversely, recent advancements in data-efficient learning have emphasized sample optimization techniques, such as dataset distillation, while neglected the critical role of target. This dichotomy motivates our investigation into understanding how both sample and target collectively influence training dynamic. To address this gap, we first establish a taxonomy of existing paradigms through the lens of sample-target interactions, categorizing them into distinct sample-to-target mapping strategies. Building upon this foundation, we then propose a novel unified loss framework to assess their impact on training efficiency. Through extensive empirical studies on our proposed strategies, we comprehensively analyze how variations in target and sample types, quantities, and qualities influence model training, providing six key insights to enhance training efficacy.
zh

[AI-115] SpecMemo: Speculative Decoding is in Your Pocket

【速读】:该论文试图解决在内存受限设备上部署生成式AI(Generative AI)的推测解码(speculative decoding)技术所面临的挑战,特别是在保持推理速度提升的同时有效管理内存分配的问题。解决方案的关键在于提出一种设备感知的推理引擎SpecMemo,该引擎通过理论建模推测解码的内存占用,确定内存预算的下限,从而在减少冗余内存分配和维持推测性能增益之间实现精确平衡。此外,SpecMemo还通过批量推测解码技术,提升了多小服务器GPU上的大模型推理效率。

链接: https://arxiv.org/abs/2506.01986
作者: Selin Yildirim,Deming Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:

点击查看摘要

Abstract:Recent advancements in speculative decoding have demonstrated considerable speedup across a wide array of large language model (LLM) tasks. Speculative decoding inherently relies on sacrificing extra memory allocations to generate several candidate tokens, of which acceptance rate drives the speedup. However, deploying speculative decoding on memory-constrained devices, such as mobile GPUs, remains as a significant challenge in real-world scenarios. In this work, we present a device-aware inference engine named SpecMemo that can smartly control memory allocations at finer levels to enable multi-turn chatbots with speculative decoding on such limited memory devices. Our methodology stems from theoretically modeling memory footprint of speculative decoding to determine a lower bound on the required memory budget while retaining speedup. SpecMemo empirically acquires a careful balance between minimizing redundant memory allocations for rejected candidate tokens and maintaining competitive performance gains from speculation. Notably, with SpecMemo’s memory management, we maintain 96% of overall throughput from speculative decoding on MT-Bench, with reduced generation-memory by 65% on single Nvidia Titan RTX. Given multiple constrained GPUs, we build on top of previous speculative decoding architectures to facilitate big-model inference by distributing Llama-2-70B-Chat model, on which we provide novel batched speculative decoding to increase usability of multiple small server GPUs. This novel framework demonstrates 2x speedup over distributed and batched vanilla decoding with the base model on eight AMD MI250 GPUs. Moreover, inference throughput increases remarkably 8x with batch size 10. Our work contributes to democratized LLM applications in resource-constrained environments, providing a pathway for faster and cheaper deployment of real-world LLM applications with robust performance.
zh

[AI-116] Improvement of AMPs Identification with Generative Adversarial Network and Ensemble Classification

【速读】:该论文试图解决抗菌肽(Antimicrobial Peptides)预测与分类中的准确性与效率问题。其解决方案的关键在于通过结合不同视角的最佳编码方法,并利用深度神经网络对不平衡数据集进行平衡,从而提升预测性能。实验结果表明,所提出的方法在准确性和效率方面均优于现有方法,具有较高的应用价值。

链接: https://arxiv.org/abs/2506.01983
作者: Reyhaneh Keshavarzpour,Eghbal Mansoori
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 21 pages, 3 figures, 4 tables

点击查看摘要

Abstract:Identification of antimicrobial peptides is an important and necessary issue in today’s era. Antimicrobial peptides are essential as an alternative to antibiotics for biomedical applications and many other practical applications. These oligopeptides are useful in drug design and cause innate immunity against microorganisms. Artificial intelligence algorithms have played a significant role in the ease of identifying these this http URL research is improved by improving proposed method in the field of antimicrobial peptides prediction. Suggested method is improved by combining the best coding method from different perspectives, In the following a deep neural network to balance the imbalanced combined datasets. The results of this research show that the proposed method have a significant improvement in the accuracy and efficiency of the prediction of antimicrobial peptides and are able to provide the best results compared to the existing methods. These development in the field of prediction and classification of antimicrobial peptides, basically in the fields of medicine and pharmaceutical industries, have high effectiveness and application.
zh

[AI-117] Music interpretation and emotion perception: A computational and neurophysiological investigation

【速读】:该论文旨在探讨音乐表演中情感表达与感知的机制,特别是不同表演情境(如曲目、调式练习和即兴演奏)以及表现力水平对表演者情感传达和听众反应的影响。其解决方案的关键在于结合计算方法与神经生理学手段,通过分析音频特征、情感标注及神经生理指标,揭示即兴演奏和富有表现力的表演在情感沟通和观众参与度方面的独特优势。

链接: https://arxiv.org/abs/2506.01982
作者: Vassilis Lyberatos,Spyridon Kantarelis,Ioanna Zioga,Christina Anagnostopoulou,Giorgos Stamou,Anastasia Georgaki
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This study investigates emotional expression and perception in music performance using computational and neurophysiological methods. The influence of different performance settings, such as repertoire, diatonic modal etudes, and improvisation, as well as levels of expressiveness, on performers’ emotional communication and listeners’ reactions is explored. Professional musicians performed various tasks, and emotional annotations were provided by both performers and the audience. Audio analysis revealed that expressive and improvisational performances exhibited unique acoustic features, while emotion analysis showed stronger emotional responses. Neurophysiological measurements indicated greater relaxation in improvisational performances. This multimodal study highlights the significance of expressivity in enhancing emotional communication and audience engagement.
zh

[AI-118] Speculative Decoding via Hybrid Drafting and Rollback-Aware Branch Parallelism

【速读】:该论文旨在解决现有推测解码(Speculative Decoding, SD)方法因串行执行而导致的推理效率瓶颈问题,特别是草案模型与目标模型之间的相互等待延迟。其解决方案的关键在于引入一种名为\textbf{SpecBranch}的新框架,通过借鉴现代处理器中的分支预测思想,实现SD中的分支并行性。该框架通过策略性地引入并行推测分支来提前应对可能的令牌拒绝,并结合自适应草案长度与隐式草案模型置信度及显式目标模型特征复用,以提升并行度并减少回滚令牌数量。

链接: https://arxiv.org/abs/2506.01979
作者: Yuhao Shen,Junyi Shen,Quan Kong,Tianyu Liu,Yao Lu,Cong Wang
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recently, speculative decoding (SD) has emerged as a promising technique to accelerate LLM inference by employing a small draft model to propose draft tokens in advance, and validating them in parallel with the large target model. However, the existing SD methods still remain fundamentally constrained by their serialized execution, which causes the mutual waiting bubbles between the draft and target models. To address this challenge, we draw inspiration from branch prediction in modern processors and propose a novel framework \textbfSpecBranch to unlock branch parallelism in SD. Specifically, we first take an in-depth analysis of the potential of branch parallelism in SD, and recognize that the key challenge lies in the trade-offs between parallelization and token rollback. Based on the analysis, we strategically introduce parallel speculative branches to preemptively hedge against likely rejections. Meanwhile, to enhance parallelism, we jointly orchestrate adaptive draft lengths with a hybrid combination of the implicit draft model confidence and explicit reusing of target model features. Extensive experiments across various models and benchmarks show that SpecBranch achieves over \textbf1.8 \times \sim \textbf4.5 \times speedups against the auto-regressive decoding and reduces rollback tokens by \textbf50 % for poorly aligned models, realizing its applicability for real-world deployments.
zh

[AI-119] owards Unsupervised Training of Matching-based Graph Edit Distance Solver via Preference-aware GAN

【速读】:该论文试图解决图编辑距离(Graph Edit Distance, GED)计算中对真实标签(ground-truth labels)高度依赖的问题,而真实标签在实际场景中通常获取成本较高。解决方案的关键在于提出GEDRanker,这是一个基于生成对抗网络(GAN)的无监督框架,其核心是引入一个可解释的偏好感知判别器,并结合有效的训练策略,引导基于匹配的GED求解器生成高质量的节点匹配,从而在无需真实标签的情况下实现接近最优的GED求解效果。

链接: https://arxiv.org/abs/2506.01977
作者: Wei Huang,Hanchen Wang,Dong Wen,Shaozhen Ma,Wenjie Zhang,Xuemin Lin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Graph Edit Distance (GED) is a fundamental graph similarity metric widely used in various applications. However, computing GED is an NP-hard problem. Recent state-of-the-art hybrid GED solver has shown promising performance by formulating GED as a bipartite graph matching problem, then leveraging a generative diffusion model to predict node matching between two graphs, from which both the GED and its corresponding edit path can be extracted using a traditional algorithm. However, such methods typically rely heavily on ground-truth supervision, where the ground-truth labels are often costly to obtain in real-world scenarios. In this paper, we propose GEDRanker, a novel unsupervised GAN-based framework for GED computation. Specifically, GEDRanker consists of a matching-based GED solver and introduces an interpretable preference-aware discriminator with an effective training strategy to guide the matching-based GED solver toward generating high-quality node matching without the need for ground-truth labels. Extensive experiments on benchmark datasets demonstrate that our GEDRanker enables the matching-based GED solver to achieve near-optimal solution quality without any ground-truth supervision.
zh

[AI-120] Crack Path Prediction with Operator Learning using Discrete Particle System data Generation

【速读】:该论文旨在解决工程材料和结构中裂纹扩展的精确建模问题,这一问题对于预测失效至关重要,因为微小裂纹可能迅速发展并导致灾难性损伤。传统方法通常依赖连续体假设,而本文提出了一种基于构型敏感粒子动力学(CPD)模拟数据的深度算子网络(DeepONet)解决方案,其关键在于通过函数空间映射而非有限维向量来学习裂纹扩展的时空演化规律。研究对比了两种DeepONet变体——标准DeepONet和融合DeepONet,并验证了融合架构在不同几何形状试样中的优越性能,尤其是在非断裂情形下表现出更高的预测准确性。

链接: https://arxiv.org/abs/2506.01976
作者: Elham Kiyani,Venkatesh Ananchaperumal,Ahmad Peyvan,Mahendaran Uchimali,Gang Li,George Em Karniadakis
机构: 未知
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI)
备注: 22 pages, 14 figures

点击查看摘要

Abstract:Accurately modeling crack propagation is critical for predicting failure in engineering materials and structures, where small cracks can rapidly evolve and cause catastrophic damage. The interaction of cracks with discontinuities, such as holes, significantly affects crack deflection and arrest. Recent developments in discrete particle systems with multibody interactions based on constitutive behavior have demonstrated the ability to capture crack nucleation and evolution without relying on continuum assumptions. In this work, we use data from Constitutively Informed Particle Dynamics (CPD) simulations to train operator learning models, specifically Deep Operator Networks (DeepONets), which learn mappings between function spaces instead of finite-dimensional vectors. We explore two DeepONet variants: vanilla and Fusion DeepONet, for predicting time-evolving crack propagation in specimens with varying geometries. Three representative cases are studied: (i) varying notch height without active fracture; and (ii) and (iii) combinations of notch height and hole radius where dynamic fracture occurs on irregular discrete meshes. The models are trained on 32 to 45 samples, using geometric inputs in the branch network and spatial-temporal coordinates in the trunk network. Results show that Fusion DeepONet consistently outperforms the vanilla variant, with more accurate predictions especially in non-fracturing cases. Fracture-driven scenarios involving displacement and crack evolution remain more challenging. These findings highlight the potential of Fusion DeepONet to generalize across complex, geometry-varying, and time-dependent crack propagation phenomena.
zh

[AI-121] An empirical study of task and feature correlations in the reuse of pre-trained models

【速读】:该论文试图解决预训练神经网络在不同任务间迁移时成功因素的归属问题,具体而言是探讨Bob在使用Alice预训练模型的部分结构进行不同任务时取得成功的原因。论文的关键解决方案是通过构建一个实验框架,在计算机模拟环境中系统地分析影响Bob任务性能的因素,从而揭示任务相关性、网络结构及优化器选择对迁移学习效果的影响。

链接: https://arxiv.org/abs/2506.01975
作者: Jama Hussein Mohamud
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Pre-trained neural networks are commonly used and reused in the machine learning community. Alice trains a model for a particular task, and a part of her neural network is reused by Bob for a different task, often to great effect. To what can we ascribe Bob’s success? This paper introduces an experimental setup through which factors contributing to Bob’s empirical success could be studied in silico. As a result, we demonstrate that Bob might just be lucky: his task accuracy increases monotonically with the correlation between his task and Alice’s. Even when Bob has provably uncorrelated tasks and input features from Alice’s pre-trained network, he can achieve significantly better than random performance due to Alice’s choice of network and optimizer. When there is little correlation between tasks, only reusing lower pre-trained layers is preferable, and we hypothesize the converse: that the optimal number of retrained layers is indicative of task and feature correlation. Finally, we show in controlled real-world scenarios that Bob can effectively reuse Alice’s pre-trained network if there are semantic correlations between his and Alice’s task.
zh

[AI-122] FlashMLA-ETAP: Efficient Transpose Attention Pipeline for Accelerating MLA Inference on NVIDIA H20 GPUs

【速读】:该论文旨在解决在单机多GPU服务器上部署DeepSeek-R1 671B模型时,多头潜在注意力(Multi-Head Latent Attention, MLA)高效推理所面临的挑战。其解决方案的关键在于提出了一种名为ETAP(Efficient Transpose Attention Pipeline)的新方法,通过转置重新配置注意力计算,使键值(KV)上下文长度与WGMMA操作中的MM维度对齐,从而显著减少冗余计算。这一设计有效提升了推理速度,并在保持数值稳定性的同时实现了比FlashAttention-3和FlashInfer更高的性能提升。

链接: https://arxiv.org/abs/2506.01969
作者: Pencuo Zeren,Qiuming Luo,Rui Mao,Chang Kong
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 15 pages, conference

点击查看摘要

Abstract:Efficient inference of Multi-Head Latent Attention (MLA) is challenged by deploying the DeepSeek-R1 671B model on a single Multi-GPU server. This paper introduces FlashMLA-ETAP, a novel framework that enhances MLA inference for the single-instance deployment scenario on NVIDIA H20 GPUs. We propose the Efficient Transpose Attention Pipeline (ETAP), which reconfigures attention computation through transposition to align the KV context length with the (M)-dimension in WGMMA operations, significantly reducing redundant computations. FlashMLA-ETAP achieves a 2.78x speedup over FlashMLA at 64K sequence length (batch size 16), with 5.24x and 4.94x improvements over FlashAttention-3 and FlashInfer, respectively, while maintaining numerical stability with a 15.2x lower RMSE ((1.25 \times 10^-5)) than FlashAttention-3. Furthermore, ETAP’s design enables seamless integration into frameworks like FlashAttention-3 and FlashInfer, supported by a detailed theoretical analysis. Our work addresses a critical gap in resource-constrained inference, offering a scalable solution for mid-tier GPUs and paving the way for broader adoption in hardware-aware optimization. Code is available at this https URL.
zh

[AI-123] Efficient ANN-SNN Conversion with Error Compensation Learning

【速读】:该论文旨在解决人工神经网络(ANN)向脉冲神经网络(SNN)转换过程中出现的显著精度损失和推理时间增加的问题,这些问题主要由转换过程中的截断、量化和激活不均匀等误差引起。其解决方案的关键在于提出一种基于误差补偿学习的新型ANN-to-SNN转换框架,通过引入可学习的阈值截断函数、双阈值神经元以及优化的膜电位初始化策略,分别缓解截断误差、量化误差和非均匀性误差,从而实现高精度与超低延迟的转换效果。

链接: https://arxiv.org/abs/2506.01968
作者: Chang Liu,Jiangrong Shen,Xuming Ran,Mingkun Xu,Qi Xu,Yi Xu,Gang Pan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注:

点击查看摘要

Abstract:Artificial neural networks (ANNs) have demonstrated outstanding performance in numerous tasks, but deployment in resource-constrained environments remains a challenge due to their high computational and memory requirements. Spiking neural networks (SNNs) operate through discrete spike events and offer superior energy efficiency, providing a bio-inspired alternative. However, current ANN-to-SNN conversion often results in significant accuracy loss and increased inference time due to conversion errors such as clipping, quantization, and uneven activation. This paper proposes a novel ANN-to-SNN conversion framework based on error compensation learning. We introduce a learnable threshold clipping function, dual-threshold neurons, and an optimized membrane potential initialization strategy to mitigate the conversion error. Together, these techniques address the clipping error through adaptive thresholds, dynamically reduce the quantization error through dual-threshold neurons, and minimize the non-uniformity error by effectively managing the membrane potential. Experimental results on CIFAR-10, CIFAR-100, ImageNet datasets show that our method achieves high-precision and ultra-low latency among existing conversion methods. Using only two time steps, our method significantly reduces the inference time while maintains competitive accuracy of 94.75% on CIFAR-10 dataset under ResNet-18 structure. This research promotes the practical application of SNNs on low-power hardware, making efficient real-time processing possible.
zh

[AI-124] Matrix Is All You Need

【速读】:该论文试图解决深度神经网络在视觉、序列和语言任务中采用不同架构导致其底层共性被掩盖的问题,旨在建立一个统一的框架以揭示这些模型之间的共同数学本质。解决方案的关键在于提出一种统一的矩阵阶框架,将卷积、循环和自注意力操作统一为稀疏矩阵乘法,其中卷积通过上三角权重矩阵实现一阶变换,循环通过下三角矩阵编码分步更新,注意力则自然地表现为三阶张量分解。该方法在合理假设下与标准的CNN、RNN和Transformer层具有代数同构性,并通过实验验证了其在多种任务上的有效性,同时展示了其在计算效率和硬件适配方面的优势。

链接: https://arxiv.org/abs/2506.01966
作者: Yuzhou Zhu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deep neural networks employ specialized architectures for vision, sequential and language tasks, yet this proliferation obscures their underlying commonalities. We introduce a unified matrix-order framework that casts convolutional, recurrent and self-attention operations as sparse matrix multiplications. Convolution is realized via an upper-triangular weight matrix performing first-order transformations; recurrence emerges from a lower-triangular matrix encoding stepwise updates; attention arises naturally as a third-order tensor factorization. We prove algebraic isomorphism with standard CNN, RNN and Transformer layers under mild assumptions. Empirical evaluations on image classification (MNIST, CIFAR-10/100, Tiny ImageNet), time-series forecasting (ETTh1, Electricity Load Diagrams) and language modeling/classification (AG News, WikiText-2, Penn Treebank) confirm that sparse-matrix formulations match or exceed native model performance while converging in comparable or fewer epochs. By reducing architecture design to sparse pattern selection, our matrix perspective aligns with GPU parallelism and leverages mature algebraic optimization tools. This work establishes a mathematically rigorous substrate for diverse neural architectures and opens avenues for principled, hardware-aware network design.
zh

[AI-125] askVAE: Task-Specific Variational Autoencoders for Exemplar Generation in Continual Learning for Human Activity Recognition

【速读】:该论文旨在解决持续学习(Continual Learning, CL)中模型在动态数据环境中适应新任务时面临的遗忘旧知识和内存约束的问题。其解决方案的关键在于提出TaskVAE框架,该框架通过使用任务特定的变分自编码器(Variational Autoencoders, VAEs)生成合成样本,以在不依赖先验类别数量或单一VAE的情况下,灵活地应对不断增加的任务。TaskVAE能够在有限的内存占用下保持对旧类别的高准确性,并有效提升新类别的学习性能。

链接: https://arxiv.org/abs/2506.01965
作者: Bonpagna Kann,Sandra Castellanos-Paez,Romain Rombourg,Philippe Lalanda
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 9 pages, 5 figures, 3 tables

点击查看摘要

Abstract:As machine learning based systems become more integrated into daily life, they unlock new opportunities but face the challenge of adapting to dynamic data environments. Various forms of data shift-gradual, abrupt, or cyclic-threaten model accuracy, making continual adaptation essential. Continual Learning (CL) enables models to learn from evolving data streams while minimizing forgetting of prior knowledge. Among CL strategies, replay-based methods have proven effective, but their success relies on balancing memory constraints and retaining old class accuracy while learning new classes. This paper presents TaskVAE, a framework for replay-based CL in class-incremental settings. TaskVAE employs task-specific Variational Autoencoders (VAEs) to generate synthetic exemplars from previous tasks, which are then used to train the classifier alongside new task data. In contrast to traditional methods that require prior knowledge of the total class count or rely on a single VAE for all tasks, TaskVAE adapts flexibly to increasing tasks without such constraints. We focus on Human Activity Recognition (HAR) using IMU sensor-equipped devices. Unlike previous HAR studies that combine data across all users, our approach focuses on individual user data, better reflecting real-world scenarios where a person progressively learns new activities. Extensive experiments on 5 different HAR datasets show that TaskVAE outperforms experience replay methods, particularly with limited data, and exhibits robust performance as dataset size increases. Additionally, memory footprint of TaskVAE is minimal, being equivalent to only 60 samples per task, while still being able to generate an unlimited number of synthetic samples. The contributions lie in balancing memory constraints, task-specific generation, and long-term stability, making it a reliable solution for real-world applications in domains like HAR.
zh

[AI-126] Graph-Based Adversarial Domain Generalization with Anatomical Correlation Knowledge for Cross-User Human Activity Recognition

【速读】:该论文旨在解决传感器基础的人类活动识别(HAR)系统中跨用户可泛化性不足的问题,因为传统模型难以适应不同用户的个体差异,如行为模式、传感器位置和数据分布的不同。其解决方案的关键在于提出一种名为GNN-ADG(图神经网络与对抗域泛化结合)的方法,该方法融合了图神经网络(GNN)的空间建模能力和对抗学习的域泛化能力,通过构建三种解剖单元(Interconnected Units、Analogous Units、Lateral Units)并将其整合为统一的图结构,动态捕捉空间、功能和侧向相关性,从而实现用户无关的表征学习。

链接: https://arxiv.org/abs/2506.01962
作者: Xiaozhou Ye,Kevin I-Kai Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Cross-user variability poses a significant challenge in sensor-based Human Activity Recognition (HAR) systems, as traditional models struggle to generalize across users due to differences in behavior, sensor placement, and data distribution. To address this, we propose GNN-ADG (Graph Neural Network with Adversarial Domain Generalization), a novel method that leverages both the strength from both the Graph Neural Networks (GNNs) and adversarial learning to achieve robust cross-user generalization. GNN-ADG models spatial relationships between sensors on different anatomical body parts, extracting three types of Anatomical Units: (1) Interconnected Units, capturing inter-relations between neighboring sensors; (2) Analogous Units, grouping sensors on symmetrical or functionally similar body parts; and (3) Lateral Units, connecting sensors based on their position to capture region-specific coordination. These units information are fused into an unified graph structure with a cyclic training strategy, dynamically integrating spatial, functional, and lateral correlations to facilitate a holistic, user-invariant representation. Information fusion mechanism of GNN-ADG occurs by iteratively cycling through edge topologies during training, allowing the model to refine its understanding of inter-sensor relationships across diverse perspectives. By representing the spatial configuration of sensors as an unified graph and incorporating adversarial learning, Information Fusion GNN-ADG effectively learns features that generalize well to unseen users without requiring target user data during training, making it practical for real-world applications.
zh

[AI-127] Ubiquitous Symmetry at Critical Points Across Diverse Optimization Landscapes

【速读】:该论文试图解决在更广泛的数学空间中研究损失函数的临界点是否具有对称性的问题,特别是扩展之前在神经网络中的观察结果。其解决方案的关键在于引入四种新的案例(有限域上的射影情况、正八面体图情况、完美匹配情况和粒子吸引情况),并验证这些情况下所有观测到的临界点均表现出非平凡的对称性。此外,论文还提出了一种新的对称性度量,能够揭示先前度量未能捕捉到的对称结构。

链接: https://arxiv.org/abs/2506.01959
作者: Irmi Schneider
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Atomic Physics (physics.atom-ph)
备注:

点击查看摘要

Abstract:Symmetry plays a crucial role in understanding the properties of mathematical structures and optimization problems. Recent work has explored this phenomenon in the context of neural networks, where the loss function is invariant under column and row permutations of the network weights. It has been observed that local minima exhibit significant symmetry with respect to the network weights (invariance to row and column permutations). And moreover no critical point was found that lacked symmetry. We extend this line of inquiry by investigating symmetry phenomena in real-valued loss functions defined on a broader class of spaces. We will introduce four more cases: the projective case over a finite field, the octahedral graph case, the perfect matching case, and the particle attraction case. We show that as in the neural network case, all the critical points observed have non-trivial symmetry. Finally we introduce a new measure of symmetry in the system and show that it reveals additional symmetry structures not captured by the previous measure.
zh

[AI-128] Modelling the Effects of Hearing Loss on Neural Coding in the Auditory Midbrain with Variational Conditioning

【速读】:该论文旨在解决听力损失对听觉神经编码影响建模的问题,传统模型无法捕捉不同个体间听力损失的广泛差异。其解决方案的关键在于提出一种新颖的变分-条件模型,该模型直接从健康和噪声暴露动物听觉中脑的神经活动记录中学习听力损失的空间,通过仅6个自由参数来表征每个动物的听力损失,并实现了对正常听力和听力受损动物神经反应的高精度预测。

链接: https://arxiv.org/abs/2506.03088
作者: Lloyd Pellatt,Fotios Drakopoulos,Shievanie Sabesan,Nicholas A. Lesica
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 12 pages, 3 figures

点击查看摘要

Abstract:The mapping from sound to neural activity that underlies hearing is highly non-linear. The first few stages of this mapping in the cochlea have been modelled successfully, with biophysical models built by hand and, more recently, with DNN models trained on datasets simulated by biophysical models. Modelling the auditory brain has been a challenge because central auditory processing is too complex for models to be built by hand, and datasets for training DNN models directly have not been available. Recent work has taken advantage of large-scale high resolution neural recordings from the auditory midbrain to build a DNN model of normal hearing with great success. But this model assumes that auditory processing is the same in all brains, and therefore it cannot capture the widely varying effects of hearing loss. We propose a novel variational-conditional model to learn to encode the space of hearing loss directly from recordings of neural activity in the auditory midbrain of healthy and noise exposed animals. With hearing loss parametrised by only 6 free parameters per animal, our model accurately predicts 62% of the explainable variance in neural responses from normal hearing animals and 68% for hearing impaired animals, within a few percentage points of state of the art animal specific models. We demonstrate that the model can be used to simulate realistic activity from out of sample animals by fitting only the learned conditioning parameters with Bayesian optimisation, achieving crossentropy loss within 2% of the optimum in 15-30 iterations. Including more animals in the training data slightly improved the performance on unseen animals. This model will enable future development of parametrised hearing loss compensation models trained to directly restore normal neural coding in hearing impaired brains, which can be quickly fitted for a new user by human in the loop optimisation. Comments: 12 pages, 3 figures Subjects: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2506.03088 [q-bio.NC] (or arXiv:2506.03088v1 [q-bio.NC] for this version) https://doi.org/10.48550/arXiv.2506.03088 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-129] CapSpeech: Enabling Downstream Applications in Style-Captioned Text-to-Speech

【速读】:该论文试图解决风格标注文本到语音合成(CapTTS)在实际应用中面临的挑战,主要由于缺乏标准化、全面的数据集以及基于CapTTS的下游任务研究有限。解决方案的关键在于引入CapSpeech,这是一个针对一系列CapTTS相关任务的新基准,包括带有声音事件的风格标注文本到语音合成(CapTTS-SE)、口音标注TTS(AccCapTTS)、情感标注TTS(EmoCapTTS)和聊天代理文本到语音合成(AgentTTS)。CapSpeech包含超过1000万条机器标注的音频-标题对和近36万条人工标注的音频-标题对,并特别为AgentTTS和CapTTS-SE任务收集了由专业配音演员和经验丰富的音频工程师录制的新数据集。此外,研究者还使用自回归和非自回归模型在CapSpeech上进行了全面实验,验证了其在多种口语风格下生成高保真且高度可理解语音的能力。

链接: https://arxiv.org/abs/2506.02863
作者: Helin Wang,Jiarui Hai,Dading Chong,Karan Thakkar,Tiantian Feng,Dongchao Yang,Junhyeok Lee,Laureano Moro Velazquez,Jesus Villalba,Zengyi Qin,Shrikanth Narayanan,Mounya Elhiali,Najim Dehak
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)
备注:

点击查看摘要

Abstract:Recent advancements in generative artificial intelligence have significantly transformed the field of style-captioned text-to-speech synthesis (CapTTS). However, adapting CapTTS to real-world applications remains challenging due to the lack of standardized, comprehensive datasets and limited research on downstream tasks built upon CapTTS. To address these gaps, we introduce CapSpeech, a new benchmark designed for a series of CapTTS-related tasks, including style-captioned text-to-speech synthesis with sound events (CapTTS-SE), accent-captioned TTS (AccCapTTS), emotion-captioned TTS (EmoCapTTS), and text-to-speech synthesis for chat agent (AgentTTS). CapSpeech comprises over 10 million machine-annotated audio-caption pairs and nearly 0.36 million human-annotated audio-caption pairs. In addition, we introduce two new datasets collected and recorded by a professional voice actor and experienced audio engineers, specifically for the AgentTTS and CapTTS-SE tasks. Alongside the datasets, we conduct comprehensive experiments using both autoregressive and non-autoregressive models on CapSpeech. Our results demonstrate high-fidelity and highly intelligible speech synthesis across a diverse range of speaking styles. To the best of our knowledge, CapSpeech is the largest available dataset offering comprehensive annotations for CapTTS-related tasks. The experiments and findings further provide valuable insights into the challenges of developing CapTTS systems.
zh

[AI-130] Deep Learning Enhanced Multivariate GARCH

【速读】:该论文试图解决传统多变量GARCH(Multivariate GARCH)方法在捕捉金融收益率数据中持久性波动聚集和资产间非对称共同运动方面的局限性。解决方案的关键在于提出一种融合长短期记忆网络(LSTM)与BEKK模型的新型多变量波动率建模框架——LSTM-BEKK,通过将循环神经网络的灵活性与BEKK模型的计量经济学结构相结合,以更好地捕获非线性、动态和高维依赖结构。

链接: https://arxiv.org/abs/2506.02796
作者: Haoyuan Wang,Chen Liu,Minh-Ngoc Tran,Chao Wang
机构: 未知
类目: Computational Finance (q-fin.CP); Artificial Intelligence (cs.AI); Econometrics (econ.EM)
备注:

点击查看摘要

Abstract:This paper introduces a novel multivariate volatility modeling framework, named Long Short-Term Memory enhanced BEKK (LSTM-BEKK), that integrates deep learning into multivariate GARCH processes. By combining the flexibility of recurrent neural networks with the econometric structure of BEKK models, our approach is designed to better capture nonlinear, dynamic, and high-dimensional dependence structures in financial return data. The proposed model addresses key limitations of traditional multivariate GARCH-based methods, particularly in capturing persistent volatility clustering and asymmetric co-movement across assets. Leveraging the data-driven nature of LSTMs, the framework adapts effectively to time-varying market conditions, offering improved robustness and forecasting performance. Empirical results across multiple equity markets confirm that the LSTM-BEKK model achieves superior performance in terms of out-of-sample portfolio risk forecast, while maintaining the interpretability from the BEKK models. These findings highlight the potential of hybrid econometric-deep learning models in advancing financial risk management and multivariate volatility forecasting.
zh

[AI-131] Prompt-Unseen-Emotion: Zero-shot Expressive Speech Synthesis with Prompt-LLM Contextual Knowledge for Mixed Emotions

【速读】:该论文试图解决现有表达性文本到语音(TTS)系统仅能建模有限的分类情绪,而人类对话中情绪多样性远超这些预定义情绪的问题,从而影响自然交互的效果。解决方案的关键在于提出一种新颖的提示未见情绪(PUE)方法,通过情感引导的提示学习生成未见过的情绪语音,该方法利用大语言模型-语音合成(LLM-TTS)架构确保情感相关提示与情感语音之间的情感一致性,并在推理阶段通过灵活调整情感比例和利用LLM上下文知识生成混合情感语音,实现对不同情感风格的量化捕捉。

链接: https://arxiv.org/abs/2506.02742
作者: Xiaoxue Gao,Huayun Zhang,Nancy F. Chen
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD); Signal Processing (eess.SP)
备注:

点击查看摘要

Abstract:Existing expressive text-to-speech (TTS) systems primarily model a limited set of categorical emotions, whereas human conversations extend far beyond these predefined emotions, making it essential to explore more diverse emotional speech generation for more natural interactions. To bridge this gap, this paper proposes a novel prompt-unseen-emotion (PUE) approach to generate unseen emotional speech via emotion-guided prompt learning. PUE is trained utilizing an LLM-TTS architecture to ensure emotional consistency between categorical emotion-relevant prompts and emotional speech, allowing the model to quantitatively capture different emotion weightings per utterance. During inference, mixed emotional speech can be generated by flexibly adjusting emotion proportions and leveraging LLM contextual knowledge, enabling the model to quantify different emotional styles. Our proposed PUE successfully facilitates expressive speech synthesis of unseen emotions in a zero-shot setting.
zh

[AI-132] Dhvani: A Weakly-supervised Phonemic Error Detection and Personalized Feedback System for Hindi INTERSPEECH2025

【速读】:该论文试图解决计算机辅助发音训练(Computer-Assisted Pronunciation Training, CAPT)在印度语言中的应用不足问题,特别是在拥有15亿母语者的印度语言中,现有的发音工具严重缺乏。针对这一问题,论文提出的解决方案关键在于开发一个名为Dhvani的新型CAPT系统,该系统通过合成印地语错误发音的语音,并采用一种新的个性化反馈方法,结合印地语高度音素化的书写系统,对学习者的发音进行分析并提供针对性反馈。

链接: https://arxiv.org/abs/2506.02166
作者: Arnav Rustagi,Satvik Bajpai,Nimrat Kaur,Siddharth Siddharth
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted for publication at Interspeech 2025 to be held in Rotterdam, the Netherlands

点击查看摘要

Abstract:Computer-Assisted Pronunciation Training (CAPT) has been extensively studied for English. However, there remains a critical gap in its application to Indian languages with a base of 1.5 billion speakers. Pronunciation tools tailored to Indian languages are strikingly lacking despite the fact that millions learn them every year. With over 600 million speakers and being the fourth most-spoken language worldwide, improving Hindi pronunciation is a vital first step toward addressing this gap. This paper proposes 1) Dhvani – a novel CAPT system for Hindi, 2) synthetic speech generation for Hindi mispronunciations, and 3) a novel methodology for providing personalized feedback to learners. While the system often interacts with learners using Devanagari graphemes, its core analysis targets phonemic distinctions, leveraging Hindi’s highly phonetic orthography to analyze mispronounced speech and provide targeted feedback.
zh

[AI-133] Enhancing GOP in CTC-Based Mispronunciation Detection with Phonological Knowledge INTERSPEECH2025

【速读】:该论文旨在解决计算机辅助发音训练(Computer-Assisted Pronunciation Training, CAPT)系统中发音质量评估指标——发音良好度(Goodness of Pronunciation, GOP)在面对声学变异时,由于强制对齐(forced alignments)导致的标注和分割错误问题。其解决方案的关键在于引入一种替代感知的无对齐GOP方法,通过基于音素聚类和常见学习者错误限制音素替换,从而提高计算效率并改善评估性能。

链接: https://arxiv.org/abs/2506.02080
作者: Aditya Kamlesh Parikh,Cristian Tejedor-Garcia,Catia Cucchiarini,Helmer Strik
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI)
备注: Accepted to Interspeech 2025. This publication is part of the project Responsible AI for Voice Diagnostics (RAIVD) with file number NGF.1607.22.013 of the research programme NGF AiNed Fellowship Grants which is financed by the Dutch Research Council (NWO)

点击查看摘要

Abstract:Computer-Assisted Pronunciation Training (CAPT) systems employ automatic measures of pronunciation quality, such as the goodness of pronunciation (GOP) metric. GOP relies on forced alignments, which are prone to labeling and segmentation errors due to acoustic variability. While alignment-free methods address these challenges, they are computationally expensive and scale poorly with phoneme sequence length and inventory size. To enhance efficiency, we introduce a substitution-aware alignment-free GOP that restricts phoneme substitutions based on phoneme clusters and common learner errors. We evaluated our GOP on two L2 English speech datasets, one with child speech, My Pronunciation Coach (MPC), and SpeechOcean762, which includes child and adult speech. We compared RPS (restricted phoneme substitutions) and UPS (unrestricted phoneme substitutions) setups within alignment-free methods, which outperformed the baseline. We discuss our results and outline avenues for future research.
zh

[AI-134] Evaluating the Effectiveness of Pre-Trained Audio Embeddings for Classification of Parkinsons Disease Speech Data INTERSPEECH2025

【速读】:该论文试图解决帕金森病(Parkinson’s Disease, PD)诊断中因个体语音差异导致的深度声学特征有效性不稳定的问题。其解决方案的关键在于评估三种预训练音频嵌入模型(OpenL3、VGGish 和 Wav2Vec2.0)在PD分类任务中的表现,特别是针对特定语音任务如二联音重复(diadochokinesis, DDK)和听写重复(listen and repeat, LR)的有效性,以探索更稳健的特征提取方法。

链接: https://arxiv.org/abs/2506.02078
作者: Emmy Postma,Cristian Tejedor-Garcia
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI)
备注: Accepted to Interspeech 2025. This publication is part of the project Responsible AI for Voice Diagnostics (RAIVD) with file number NGF.1607.22.013 of the research programme NGF AiNed Fellowship Grants which is financed by the Dutch Research Council (NWO)

点击查看摘要

Abstract:Speech impairments are prevalent biomarkers for Parkinson’s Disease (PD), motivating the development of diagnostic techniques using speech data for clinical applications. Although deep acoustic features have shown promise for PD classification, their effectiveness often varies due to individual speaker differences, a factor that has not been thoroughly explored in the existing literature. This study investigates the effectiveness of three pre-trained audio embeddings (OpenL3, VGGish and Wav2Vec2.0 models) for PD classification. Using the NeuroVoz dataset, OpenL3 outperforms others in diadochokinesis (DDK) and listen and repeat (LR) tasks, capturing critical acoustic features for PD detection. Only Wav2Vec2.0 shows significant gender bias, achieving more favorable results for male speakers, in DDK tasks. The misclassified cases reveal challenges with atypical speech patterns, highlighting the need for improved feature extraction and model robustness in PD detection.
zh

[AI-135] Protap: A Benchmark for Protein Modeling on Realistic Downstream Applications

【速读】:该论文旨在解决蛋白质下游应用中模型性能提升的问题,特别是针对通用任务和工业相关但现有基准缺失的特殊任务(如酶促蛋白裂解位点预测和靶向蛋白降解)进行系统评估。其解决方案的关键在于构建一个全面的基准测试平台Protap,该平台对不同的主干架构、预训练策略以及领域特定模型在多种真实场景下的表现进行了系统比较,并验证了结构信息与领域先验知识在提升模型性能中的重要作用。

链接: https://arxiv.org/abs/2506.02052
作者: Shuo Yan,Yuliang Yan,Bin Ma,Chenao Li,Haochun Tang,Jiahua Lu,Minhua Lin,Yuyuan Feng,Hui Xiong,Enyan Dai
机构: 未知
类目: Biomolecules (q-bio.BM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
备注:

点击查看摘要

Abstract:Recently, extensive deep learning architectures and pretraining strategies have been explored to support downstream protein applications. Additionally, domain-specific models incorporating biological knowledge have been developed to enhance performance in specialized tasks. In this work, we introduce \textbfProtap , a comprehensive benchmark that systematically compares backbone architectures, pretraining strategies, and domain-specific models across diverse and realistic downstream protein applications. Specifically, Protap covers five applications: three general tasks and two novel specialized tasks, i.e., enzyme-catalyzed protein cleavage site prediction and targeted protein degradation, which are industrially relevant yet missing from existing benchmarks. For each application, Protap compares various domain-specific models and general architectures under multiple pretraining settings. Our empirical studies imply that: (i) Though large-scale pretraining encoders achieve great results, they often underperform supervised encoders trained on small downstream training sets. (ii) Incorporating structural information during downstream fine-tuning can match or even outperform protein language models pretrained on large-scale sequence corpora. (iii) Domain-specific biological priors can enhance performance on specialized downstream tasks. Code and datasets are publicly available at this https URL.
zh

[AI-136] Phenotypic Profile-Informed Generation of Drug-Like Molecules via Dual-Channel Variational Autoencoders IJCAI2025

【速读】:该论文试图解决传统方法在生成具有潜在治疗效果的类药分子时,仅依赖表达谱而忽视分子对细胞背景扰动效应的问题。解决方案的关键在于提出SmilesGEN,这是一种基于变分自编码器(VAE)架构的生成模型,通过将预训练的药物VAE(SmilesNet)与表达谱VAE(ProfileNet)相结合,在共同的潜在空间中联合建模药物扰动与转录响应之间的相互作用,从而更准确地生成具有高有效性和新颖性的分子。

链接: https://arxiv.org/abs/2506.02051
作者: Hui Liu,Shiye Tian,Xuejun Liu
机构: 未知
类目: Biomolecules (q-bio.BM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: IJCAI2025

点击查看摘要

Abstract:The de novo generation of drug-like molecules capable of inducing desirable phenotypic changes is receiving increasing attention. However, previous methods predominantly rely on expression profiles to guide molecule generation, but overlook the perturbative effect of the molecules on cellular contexts. To overcome this limitation, we propose SmilesGEN, a novel generative model based on variational autoencoder (VAE) architecture to generate molecules with potential therapeutic effects. SmilesGEN integrates a pre-trained drug VAE (SmilesNet) with an expression profile VAE (ProfileNet), jointly modeling the interplay between drug perturbations and transcriptional responses in a common latent space. Specifically, ProfileNet is imposed to reconstruct pre-treatment expression profiles when eliminating drug-induced perturbations in the latent space, while SmilesNet is informed by desired expression profiles to generate drug-like molecules. Our empirical experiments demonstrate that SmilesGEN outperforms current state-of-the-art models in generating molecules with higher degree of validity, uniqueness, novelty, as well as higher Tanimoto similarity to known ligands targeting the relevant proteins. Moreover, we evaluate SmilesGEN for scaffold-based molecule optimization and generation of therapeutic agents, and confirmed its superior performance in generating molecules with higher similarity to approved drugs. SmilesGEN establishes a robust framework that leverages gene signatures to generate drug-like molecules that hold promising potential to induce desirable cellular phenotypic changes.
zh

[AI-137] No Audiogram: Leverag ing Existing Scores for Personalized Speech Intelligibility Prediction INTERSPEECH2025

【速读】:该论文试图解决个性化语音可懂度预测的难题,传统方法主要依赖于纯音听力图(audiogram),但其在准确性上存在固有限制,因为仅能捕捉听者对纯音的听阈。论文提出的解决方案的关键在于利用个体已有的可懂度数据,而非依赖额外的听者特征,通过引入支持样本基础的可懂度预测网络(Support Sample-Based Intelligibility Prediction Network, SSIPNet),该深度学习模型借助语音基础模型构建听者语音识别能力的高维表示,从而实现对未见过音频的准确预测。

链接: https://arxiv.org/abs/2506.02039
作者: Haoshuai Zhou,Changgeng Mo,Boxuan Cao,Linkai Li,Shan Xiang Wang
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)
备注: Accepted at Interspeech 2025

点击查看摘要

Abstract:Personalized speech intelligibility prediction is challenging. Previous approaches have mainly relied on audiograms, which are inherently limited in accuracy as they only capture a listener’s hearing threshold for pure tones. Rather than incorporating additional listener features, we propose a novel approach that leverages an individual’s existing intelligibility data to predict their performance on new audio. We introduce the Support Sample-Based Intelligibility Prediction Network (SSIPNet), a deep learning model that leverages speech foundation models to build a high-dimensional representation of a listener’s speech recognition ability from multiple support (audio, score) pairs, enabling accurate predictions for unseen audio. Results on the Clarity Prediction Challenge dataset show that, even with a small number of support (audio, score) pairs, our method outperforms audiogram-based predictions. Our work presents a new paradigm for personalized speech intelligibility prediction.
zh

[AI-138] Re-experiment Smart: a Novel Method to Enhance Data-driven Prediction of Mechanical Properties of Epoxy Polymers

【速读】:该论文试图解决实验测量中不可避免的异常值(outlier)对机器学习结果的严重干扰问题,这一问题会导致预测模型错误和材料设计性能不佳。解决方案的关键在于通过集成多算法异常值检测与不可靠异常值案例的有选择性重实验,高效提升数据集质量。该方法在仅需约5%数据进行额外实验的情况下,显著降低了预测误差(RMSE),并提高了多种机器学习模型(如Elastic Net、SVR、Random Forest和TPOT)在预测聚合物关键力学性能(包括玻璃化转变温度T_g、tan δ峰和交联密度v_c)时的准确性。

链接: https://arxiv.org/abs/2506.01994
作者: Wanshan Cui,Yejin Jeong,Inwook Song,Gyuri Kim,Minsang Kwon,Donghun Lee
机构: 未知
类目: oft Condensed Matter (cond-mat.soft); Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI)
备注: 27 pages, 8 figures

点击查看摘要

Abstract:Accurate prediction of polymer material properties through data-driven approaches greatly accelerates novel material development by reducing redundant experiments and trial-and-error processes. However, inevitable outliers in empirical measurements can severely skew machine learning results, leading to erroneous prediction models and suboptimal material designs. To address this limitation, we propose a novel approach to enhance dataset quality efficiently by integrating multi-algorithm outlier detection with selective re-experimentation of unreliable outlier cases. To validate the empirical effectiveness of the approach, we systematically construct a new dataset containing 701 measurements of three key mechanical properties: glass transition temperature ( T_g ), tan \delta peak, and crosslinking density ( v_c ). To demonstrate its general applicability, we report the performance improvements across multiple machine learning models, including Elastic Net, SVR, Random Forest, and TPOT, to predict the three key properties. Our method reliably reduces prediction error (RMSE) and significantly improves accuracy with minimal additional experimental work, requiring only about 5% of the dataset to be this http URL findings highlight the importance of data quality enhancement in achieving reliable machine learning applications in polymer science and present a scalable strategy for improving predictive reliability in materials science.
zh

机器学习

[LG-0] Not All Tokens Are Meant to Be Forgotten

链接: https://arxiv.org/abs/2506.03142
作者: Xiangyu Zhou,Yao Qiang,Saleh Zare Zade,Douglas Zytko,Prashant Khanduri,Dongxiao Zhu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs), pre-trained on massive text corpora, exhibit remarkable human-level language understanding, reasoning, and decision-making abilities. However, they tend to memorize unwanted information, such as private or copyrighted content, raising significant privacy and legal concerns. Unlearning has emerged as a promising solution, but existing methods face a significant challenge of over-forgetting. This issue arises because they indiscriminately suppress the generation of all the tokens in forget samples, leading to a substantial loss of model utility. To overcome this challenge, we introduce the Targeted Information Forgetting (TIF) framework, which consists of (1) a flexible targeted information identifier designed to differentiate between unwanted words (UW) and general words (GW) in the forget samples, and (2) a novel Targeted Preference Optimization approach that leverages Logit Preference Loss to unlearn unwanted information associated with UW and Preservation Loss to retain general information in GW, effectively improving the unlearning process while mitigating utility degradation. Extensive experiments on the TOFU and MUSE benchmarks demonstrate that the proposed TIF framework enhances unlearning effectiveness while preserving model utility and achieving state-of-the-art results.

[LG-1] Zero-Shot Time Series Forecasting with Covariates via In-Context Learning

链接: https://arxiv.org/abs/2506.03128
作者: Andreas Auer,Raghul Parthipan,Pedro Mercado,Abdul Fatir Ansari,Lorenzo Stella,Bernie Wang,Michael Bohlke-Schneider,Syama Sundar Rangapuram
类目: Machine Learning (cs.LG)
*备注: The paper was written at the end of 2024

点击查看摘要

Abstract:Pretrained time series models, capable of zero-shot forecasting, have demonstrated significant potential in enhancing both the performance and accessibility of time series forecasting. However, existing pretrained models either do not support covariates or fail to incorporate them effectively. We introduce COSMIC, a zero-shot forecasting model that utilizes covariates via in-context learning. To address the challenge of data scarcity, we propose Informative Covariate Augmentation, which enables the training of COSMIC without requiring any datasets that include covariates. COSMIC achieves state-of-the-art performance in zero-shot forecasting, both with and without covariates. Our quantitative and qualitative analysis demonstrates that COSMIC effectively leverages covariates in zero-shot forecasting.

[LG-2] Rectified Flows for Fast Multiscale Fluid Flow Modeling

链接: https://arxiv.org/abs/2506.03111
作者: Victor Armegioiu,Yannick Ramic,Siddhartha Mishra
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The statistical modeling of fluid flows is very challenging due to their multiscale dynamics and extreme sensitivity to initial conditions. While recently proposed conditional diffusion models achieve high fidelity, they typically require hundreds of stochastic sampling steps at inference. We introduce a rectified flow framework that learns a time-dependent velocity field, transporting input to output distributions along nearly straight trajectories. By casting sampling as solving an ordinary differential equation (ODE) along this straighter flow field, our method makes each integration step much more effective, using as few as eight steps versus (more than) 128 steps in standard score-based diffusion, without sacrificing predictive fidelity. Experiments on challenging multiscale flow benchmarks show that rectified flows recover the same posterior distributions as diffusion models, preserve fine-scale features that MSE-trained baselines miss, and deliver high-resolution samples in a fraction of inference time.

[LG-3] On Weak-to-Strong Generalization and f-Divergence

链接: https://arxiv.org/abs/2506.03109
作者: Wei Yao,Gengze Xu,Huayi Tang,Wenkai Yang,Donglin Di,Ziqiao Wang,Yong Liu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Weak-to-strong generalization (W2SG) has emerged as a promising paradigm for stimulating the capabilities of strong pre-trained models by leveraging supervision from weaker supervisors. To improve the performance of the strong model, existing methods often require additional weak models or complex procedures, leading to substantial computational and memory overhead. Motivated by the effectiveness of f -divergence loss in various machine learning domains, we introduce f -divergence as an information-theoretic loss function framework in W2SG. Our theoretical analysis reveals fundamental limitations and equivalence of different f -divergence losses in W2SG, supported by sample complexity bounds and information-theoretic insights. We empirically demonstrate that f -divergence loss, which generalizes widely-used metrics like KL divergence, effectively improves generalization and noise tolerance of the strong model in practice.

[LG-4] From Flat to Hierarchical: Extracting Sparse Representations with Matching Pursuit

链接: https://arxiv.org/abs/2506.03093
作者: Valérie Costa,Thomas Fel,Ekdeep Singh Lubana,Bahareh Tolooshams,Demba Ba
类目: Machine Learning (cs.LG)
*备注: Preprint

点击查看摘要

Abstract:Motivated by the hypothesis that neural network representations encode abstract, interpretable features as linearly accessible, approximately orthogonal directions, sparse autoencoders (SAEs) have become a popular tool in interpretability. However, recent work has demonstrated phenomenology of model representations that lies outside the scope of this hypothesis, showing signatures of hierarchical, nonlinear, and multi-dimensional features. This raises the question: do SAEs represent features that possess structure at odds with their motivating hypothesis? If not, does avoiding this mismatch help identify said features and gain further insights into neural network representations? To answer these questions, we take a construction-based approach and re-contextualize the popular matching pursuits (MP) algorithm from sparse coding to design MP-SAE – an SAE that unrolls its encoder into a sequence of residual-guided steps, allowing it to capture hierarchical and nonlinearly accessible features. Comparing this architecture with existing SAEs on a mixture of synthetic and natural data settings, we show: (i) hierarchical concepts induce conditionally orthogonal features, which existing SAEs are unable to faithfully capture, and (ii) the nonlinear encoding step of MP-SAE recovers highly meaningful features, helping us unravel shared structure in the seemingly dichotomous representation spaces of different modalities in a vision-language model, hence demonstrating the assumption that useful features are solely linearly accessible is insufficient. We also show that the sequential encoder principle of MP-SAE affords an additional benefit of adaptive sparsity at inference time, which may be of independent interest. Overall, we argue our results provide credence to the idea that interpretability should begin with the phenomenology of representations, with methods emerging from assumptions that fit it.

[LG-5] Non-Asymptotic Length Generalization

链接: https://arxiv.org/abs/2506.03085
作者: Thomas Chen,Tengyu Ma,Zhiyuan Li
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Length generalization is the ability of a learning algorithm to learn a hypothesis which generalizes to longer inputs than the inputs in the training set. In this paper, we provide provable guarantees of length generalization for various classes of functions in an idealized setting. First, we formalize the framework of non-asymptotic length generalization, which requires a computable upper bound for the minimum input length that guarantees length generalization, as a function of the complexity of ground-truth function under some given complexity measure. We refer to this minimum input length to length generalize as length complexity. We show the Minimum-Complexity Interpolator learning algorithm achieves optimal length complexity. We further show that whether a function class admits non-asymptotic length generalization is equivalent to the decidability of its language equivalence problem, which implies that there is no computable upper bound for the length complexity of Context-Free Grammars. On the positive side, we show that the length complexity of Deterministic Finite Automata is 2n - 2 where n is the number of states of the ground-truth automaton. Our main results are upper bounds of length complexity for a subset of a transformer-related function class called C-RASP (Yang Chiang, 2024). We show that the length complexity of 1-layer C-RASP functions is O(T^2) when the ground-truth function has precision T , and that the length complexity of 2-layer C-RASP functions is O(T^O(K)) when the ground-truth function has precision T and K heads.

[LG-6] Agnostic Learning under Targeted Poisoning: Optimal Rates and the Role of Randomness

链接: https://arxiv.org/abs/2506.03075
作者: Bogdan Chornomaz,Yonatan Koren,Shay Moran,Tom Waknine
类目: Machine Learning (cs.LG); Probability (math.PR)
*备注:

点击查看摘要

Abstract:We study the problem of learning in the presence of an adversary that can corrupt an \eta fraction of the training examples with the goal of causing failure on a specific test point. In the realizable setting, prior work established that the optimal error under such instance-targeted poisoning attacks scales as \Theta(d\eta) , where d is the VC dimension of the hypothesis class arXiv:2210.02713. In this work, we resolve the corresponding question in the agnostic setting. We show that the optimal excess error is \tilde\Theta(\sqrtd\eta) , answering one of the main open problems left by Hanneke et al. To achieve this rate, it is necessary to use randomized learners: Hanneke et al. showed that deterministic learners can be forced to suffer error close to 1, even under small amounts of poisoning. Perhaps surprisingly, our upper bound remains valid even when the learner’s random bits are fully visible to the adversary . In the other direction, our lower bound is stronger than standard PAC-style bounds: instead of tailoring a hard distribution separately for each sample size, we exhibit a single fixed distribution under which the adversary can enforce an excess error of \Omega(\sqrtd\eta) infinitely often.

[LG-7] Provable Reinforcement Learning from Human Feedback with an Unknown Link Function

链接: https://arxiv.org/abs/2506.03066
作者: Qining Zhang,Lei Ying
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Link functions, which characterize how human preferences are generated from the value function of an RL problem, are a crucial component in designing RLHF algorithms. Almost all RLHF algorithms, including state-of-the-art ones in empirical studies such as DPO and PPO, assume the link function is known to the agent (e.g., a logistic function according to the Bradley-Terry model), which is arguably unrealistic considering the complex nature of human preferences. To avoid link function mis-specification, this paper studies general RLHF problems with unknown link functions. We propose a novel policy optimization algorithm called ZSPO based on a new zeroth-order policy optimization method, where the key is to use human preference to construct a parameter update direction that is positively correlated with the true policy gradient direction. ZSPO achieves it by estimating the sign of the value function difference instead of estimating the gradient from the value function difference, so it does not require knowing the link function. Under mild conditions, ZSPO converges to a stationary policy with a polynomial convergence rate depending on the number of policy iterations and trajectories per iteration. Numerical results also show the superiority of ZSPO under link function mismatch.

[LG-8] Multi-Metric Adaptive Experimental Design under Fixed Budget with Validation

链接: https://arxiv.org/abs/2506.03062
作者: Qining Zhang,Tanner Fiez,Yi Liu,Wenyang Liu
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Standard A/B tests in online experiments face statistical power challenges when testing multiple candidates simultaneously, while adaptive experimental designs (AED) alone fall short in inferring experiment statistics such as the average treatment effect, especially with many metrics (e.g., revenue, safety) and heterogeneous variances. This paper proposes a fixed-budget multi-metric AED framework with a two-phase structure: an adaptive exploration phase to identify the best treatment, and a validation phase with an A/B test to verify the treatment’s quality and infer statistics. We propose SHRVar, which generalizes sequential halving (SH) (Karnin et al., 2013) with a novel relative-variance-based sampling and an elimination strategy built on reward z-values. It achieves a provable error probability that decreases exponentially, where the exponent generalizes the complexity measure for SH (Karnin et al., 2013) and SHVar (Lalitha et al., 2023) with homogeneous and heterogeneous variances, respectively. Numerical experiments verify our analysis and demonstrate the superior performance of this new framework.

[LG-9] Sample complexity of Schrödinger potential estimation

链接: https://arxiv.org/abs/2506.03043
作者: Nikita Puchkin,Iurii Pustovalov,Yuri Sapronov,Denis Suchkov,Alexey Naumov,Denis Belomestny
类目: Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注: 60 pages

点击查看摘要

Abstract:We address the problem of Schrödinger potential estimation, which plays a crucial role in modern generative modelling approaches based on Schrödinger bridges and stochastic optimal control for SDEs. Given a simple prior diffusion process, these methods search for a path between two given distributions \rho_0 and \rho_T^* requiring minimal efforts. The optimal drift in this case can be expressed through a Schrödinger potential. In the present paper, we study generalization ability of an empirical Kullback-Leibler (KL) risk minimizer over a class of admissible log-potentials aimed at fitting the marginal distribution at time T . Under reasonable assumptions on the target distribution \rho_T^* and the prior process, we derive a non-asymptotic high-probability upper bound on the KL-divergence between \rho_T^* and the terminal density corresponding to the estimated log-potential. In particular, we show that the excess KL-risk may decrease as fast as O(\log^2 n / n) when the sample size n tends to infinity even if both \rho_0 and \rho_T^* have unbounded supports.

[LG-10] On the Need to Align Intent and Implementation in Uncertainty Quantification for Machine Learning

链接: https://arxiv.org/abs/2506.03037
作者: Shubhendu Trivedi,Brian D. Nord
类目: Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Quantifying uncertainties for machine learning (ML) models is a foundational challenge in modern data analysis. This challenge is compounded by at least two key aspects of the field: (a) inconsistent terminology surrounding uncertainty and estimation across disciplines, and (b) the varying technical requirements for establishing trustworthy uncertainties in diverse problem contexts. In this position paper, we aim to clarify the depth of these challenges by identifying these inconsistencies and articulating how different contexts impose distinct epistemic demands. We examine the current landscape of estimation targets (e.g., prediction, inference, simulation-based inference), uncertainty constructs (e.g., frequentist, Bayesian, fiducial), and the approaches used to map between them. Drawing on the literature, we highlight and explain examples of problematic mappings. To help address these issues, we advocate for standards that promote alignment between the \textitintent and \textitimplementation of uncertainty quantification (UQ) approaches. We discuss several axes of trustworthiness that are necessary (if not sufficient) for reliable UQ in ML models, and show how these axes can inform the design and evaluation of uncertainty-aware ML systems. Our practical recommendations focus on scientific ML, offering illustrative cases and use scenarios, particularly in the context of simulation-based inference (SBI).

[LG-11] Protein Inverse Folding From Structure Feedback

链接: https://arxiv.org/abs/2506.03028
作者: Junde Xu,Zijun Gao,Xinyi Zhou,Jie Hu,Xingyi Cheng,Le Song,Guangyong Chen,Pheng-Ann Heng,Jiezhong Qiu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The inverse folding problem, aiming to design amino acid sequences that fold into desired three-dimensional structures, is pivotal for various biotechnological applications. Here, we introduce a novel approach leveraging Direct Preference Optimization (DPO) to fine-tune an inverse folding model using feedback from a protein folding model. Given a target protein structure, we begin by sampling candidate sequences from the inverse-folding model, then predict the three-dimensional structure of each sequence with the folding model to generate pairwise structural-preference labels. These labels are used to fine-tune the inverse-folding model under the DPO objective. Our results on the CATH 4.2 test set demonstrate that DPO fine-tuning not only improves sequence recovery of baseline models but also leads to a significant improvement in average TM-Score from 0.77 to 0.81, indicating enhanced structure similarity. Furthermore, iterative application of our DPO-based method on challenging protein structures yields substantial gains, with an average TM-Score increase of 79.5% with regard to the baseline model. This work establishes a promising direction for enhancing protein sequence design ability from structure feedback by effectively utilizing preference optimization.

[LG-12] How do Pre-Trained Models Support Software Engineering? An Empirical Study in Hugging Face

链接: https://arxiv.org/abs/2506.03013
作者: Alexandra González,Xavier Franch,David Lo,Silverio Martínez-Fernández
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Open-Source Pre-Trained Models (PTMs) provide extensive resources for various Machine Learning (ML) tasks, yet these resources lack a classification tailored to Software Engineering (SE) needs. To address this gap, we derive a taxonomy encompassing 147 SE tasks and apply an SE-oriented classification to PTMs in a popular open-source ML repository, Hugging Face (HF). Our repository mining study began with a systematically gathered database of PTMs from the HF API, considering their model card descriptions and metadata, and the abstract of the associated arXiv papers. We confirmed SE relevance through multiple filtering steps: detecting outliers, identifying near-identical PTMs, and the use of Gemini 2.0 Flash, which was validated with five pilot studies involving three human annotators. This approach uncovered 2,205 SE PTMs. We find that code generation is the most common SE task among PTMs, primarily focusing on software implementation, while requirements engineering and software design activities receive limited attention. In terms of ML tasks, text generation dominates within SE PTMs. Notably, the number of SE PTMs has increased markedly since 2023 Q2. Our classification provides a solid foundation for future automated SE scenarios, such as the sampling and selection of suitable PTMs.

[LG-13] Implicit Regularization of the Deep Inverse Prior Trained with Inertia

链接: https://arxiv.org/abs/2506.02986
作者: Nathan Buskulic,Jalal Fadil,Yvain Quéau
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Solving inverse problems with neural networks benefits from very few theoretical guarantees when it comes to the recovery guarantees. We provide in this work convergence and recovery guarantees for self-supervised neural networks applied to inverse problems, such as Deep Image/Inverse Prior, and trained with inertia featuring both viscous and geometric Hessian-driven dampings. We study both the continuous-time case, i.e., the trajectory of a dynamical system, and the discrete case leading to an inertial algorithm with an adaptive step-size. We show in the continuous-time case that the network can be trained with an optimal accelerated exponential convergence rate compared to the rate obtained with gradient flow. We also show that training a network with our inertial algorithm enjoys similar recovery guarantees though with a less sharp linear convergence rate.

[LG-14] On the Robustness of Tabular Foundation Models: Test-Time Attacks and In-Context Defenses

链接: https://arxiv.org/abs/2506.02978
作者: Mohamed Djilani,Thibault Simonetto,Karim Tit,Florian Tambon,Paul Récamier,Salah Ghamizi,Maxime Cordy,Mike Papadakis
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent tabular Foundational Models (FM) such as TabPFN and TabICL, leverage in-context learning to achieve strong performance without gradient updates or fine-tuning. However, their robustness to adversarial manipulation remains largely unexplored. In this work, we present a comprehensive study of the adversarial vulnerabilities of tabular FM, focusing on both their fragility to targeted test-time attacks and their potential misuse as adversarial tools. We show on three benchmarks in finance, cybersecurity and healthcare, that small, structured perturbations to test inputs can significantly degrade prediction accuracy, even when training context remain fixed. Additionally, we demonstrate that tabular FM can be repurposed to generate transferable evasion to conventional models such as random forests and XGBoost, and on a lesser extent to deep tabular models. To improve tabular FM, we formulate the robustification problem as an optimization of the weights (adversarial fine-tuning), or the context (adversarial in-context learning). We introduce an in-context adversarial training strategy that incrementally replaces the context with adversarial perturbed instances, without updating model weights. Our approach improves robustness across multiple tabular benchmarks. Together, these findings position tabular FM as both a target and a source of adversarial threats, highlighting the urgent need for robust training and evaluation practices in this emerging paradigm.

[LG-15] Computation- and Communication-Efficient Online FL for Resource-Constrained Aerial Vehicles

链接: https://arxiv.org/abs/2506.02972
作者: Md-Ferdous Pervej,Richeng Jin,Md Moin Uddin Chowdhury,Simran Singh,İsmail Güvenç,Huaiyu Dai
类目: Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Privacy-preserving distributed machine learning (ML) and aerial connected vehicle (ACV)-assisted edge computing have drawn significant attention lately. Since the onboard sensors of ACVs can capture new data as they move along their trajectories, the continual arrival of such ‘newly’ sensed data leads to online learning and demands carefully crafting the trajectories. Besides, as typical ACVs are inherently resource-constrained, computation- and communication-efficient ML solutions are needed. Therefore, we propose a computation- and communication-efficient online aerial federated learning (2CEOAFL) algorithm to take the benefits of continual sensed data and limited onboard resources of the ACVs. In particular, considering independently owned ACVs act as selfish data collectors, we first model their trajectories according to their respective time-varying data distributions. We then propose a 2CEOAFL algorithm that allows the flying ACVs to (a) prune the received dense ML model to make it shallow, (b) train the pruned model, and © probabilistically quantize and offload their trained accumulated gradients to the central server (CS). Our extensive simulation results show that the proposed 2CEOAFL algorithm delivers comparable performances to its non-pruned and nonquantized, hence, computation- and communication-inefficient counterparts.

[LG-16] Memory-Efficient and Privacy-Preserving Collaborative Training for Mixture-of-Experts LLM s

链接: https://arxiv.org/abs/2506.02965
作者: Ze Yu Zhang,Bolin Ding,Bryan Kian Hsiang Low
类目: Machine Learning (cs.LG)
*备注: 20 pages, 4 figures,

点击查看摘要

Abstract:Mixture-of-Experts (MoE) has been gaining popularity due to its successful adaptation to large language models (LLMs). In this work, we introduce Privacy-preserving Collaborative Mixture-of-Experts (PC-MoE), which leverages the sparsity of the MoE architecture for memory-efficient decentralized collaborative LLM training, enabling multiple parties with limited GPU-memory and data resources to collectively train more capable LLMs than they could achieve individually. At the same time, this approach protects training data privacy of each participant by keeping training data, as well as parts of the forward pass signal and gradients locally within each party. By design, PC-MoE synergistically combines the strengths of distributed computation with strong confidentiality assurances. Unlike most privacy-preserving schemes, which pay for confidentiality with lower task accuracy, our framework breaks that trade-off: across seven popular LLM benchmarks, it almost matches (and sometimes exceeds) the performance and convergence rate of a fully centralized model, enjoys near 70% peak GPU RAM reduction, while being fully robust against reconstruction attacks.

[LG-17] Abstract Counterfactuals for Language Model Agents

链接: https://arxiv.org/abs/2506.02946
作者: Edoardo Pona,Milad Kazemi,Yali Du,David Watson,Nicola Paoletti
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Counterfactual inference is a powerful tool for analysing and evaluating autonomous agents, but its application to language model (LM) agents remains challenging. Existing work on counterfactuals in LMs has primarily focused on token-level counterfactuals, which are often inadequate for LM agents due to their open-ended action spaces. Unlike traditional agents with fixed, clearly defined action spaces, the actions of LM agents are often implicit in the strings they output, making their action spaces difficult to define and interpret. Furthermore, the meanings of individual tokens can shift depending on the context, adding complexity to token-level reasoning and sometimes leading to biased or meaningless counterfactuals. We introduce \emphAbstract Counterfactuals, a framework that emphasises high-level characteristics of actions and interactions within an environment, enabling counterfactual reasoning tailored to user-relevant features. Our experiments demonstrate that the approach produces consistent and meaningful counterfactuals while minimising the undesired side effects of token-level methods. We conduct experiments on text-based games and counterfactual text generation, while considering both token-level and latent-space interventions.

[LG-18] QKV Projections Require a Fraction of Their Memory

链接: https://arxiv.org/abs/2506.02939
作者: Malik Khalf,Yara Shamshoum,Nitzan Hodos,Yuval Sieradzki,Assaf Schuster
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The Multi-Head Attention mechanism is central to LLM operation, and multiple works target its compute and memory efficiency during training. While most works focus on approximating the scaled dot product, the memory consumption of the linear projections that compute the Q , K , and V tensors from the input x is often overlooked. To address this, we propose Point-Approximate Matrix Multiplication (PAMM), a novel tensor compression technique that reduces memory consumption of the Q,K,V projections in attention layers by a factor of up to \times 512 , effectively erasing their memory footprint, while achieving similar or better final perplexity. PAMM is fully composable with efficient attention techniques such as FlashAttention, making it a practical and complementary method for memory-efficient LLM training.

[LG-19] MTL-KD: Multi-Task Learning Via Knowledge Distillation for Generalizable Neural Vehicle Routing Solver

链接: https://arxiv.org/abs/2506.02935
作者: Yuepeng Zheng,Fu Luo,Zhenkun Wang,Yaoxin Wu,Yu Zhou
类目: Machine Learning (cs.LG)
*备注: 24 pages,5 figures, 8 tables

点击查看摘要

Abstract:Multi-Task Learning (MTL) in Neural Combinatorial Optimization (NCO) is a promising approach to train a unified model capable of solving multiple Vehicle Routing Problem (VRP) variants. However, existing Reinforcement Learning (RL)-based multi-task methods can only train light decoder models on small-scale problems, exhibiting limited generalization ability when solving large-scale problems. To overcome this limitation, this work introduces a novel multi-task learning method driven by knowledge distillation (MTL-KD), which enables the efficient training of heavy decoder models with strong generalization ability. The proposed MTL-KD method transfers policy knowledge from multiple distinct RL-based single-task models to a single heavy decoder model, facilitating label-free training and effectively improving the model’s generalization ability across diverse tasks. In addition, we introduce a flexible inference strategy termed Random Reordering Re-Construction (R3C), which is specifically adapted for diverse VRP tasks and further boosts the performance of the multi-task model. Experimental results on 6 seen and 10 unseen VRP variants with up to 1000 nodes indicate that our proposed method consistently achieves superior performance on both uniform and real-world benchmarks, demonstrating robust generalization abilities.

[LG-20] From Theory to Practice with RAVEN-UCB: Addressing Non-Stationarity in Multi-Armed Bandits through Variance Adaptation

链接: https://arxiv.org/abs/2506.02933
作者: Junyi Fang,Yuxun Chen,Yuxin Chen,Chen Zhang
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 25 pages, 5 figures, 4 tables, submitted to Applied Intelligence, code available at this https URL

点击查看摘要

Abstract:The Multi-Armed Bandit (MAB) problem is challenging in non-stationary environments where reward distributions evolve dynamically. We introduce RAVEN-UCB, a novel algorithm that combines theoretical rigor with practical efficiency via variance-aware adaptation. It achieves tighter regret bounds than UCB1 and UCB-V, with gap-dependent regret of order K \sigma_\max^2 \log T / \Delta and gap-independent regret of order \sqrtK T \log T . RAVEN-UCB incorporates three innovations: (1) variance-driven exploration using \sqrt\hat\sigma_k^2 / (N_k + 1) in confidence bounds, (2) adaptive control via \alpha_t = \alpha_0 / \log(t + \epsilon) , and (3) constant-time recursive updates for efficiency. Experiments across non-stationary patterns - distributional changes, periodic shifts, and temporary fluctuations - in synthetic and logistics scenarios demonstrate its superiority over state-of-the-art baselines, confirming theoretical and practical robustness.

[LG-21] Sociodynamics-inspired Adaptive Coalition and Client Selection in Federated Learning

链接: https://arxiv.org/abs/2506.02897
作者: Alessandro Licciardi,Roberta Raineri,Anton Proskurnikov,Lamberto Rondoni,Lorenzo Zino
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Federated Learning (FL) enables privacy-preserving collaborative model training, yet its practical strength is often undermined by client data heterogeneity, which severely degrades model performance. This paper proposes that data heterogeneity across clients’ distributions can be effectively addressed by adopting an approach inspired by opinion dynamics over temporal social networks. We introduce \shortname (Federated Coalition Variance Reduction with Boltzmann Exploration), a variance-reducing selection algorithm in which (1) clients dynamically organize into non-overlapping clusters based on asymptotic agreements, and (2) from each cluster, one client is selected to minimize the expected variance of its model update. Our experiments show that in heterogeneous scenarios our algorithm outperforms existing FL algorithms, yielding more accurate results and faster convergence, validating the efficacy of our approach.

[LG-22] Overcoming Challenges of Partial Client Participation in Federated Learning : A Comprehensive Review ICIP

链接: https://arxiv.org/abs/2506.02887
作者: Mrinmay Sen,Shruti Aparna,Rohit Agarwal,Chalavadi Krishna Mohan
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: 15 pages, 6 tables, comprehensive survey of federated learning with partial client participation

点击查看摘要

Abstract:Federated Learning (FL) is a learning mechanism that falls under the distributed training umbrella, which collaboratively trains a shared global model without disclosing the raw data from different clients. This paper presents an extensive survey on the impact of partial client participation in federated learning. While much of the existing research focuses on addressing issues such as generalization, robustness, and fairness caused by data heterogeneity under the assumption of full client participation, limited attention has been given to the practical and theoretical challenges arising from partial client participation, which is common in real-world scenarios. This survey provides an in-depth review of existing FL methods designed to cope with partial client participation. We offer a comprehensive analysis supported by theoretical insights and empirical findings, along with a structured categorization of these methods, highlighting their respective advantages and disadvantages.

[LG-23] A Continual Offline Reinforcement Learning Benchmark for Navigation Tasks

链接: https://arxiv.org/abs/2506.02883
作者: Anthony Kobanda,Odalric-Ambrym Maillard,Rémy Portelas
类目: Machine Learning (cs.LG)
*备注: arXiv admin note: text overlap with arXiv:2412.14865

点击查看摘要

Abstract:Autonomous agents operating in domains such as robotics or video game simulations must adapt to changing tasks without forgetting about the previous ones. This process called Continual Reinforcement Learning poses non-trivial difficulties, from preventing catastrophic forgetting to ensuring the scalability of the approaches considered. Building on recent advances, we introduce a benchmark providing a suite of video-game navigation scenarios, thus filling a gap in the literature and capturing key challenges : catastrophic forgetting, task adaptation, and memory efficiency. We define a set of various tasks and datasets, evaluation protocols, and metrics to assess the performance of algorithms, including state-of-the-art baselines. Our benchmark is designed not only to foster reproducible research and to accelerate progress in continual reinforcement learning for gaming, but also to provide a reproducible framework for production pipelines – helping practitioners to identify and to apply effective approaches.

[LG-24] Learned Controllers for Agile Quadrotors in Pursuit-Evasion Games

链接: https://arxiv.org/abs/2506.02849
作者: Alejandro Sanchez Roncero,Olov Andersson,Petter Ogren
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The increasing proliferation of small UAVs in civilian and military airspace has raised critical safety and security concerns, especially when unauthorized or malicious drones enter restricted zones. In this work, we present a reinforcement learning (RL) framework for agile 1v1 quadrotor pursuit-evasion. We train neural network policies to command body rates and collective thrust, enabling high-speed pursuit and evasive maneuvers that fully exploit the quadrotor’s nonlinear dynamics. To mitigate nonstationarity and catastrophic forgetting during adversarial co-training, we introduce an Asynchronous Multi-Stage Population-Based (AMSPB) algorithm where, at each stage, either the pursuer or evader learns against a sampled opponent drawn from a growing population of past and current policies. This continual learning setup ensures monotonic performance improvement and retention of earlier strategies. Our results show that (i) rate-based policies achieve significantly higher capture rates and peak speeds than velocity-level baselines, and (ii) AMSPB yields stable, monotonic gains against a suite of benchmark opponents.

[LG-25] Ensemble-MIX: Enhancing Sample Efficiency in Multi-Agent RL Using Ensemble Methods

链接: https://arxiv.org/abs/2506.02841
作者: Tom Danino,Nahum Shimkin
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multi-agent reinforcement learning (MARL) methods have achieved state-of-the-art results on a range of multi-agent tasks. Yet, MARL algorithms typically require significantly more environment interactions than their single-agent counterparts to converge, a problem exacerbated by the difficulty in exploring over a large joint action space and the high variance intrinsic to MARL environments. To tackle these issues, we propose a novel algorithm that combines a decomposed centralized critic with decentralized ensemble learning, incorporating several key contributions. The main component in our scheme is a selective exploration method that leverages ensemble kurtosis. We extend the global decomposed critic with a diversity-regularized ensemble of individual critics and utilize its excess kurtosis to guide exploration toward high-uncertainty states and actions. To improve sample efficiency, we train the centralized critic with a novel truncated variation of the TD( \lambda ) algorithm, enabling efficient off-policy learning with reduced variance. On the actor side, our suggested algorithm adapts the mixed samples approach to MARL, mixing on-policy and off-policy loss functions for training the actors. This approach balances between stability and efficiency and outperforms purely off-policy learning. The evaluation shows our method outperforms state-of-the-art baselines on standard MARL benchmarks, including a variety of SMAC II maps.

[LG-26] CART-based Synthetic Tabular Data Generation for Imbalanced Regression

链接: https://arxiv.org/abs/2506.02811
作者: António Pedro Pinheiro,Rita P. Ribeiro
类目: Machine Learning (cs.LG)
*备注: 15 pages, 2 figures, 5 tables, 1 algorithm

点击查看摘要

Abstract:Handling imbalanced target distributions in regression tasks remains a significant challenge in tabular data settings where underrepresented regions can hinder model performance. Among data-level solutions, some proposals, such as random sampling and SMOTE-based approaches, propose adapting classification techniques to regression tasks. However, these methods typically rely on crisp, artificial thresholds over the target variable, a limitation inherited from classification settings that can introduce arbitrariness, often leading to non-intuitive and potentially misleading problem formulations. While recent generative models, such as GANs and VAEs, provide flexible sample synthesis, they come with high computational costs and limited interpretability. In this study, we propose adapting an existing CART-based synthetic data generation method, tailoring it for imbalanced regression. The new method integrates relevance and density-based mechanisms to guide sampling in sparse regions of the target space and employs a threshold-free, feature-driven generation process. Our experimental study focuses on the prediction of extreme target values across benchmark datasets. The results indicate that the proposed method is competitive with other resampling and generative strategies in terms of performance, while offering faster execution and greater transparency. These results highlight the method’s potential as a transparent, scalable data-level strategy for improving regression models in imbalanced domains.

[LG-27] A Learned Cost Model-based Cross-engine Optimizer for SQL Workloads

链接: https://arxiv.org/abs/2506.02802
作者: András Strausz,Niels Pardon,Ioana Giurgiu
类目: Databases (cs.DB); Machine Learning (cs.LG)
*备注: 6 pages

点击查看摘要

Abstract:Lakehouse systems enable the same data to be queried with multiple execution engines. However, selecting the engine best suited to run a SQL query still requires a priori knowledge of the query computational requirements and an engine capability, a complex and manual task that only becomes more difficult with the emergence of new engines and workloads. In this paper, we address this limitation by proposing a cross-engine optimizer that can automate engine selection for diverse SQL queries through a learned cost model. Optimized with hints, a query plan is used for query cost prediction and routing. Cost prediction is formulated as a multi-task learning problem, and multiple predictor heads, corresponding to different engines and provisionings, are used in the model architecture. This eliminates the need to train engine-specific models and allows the flexible addition of new engines at a minimal fine-tuning cost. Results on various databases and engines show that using a query optimized logical plan for cost estimation decreases the average Q-error by even 12.6% over using unoptimized plans as input. Moreover, the proposed cross-engine optimizer reduces the total workload runtime by up to 25.2% in a zero-shot setting and 30.4% in a few-shot setting when compared to random routing.

[LG-28] Accelerating Model-Based Reinforcement Learning using Non-Linear Trajectory Optimization

链接: https://arxiv.org/abs/2506.02767
作者: Marco Calì,Giulio Giacomuzzo,Ruggero Carli,Alberto Dalla Libera
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:This paper addresses the slow policy optimization convergence of Monte Carlo Probabilistic Inference for Learning Control (MC-PILCO), a state-of-the-art model-based reinforcement learning (MBRL) algorithm, by integrating it with iterative Linear Quadratic Regulator (iLQR), a fast trajectory optimization method suitable for nonlinear systems. The proposed method, Exploration-Boosted MC-PILCO (EB-MC-PILCO), leverages iLQR to generate informative, exploratory trajectories and initialize the policy, significantly reducing the number of required optimization steps. Experiments on the cart-pole task demonstrate that EB-MC-PILCO accelerates convergence compared to standard MC-PILCO, achieving up to \bm45.9% reduction in execution time when both methods solve the task in four trials. EB-MC-PILCO also maintains a \bm100% success rate across trials while solving the task faster, even in cases where MC-PILCO converges in fewer iterations.

[LG-29] WeightLoRA: Keep Only Necessary Adapters

链接: https://arxiv.org/abs/2506.02724
作者: Andrey Veprikov,Vladimir Solodkin,Alexander Zyl,Andrey Savchenko,Aleksandr Beznosikov
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 13 pages, 9 tables

点击查看摘要

Abstract:The widespread utilization of language models in modern applications is inconceivable without Parameter-Efficient Fine-Tuning techniques, such as low-rank adaptation ( \textttLoRA ), which adds trainable adapters to selected layers. Although \textttLoRA may obtain accurate solutions, it requires significant memory to train large models and intuition on which layers to add adapters. In this paper, we propose a novel method, \textttWeightLoRA , which overcomes this issue by adaptive selection of the most critical \textttLoRA heads throughout the optimization process. As a result, we can significantly reduce the number of trainable parameters while maintaining the capability to obtain consistent or even superior metric values. We conduct experiments for a series of competitive benchmarks and DeBERTa, BART, and Llama models, comparing our method with different adaptive approaches. The experimental results demonstrate the efficacy of \textttWeightLoRA and the superior performance of \textttWeightLoRA+ in almost all cases.

[LG-30] heoretical Performance Guarantees for Partial Domain Adaptation via Partial Optimal Transport ICML2025

链接: https://arxiv.org/abs/2506.02712
作者: Jayadev Naram,Fredrik Hellström,Ziming Wang,Rebecka Jörnsten,Giuseppe Durisi
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: ICML 2025

点击查看摘要

Abstract:In many scenarios of practical interest, labeled data from a target distribution are scarce while labeled data from a related source distribution are abundant. One particular setting of interest arises when the target label space is a subset of the source label space, leading to the framework of partial domain adaptation (PDA). Typical approaches to PDA involve minimizing a domain alignment term and a weighted empirical loss on the source data, with the aim of transferring knowledge between domains. However, a theoretical basis for this procedure is lacking, and in particular, most existing weighting schemes are heuristic. In this work, we derive generalization bounds for the PDA problem based on partial optimal transport. These bounds corroborate the use of the partial Wasserstein distance as a domain alignment term, and lead to theoretically motivated explicit expressions for the empirical source loss weights. Inspired by these bounds, we devise a practical algorithm for PDA, termed WARMPOT. Through extensive numerical experiments, we show that WARMPOT is competitive with recent approaches, and that our proposed weights improve on existing schemes.

[LG-31] Beyond Invisibility: Learning Robust Visible Watermarks for Stronger Copyright Protection UAI2025

链接: https://arxiv.org/abs/2506.02665
作者: Tianci Liu,Tong Yang,Quan Zhang,Qi Lei
类目: Machine Learning (cs.LG)
*备注: UAI 2025

点击查看摘要

Abstract:As AI advances, copyrighted content faces growing risk of unauthorized use, whether through model training or direct misuse. Building upon invisible adversarial perturbation, recent works developed copyright protections against specific AI techniques such as unauthorized personalization through DreamBooth that are misused. However, these methods offer only short-term security, as they require retraining whenever the underlying model architectures change. To establish long-term protection aiming at better robustness, we go beyond invisible perturbation, and propose a universal approach that embeds \textitvisible watermarks that are \textithard-to-remove into images. Grounded in a new probabilistic and inverse problem-based formulation, our framework maximizes the discrepancy between the \textitoptimal reconstruction and the original content. We develop an effective and efficient approximation algorithm to circumvent a intractable bi-level optimization. Experimental results demonstrate superiority of our approach across diverse scenarios.

[LG-32] Maximizing the Promptness of Metaverse Systems using Edge Computing by Deep Reinforcement Learning ATC2024

链接: https://arxiv.org/abs/2506.02657
作者: Tam Ninh Thi-Thanh,Trinh Van Chien,Hung Tran,Nguyen Hoai Son,Van Nhan Vo
类目: Information Theory (cs.IT); Machine Learning (cs.LG)
*备注: 6 pages, 3 figures, and 2 tables. Published by IEEE at ATC2024

点击查看摘要

Abstract:Metaverse and Digital Twin (DT) have attracted much academic and industrial attraction to approach the future digital world. This paper introduces the advantages of deep reinforcement learning (DRL) in assisting Metaverse system-based Digital Twin. In this system, we assume that it includes several Metaverse User devices collecting data from the real world to transfer it into the virtual world, a Metaverse Virtual Access Point (MVAP) undertaking the processing of data, and an edge computing server that receives the offloading data from the MVAP. The proposed model works under a dynamic environment with various parameters changing over time. The experiment results show that our proposed DRL algorithm is suitable for offloading tasks to ensure the promptness of DT in a dynamic environment.

[LG-33] HAM: A Hyperbolic Step to Regulate Implicit Bias

链接: https://arxiv.org/abs/2506.02630
作者: Tom Jacobs,Advait Gadhikar,Celia Rubio-Madrigal,Rebekka Burkholz
类目: Machine Learning (cs.LG)
*备注: 26 pages, 7 figures

点击查看摘要

Abstract:Understanding the implicit bias of optimization algorithms has become central to explaining the generalization behavior of deep learning models. For instance, the hyperbolic implicit bias induced by the overparameterization m \odot w --though effective in promoting sparsity–can result in a small effective learning rate, which slows down convergence. To overcome this obstacle, we propose HAM (Hyperbolic Aware Minimization), which alternates between an optimizer step and a new hyperbolic mirror step. We derive the Riemannian gradient flow for its combination with gradient descent, leading to improved convergence and a similar beneficial hyperbolic geometry as m \odot w for feature learning. We provide an interpretation of the the algorithm by relating it to natural gradient descent, and an exact characterization of its implicit bias for underdetermined linear regression. HAM’s implicit bias consistently boosts performance–even of dense training, as we demonstrate in experiments across diverse tasks, including vision, graph and node classification, and large language model fine-tuning. HAM is especially effective in combination with different sparsification methods, improving upon the state of the art. The hyperbolic step requires minimal computational and memory overhead, it succeeds even with small batch sizes, and its implementation integrates smoothly with existing optimizers.

[LG-34] Compositional Learning for Modular Multi-Agent Self-Organizing Networks

链接: https://arxiv.org/abs/2506.02616
作者: Qi Liao,Parijat Bhattacharjee
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Self-organizing networks face challenges from complex parameter interdependencies and conflicting objectives. This study introduces two compositional learning approaches-Compositional Deep Reinforcement Learning (CDRL) and Compositional Predictive Decision-Making (CPDM)-and evaluates their performance under training time and safety constraints in multi-agent systems. We propose a modular, two-tier framework with cell-level and cell-pair-level agents to manage heterogeneous agent granularities while reducing model complexity. Numerical simulations reveal a significant reduction in handover failures, along with improved throughput and latency, outperforming conventional multi-agent deep reinforcement learning approaches. The approach also demonstrates superior scalability, faster convergence, higher sample efficiency, and safer training in large-scale self-organizing networks.

[LG-35] Assessing the Completeness of Traffic Scenario Categories for Automated Highway Driving Functions via Cluster-based Analysis

链接: https://arxiv.org/abs/2506.02599
作者: Niklas Roßberg,Marion Neumeier,Sinan Hasirlioglu,Mohamed Essayed Bouzouraa,Michael Botsch
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The ability to operate safely in increasingly complex traffic scenarios is a fundamental requirement for Automated Driving Systems (ADS). Ensuring the safe release of ADS functions necessitates a precise understanding of the occurring traffic scenarios. To support this objective, this work introduces a pipeline for traffic scenario clustering and the analysis of scenario category completeness. The Clustering Vector Quantized - Variational Autoencoder (CVQ-VAE) is employed for the clustering of highway traffic scenarios and utilized to create various catalogs with differing numbers of traffic scenario categories. Subsequently, the impact of the number of categories on the completeness considerations of the traffic scenario categories is analyzed. The results show an outperforming clustering performance compared to previous work. The trade-off between cluster quality and the amount of required data to maintain completeness is discussed based on the publicly available highD dataset.

[LG-36] Reachability Weighted Offline Goal-conditioned Resampling

链接: https://arxiv.org/abs/2506.02577
作者: Wenyan Yang,Joni Pajarinen
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Offline goal-conditioned reinforcement learning (RL) relies on fixed datasets where many potential goals share the same state and action spaces. However, these potential goals are not explicitly represented in the collected trajectories. To learn a generalizable goal-conditioned policy, it is common to sample goals and state-action pairs uniformly using dynamic programming methods such as Q-learning. Uniform sampling, however, requires an intractably large dataset to cover all possible combinations and creates many unreachable state-goal-action pairs that degrade policy performance. Our key insight is that sampling should favor transitions that enable goal achievement. To this end, we propose Reachability Weighted Sampling (RWS). RWS uses a reachability classifier trained via positive-unlabeled (PU) learning on goal-conditioned state-action values. The classifier maps these values to a reachability score, which is then used as a sampling priority. RWS is a plug-and-play module that integrates seamlessly with standard offline RL algorithms. Experiments on six complex simulated robotic manipulation tasks, including those with a robot arm and a dexterous hand, show that RWS significantly improves performance. In one notable case, performance on the HandBlock-Z task improved by nearly 50 percent relative to the baseline. These results indicate the effectiveness of reachability-weighted sampling.

[LG-37] Privacy-Preserving Federated Convex Optimization: Balancing Partial-Participation and Efficiency via Noise Cancellation

链接: https://arxiv.org/abs/2506.02563
作者: Roie Reshef,Kfir Yehuda Levy
类目: Machine Learning (cs.LG)
*备注: arXiv admin note: text overlap with arXiv:2407.12396

点击查看摘要

Abstract:This paper tackles the challenge of achieving Differential Privacy (DP) in Federated Learning (FL) under partial-participation, where only a subset of the machines participate in each time-step. While previous work achieved optimal performance in full-participation settings, these methods struggled to extend to partial-participation scenarios. Our approach fills this gap by introducing a novel noise-cancellation mechanism that preserves privacy without sacrificing convergence rates or computational efficiency. We analyze our method within the Stochastic Convex Optimization (SCO) framework and show that it delivers optimal performance for both homogeneous and heterogeneous data distributions. This work expands the applicability of DP in FL, offering an efficient and practical solution for privacy-preserving learning in distributed systems with partial participation.

[LG-38] VerificAgent : Integrating Expert Knowledge and Fact-Checked Memory for Robust Domain-Specific Task Planning

链接: https://arxiv.org/abs/2506.02539
作者: Thong Q. Nguyen,Shubhang Desai,Yash Jain,Tanvir Aumi,Vishal Chowdhary
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Continual memory augmentation allows computer-use agents (CUAs) to learn from past interactions and refine their task-solving strategies over time. However, unchecked memory accumulation can introduce spurious or hallucinated “learnings” that degrade agent performance, particularly in domain-specific workflows such as productivity software. We present a novel framework, VerificAgent, that effectively manages memory for CUAs through (1) an expert-curated seed of domain knowledge, (2) iterative, trajectory-based memory refinement during training, and (3) a post-hoc fact-checking pass by human experts to sanitize accumulated memory before deployment. On OSWorld productivity tasks, VerificAgent achieves a 111.1% relative improvement in success rate over baseline CUA without any additional fine-tuning.

[LG-39] Stochastic Momentum Methods for Non-smooth Non-Convex Finite-Sum Coupled Compositional Optimization

链接: https://arxiv.org/abs/2506.02504
作者: Xingyu Chen,Bokun Wang,Ming Yang,Quanqi Hu,Qihang Lin,Tianbao Yang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Finite-sum Coupled Compositional Optimization (FCCO), characterized by its coupled compositional objective structure, emerges as an important optimization paradigm for addressing a wide range of machine learning problems. In this paper, we focus on a challenging class of non-convex non-smooth FCCO, where the outer functions are non-smooth weakly convex or convex and the inner functions are smooth or weakly convex. Existing state-of-the-art result face two key limitations: (1) a high iteration complexity of O(1/\epsilon^6) under the assumption that the stochastic inner functions are Lipschitz continuous in expectation; (2) reliance on vanilla SGD-type updates, which are not suitable for deep learning applications. Our main contributions are two fold: (i) We propose stochastic momentum methods tailored for non-smooth FCCO that come with provable convergence guarantees; (ii) We establish a new state-of-the-art iteration complexity of O(1/\epsilon^5) . Moreover, we apply our algorithms to multiple inequality constrained non-convex optimization problems involving smooth or weakly convex functional inequality constraints. By optimizing a smoothed hinge penalty based formulation, we achieve a new state-of-the-art complexity of O(1/\epsilon^5) for finding an (nearly) \epsilon -level KKT solution. Experiments on three tasks demonstrate the effectiveness of the proposed algorithms.

[LG-40] A Novel Deep Reinforcement Learning Method for Computation Offloading in Multi-User Mobile Edge Computing with Decentralization ATC2024

链接: https://arxiv.org/abs/2506.02458
作者: Nguyen Chi Long,Trinh Van Chien,Ta Hai Tung,Van Son Nguyen,Trong-Minh Hoang,Nguyen Ngoc Hai Dang
类目: Information Theory (cs.IT); Machine Learning (cs.LG)
*备注: 6 pages, 5 figures, and 1 table. Published by IEEE at ATC2024

点击查看摘要

Abstract:Mobile edge computing (MEC) allows appliances to offload workloads to neighboring MEC servers that have the potential for computation-intensive tasks with limited computational capabilities. This paper studied how deep reinforcement learning (DRL) algorithms are used in an MEC system to find feasible decentralized dynamic computation offloading strategies, which leads to the construction of an extensible MEC system that operates effectively with finite feedback. Even though the Deep Deterministic Policy Gradient (DDPG) algorithm, subject to their knowledge of the MEC system, can be used to allocate powers of both computation offloading and local execution, to learn a computation offloading policy for each user independently, we realized that this solution still has some inherent weaknesses. Hence, we introduced a new approach for this problem based on the Twin Delayed DDPG algorithm, which enables us to overcome this proneness and investigate cases where mobile users are portable. Numerical results showed that individual users can autonomously learn adequate policies through the proposed approach. Besides, the performance of the suggested solution exceeded the conventional DDPG-based power control strategy.

[LG-41] Weak Supervision for Real World Graphs

链接: https://arxiv.org/abs/2506.02451
作者: Pratheeksha Nair,Reihaneh Rabbany
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Node classification in real world graphs often suffers from label scarcity and noise, especially in high stakes domains like human trafficking detection and misinformation monitoring. While direct supervision is limited, such graphs frequently contain weak signals, noisy or indirect cues, that can still inform learning. We propose WSNET, a novel weakly supervised graph contrastive learning framework that leverages these weak signals to guide robust representation learning. WSNET integrates graph structure, node features, and multiple noisy supervision sources through a contrastive objective tailored for weakly labeled data. Across three real world datasets and synthetic benchmarks with controlled noise, WSNET consistently outperforms state of the art contrastive and noisy label learning methods by up to 15% in F1 score. Our results highlight the effectiveness of contrastive learning under weak supervision and the promise of exploiting imperfect labels in graph based settings.

[LG-42] Enhancing Convergence Privacy and Fairness for Wireless Personalized Federated Learning: Quantization-Assisted Min-Max Fair Scheduling

链接: https://arxiv.org/abs/2506.02422
作者: Xiyu Zhao,Qimei Cui,Ziqiang Du,Weicai Li,Xi Yu,Wei Ni,Ji Zhang,Xiaofeng Tao,Ping Zhang
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Personalized federated learning (PFL) offers a solution to balancing personalization and generalization by conducting federated learning (FL) to guide personalized learning (PL). Little attention has been given to wireless PFL (WPFL), where privacy concerns arise. Performance fairness of PL models is another challenge resulting from communication bottlenecks in WPFL. This paper exploits quantization errors to enhance the privacy of WPFL and proposes a novel quantization-assisted Gaussian differential privacy (DP) mechanism. We analyze the convergence upper bounds of individual PL models by considering the impact of the mechanism (i.e., quantization errors and Gaussian DP noises) and imperfect communication channels on the FL of WPFL. By minimizing the maximum of the bounds, we design an optimal transmission scheduling strategy that yields min-max fairness for WPFL with OFDMA interfaces. This is achieved by revealing the nested structure of this problem to decouple it into subproblems solved sequentially for the client selection, channel allocation, and power control, and for the learning rates and PL-FL weighting coefficients. Experiments validate our analysis and demonstrate that our approach substantially outperforms alternative scheduling strategies by 87.08%, 16.21%, and 38.37% in accuracy, the maximum test loss of participating clients, and fairness (Jain’s index), respectively.

[LG-43] Improving Generalization of Neural Combinatorial Optimization for Vehicle Routing Problems via Test-Time Projection Learning

链接: https://arxiv.org/abs/2506.02392
作者: Yuanyao Chen,Rongsheng Chen,Fu Luo,Zhenkun Wang
类目: Machine Learning (cs.LG)
*备注: arXiv admin note: text overlap with arXiv:2505.24627

点击查看摘要

Abstract:Neural Combinatorial Optimization (NCO) has emerged as a promising learning-based paradigm for addressing Vehicle Routing Problems (VRPs) by minimizing the need for extensive manual engineering. While existing NCO methods, trained on small-scale instances (e.g., 100 nodes), have demonstrated considerable success on problems of similar scale, their performance significantly degrades when applied to large-scale scenarios. This degradation arises from the distributional shift between training and testing data, rendering policies learned on small instances ineffective for larger problems. To overcome this limitation, we introduce a novel learning framework driven by Large Language Models (LLMs). This framework learns a projection between the training and testing distributions, which is then deployed to enhance the scalability of the NCO model. Notably, unlike prevailing techniques that necessitate joint training with the neural network, our approach operates exclusively during the inference phase, obviating the need for model retraining. Extensive experiments demonstrate that our method enables a backbone model (trained on 100-node instances) to achieve superior performance on large-scale Traveling Salesman Problem (TSP) and Capacitated Vehicle Routing Problem (CVRP) of up to 100K nodes from diverse distributions.

[LG-44] GAdaBoost: An Efficient and Robust AdaBoost Algorithm Based on Granular-Ball Structure

链接: https://arxiv.org/abs/2506.02390
作者: Qin Xie,Qinghua Zhang,Shuyin Xia,Xinran Zhou,Guoyin Wang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Adaptive Boosting (AdaBoost) faces significant challenges posed by label noise, especially in multiclass classification tasks. Existing methods either lack mechanisms to handle label noise effectively or suffer from high computational costs due to redundant data usage. Inspired by granular computing, this paper proposes granular adaptive boosting (GAdaBoost), a novel two-stage framework comprising a data granulation stage and an adaptive boosting stage, to enhance efficiency and robustness under noisy conditions. To validate its feasibility, an extension of SAMME, termed this http URL, is proposed. Specifically, first, a granular-ball generation method is designed to compress data while preserving diversity and mitigating label noise. Second, the granular ball-based SAMME algorithm focuses on granular balls rather than individual samples, improving efficiency and reducing sensitivity to noise. Experimental results on some noisy datasets show that the proposed approach achieves superior robustness and efficiency compared with existing methods, demonstrating that this work effectively extends AdaBoost and SAMME.

[LG-45] Multi-agent Markov Entanglement

链接: https://arxiv.org/abs/2506.02385
作者: Shuze Chen,Tianyi Peng
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Value decomposition has long been a fundamental technique in multi-agent dynamic programming and reinforcement learning (RL). Specifically, the value function of a global state (s_1,s_2,\ldots,s_N) is often approximated as the sum of local functions: V(s_1,s_2,\ldots,s_N)\approx\sum_i=1^N V_i(s_i) . This approach traces back to the index policy in restless multi-armed bandit problems and has found various applications in modern RL systems. However, the theoretical justification for why this decomposition works so effectively remains underexplored. In this paper, we uncover the underlying mathematical structure that enables value decomposition. We demonstrate that a multi-agent Markov decision process (MDP) permits value decomposition if and only if its transition matrix is not “entangled” – a concept analogous to quantum entanglement in quantum physics. Drawing inspiration from how physicists measure quantum entanglement, we introduce how to measure the “Markov entanglement” for multi-agent MDPs and show that this measure can be used to bound the decomposition error in general multi-agent MDPs. Using the concept of Markov entanglement, we proved that a widely-used class of index policies is weakly entangled and enjoys a sublinear \mathcal O(\sqrtN) scale of decomposition error for N -agent systems. Finally, we show how Markov entanglement can be efficiently estimated in practice, providing practitioners with an empirical proxy for the quality of value decomposition. Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2506.02385 [cs.LG] (or arXiv:2506.02385v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2506.02385 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-46] Olfactory Inertial Odometry: Methodology for Effective Robot Navigation by Scent

链接: https://arxiv.org/abs/2506.02373
作者: Kordel K. France,Ovidiu Daescu
类目: Robotics (cs.RO); Machine Learning (cs.LG); Systems and Control (eess.SY); Instrumentation and Detectors (physics.ins-det)
*备注:

点击查看摘要

Abstract:Olfactory navigation is one of the most primitive mechanisms of exploration used by organisms. Navigation by machine olfaction (artificial smell) is a very difficult task to both simulate and solve. With this work, we define olfactory inertial odometry (OIO), a framework for using inertial kinematics, and fast-sampling olfaction sensors to enable navigation by scent analogous to visual inertial odometry (VIO). We establish how principles from SLAM and VIO can be extrapolated to olfaction to enable real-world robotic tasks. We demonstrate OIO with three different odour localization algorithms on a real 5-DoF robot arm over an odour-tracking scenario that resembles real applications in agriculture and food quality control. Our results indicate success in establishing a baseline framework for OIO from which other research in olfactory navigation can build, and we note performance enhancements that can be made to address more complex tasks in the future.

[LG-47] SFBD Flow: A Continuous-Optimization Framework for Training Diffusion Models with Noisy Samples

链接: https://arxiv.org/abs/2506.02371
作者: Haoye Lu,Darren Lo,Yaoliang Yu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Diffusion models achieve strong generative performance but often rely on large datasets that may include sensitive content. This challenge is compounded by the models’ tendency to memorize training data, raising privacy concerns. SFBD (Lu et al., 2025) addresses this by training on corrupted data and using limited clean samples to capture local structure and improve convergence. However, its iterative denoising and fine-tuning loop requires manual coordination, making it burdensome to implement. We reinterpret SFBD as an alternating projection algorithm and introduce a continuous variant, SFBD flow, that removes the need for alternating steps. We further show its connection to consistency constraint-based methods, and demonstrate that its practical instantiation, Online SFBD, consistently outperforms strong baselines across benchmarks.

[LG-48] Reconciling Hessian-Informed Acceleration and Scalar-Only Communication for Efficient Federated Zeroth-Order Fine-Tuning

链接: https://arxiv.org/abs/2506.02370
作者: Zhe Li,Bicheng Ying,Zidong Liu,Chaosheng Dong,Haibo Yang
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: Under review

点击查看摘要

Abstract:Recent dimension-free communication frameworks in Federated Learning (FL), such as DeComFL, significantly reduce per-round communication by transmitting only scalars via zeroth-order stochastic gradient descent (ZO-SGD). This method is particularly advantageous for federated fine-tuning of Large Language Models (LLMs). Yet, the high variance in ZO gradient estimation typically leads to slow convergence. Although leveraging Hessian information is known to enhance optimization speed, integrating this into FL presents significant challenges. These include clients’ restrictions on local data and the critical need to maintain the dimension-free communication property. To overcome this limitation, we first introduce a generalized scalar-only communication FL framework that decouples dimension-free communication from standard ZO-SGD, enabling the integration of more advanced optimization strategies. Building on this framework, we propose HiSo, a fast federated fine-tuning method via Hessian-informed zeroth-order optimization and Scalar-only communication. Specifically, it leverages global curvature information to accelerate convergence while preserving the same minimal communication cost per round. Theoretically, we establish convergence guarantees that are independent of the global Lipschitz constant, and further show that HiSo achieves faster rates when the global Hessian exhibits a low effective rank – a common phenomenon in LLMs. Extensive experiments on benchmark datasets and LLM fine-tuning tasks confirm that HiSo significantly outperforms existing ZO-based FL methods in both convergence speed and communication efficiency.

[LG-49] Rewarding the Unlikely: Lifting GRPO Beyond Distribution Sharpening

链接: https://arxiv.org/abs/2506.02355
作者: Andre He,Daniel Fried,Sean Welleck
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reinforcement learning has emerged as an effective framework for training large language models on structured language-conditioned tasks. We identify a critical flaw of Group Relative Policy Optimization (GRPO), a widely used RL algorithm in this setting. For tasks that require multi-sample performance, such as formal theorem proving, GRPO biasedly reinforces already probable solutions and neglects rare but correct proofs. This implicit bias impairs performance on pass@ N metrics at large sample sizes, limiting its practicality for training theorem provers. To address this, we introduce the unlikeliness reward, a straightforward method that explicitly encourages reinforcing rare correct solutions. Additionally, we find that increasing the number of PPO epochs further mitigates this bias. Our experiments confirm that incorporating the unlikeliness reward significantly improves pass@ N across a large range of N, outperforming standard GRPO and substantially increasing sample diversity. Applying our revised recipe to Lean, we achieve competitive performance with DeepSeek-Prover-V1.5-RL on the miniF2F-test benchmark. We release our implementation, providing a simple yet effective recipe for training formal theorem provers with RL.

[LG-50] Discovery of Probabilistic Dirichlet-to-Neumann Maps on Graphs

链接: https://arxiv.org/abs/2506.02337
作者: Adrienne M. Propp,Jonas A. Actor,Elise Walker,Houman Owhadi,Nathaniel Trask,Daniel M. Tartakovsky
类目: Machine Learning (cs.LG); Mathematical Physics (math-ph); Numerical Analysis (math.NA); Computational Physics (physics.comp-ph); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Dirichlet-to-Neumann maps enable the coupling of multiphysics simulations across computational subdomains by ensuring continuity of state variables and fluxes at artificial interfaces. We present a novel method for learning Dirichlet-to-Neumann maps on graphs using Gaussian processes, specifically for problems where the data obey a conservation constraint from an underlying partial differential equation. Our approach combines discrete exterior calculus and nonlinear optimal recovery to infer relationships between vertex and edge values. This framework yields data-driven predictions with uncertainty quantification across the entire graph, even when observations are limited to a subset of vertices and edges. By optimizing over the reproducing kernel Hilbert space norm while applying a maximum likelihood estimation penalty on kernel complexity, our method ensures that the resulting surrogate strictly enforces conservation laws without overfitting. We demonstrate our method on two representative applications: subsurface fracture networks and arterial blood flow. Our results show that the method maintains high accuracy and well-calibrated uncertainty estimates even under severe data scarcity, highlighting its potential for scientific applications where limited data and reliable uncertainty quantification are critical.

[LG-51] Absorb and Converge: Provable Convergence Guarantee for Absorbing Discrete Diffusion Models

链接: https://arxiv.org/abs/2506.02318
作者: Yuchen Liang,Renxiang Huang,Lifeng Lai,Ness Shroff,Yingbin Liang
类目: Machine Learning (cs.LG); Signal Processing (eess.SP); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:Discrete state space diffusion models have shown significant advantages in applications involving discrete data, such as text and image generation. It has also been observed that their performance is highly sensitive to the choice of rate matrices, particularly between uniform and absorbing rate matrices. While empirical results suggest that absorbing rate matrices often yield better generation quality compared to uniform rate matrices, existing theoretical works have largely focused on the uniform rate matrices case. Notably, convergence guarantees and error analyses for absorbing diffusion models are still missing. In this work, we provide the first finite-time error bounds and convergence rate analysis for discrete diffusion models using absorbing rate matrices. We begin by deriving an upper bound on the KL divergence of the forward process, introducing a surrogate initialization distribution to address the challenge posed by the absorbing stationary distribution, which is a singleton and causes the KL divergence to be ill-defined. We then establish the first convergence guarantees for both the \tau -leaping and uniformization samplers under absorbing rate matrices, demonstrating improved rates over their counterparts using uniform rate matrices. Furthermore, under suitable assumptions, we provide convergence guarantees without early stopping. Our analysis introduces several new technical tools to address challenges unique to absorbing rate matrices. These include a Jensen-type argument for bounding forward process convergence, novel techniques for bounding absorbing score functions, and a non-divergent upper bound on the score near initialization that removes the need of early-stopping.

[LG-52] CACTI: Leverag ing Copy Masking and Contextual Information to Improve Tabular Data Imputation

链接: https://arxiv.org/abs/2506.02306
作者: Aditya Gorla,Ryan Wang,Zhengtong Liu,Ulzee An,Sriram Sankararaman
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We present CACTI, a masked autoencoding approach for imputing tabular data that leverages the structure in missingness patterns and contextual information. Our approach employs a novel median truncated copy masking training strategy that encourages the model to learn from empirical patterns of missingness while incorporating semantic relationships between features - captured by column names and text descriptions - to better represent feature dependence. These dual sources of inductive bias enable CACTI to outperform state-of-the-art methods - an average R^2 gain of 7.8% over the next best method (13.4%, 6.1%, and 5.3% under missing not at random, at random and completely at random, respectively) - across a diverse range of datasets and missingness conditions. Our results highlight the value of leveraging dataset-specific contextual information and missingness patterns to enhance imputation performance.

[LG-53] hrough a Steerable Lens: Magnifying Neural Network Interpretability via Phase-Based Extrapolation

链接: https://arxiv.org/abs/2506.02300
作者: Farzaneh Mahdisoltani,Saeed Mahdisoltani,Roger B. Grosse,David J. Fleet
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Understanding the internal representations and decision mechanisms of deep neural networks remains a critical open challenge. While existing interpretability methods often identify influential input regions, they may not elucidate how a model distinguishes between classes or what specific changes would transition an input from one category to another. To address these limitations, we propose a novel framework that visualizes the implicit path between classes by treating the network gradient as a form of infinitesimal motion. Drawing inspiration from phase-based motion magnification, we first decompose images using invertible transforms-specifically the Complex Steerable Pyramid-then compute class-conditional gradients in the transformed space. Rather than iteratively integrating the gradient to trace a full path, we amplify the one-step gradient to the input and perform a linear extrapolation to expose how the model moves from source to target class. By operating in the steerable pyramid domain, these amplified gradients produce semantically meaningful, spatially coherent morphs that highlight the classifier’s most sensitive directions, giving insight into the geometry of its decision boundaries. Experiments on both synthetic and real-world datasets demonstrate that our phase-focused extrapolation yields perceptually aligned, semantically meaningful transformations, offering a novel, interpretable lens into neural classifiers’ internal representations.

[LG-54] On Universality Classes of Equivariant Networks

链接: https://arxiv.org/abs/2506.02293
作者: Marco Pacini,Gabriele Santin,Bruno Lepri,Shubhendu Trivedi
类目: Machine Learning (cs.LG)
*备注: Preprint. Under review. 22 pages

点击查看摘要

Abstract:Equivariant neural networks provide a principled framework for incorporating symmetry into learning architectures and have been extensively analyzed through the lens of their separation power, that is, the ability to distinguish inputs modulo symmetry. This notion plays a central role in settings such as graph learning, where it is often formalized via the Weisfeiler-Leman hierarchy. In contrast, the universality of equivariant models-their capacity to approximate target functions-remains comparatively underexplored. In this work, we investigate the approximation power of equivariant neural networks beyond separation constraints. We show that separation power does not fully capture expressivity: models with identical separation power may differ in their approximation ability. To demonstrate this, we characterize the universality classes of shallow invariant networks, providing a general framework for understanding which functions these architectures can approximate. Since equivariant models reduce to invariant ones under projection, this analysis yields sufficient conditions under which shallow equivariant networks fail to be universal. Conversely, we identify settings where shallow models do achieve separation-constrained universality. These positive results, however, depend critically on structural properties of the symmetry group, such as the existence of adequate normal subgroups, which may not hold in important cases like permutation symmetry.

[LG-55] Learning Optimal Posted Prices for a Unit-Demand Buyer

链接: https://arxiv.org/abs/2506.02284
作者: Yifeng Teng,Yifan Wang
类目: Computer Science and Game Theory (cs.GT); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study the problem of learning the optimal item pricing for a unit-demand buyer with independent item values, and the learner has query access to the buyer’s value distributions. We consider two common query models in the literature: the sample access model where the learner can obtain a sample of each item value, and the pricing query model where the learner can set a price for an item and obtain a binary signal on whether the sampled value of the item is greater than our proposed price. In this work, we give nearly tight sample complexity and pricing query complexity of the unit-demand pricing problem.

[LG-56] Latent Stochastic Interpolants

链接: https://arxiv.org/abs/2506.02276
作者: Saurabh Singh,Dmitry Lagun
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Under Review

点击查看摘要

Abstract:Stochastic Interpolants (SI) are a powerful framework for generative modeling, capable of flexibly transforming between two probability distributions. However, their use in jointly optimized latent variable models remains unexplored as they require direct access to the samples from the two distributions. This work presents Latent Stochastic Interpolants (LSI) enabling joint learning in a latent space with end-to-end optimized encoder, decoder and latent SI models. We achieve this by developing a principled Evidence Lower Bound (ELBO) objective derived directly in continuous time. The joint optimization allows LSI to learn effective latent representations along with a generative process that transforms an arbitrary prior distribution into the encoder-defined aggregated posterior. LSI sidesteps the simple priors of the normal diffusion models and mitigates the computational demands of applying SI directly in high-dimensional observation spaces, while preserving the generative flexibility of the SI framework. We demonstrate the efficacy of LSI through comprehensive experiments on the standard large scale ImageNet generation benchmark.

[LG-57] owards Human-like Preference Profiling in Sequential Recommendation

链接: https://arxiv.org/abs/2506.02261
作者: Zhongyu Ouyang,Qianlong Wen,Chunhui Zhang,Yanfang Ye,Soroush Vosoughi
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Sequential recommendation systems aspire to profile users by interpreting their interaction histories, echoing how humans make decisions by weighing experience, relative preference strength, and situational relevance. Yet, existing large language model (LLM)-based recommenders often fall short of mimicking the flexible, context-aware decision strategies humans exhibit, neglecting the structured, dynamic, and context-aware mechanisms fundamental to human behaviors. To bridge this gap, we propose RecPO, a preference optimization framework that models structured feedback and contextual delay to emulate human-like prioritization in sequential recommendation RecPO exploits adaptive reward margins based on inferred preference hierarchies and temporal signals, enabling the model to favor immediately relevant items and to distinguish between varying degrees of preference and aversion. Extensive experiments across five real-world datasets demonstrate that RecPO not only yields performance gains over state-of-the-art baselines, but also mirrors key characteristics of human decision-making: favoring timely satisfaction, maintaining coherent preferences, and exercising discernment under shifting contexts.

[LG-58] SafeOR-Gym: A Benchmark Suite for Safe Reinforcement Learning Algorithms on Practical Operations Research Problems

链接: https://arxiv.org/abs/2506.02255
作者: Asha Ramanujam(1),Adam Elyoumi(1),Hao Chen(1),Sai Madhukiran Kompalli(1),Akshdeep Singh Ahluwalia(1),Shraman Pal(1),Dimitri J. Papageorgiou(2),Can Li(1) ((1) Davidson School of Chemical Engineering, Purdue University, West Lafayette, IN (2) Energy Sciences, ExxonMobil Technology and Engineering Company, Annandale, NJ)
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Most existing safe reinforcement learning (RL) benchmarks focus on robotics and control tasks, offering limited relevance to high-stakes domains that involve structured constraints, mixed-integer decisions, and industrial complexity. This gap hinders the advancement and deployment of safe RL in critical areas such as energy systems, manufacturing, and supply chains. To address this limitation, we present SafeOR-Gym, a benchmark suite of nine operations research (OR) environments tailored for safe RL under complex constraints. Each environment captures a realistic planning, scheduling, or control problems characterized by cost-based constraint violations, planning horizons, and hybrid discrete-continuous action spaces. The suite integrates seamlessly with the Constrained Markov Decision Process (CMDP) interface provided by OmniSafe. We evaluate several state-of-the-art safe RL algorithms across these environments, revealing a wide range of performance: while some tasks are tractable, others expose fundamental limitations in current approaches. SafeOR-Gym provides a challenging and practical testbed that aims to catalyze future research in safe RL for real-world decision-making problems. The SafeOR-Gym framework and all accompanying code are available at: this https URL.

[LG-59] From Features to Structure: Task-Aware Graph Construction for Relational and Tabular Learning with GNNs

链接: https://arxiv.org/abs/2506.02243
作者: Tamara Cucumides,Floris Geerts
类目: Machine Learning (cs.LG)
*备注: 5 pages, 2 figures

点击查看摘要

Abstract:Tabular and relational data remain the most ubiquitous formats in real-world machine learning applications, spanning domains from finance to healthcare. Although both formats offer structured representations, they pose distinct challenges for modern deep learning methods, which typically assume flat, feature-aligned inputs. Graph Neural Networks (GNNs) have emerged as a promising solution by capturing structural dependencies within and between tables. However, existing GNN-based approaches often rely on rigid, schema-derived graphs – such as those based on primary-foreign key links – thereby underutilizing rich, predictive signals in non key attributes. In this work, we introduce auGraph, a unified framework for task-aware graph augmentation that applies to both tabular and relational data. auGraph enhances base graph structures by selectively promoting attributes into nodes, guided by scoring functions that quantify their relevance to the downstream prediction task. This augmentation preserves the original data schema while injecting task-relevant structural signal. Empirically, auGraph outperforms schema-based and heuristic graph construction methods by producing graphs that better support learning for relational and tabular prediction tasks.

[LG-60] From Street Views to Urban Science: Discovering Road Safety Factors with Multimodal Large Language Models

链接: https://arxiv.org/abs/2506.02242
作者: Yihong Tang,Ao Qu,Xujing Yu,Weipeng Deng,Jun Ma,Jinhua Zhao,Lijun Sun
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:Urban and transportation research has long sought to uncover statistically meaningful relationships between key variables and societal outcomes such as road safety, to generate actionable insights that guide the planning, development, and renewal of urban and transportation systems. However, traditional workflows face several key challenges: (1) reliance on human experts to propose hypotheses, which is time-consuming and prone to confirmation bias; (2) limited interpretability, particularly in deep learning approaches; and (3) underutilization of unstructured data that can encode critical urban context. Given these limitations, we propose a Multimodal Large Language Model (MLLM)-based approach for interpretable hypothesis inference, enabling the automated generation, evaluation, and refinement of hypotheses concerning urban context and road safety outcomes. Our method leverages MLLMs to craft safety-relevant questions for street view images (SVIs), extract interpretable embeddings from their responses, and apply them in regression-based statistical models. UrbanX supports iterative hypothesis testing and refinement, guided by statistical evidence such as coefficient significance, thereby enabling rigorous scientific discovery of previously overlooked correlations between urban design and safety. Experimental evaluations on Manhattan street segments demonstrate that our approach outperforms pretrained deep learning models while offering full interpretability. Beyond road safety, UrbanX can serve as a general-purpose framework for urban scientific discovery, extracting structured insights from unstructured urban data across diverse socioeconomic and environmental outcomes. This approach enhances model trustworthiness for policy applications and establishes a scalable, statistically grounded pathway for interpretable knowledge discovery in urban and transportation studies.

[LG-61] Second-order AAA algorithms for structured data-driven modeling

链接: https://arxiv.org/abs/2506.02241
作者: Michael S. Ackermann,Ion Victor Gosea,Serkan Gugercin,Steffen W. R. Werner
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG); Systems and Control (eess.SY); Dynamical Systems (math.DS); Optimization and Control (math.OC)
*备注: 37 pages, 6 figures, 3 tables

点击查看摘要

Abstract:The data-driven modeling of dynamical systems has become an essential tool for the construction of accurate computational models from real-world data. In this process, the inherent differential structures underlying the considered physical phenomena are often neglected making the reinterpretation of the learned models in a physically meaningful sense very challenging. In this work, we present three data-driven modeling approaches for the construction of dynamical systems with second-order differential structure directly from frequency domain data. Based on the second-order structured barycentric form, we extend the well-known Adaptive Antoulas-Anderson algorithm to the case of second-order systems. Depending on the available computational resources, we propose variations of the proposed method that prioritize either higher computation speed or greater modeling accuracy, and we present a theoretical analysis for the expected accuracy and performance of the proposed methods. Three numerical examples demonstrate the effectiveness of our new structured approaches in comparison to classical unstructured data-driven modeling.

[LG-62] Quantum Ensembling Methods for Healthcare and Life Science

链接: https://arxiv.org/abs/2506.02213
作者: Kahn Rhrissorrakrai,Kathleen E. Hamilton,Prerana Bangalore Parthsarathy,Aldo Guzman-Saenz,Tyler Alban,Filippo Utro,Laxmi Parida
类目: Machine Learning (cs.LG); Genomics (q-bio.GN); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:Learning on small data is a challenge frequently encountered in many real-world applications. In this work we study how effective quantum ensemble models are when trained on small data problems in healthcare and life sciences. We constructed multiple types of quantum ensembles for binary classification using up to 26 qubits in simulation and 56 qubits on quantum hardware. Our ensemble designs use minimal trainable parameters but require long-range connections between qubits. We tested these quantum ensembles on synthetic datasets and gene expression data from renal cell carcinoma patients with the task of predicting patient response to immunotherapy. From the performance observed in simulation and initial hardware experiments, we demonstrate how quantum embedding structure affects performance and discuss how to extract informative features and build models that can learn and generalize effectively. We present these exploratory results in order to assist other researchers in the design of effective learning on small data using ensembles. Incorporating quantum computing in these data constrained problems offers hope for a wide range of studies in healthcare and life sciences where biological samples are relatively scarce given the feature space to be explored.

[LG-63] Learning Treatment Representations for Downstream Instrumental Variable Regression

链接: https://arxiv.org/abs/2506.02200
作者: Shiangyi Lin,Hui Lan,Vasilis Syrgkanis
类目: Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Traditional instrumental variable (IV) estimators face a fundamental constraint: they can only accommodate as many endogenous treatment variables as available instruments. This limitation becomes particularly challenging in settings where the treatment is presented in a high-dimensional and unstructured manner (e.g. descriptions of patient treatment pathways in a hospital). In such settings, researchers typically resort to applying unsupervised dimension reduction techniques to learn a low-dimensional treatment representation prior to implementing IV regression analysis. We show that such methods can suffer from substantial omitted variable bias due to implicit regularization in the representation learning step. We propose a novel approach to construct treatment representations by explicitly incorporating instrumental variables during the representation learning process. Our approach provides a framework for handling high-dimensional endogenous variables with limited instruments. We demonstrate both theoretically and empirically that fitting IV models on these instrument-informed representations ensures identification of directions that optimize outcome prediction. Our experiments show that our proposed methodology improves upon the conventional two-stage approaches that perform dimension reduction without incorporating instrument information.

[LG-64] An Approximation Theory Perspective on Machine Learning

链接: https://arxiv.org/abs/2506.02168
作者: Hrushikesh N. Mhaskar,Efstratios Tsoukanis,Ameya D. Jagtap
类目: Machine Learning (cs.LG)
*备注: 56 pages

点击查看摘要

Abstract:A central problem in machine learning is often formulated as follows: Given a dataset (x_j, y_j)_j=1^M , which is a sample drawn from an unknown probability distribution, the goal is to construct a functional model f such that f(x) \approx y for any (x, y) drawn from the same distribution. Neural networks and kernel-based methods are commonly employed for this task due to their capacity for fast and parallel computation. The approximation capabilities, or expressive power, of these methods have been extensively studied over the past 35 years. In this paper, we will present examples of key ideas in this area found in the literature. We will discuss emerging trends in machine learning including the role of shallow/deep networks, approximation on manifolds, physics-informed neural surrogates, neural operators, and transformer architectures. Despite function approximation being a fundamental problem in machine learning, approximation theory does not play a central role in the theoretical foundations of the field. One unfortunate consequence of this disconnect is that it is often unclear how well trained models will generalize to unseen or unlabeled data. In this review, we examine some of the shortcomings of the current machine learning framework and explore the reasons for the gap between approximation theory and machine learning practice. We will then introduce our novel research to achieve function approximation on unknown manifolds without the need to learn specific manifold features, such as the eigen-decomposition of the Laplace-Beltrami operator or atlas construction. In many machine learning problems, particularly classification tasks, the labels y_j are drawn from a finite set of values.

[LG-65] Revisiting LRP: Positional Attribution as the Missing Ingredient for Transformer Explainability

链接: https://arxiv.org/abs/2506.02138
作者: Yarden Bakish,Itamar Zimerman,Hila Chefer,Lior Wolf
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The development of effective explainability tools for Transformers is a crucial pursuit in deep learning research. One of the most promising approaches in this domain is Layer-wise Relevance Propagation (LRP), which propagates relevance scores backward through the network to the input space by redistributing activation values based on predefined rules. However, existing LRP-based methods for Transformer explainability entirely overlook a critical component of the Transformer architecture: its positional encoding (PE), resulting in violation of the conservation property, and the loss of an important and unique type of relevance, which is also associated with structural and positional features. To address this limitation, we reformulate the input space for Transformer explainability as a set of position-token pairs. This allows us to propose specialized theoretically-grounded LRP rules designed to propagate attributions across various positional encoding methods, including Rotary, Learnable, and Absolute PE. Extensive experiments with both fine-tuned classifiers and zero-shot foundation models, such as LLaMA 3, demonstrate that our method significantly outperforms the state-of-the-art in both vision and NLP explainability tasks. Our code is publicly available.

[LG-66] ReconXF: Graph Reconstruction Attack via Public Feature Explanations on Privatized Node Features and Labels

链接: https://arxiv.org/abs/2506.02134
作者: Rishi Raj Sahoo,Rucha Bhalchandra Joshi,Subhankar Mishra
类目: Machine Learning (cs.LG)
*备注: Under review

点击查看摘要

Abstract:Graph Neural Networks (GNNs) achieve high performance across many applications but function as black-box models, limiting their use in critical domains like healthcare and criminal justice. Explainability methods address this by providing feature-level explanations that identify important node attributes for predictions. These explanations create privacy risks. Combined with auxiliary information, feature explanations can enable adversaries to reconstruct graph structure, exposing sensitive relationships. Existing graph reconstruction attacks assume access to original auxiliary data, but practical systems use differential privacy to protect node features and labels while providing explanations for transparency. We study a threat model where adversaries access public feature explanations along with privatized node features and labels. We show that existing explanation-based attacks like GSEF perform poorly with privatized data due to noise from differential privacy mechanisms. We propose ReconXF, a graph reconstruction attack for scenarios with public explanations and privatized auxiliary data. Our method adapts explanation-based frameworks by incorporating denoising mechanisms that handle differential privacy noise while exploiting structural signals in explanations. Experiments across multiple datasets show ReconXF outperforms SoTA methods in privatized settings, with improvements in AUC and average precision. Results indicate that public explanations combined with denoising enable graph structure recovery even under the privacy protection of auxiliary data. Code is available at (link to be made public after acceptance).

[LG-67] LibriBrain: Over 50 Hours of Within-Subject MEG to Improve Speech Decoding Methods at Scale

链接: https://arxiv.org/abs/2506.02098
作者: Miran Özdogan,Gilad Landau,Gereon Elvers,Dulhan Jayalath,Pratik Somaiya,Francesco Mantegna,Mark Woolrich,Oiwi Parker Jones
类目: Machine Learning (cs.LG)
*备注: 37 pages, 14 figures, 13 tables. Under review

点击查看摘要

Abstract:LibriBrain represents the largest single-subject MEG dataset to date for speech decoding, with over 50 hours of recordings – 5 \times larger than the next comparable dataset and 50 \times larger than most. This unprecedented `depth’ of within-subject data enables exploration of neural representations at a scale previously unavailable with non-invasive methods. LibriBrain comprises high-quality MEG recordings together with detailed annotations from a single participant listening to naturalistic spoken English, covering nearly the full Sherlock Holmes canon. Designed to support advances in neural decoding, LibriBrain comes with a Python library for streamlined integration with deep learning frameworks, standard data splits for reproducibility, and baseline results for three foundational decoding tasks: speech detection, phoneme classification, and word classification. Baseline experiments demonstrate that increasing training data yields substantial improvements in decoding performance, highlighting the value of scaling up deep, within-subject datasets. By releasing this dataset, we aim to empower the research community to advance speech decoding methodologies and accelerate the development of safe, effective clinical brain-computer interfaces.

[LG-68] Comparison of spectrogram scaling in multi-label Music Genre Recognition

链接: https://arxiv.org/abs/2506.02091
作者: Bartosz Karpiński,Cyryl Leszczyński
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: 14 pages, 10 figures

点击查看摘要

Abstract:As the accessibility and ease-of-use of digital audio workstations increases, so does the quantity of music available to the average listener; additionally, differences between genres are not always well defined and can be abstract, with widely varying combinations of genres across individual records. In this article, multiple preprocessing methods and approaches to model training are described and compared, accounting for the eclectic nature of today’s albums. A custom, manually labeled dataset of more than 18000 entries has been used to perform the experiments.

[LG-69] mporal Causal-based Simulation for Realistic Time-series Generation

链接: https://arxiv.org/abs/2506.02084
作者: Nikolaos Gkorgkolis,Nikolaos Kougioulis,MingXue Wang,Bora Caglayan,Andrea Tonon,Dario Simionato,Ioannis Tsamardinos
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 22 pages, 3 figures

点击查看摘要

Abstract:Causal Discovery plays a pivotal role in revealing relationships among observed variables, particularly in the temporal setup. While the majority of CD methods rely on synthetic data for evaluation, and recently for training, these fall short in accurately mirroring real-world scenarios; an effect even more evident in temporal data. Generation techniques depending on simplified assumptions on causal structure, effects and time, limit the quality and diversity of the simulated data. In this work, we introduce Temporal Causal-based Simulation (TCS), a robust framework for generating realistic time-series data and their associated temporal causal graphs. The approach is structured in three phases: estimating the true lagged causal structure of the data, approximating the functional dependencies between variables and learning the noise distribution of the corresponding causal model, each part of which can be explicitly tailored based on data assumptions and characteristics. Through an extensive evaluation process, we highlight that single detection methods for generated data discrimination prove inadequate, accentuating it as a multifaceted challenge. For this, we detail a Min-max optimization phase that draws on AutoML techniques. Our contributions include a flexible, model-agnostic pipeline for generating realistic temporal causal data, a thorough evaluation setup which enhances the validity of the generated datasets and insights into the challenges posed by realistic data generation. Through experiments involving not only real but also semi-synthetic and purely synthetic datasets, we demonstrate that while sampling realistic causal data remains a complex task, our method enriches the domain of generating sensible causal-based temporal data.

[LG-70] An Introduction to Flow Matching and Diffusion Models

链接: https://arxiv.org/abs/2506.02070
作者: Peter Holderrieth,Ezra Erives
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Diffusion and flow-based models have become the state of the art for generative AI across a wide range of data modalities, including images, videos, shapes, molecules, music, and more! These notes are originally from this https URL, as taught at MIT over the 2025 IAP (winter) term, and are intended to accompany other course content, including lectures and labs. Overall, they function as a self-contained introduction to both flow matching and diffusion models, starting with ordinary and stochastic differential equations, and culminating in flow matching, score matching, classifier-free guidance, and the inner workings of modern, state-of-the-art models for image and video. These notes, and the accompanying course, are ideal for students and practitioners alike who want to develop a principled understanding of the theory and practice of generative AI.

[LG-71] Blockchain Powered Edge Intelligence for U-Healthcare in Privacy Critical and Time Sensitive Environment

链接: https://arxiv.org/abs/2506.02038
作者: Anum Nawaz,Hafiz Humza Mahmood Ramzan,Xianjia Yu,Zhuo Zou,Tomi Westerlund
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Edge Intelligence (EI) serves as a critical enabler for privacy-preserving systems by providing AI-empowered computation and distributed caching services at the edge, thereby minimizing latency and enhancing data privacy. The integration of blockchain technology further augments EI frameworks by ensuring transactional transparency, auditability, and system-wide reliability through a decentralized network model. However, the operational architecture of such systems introduces inherent vulnerabilities, particularly due to the extensive data interactions between edge gateways (EGs) and the distributed nature of information storage during service provisioning. To address these challenges, we propose an autonomous computing model along with its interaction topologies tailored for privacy-critical and time-sensitive health applications. The system supports continuous monitoring, real-time alert notifications, disease detection, and robust data processing and aggregation. It also includes a data transaction handler and mechanisms for ensuring privacy at the EGs. Moreover, a resource-efficient one-dimensional convolutional neural network (1D-CNN) is proposed for the multiclass classification of arrhythmia, enabling accurate and real-time analysis of constrained EGs. Furthermore, a secure access scheme is defined to manage both off-chain and on-chain data sharing and storage. To validate the proposed model, comprehensive security, performance, and cost analyses are conducted, demonstrating the efficiency and reliability of the fine-grained access control scheme.

[LG-72] DistMLIP: A Distributed Inference Platform for Machine Learning Interatomic Potentials

链接: https://arxiv.org/abs/2506.02023
作者: Kevin Han,Bowen Deng,Amir Barati Farimani,Gerbrand Ceder
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG); Performance (cs.PF)
*备注:

点击查看摘要

Abstract:Large-scale atomistic simulations are essential to bridge computational materials and chemistry to realistic materials and drug discovery applications. In the past few years, rapid developments of machine learning interatomic potentials (MLIPs) have offered a solution to scale up quantum mechanical calculations. Parallelizing these interatomic potentials across multiple devices poses a challenging, but promising approach to further extending simulation scales to real-world applications. In this work, we present DistMLIP, an efficient distributed inference platform for MLIPs based on zero-redundancy, graph-level parallelization. In contrast to conventional space-partitioning parallelization, DistMLIP enables efficient MLIP parallelization through graph partitioning, allowing multi-device inference on flexible MLIP model architectures like multi-layer graph neural networks. DistMLIP presents an easy-to-use, flexible, plug-in interface that enables distributed inference of pre-existing MLIPs. We demonstrate DistMLIP on four widely used and state-of-the-art MLIPs: CHGNet, MACE, TensorNet, and eSEN. We show that existing foundational potentials can perform near-million-atom calculations at the scale of a few seconds on 8 GPUs with DistMLIP.

[LG-73] Efficient and Workload-Aware LLM Serving via Runtime Layer Swapping and KV Cache Resizing

链接: https://arxiv.org/abs/2506.02006
作者: Zhaoyuan Su,Tingfeng Lan,Zirui Wang,Juncheng Yang,Yue Cheng
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: 19 pages, 7 figures

点击查看摘要

Abstract:Efficiently serving large language models (LLMs) under dynamic and bursty workloads remains a key challenge for real-world deployment. Existing serving frameworks and static model compression techniques fail to adapt to workload fluctuations, leading to either service-level objective (SLO) violations under full-precision serving or persistent accuracy degradation with static quantization. We present MorphServe, a dynamic, workload-aware LLM serving framework based on morphological adaptation. MorphServe introduces two asynchronous, token-level runtime mechanisms: quantized layer swapping, which selectively replaces less impactful layers with quantized alternatives during high-load periods, and pressure-aware KV cache resizing, which dynamically adjusts KV cache capacity in response to memory pressure. These mechanisms enable state-preserving transitions with minimum runtime overhead and are fully compatible with modern scheduling and attention techniques. Extensive experiments on Vicuna and Llama family models with real-world workloads demonstrate that MorphServe reduces average SLO violations by 92.45 percent and improves the P95 TTFT latency by 2.2x-3.9x compared to full-precision serving, without compromising generation quality. These results establish MorphServe as a practical and elastic solution for LLM deployment in dynamic environments.

[LG-74] Machine Learning for Consistency Violation Faults Analysis

链接: https://arxiv.org/abs/2506.02002
作者: Kamal Giri,Amit Garu
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: 5 pages, 5 figures

点击查看摘要

Abstract:Distributed systems frequently encounter consistency violation faults (cvfs), where nodes operate on outdated or inaccurate data, adversely affecting convergence and overall system performance. This study presents a machine learning-based approach for analyzing the impact of CVFs, using Dijkstra’s Token Ring problem as a case study. By computing program transition ranks and their corresponding effects, the proposed method quantifies the influence of cvfs on system behavior. To address the state space explosion encountered in larger graphs, two models are implemented: a Feedforward Neural Network (FNN) and a distributed neural network leveraging TensorFlow’s \textttthis http URL API. These models are trained on datasets generated from smaller graphs (3 to 10 nodes) to predict parameters essential for determining rank effects. Experimental results demonstrate promising performance, with a test loss of 4.39 and a mean absolute error of 1.5. Although distributed training on a CPU did not yield significant speed improvements over a single-device setup, the findings suggest that scalability could be enhanced through the use of advanced hardware accelerators such as GPUs or TPUs.

[LG-75] raffic and Mobility Optimization Using AI: Comparative Study between Dubai and Riyadh

链接: https://arxiv.org/abs/2506.01974
作者: Kanwal Aalijah
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Urban planning plays a very important role in development modern cities. It effects the economic growth, quality of life, and environmental sustainability. Modern cities face challenges in managing traffic congestion. These challenges arise to due to rapid urbanization. In this study we will explore how AI can be used to understand the traffic and mobility related issues and its effects on the residents sentiment. The approach combines real-time traffic data with geo-located sentiment analysis, offering a comprehensive and dynamic approach to urban mobility planning. AI models and exploratory data analysis was used to predict traffic congestion patterns, analyze commuter behaviors, and identify congestion hotspots and dissatisfaction zones. The findings offer actionable recommendations for optimizing traffic flow, enhancing commuter experiences, and addressing city specific mobility challenges in the Middle East and beyond.

[LG-76] A Data-Driven Approach to Enhancing Gravity Models for Trip Demand Prediction

链接: https://arxiv.org/abs/2506.01964
作者: Kamal Acharya,Mehul Lad,Liang Sun,Houbing Song
类目: Machine Learning (cs.LG)
*备注: 6 pages, 3 figures, IEEE CAI-2025

点击查看摘要

Abstract:Accurate prediction of trips between zones is critical for transportation planning, as it supports resource allocation and infrastructure development across various modes of transport. Although the gravity model has been widely used due to its simplicity, it often inadequately represents the complex factors influencing modern travel behavior. This study introduces a data-driven approach to enhance the gravity model by integrating geographical, economic, social, and travel data from the counties in Tennessee and New York state. Using machine learning techniques, we extend the capabilities of the traditional model to handle more complex interactions between variables. Our experiments demonstrate that machine learning-enhanced models significantly outperform the traditional model. Our results show a 51.48% improvement in R-squared, indicating a substantial enhancement in the model’s explanatory power. Also, a 63.59% reduction in Mean Absolute Error (MAE) reflects a significant increase in prediction accuracy. Furthermore, a 44.32% increase in Common Part of Commuters (CPC) demonstrates improved prediction reliability. These findings highlight the substantial benefits of integrating diverse datasets and advanced algorithms into transportation models. They provide urban planners and policymakers with more reliable forecasting and decision-making tools.

[LG-77] Validating remotely sensed biomass estimates with forest inventory data in the western US

链接: https://arxiv.org/abs/2506.03120
作者: Xiuyu Cao,Joseph O. Sexton,Panshi Wang,Dimitrios Gounaridis,Neil H. Carter,Kai Zhu
类目: Applications (stat.AP); Machine Learning (cs.LG)
*备注: 32 pages, 5 figures

点击查看摘要

Abstract:Monitoring aboveground biomass (AGB) and its density (AGBD) at high resolution is essential for carbon accounting and ecosystem management. While NASA’s spaceborne Global Ecosystem Dynamics Investigation (GEDI) LiDAR mission provides globally distributed reference measurements for AGBD estimation, the majority of commercial remote sensing products based on GEDI remain without rigorous or independent validation. Here, we present an independent regional validation of an AGBD dataset offered by terraPulse, Inc., based on independent reference data from the US Forest Service Forest Inventory and Analysis (FIA) program. Aggregated to 64,000-hectare hexagons and US counties across the US states of Utah, Nevada, and Washington, we found very strong agreement between terraPulse and FIA estimates. At the hexagon scale, we report R2 = 0.88, RMSE = 26.68 Mg/ha, and a correlation coefficient ® of 0.94. At the county scale, agreement improves to R2 = 0.90, RMSE =32.62 Mg/ha, slope = 1.07, and r = 0.95. Spatial and statistical analyses indicated that terraPulse AGBD values tended to exceed FIA estimates in non-forest areas, likely due to FIA’s limited sampling of non-forest vegetation. The terraPulse AGBD estimates also exhibited lower values in high-biomass forests, likely due to saturation effects in its optical remote-sensing covariates. This study advances operational carbon monitoring by delivering a scalable framework for comprehensive AGBD validation using independent FIA data, as well as a benchmark validation of a new commercial dataset for global biomass monitoring.

[LG-78] GL-LowPopArt: A Nearly Instance-Wise Minimax Estimator for Generalized Low-Rank Trace Regression ICML2025

链接: https://arxiv.org/abs/2506.03074
作者: Junghyun Lee,Kyoungseok Jang,Kwang-Sung Jun,Milan Vojnović,Se-Young Yun
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 53 pages, 2 figures, 3 tables; Accepted as a Spotlight Poster to the 42nd International Conference on Machine Learning (ICML 2025)

点击查看摘要

Abstract:We present GL-LowPopArt, a novel Catoni-style estimator for generalized low-rank trace regression. Building on LowPopArt (Jang et al., 2024), it employs a two-stage approach: nuclear norm regularization followed by matrix Catoni estimation. We establish state-of-the-art estimation error bounds, surpassing existing guarantees (Fan et al., 2019; Kang et al., 2022), and reveal a novel experimental design objective, \mathrmGL(\pi) . The key technical challenge is controlling bias from the nonlinear inverse link function, which we address by our two-stage approach. We prove a local minimax lower bound, showing that our GL-LowPopArt enjoys instance-wise optimality up to the condition number of the ground-truth Hessian. Applications include generalized linear matrix completion, where GL-LowPopArt achieves a state-of-the-art Frobenius error guarantee, and bilinear dueling bandits, a novel setting inspired by general preference learning (Zhang et al., 2024). Our analysis of a GL-LowPopArt-based explore-then-commit algorithm reveals a new, potentially interesting problem-dependent quantity, along with improved Borda regret bound than vectorization (Wu et al., 2024).

[LG-79] Causal Explainability of Machine Learning in Heart Failure Prediction from Electronic Health Records

链接: https://arxiv.org/abs/2506.03068
作者: Yina Hou,Shourav B. Rabbani,Liang Hong,Norou Diawara,Manar D. Samad
类目: Machine Learning (stat.ML); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注: 4 figures

点击查看摘要

Abstract:The importance of clinical variables in the prognosis of the disease is explained using statistical correlation or machine learning (ML). However, the predictive importance of these variables may not represent their causal relationships with diseases. This paper uses clinical variables from a heart failure (HF) patient cohort to investigate the causal explainability of important variables obtained in statistical and ML contexts. Due to inherent regression modeling, popular causal discovery methods strictly assume that the cause and effect variables are numerical and continuous. This paper proposes a new computational framework to enable causal structure discovery (CSD) and score the causal strength of mixed-type (categorical, numerical, binary) clinical variables for binary disease outcomes. In HF classification, we investigate the association between the importance rank order of three feature types: correlated features, features important for ML predictions, and causal features. Our results demonstrate that CSD modeling for nonlinear causal relationships is more meaningful than its linear counterparts. Feature importance obtained from nonlinear classifiers (e.g., gradient-boosting trees) strongly correlates with the causal strength of variables without differentiating cause and effect variables. Correlated variables can be causal for HF, but they are rarely identified as effect variables. These results can be used to add the causal explanation of variables important for ML-based prediction modeling.

[LG-80] orsion in Persistent Homology and Neural Networks

链接: https://arxiv.org/abs/2506.03049
作者: Maria Walch
类目: Algebraic Topology (math.AT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We explore the role of torsion in hybrid deep learning models that incorporate topological data analysis, focusing on autoencoders. While most TDA tools use field coefficients, this conceals torsional features present in integer homology. We show that torsion can be lost during encoding, altered in the latent space, and in many cases, not reconstructed by standard decoders. Using both synthetic and high-dimensional data, we evaluate torsion sensitivity to perturbations and assess its recoverability across several autoencoder architectures. Our findings reveal key limitations of field-based approaches and underline the need for architectures or loss terms that preserve torsional information for robust data representation.

[LG-81] On the Benefits of Accelerated Optimization in Robust and Private Estimation

链接: https://arxiv.org/abs/2506.03044
作者: Laurentiu Andrei Marchis,Po-Ling Loh
类目: atistics Theory (math.ST); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 91 pages, 8 figures

点击查看摘要

Abstract:We study the advantages of accelerated gradient methods, specifically based on the Frank-Wolfe method and projected gradient descent, for privacy and heavy-tailed robustness. Our approaches are as follows: For the Frank-Wolfe method, our technique is based on a tailored learning rate and a uniform lower bound on the gradient of the \ell_2 -norm over the constraint set. For accelerating projected gradient descent, we use the popular variant based on Nesterov’s momentum, and we optimize our objective over \mathbbR^p . These accelerations reduce iteration complexity, translating into stronger statistical guarantees for empirical and population risk minimization. Our analysis covers three settings: non-random data, random model-free data, and parametric models (linear regression and generalized linear models). Methodologically, we approach both privacy and robustness based on noisy gradients. We ensure differential privacy via the Gaussian mechanism and advanced composition, and we achieve heavy-tailed robustness using a geometric median-of-means estimator, which also sharpens the dependency on the dimension of the covariates. Finally, we compare our rates to existing bounds and identify scenarios where our methods attain optimal convergence.

[LG-82] Non-stationary Bandit Convex Optimization: A Comprehensive Study

链接: https://arxiv.org/abs/2506.02980
作者: Xiaoqi Liu,Dorian Baudry,Julian Zimmert,Patrick Rebeschini,Arya Akhavan
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 32 pages, 1 figure

点击查看摘要

Abstract:Bandit Convex Optimization is a fundamental class of sequential decision-making problems, where the learner selects actions from a continuous domain and observes a loss (but not its gradient) at only one point per round. We study this problem in non-stationary environments, and aim to minimize the regret under three standard measures of non-stationarity: the number of switches S in the comparator sequence, the total variation \Delta of the loss functions, and the path-length P of the comparator sequence. We propose a polynomial-time algorithm, Tilted Exponentially Weighted Average with Sleeping Experts (TEWA-SE), which adapts the sleeping experts framework from online convex optimization to the bandit setting. For strongly convex losses, we prove that TEWA-SE is minimax-optimal with respect to known S and \Delta by establishing matching upper and lower bounds. By equipping TEWA-SE with the Bandit-over-Bandit framework, we extend our analysis to environments with unknown non-stationarity measures. For general convex losses, we introduce a second algorithm, clipped Exploration by Optimization (cExO), based on exponential weights over a discretized action space. While not polynomial-time computable, this method achieves minimax-optimal regret with respect to known S and \Delta , and improves on the best existing bounds with respect to P .

[LG-83] Diffusion Buffer: Online Diffusion-based Speech Enhancement with Sub-Second Latency INTERSPEECH2025

链接: https://arxiv.org/abs/2506.02908
作者: Bunlong Lay,Rostilav Makarov,Timo Gerkmann
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG)
*备注: 5 pages, 2 figures, Accepted to Interspeech 2025

点击查看摘要

Abstract:Diffusion models are a class of generative models that have been recently used for speech enhancement with remarkable success but are computationally expensive at inference time. Therefore, these models are impractical for processing streaming data in real-time. In this work, we adapt a sliding window diffusion framework to the speech enhancement task. Our approach progressively corrupts speech signals through time, assigning more noise to frames close to the present in a buffer. This approach outputs denoised frames with a delay proportional to the chosen buffer size, enabling a trade-off between performance and latency. Empirical results demonstrate that our method outperforms standard diffusion models and runs efficiently on a GPU, achieving an input-output latency in the order of 0.3 to 1 seconds. This marks the first practical diffusion-based solution for online speech enhancement.

[LG-84] Simulation-Based Inference for Adaptive Experiments

链接: https://arxiv.org/abs/2506.02881
作者: Brian M Cho,Aurélien Bibaut,Nathan Kallus
类目: Methodology (stat.ME); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Multi-arm bandit experimental designs are increasingly being adopted over standard randomized trials due to their potential to improve outcomes for study participants, enable faster identification of the best-performing options, and/or enhance the precision of estimating key parameters. Current approaches for inference after adaptive sampling either rely on asymptotic normality under restricted experiment designs or underpowered martingale concentration inequalities that lead to weak power in practice. To bypass these limitations, we propose a simulation-based approach for conducting hypothesis tests and constructing confidence intervals for arm specific means and their differences. Our simulation-based approach uses positively biased nuisances to generate additional trajectories of the experiment, which we call \textitsimulation with optimism. Using these simulations, we characterize the distribution potentially non-normal sample mean test statistic to conduct inference. We provide guarantees for (i) asymptotic type I error control, (ii) convergence of our confidence intervals, and (iii) asymptotic strong consistency of our estimator over a wide variety of common bandit designs. Our empirical results show that our approach achieves the desired coverage while reducing confidence interval widths by up to 50%, with drastic improvements for arms not targeted by the design.

[LG-85] Asymptotically perfect seeded graph matching without edge correlation (and applications to inference)

链接: https://arxiv.org/abs/2506.02825
作者: Tong Qi,Vera Andersson,Peter Viechnicki,Vince Lyzinski
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 10 figures, 35 pages

点击查看摘要

Abstract:We present the OmniMatch algorithm for seeded multiple graph matching. In the setting of d -dimensional Random Dot Product Graphs (RDPG), we prove that under mild assumptions, OmniMatch with s seeds asymptotically and efficiently perfectly aligns O(s^\alpha) unseeded vertices – for \alpha2\wedge d/4 – across multiple networks even in the presence of no edge correlation. We demonstrate the effectiveness of our algorithm across numerous simulations and in the context of shuffled graph hypothesis testing. In the shuffled testing setting, testing power is lost due to the misalignment/shuffling of vertices across graphs, and we demonstrate the capacity of OmniMatch to correct for misaligned vertices prior to testing and hence recover the lost testing power. We further demonstrate the algorithm on a pair of data examples from connectomics and machine translation.

[LG-86] Doubly-Robust Estimation of Counterfactual Policy Mean Embeddings

链接: https://arxiv.org/abs/2506.02793
作者: Houssam Zenati,Bariscan Bozkurt,Arthur Gretton
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Estimating the distribution of outcomes under counterfactual policies is critical for decision-making in domains such as recommendation, advertising, and healthcare. We analyze a novel framework-Counterfactual Policy Mean Embedding (CPME)-that represents the entire counterfactual outcome distribution in a reproducing kernel Hilbert space (RKHS), enabling flexible and nonparametric distributional off-policy evaluation. We introduce both a plug-in estimator and a doubly robust estimator; the latter enjoys improved uniform convergence rates by correcting for bias in both the outcome embedding and propensity models. Building on this, we develop a doubly robust kernel test statistic for hypothesis testing, which achieves asymptotic normality and thus enables computationally efficient testing and straightforward construction of confidence intervals. Our framework also supports sampling from the counterfactual distribution. Numerical simulations illustrate the practical benefits of CPME over existing methods.

[LG-87] Safely Learning Controlled Stochastic Dynamics NEURIPS2025

链接: https://arxiv.org/abs/2506.02754
作者: Luc Brogat-Motte,Alessandro Rudi,Riccardo Bonalli
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Under review at NeurIPS 2025

点击查看摘要

Abstract:We address the problem of safely learning controlled stochastic dynamics from discrete-time trajectory observations, ensuring system trajectories remain within predefined safe regions during both training and deployment. Safety-critical constraints of this kind are crucial in applications such as autonomous robotics, finance, and biomedicine. We introduce a method that ensures safe exploration and efficient estimation of system dynamics by iteratively expanding an initial known safe control set using kernel-based confidence bounds. After training, the learned model enables predictions of the system’s dynamics and permits safety verification of any given control. Our approach requires only mild smoothness assumptions and access to an initial safe control set, enabling broad applicability to complex real-world systems. We provide theoretical guarantees for safety and derive adaptive learning rates that improve with increasing Sobolev regularity of the true dynamics. Experimental evaluations demonstrate the practical effectiveness of our method in terms of safety, estimation accuracy, and computational efficiency.

[LG-88] Online Bayesian system identification in multivariate autoregressive models via message passing

链接: https://arxiv.org/abs/2506.02710
作者: T. N. Nisslbeck,Wouter M. Kouw
类目: ignal Processing (eess.SP); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 6 pages, 1 figure, conference: ECC2025

点击查看摘要

Abstract:We propose a recursive Bayesian estimation procedure for multivariate autoregressive models with exogenous inputs based on message passing in a factor graph. Unlike recursive least-squares, our method produces full posterior distributions for both the autoregressive coefficients and noise precision. The uncertainties regarding these estimates propagate into the uncertainties on predictions for future system outputs, and support online model evidence calculations. We demonstrate convergence empirically on a synthetic autoregressive system and competitive performance on a double mass-spring-damper system.

[LG-89] Symmetry-Aware GFlowNets ICML2025

链接: https://arxiv.org/abs/2506.02685
作者: Hohyun Kim,Seunggeun Lee,Min-hwan Oh
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 29 pages; Accepted at ICML 2025

点击查看摘要

Abstract:Generative Flow Networks (GFlowNets) offer a powerful framework for sampling graphs in proportion to their rewards. However, existing approaches suffer from systematic biases due to inaccuracies in state transition probability computations. These biases, rooted in the inherent symmetries of graphs, impact both atom-based and fragment-based generation schemes. To address this challenge, we introduce Symmetry-Aware GFlowNets (SA-GFN), a method that incorporates symmetry corrections into the learning process through reward scaling. By integrating bias correction directly into the reward structure, SA-GFN eliminates the need for explicit state transition computations. Empirical results show that SA-GFN enables unbiased sampling while enhancing diversity and consistently generating high-reward graphs that closely match the target distribution.

[LG-90] Computational Thresholds in Multi-Modal Learning via the Spiked Matrix-Tensor Model

链接: https://arxiv.org/abs/2506.02664
作者: Hugo Tabanelli,Pierre Mergny,Lenka Zdeborova,Florent Krzakala
类目: Machine Learning (stat.ML); Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study the recovery of multiple high-dimensional signals from two noisy, correlated modalities: a spiked matrix and a spiked tensor sharing a common low-rank structure. This setting generalizes classical spiked matrix and tensor models, unveiling intricate interactions between inference channels and surprising algorithmic behaviors. Notably, while the spiked tensor model is typically intractable at low signal-to-noise ratios, its correlation with the matrix enables efficient recovery via Bayesian Approximate Message Passing, inducing staircase-like phase transitions reminiscent of neural network phenomena. In contrast, empirical risk minimization for joint learning fails: the tensor component obstructs effective matrix recovery, and joint optimization significantly degrades performance, highlighting the limitations of naive multi-modal learning. We show that a simple Sequential Curriculum Learning strategy-first recovering the matrix, then leveraging it to guide tensor recovery-resolves this bottleneck and achieves optimal weak recovery thresholds. This strategy, implementable with spectral methods, emphasizes the critical role of structural correlation and learning order in multi-modal high-dimensional inference.

[LG-91] Asymptotics of SGD in Sequence-Single Index Models and Single-Layer Attention Networks

链接: https://arxiv.org/abs/2506.02651
作者: Luca Arnaboldi,Bruno Loureiro,Ludovic Stephan,Florent Krzakala,Lenka Zdeborova
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study the dynamics of stochastic gradient descent (SGD) for a class of sequence models termed Sequence Single-Index (SSI) models, where the target depends on a single direction in input space applied to a sequence of tokens. This setting generalizes classical single-index models to the sequential domain, encompassing simplified one-layer attention architectures. We derive a closed-form expression for the population loss in terms of a pair of sufficient statistics capturing semantic and positional alignment, and characterize the induced high-dimensional SGD dynamics for these coordinates. Our analysis reveals two distinct training phases: escape from uninformative initialization and alignment with the target subspace, and demonstrates how the sequence length and positional encoding influence convergence speed and learning trajectories. These results provide a rigorous and interpretable foundation for understanding how sequential structure in data can be beneficial for learning with attention-based models.

[LG-92] nsor State Space-based Dynamic Multilayer Network Modeling

链接: https://arxiv.org/abs/2506.02413
作者: Tian Lan,Jie Guo,Chen Zhang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Understanding the complex interactions within dynamic multilayer networks is critical for advancements in various scientific domains. Existing models often fail to capture such networks’ temporal and cross-layer dynamics. This paper introduces a novel Tensor State Space Model for Dynamic Multilayer Networks (TSSDMN), utilizing a latent space model framework. TSSDMN employs a symmetric Tucker decomposition to represent latent node features, their interaction patterns, and layer transitions. Then by fixing the latent features and allowing the interaction patterns to evolve over time, TSSDMN uniquely captures both the temporal dynamics within layers and across different layers. The model identifiability conditions are discussed. By treating latent features as variables whose posterior distributions are approximated using a mean-field variational inference approach, a variational Expectation Maximization algorithm is developed for efficient model inference. Numerical simulations and case studies demonstrate the efficacy of TSSDMN for understanding dynamic multilayer networks.

[LG-93] Joint Modeling for Learning Decision-Making Dynamics in Behavioral Experiments

链接: https://arxiv.org/abs/2506.02394
作者: Yuan Bian,Xingche Guo,Yuanjia Wang
类目: Methodology (stat.ME); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Major depressive disorder (MDD), a leading cause of disability and mortality, is associated with reward-processing abnormalities and concentration issues. Motivated by the probabilistic reward task from the Establishing Moderators and Biosignatures of Antidepressant Response in Clinical Care (EMBARC) study, we propose a novel framework that integrates the reinforcement learning (RL) model and drift-diffusion model (DDM) to jointly analyze reward-based decision-making with response times. To account for emerging evidence suggesting that decision-making may alternate between multiple interleaved strategies, we model latent state switching using a hidden Markov model (HMM). In the ‘‘engaged’’ state, decisions follow an RL-DDM, simultaneously capturing reward processing, decision dynamics, and temporal structure. In contrast, in the ‘‘lapsed’’ state, decision-making is modeled using a simplified DDM, where specific parameters are fixed to approximate random guessing with equal probability. The proposed method is implemented using a computationally efficient generalized expectation-maximization algorithm with forward-backward procedures. Through extensive numerical studies, we demonstrate that our proposed method outperforms competing approaches under various reward-generating distributions, both with and without strategy switching. When applied to the EMBARC study, our framework reveals that MDD patients exhibit lower overall engagement than healthy controls and experience longer decision times when they do engage. Additionally, we show that neuroimaging measures of brain activities are associated with decision-making characteristics in the ‘‘engaged’’ state but not in the ‘‘lapsed’’ state, providing evidence of brain-behavioral association specific to the ‘‘engaged’’ state.

[LG-94] Large Stepsizes Accelerate Gradient Descent for Regularized Logistic Regression

链接: https://arxiv.org/abs/2506.02336
作者: Jingfeng Wu,Pierre Marion,Peter Bartlett
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study gradient descent (GD) with a constant stepsize for \ell_2 -regularized logistic regression with linearly separable data. Classical theory suggests small stepsizes to ensure monotonic reduction of the optimization objective, achieving exponential convergence in \widetilde\mathcalO(\kappa) steps with \kappa being the condition number. Surprisingly, we show that this can be accelerated to \widetilde\mathcalO(\sqrt\kappa) by simply using a large stepsize – for which the objective evolves nonmonotonically. The acceleration brought by large stepsizes extends to minimizing the population risk for separable distributions, improving on the best-known upper bounds on the number of steps to reach a near-optimum. Finally, we characterize the largest stepsize for the local convergence of GD, which also determines the global convergence in special scenarios. Our results extend the analysis of Wu et al. (2024) from convex settings with minimizers at infinity to strongly convex cases with finite minimizers.

[LG-95] MoCA: Multi-modal Cross-masked Autoencoder for Digital Health Measurements

链接: https://arxiv.org/abs/2506.02260
作者: Howon Ryu,Yuliang Chen,Yacun Wang,Andrea Z. LaCroix,Chongzhi Di,Loki Natarajan,Yu Wang,Jingjing Zou
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:The growing prevalence of digital health technologies has led to the generation of complex multi-modal data, such as physical activity measurements simultaneously collected from various sensors of mobile and wearable devices. These data hold immense potential for advancing health studies, but current methods predominantly rely on supervised learning, requiring extensive labeled datasets that are often expensive or impractical to obtain, especially in clinical studies. To address this limitation, we propose a self-supervised learning framework called Multi-modal Cross-masked Autoencoder (MoCA) that leverages cross-modality masking and the Transformer autoencoder architecture to utilize both temporal correlations within modalities and cross-modal correlations between data streams. We also provide theoretical guarantees to support the effectiveness of the cross-modality masking scheme in MoCA. Comprehensive experiments and ablation studies demonstrate that our method outperforms existing approaches in both reconstruction and downstream tasks. We release open-source code for data processing, pre-training, and downstream tasks in the supplementary materials. This work highlights the transformative potential of self-supervised learning in digital health and multi-modal data.

[LG-96] Assumption-free stability for ranking problems

链接: https://arxiv.org/abs/2506.02257
作者: Ruiting Liang,Jake A. Soloff,Rina Foygel Barber,Rebecca Willett
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:In this work, we consider ranking problems among a finite set of candidates: for instance, selecting the top- k items among a larger list of candidates or obtaining the full ranking of all items in the set. These problems are often unstable, in the sense that estimating a ranking from noisy data can exhibit high sensitivity to small perturbations. Concretely, if we use data to provide a score for each item (say, by aggregating preference data over a sample of users), then for two items with similar scores, small fluctuations in the data can alter the relative ranking of those items. Many existing theoretical results for ranking problems assume a separation condition to avoid this challenge, but real-world data often contains items whose scores are approximately tied, limiting the applicability of existing theory. To address this gap, we develop a new algorithmic stability framework for ranking problems, and propose two novel ranking operators for achieving stable ranking: the \emphinflated top- k for the top- k selection problem and the \emphinflated full ranking for ranking the full list. To enable stability, each method allows for expressing some uncertainty in the output. For both of these two problems, our proposed methods provide guaranteed stability, with no assumptions on data distributions and no dependence on the total number of candidates to be ranked. Experiments on real-world data confirm that the proposed methods offer stability without compromising the informativeness of the output.

[LG-97] Enabling Probabilistic Learning on Manifolds through Double Diffusion Maps

链接: https://arxiv.org/abs/2506.02254
作者: Dimitris G Giovanis,Nikolaos Evangelou,Ioannis G Kevrekidis,Roger G Ghanem
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR)
*备注:

点击查看摘要

Abstract:We present a generative learning framework for probabilistic sampling based on an extension of the Probabilistic Learning on Manifolds (PLoM) approach, which is designed to generate statistically consistent realizations of a random vector in a finite-dimensional Euclidean space, informed by a limited (yet representative) set of observations. In its original form, PLoM constructs a reduced-order probabilistic model by combining three main components: (a) kernel density estimation to approximate the underlying probability measure, (b) Diffusion Maps to uncover the intrinsic low-dimensional manifold structure, and © a reduced-order Ito Stochastic Differential Equation (ISDE) to sample from the learned distribution. A key challenge arises, however, when the number of available data points N is small and the dimensionality of the diffusion-map basis approaches N, resulting in overfitting and loss of generalization. To overcome this limitation, we propose an enabling extension that implements a synthesis of Double Diffusion Maps – a technique capable of capturing multiscale geometric features of the data – with Geometric Harmonics (GH), a nonparametric reconstruction method that allows smooth nonlinear interpolation in high-dimensional ambient spaces. This approach enables us to solve a full-order ISDE directly in the latent space, preserving the full dynamical complexity of the system, while leveraging its reduced geometric representation. The effectiveness and robustness of the proposed method are illustrated through two numerical studies: one based on data generated from two-dimensional Hermite polynomial functions and another based on high-fidelity simulations of a detonation wave in a reactive flow.

[LG-98] omographic Foundation Model – FORCE: Flow-Oriented Reconstruction Conditioning Engine

链接: https://arxiv.org/abs/2506.02149
作者: Wenjun Xia,Chuang Niu,Ge Wang
类目: Image and Video Processing (eess.IV); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Computed tomography (CT) is a major medical imaging modality. Clinical CT scenarios, such as low-dose screening, sparse-view scanning, and metal implants, often lead to severe noise and artifacts in reconstructed images, requiring improved reconstruction techniques. The introduction of deep learning has significantly advanced CT image reconstruction. However, obtaining paired training data remains rather challenging due to patient motion and other constraints. Although deep learning methods can still perform well with approximately paired data, they inherently carry the risk of hallucination due to data inconsistencies and model instability. In this paper, we integrate the data fidelity with the state-of-the-art generative AI model, referred to as the Poisson flow generative model (PFGM) with a generalized version PFGM++, and propose a novel CT framework: Flow-Oriented Reconstruction Conditioning Engine (FORCE). In our experiments, the proposed method shows superior performance in various CT imaging tasks, outperforming existing unsupervised reconstruction approaches.

[LG-99] A meaningful prediction of functional decline in amyotrophic lateral sclerosis based on multi-event survival analysis

链接: https://arxiv.org/abs/2506.02076
作者: Christian Marius Lillelund,Sanjay Kalra,Russell Greiner
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Amyotrophic lateral sclerosis (ALS) is a degenerative disorder of motor neurons that causes progressive paralysis in patients. Current treatment options aim to prolong survival and improve quality of life; however, due to the heterogeneity of the disease, it is often difficult to determine the optimal time for potential therapies or medical interventions. In this study, we propose a novel method to predict the time until a patient with ALS experiences significant functional impairment (ALSFRS-R=2) with respect to five common functions: speaking, swallowing, handwriting, walking and breathing. We formulate this task as a multi-event survival problem and validate our approach in the PRO-ACT dataset by training five covariate-based survival models to estimate the probability of an event over a 500-day period after a baseline visit. We then predict five event-specific individual survival distributions (ISDs) for each patient, each providing an interpretable and meaningful estimate of when that event will likely take place in the future. The results show that covariate-based models are superior to the Kaplan-Meier estimator at predicting time-to-event outcomes. Additionally, our method enables practitioners to make individual counterfactual predictions, where certain features (covariates) can be changed to see their effect on the predicted outcome. In this regard, we find that Riluzole has little to no impact on predicted functional decline. However, for patients with bulbar-onset ALS, our method predicts considerably shorter counterfactual time-to-event estimates for tasks related to speech and swallowing compared to limb-onset ALS. The proposed method can be applied to current clinical examination data to assess the risk of functional decline and thus allow more personalized treatment planning.

[LG-100] Stop Chasing the C-index: This Is How We Should Evaluate Our Survival Models

链接: https://arxiv.org/abs/2506.02075
作者: Christian Marius Lillelund,Shi-ang Qi,Russell Greiner,Christian Fischer Pedersen
类目: Methodology (stat.ME); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We argue that many survival analysis and time-to-event models are incorrectly evaluated. First, we survey many examples of evaluation approaches in the literature and find that most rely on concordance (C-index). However, the C-index only measures a model’s discriminative ability and does not assess other important aspects, such as the accuracy of the time-to-event predictions or the calibration of the model’s probabilistic estimates. Next, we present a set of key desiderata for choosing the right evaluation metric and discuss their pros and cons. These are tailored to the challenges in survival analysis, such as sensitivity to miscalibration and various censoring assumptions. We hypothesize that the current development of survival metrics conforms to a double-helix ladder, and that model validity and metric validity must stand on the same rung of the assumption ladder. Finally, we discuss the appropriate methods for evaluating a survival model in practice and summarize various viewpoints opposing our analysis.

[LG-101] Enhancing Interpretability of Quantum-Assisted Blockchain Clustering via AI Agent -Based Qualitative Analysis

链接: https://arxiv.org/abs/2506.02068
作者: Yun-Cheng Tsai,Yen-Ku Liu,Samuel Yen-Chi Chen
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Blockchain transaction data is inherently high dimensional, noisy, and entangled, posing substantial challenges for traditional clustering algorithms. While quantum enhanced clustering models have demonstrated promising performance gains, their interpretability remains limited, restricting their application in sensitive domains such as financial fraud detection and blockchain governance. To address this gap, we propose a two stage analysis framework that synergistically combines quantitative clustering evaluation with AI Agent assisted qualitative interpretation. In the first stage, we employ classical clustering methods and evaluation metrics including the Silhouette Score, Davies Bouldin Index, and Calinski Harabasz Index to determine the optimal cluster count and baseline partition quality. In the second stage, we integrate an AI Agent to generate human readable, semantic explanations of clustering results, identifying intra cluster characteristics and inter cluster relationships. Our experiments reveal that while fully trained Quantum Neural Networks (QNN) outperform random Quantum Features (QF) in quantitative metrics, the AI Agent further uncovers nuanced differences between these methods, notably exposing the singleton cluster phenomenon in QNN driven models. The consolidated insights from both stages consistently endorse the three cluster configuration, demonstrating the practical value of our hybrid approach. This work advances the interpretability frontier in quantum assisted blockchain analytics and lays the groundwork for future autonomous AI orchestrated clustering frameworks.

[LG-102] A Brain Graph Foundation Model: Pre-Training and Prompt-Tuning for Any Atlas and Disorder

链接: https://arxiv.org/abs/2506.02044
作者: Xinxu Wei,Kanhao Zhao,Yong Jiao,Lifang He,Yu Zhang
类目: Neurons and Cognition (q-bio.NC); Machine Learning (cs.LG)
*备注: 34pages

点击查看摘要

Abstract:As large language models (LLMs) continue to revolutionize AI research, there is a growing interest in building large-scale brain foundation models to advance neuroscience. While most existing brain foundation models are pre-trained on time-series signals or region-of-interest (ROI) features, we propose a novel graph-based pre-training paradigm for constructing a brain graph foundation model. In this paper, we introduce the Brain Graph Foundation Model, termed BrainGFM, a unified framework that leverages graph contrastive learning and graph masked autoencoders for large-scale fMRI-based pre-training. BrainGFM is pre-trained on a diverse mixture of brain atlases with varying parcellations, significantly expanding the pre-training corpus and enhancing the model’s ability to generalize across heterogeneous fMRI-derived brain representations. To support efficient and versatile downstream transfer, we integrate both graph prompts and language prompts into the model design, enabling BrainGFM to flexibly adapt to a wide range of atlases, neurological and psychiatric disorders, and task settings. Furthermore, we employ meta-learning to optimize the graph prompts, facilitating strong generalization to previously unseen disorders under both few-shot and zero-shot learning conditions via language-guided prompting. BrainGFM is pre-trained on 27 neuroimaging datasets spanning 25 common neurological and psychiatric disorders, encompassing 2 types of brain atlases (functional and anatomical) across 8 widely-used parcellations, and covering over 25,000 subjects, 60,000 fMRI scans, and a total of 400,000 graph samples aggregated across all atlases and parcellations. The code is available at: this https URL

信息检索

[IR-0] MMM4Rec: An Transfer-Efficient Framework for Multi-modal Sequential Recommendation

链接: https://arxiv.org/abs/2506.02916
作者: Hao Fan,Yanrong Hu,Kai Fang,Qingyang Liu,Hongjiu Liu
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Sequential Recommendation (SR) systems model user preferences by analyzing interaction histories. Although transferable multi-modal SR architectures demonstrate superior performance compared to traditional ID-based approaches, current methods incur substantial fine-tuning costs when adapting to new domains due to complex optimization requirements and negative transfer effects - a significant deployment bottleneck that hinders engineers from efficiently repurposing pre-trained models for novel application scenarios with minimal tuning overhead. We propose MMM4Rec (Multi-Modal Mamba for Sequential Recommendation), a novel multi-modal SR framework that incorporates a dedicated algebraic constraint mechanism for efficient transfer learning. By combining State Space Duality (SSD)'s temporal decay properties with a time-aware modeling design, our model dynamically prioritizes key modality information, overcoming limitations of Transformer-based approaches. The framework implements a constrained two-stage process: (1) sequence-level cross-modal alignment via shared projection matrices, followed by (2) temporal fusion using our newly designed Cross-SSD module and dual-channel Fourier adaptive filtering. This architecture maintains semantic consistency while suppressing noise propagation.MMM4Rec achieves rapid fine-tuning convergence with simple cross-entropy loss, significantly improving multi-modal recommendation accuracy while maintaining strong transferability. Extensive experiments demonstrate MMM4Rec’s state-of-the-art performance, achieving the maximum 31.78% NDCG@10 improvement over existing models and exhibiting 10 times faster average convergence speed when transferring to large-scale downstream datasets.

[IR-1] Combining social relations and interaction data in Recommender System with Graph Convolution Collaborative Filtering

链接: https://arxiv.org/abs/2506.02834
作者: Tin T. Tran,Vaclav Snasel,Loc Tan Nguyen
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:A recommender system is an important subject in the field of data mining, where the item rating information from users is exploited and processed to make suitable recommendations with all other users. The recommender system creates convenience for e-commerce users and stimulates the consumption of items that are suitable for users. In addition to e-commerce, a recommender system is also used to provide recommendations on books to read, movies to watch, courses to take or websites to visit. Similarity between users is an important impact for recommendation, which could be calculated from the data of past user ratings of the item by methods of collaborative filtering, matrix factorization or singular vector decomposition. In the development of graph data mining techniques, the relationships between users and items can be represented by matrices from which collaborative filtering could be done with the larger database, more accurate and faster in calculation. All these data can be represented graphically and mined by today’s highly developed graph neural network models. On the other hand, users’ social friendship data also influence consumption habits because recommendations from friends will be considered more carefully than information sources. However, combining a user’s friend influence and the similarity between users whose similar shopping habits is challenging. Because the information is noisy and it affects each particular data set in different ways. In this study, we present the input data processing method to remove outliers which are single reviews or users with little interaction with the items; the next proposed model will combine the social relationship data and the similarity in the rating history of users to improve the accuracy and recall of the recommender system.

[IR-2] UTCS: Effective Unsupervised Temporal Community Search with Pre-training of Temporal Dynamics and Subgraph Knowledge SIGIR’25

链接: https://arxiv.org/abs/2506.02784
作者: Yue Zhang,Yankai Chen,Yingli Zhou,Yucan Guo,Xiaolin Han,Chenhao Ma
类目: Information Retrieval (cs.IR)
*备注: Accepted by SIGIR’25 short paper track

点击查看摘要

Abstract:In many real-world applications, the evolving relationships between entities can be modeled as temporal graphs, where each edge has a timestamp representing the interaction time. As a fundamental problem in graph analysis, \it community search (CS) in temporal graphs has received growing attention but exhibits two major limitations: (1) Traditional methods typically require predefined subgraph structures, which are not always known in advance. (2) Learning-based methods struggle to capture temporal interaction information. To fill this research gap, in this paper, we propose an effective \textbfUnsupervised \textbfTemporal \textbfCommunity \textbfSearch with pre-training of temporal dynamics and subgraph knowledge model (\textbf\model). \model~contains two key stages: offline pre-training and online search. In the first stage, we introduce multiple learning objectives to facilitate the pre-training process in the unsupervised learning setting. In the second stage, we identify a candidate subgraph and compute community scores using the pre-trained node representations and a novel scoring mechanism to determine the final community members. Experiments on five real-world datasets demonstrate the effectiveness. Comments: Accepted by SIGIR’25 short paper track Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2506.02784 [cs.IR] (or arXiv:2506.02784v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2506.02784 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-3] Learning Binarized Representations with Pseudo-positive Sample Enhancement for Efficient Graph Collaborative Filtering

链接: https://arxiv.org/abs/2506.02750
作者: Yankai Chen,Yue Que,Xinni Zhang,Chen Ma,Irwin King
类目: Information Retrieval (cs.IR)
*备注: Accepted by TOIS

点击查看摘要

Abstract:Learning vectorized embeddings is fundamental to many recommender systems for user-item matching. To enable efficient online inference, representation binarization, which embeds latent features into compact binary sequences, has recently shown significant promise in optimizing both memory usage and computational overhead. However, existing approaches primarily focus on numerical quantization, neglecting the associated information loss, which often results in noticeable performance degradation. To address these issues, we study the problem of graph representation binarization for efficient collaborative filtering. Our findings indicate that explicitly mitigating information loss at various stages of embedding binarization has a significant positive impact on performance. Building on these insights, we propose an enhanced framework, BiGeaR++, which specifically leverages supervisory signals from pseudo-positive samples, incorporating both real item data and latent embedding samples. Compared to its predecessor BiGeaR, BiGeaR++ introduces a fine-grained inference distillation mechanism and an effective embedding sample synthesis approach. Empirical evaluations across five real-world datasets demonstrate that the new designs in BiGeaR++ work seamlessly well with other modules, delivering substantial improvements of around 1%-10% over BiGeaR and thus achieving state-of-the-art performance compared to the competing methods. Our implementation is available at this https URL.

[IR-4] NextQuill: Causal Preference Modeling for Enhancing LLM Personalization

链接: https://arxiv.org/abs/2506.02368
作者: Xiaoyan Zhao,Juntao You,Yang Zhang,Wenjie Wang,Hong Cheng,Fuli Feng,See-Kiong Ng,Tat-Seng Chua
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Personalizing large language models (LLMs) for individual users has become increasingly important as they are progressively integrated into real-world applications to support users’ daily lives. However, existing personalization approaches often fail to distinguish which components of model predictions and training data truly reflect user preferences, leading to superficial personalization alignment. In this paper, we introduce NextQuill, a novel LLM personalization alignment framework grounded in causal preference modeling. We approach personalization from a causal perspective, treating both model predictions and ground-truth data generation as outcomes influenced by user preferences, along with other factors. We define the true preference effect as the causal impact of user history (which reflects preferences) on each token prediction or data generation instance, estimated through causal intervention techniques. Building on this insight, NextQuill introduces two complementary alignment strategies: (1) aligning model-internal causal preference effects on predictions with those reflected in ground-truth data, rather than indiscriminately fitting predictions, and (2) focusing on fitting preference-bearing tokens identified via ground-truth data preference effects, rather than treating all tokens uniformly. By integrating these strategies, NextQuill shifts the alignment process toward learning from causal preference effects, facilitating more effective and personalized adaptation. Experiments across multiple personalization benchmarks demonstrate that NextQuill significantly improves personalization quality, offering a principled, causal foundation for LLM personalization. Our codes are available on this https URL.

附件下载

点击下载今日全部论文列表