本篇博文主要内容为 2025-06-03 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。

目录

概览 (2025-06-03)

今日共更新1230篇论文,其中:

  • 自然语言处理308篇(Computation and Language (cs.CL))
  • 人工智能411篇(Artificial Intelligence (cs.AI))
  • 计算机视觉224篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习405篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] Dual-Process Image Generation

【速读】: 该论文试图解决传统图像生成控制方法在学习新任务方面能力有限的问题,而视觉语言模型(Vision-Language Models, VLMs)能够通过上下文学习并生成正确的输出。其解决方案的关键在于提出一种双过程蒸馏方案,该方案利用VLM对生成的图像进行评分,并将此梯度反向传播以更新图像生成器的权重,从而使得前馈图像生成器能够从VLM中学习新任务。

链接: https://arxiv.org/abs/2506.01955
作者: Grace Luo,Jonathan Granskog,Aleksander Holynski,Trevor Darrell
机构: UC Berkeley(加州大学伯克利分校); Runway(运行线)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Prior methods for controlling image generation are limited in their ability to be taught new tasks. In contrast, vision-language models, or VLMs, can learn tasks in-context and produce the correct outputs for a given input. We propose a dual-process distillation scheme that allows feed-forward image generators to learn new tasks from deliberative VLMs. Our scheme uses a VLM to rate the generated images and backpropagates this gradient to update the weights of the image generator. Our general framework enables a wide variety of new control tasks through the same text-and-image based interface. We showcase a handful of applications of this technique for different types of control signals, such as commonsense inferences and visual prompts. With our method, users can implement multimodal controls for properties such as color palette, line weight, horizon position, and relative depth within a matter of minutes. Project page: this https URL.
zh

[NLP-1] DRAG : Distilling RAG for SLMs from LLM s to Transfer Knowledge and Mitigate Hallucination via Evidence and Graph-based Distillation ACL2025

【速读】: 该论文旨在解决大规模检索增强生成(Retrieval-Augmented Generation, RAG)系统在计算资源消耗大且容易产生幻觉内容的问题。其解决方案的关键在于提出一种名为\texttt{DRAG}的框架,通过基于证据和知识图谱的蒸馏方法,将大规模语言模型(Large Language Models, LLMs)中的RAG知识有效地压缩到小型语言模型(Small Language Models, SLMs)中,从而在保持关键事实知识的同时显著降低模型规模和计算成本,并有效缓解幻觉现象,提升事实准确性。

链接: https://arxiv.org/abs/2506.01954
作者: Jennifer Chen,Aidar Myrzakhan,Yaxin Luo,Hassaan Muhammad Khan,Sondos Mahmoud Bsharat,Zhiqiang Shen
机构: VILA Lab, Mohamed bin Zayed University of AI (VILA 实验室,穆罕默德·本·扎耶德人工智能大学); McGill University (麦吉尔大学); National University of Science and Technology (国家科技大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: ACL 2025 Main. Code is available at this https URL

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) methods have proven highly effective for tasks requiring factual consistency and robust knowledge retrieval. However, large-scale RAG systems consume significant computational resources and are prone to generating hallucinated content from Humans. In this work, we introduce \textttDRAG , a novel framework for distilling RAG knowledge from large-scale Language Models (LLMs) into small LMs (SLMs). Our approach leverages evidence- and knowledge graph-based distillation, ensuring that the distilled model retains critical factual knowledge while significantly reducing model size and computational cost. By aligning the smaller model’s predictions with a structured knowledge graph and ranked evidence, \textttDRAG effectively mitigates hallucinations and improves factual accuracy. We further present a case demonstrating how our framework mitigates user privacy risks and introduce a corresponding benchmark. Experimental evaluations on multiple benchmarks demonstrate that our method outperforms the prior competitive RAG methods like MiniRAG for SLMs by up to 27.7% using the same models, preserving high-level efficiency and reliability. With \textttDRAG , we provide a practical and resource-efficient roadmap to deploying enhanced retrieval and generation capabilities in small-sized LLMs.
zh

[NLP-2] WebChoreArena: Evaluating Web Browsing Agents on Realistic Tedious Web Tasks

【速读】: 该论文试图解决当前基于大语言模型(Large Language Model, LLM)的网络浏览代理在处理复杂、繁琐任务时能力不足的问题,即是否能够超越一般的网络浏览任务, robustly 处理人类通常不愿亲自完成的劳动密集型任务。解决方案的关键在于提出 WebChoreArena,这是一个全新的可完全复现的基准测试集,包含532个精心设计的任务,旨在扩展 WebArena 的适用范围至更繁重和繁琐的任务领域,并系统性地整合了三大核心挑战:大规模记忆任务、计算任务以及长期记忆任务。通过构建在 WebArena 模拟环境之上,WebChoreArena 确保了严格的可复现性,并支持与现有 WebArena 基准的公平直接比较,从而为评估智能体的进步提供了关键洞察。

链接: https://arxiv.org/abs/2506.01952
作者: Atsuyuki Miyai,Zaiying Zhao,Kazuki Egashira,Atsuki Sato,Tatsumi Sunada,Shota Onohara,Hiromasa Yamanishi,Mashiro Toyooka,Kunato Nishina,Ryoma Maeda,Kiyoharu Aizawa,Toshihiko Yamasaki
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Project Page: this https URL

点击查看摘要

Abstract:Powered by a large language model (LLM), a web browsing agent operates web browsers in a human-like manner and offers a highly transparent path toward automating a wide range of everyday tasks. As web agents become increasingly capable and demonstrate proficiency in general browsing tasks, a critical question emerges: Can they go beyond general browsing to robustly handle tasks that are tedious and complex, or chores that humans often avoid doing themselves? In this paper, we introduce WebChoreArena, a new fully reproducible benchmark comprising 532 carefully curated tasks designed to extend the scope of WebArena beyond general browsing to more labor-intensive and tedious tasks. WebChoreArena systematically integrates three key challenges: (i) Massive Memory tasks requiring accurate retrieval of large amounts of information in the observations, (ii) Calculation tasks demanding precise mathematical reasoning, and (iii) Long-Term Memory tasks necessitating long-term memory across multiple webpages. Built on top of the fully reproducible and widely adopted four WebArena simulation environments, WebChoreArena ensures strict reproducibility and enables fair, direct comparisons with the established WebArena benchmark, offering key insights into agent progress. Our experimental results demonstrate that as LLMs evolve, represented by GPT-4o, Claude 3.7 Sonnet, and Gemini 2.5 Pro, significant improvements in performance are observed on WebChoreArena. These findings suggest that WebChoreArena is well-suited to measure the advancement of state-of-the-art LLMs with greater clarity. Nevertheless, the results also indicate that even with Gemini 2.5 Pro, there remains substantial room for improvement compared to WebArena, highlighting the increased challenges posed by WebChoreArena.
zh

[NLP-3] Self-ensemble: Mitigating Confidence Distortion for Large Language Models

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在多选题问答(MCQA)任务中出现的置信度失真问题,特别是在选项数量增加时,模型对正确预测表现出过度不自信而对错误预测则过度自信,从而导致性能显著下降。解决方案的关键在于提出一种名为Self-ensemble的方法,该方法通过将选项分组并整合不同组的LLM预测结果来做出最终决策,其核心优势在于无需依赖标注数据进行参数调优,即可通过设计的注意力掩码和位置编码嵌入现有LLM架构中。

链接: https://arxiv.org/abs/2506.01951
作者: Zicheng Xu,Guanchu Wang,Guangyao Zheng,Yu-Neng Chuang,Alexander Szalay,Xia Hu,Vladimir Braverman
机构: Rice University; University of North Carolina at Charlotte; Johns Hopkins University
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Although Large Language Models (LLMs) perform well in general fields, they exhibit a confidence distortion problem on multi-choice question-answering (MCQA), particularly as the number of answer choices increases. Specifically, on MCQA with many choices, LLMs suffer from under-confidence in correct predictions and over-confidence in incorrect ones, leading to a substantially degraded performance. To solve this problem, we propose Self-ensemble in this work. Our method splits the choices into several groups and ensembles LLM predictions across these groups to reach a final decision. The advantage of Self-ensemble is its plug-and-play nature, where it can be integrated into existing LLM architecture based on a designed attention mask and positional encoding, without requiring labeled datasets for parameter tuning. Experimental results on three LLMs and datasets demonstrate that Self-ensemble comprehensively addresses the confidence distortion problem of LLMs, outperforming standard inference as well as baseline methods.
zh

[NLP-4] Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning

【速读】: 该论文旨在解决如何提升大型语言模型(Large Language Models, LLMs)推理能力的问题,特别是通过强化学习与可验证奖励(Reinforcement Learning with Verifiable Rewards, RLVR)方法进行优化。其解决方案的关键在于从token熵模式的角度出发,识别并优化高熵token(即分叉token),这些token在推理过程中起到关键的路径引导作用。研究发现,仅对高熵token进行策略梯度更新即可显著提升模型性能,甚至在仅使用20%的token时仍能保持与全梯度更新相当的性能,且在多个模型上表现更优,表明RLVR的有效性主要来源于对决定推理方向的高熵token的优化。

链接: https://arxiv.org/abs/2506.01939
作者: Shenzhi Wang,Le Yu,Chang Gao,Chujie Zheng,Shixuan Liu,Rui Lu,Kai Dang,Xionghui Chen,Jianxin Yang,Zhenru Zhang,Yuqiong Liu,An Yang,Andrew Zhao,Yang Yue,Shiji Song,Bowen Yu,Gao Huang,Junyang Lin
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 25 pages, 17 figures, 2 tables

点击查看摘要

Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a powerful approach to enhancing the reasoning capabilities of Large Language Models (LLMs), while its mechanisms are not yet well understood. In this work, we undertake a pioneering exploration of RLVR through the novel perspective of token entropy patterns, comprehensively analyzing how different tokens influence reasoning performance. By examining token entropy patterns in Chain-of-Thought (CoT) reasoning, we observe that only a small fraction of tokens exhibit high entropy, and these tokens act as critical forks that steer the model toward diverse reasoning pathways. Furthermore, studying how entropy patterns evolve during RLVR training reveals that RLVR largely adheres to the base model’s entropy patterns, primarily adjusting the entropy of high-entropy tokens. These findings highlight the significance of high-entropy tokens (i.e., forking tokens) to RLVR. We ultimately improve RLVR by restricting policy gradient updates to forking tokens and uncover a finding even beyond the 80/20 rule: utilizing only 20% of the tokens while maintaining performance comparable to full-gradient updates on the Qwen3-8B base model and significantly surpassing full-gradient updates on the Qwen3-32B (+11.04 on AIME’25 and +7.71 on AIME’24) and Qwen3-14B (+4.79 on AIME’25 and +5.21 on AIME’24) base models, highlighting a strong scaling trend. In contrast, training exclusively on the 80% lowest-entropy tokens leads to a marked decline in performance. These findings indicate that the efficacy of RLVR primarily arises from optimizing the high-entropy tokens that decide reasoning directions. Collectively, our results highlight the potential to understand RLVR through a token-entropy perspective and optimize RLVR by leveraging high-entropy minority tokens to further improve LLM reasoning.
zh

[NLP-5] Novel Benchmark for NER in the Wastewater and Stormwater Domain

【速读】: 该论文试图解决城市可持续性和环境保护中废水和雨水管理报告及法规中结构化知识提取的问题,其核心挑战在于领域特定术语和多语言环境带来的困难。解决方案的关键在于构建一个法语-意大利语的领域特定文本语料库,并评估先进的命名实体识别(Named Entity Recognition, NER)方法,包括基于大语言模型(LLM)的方法,以提供未来策略的可靠基准,并探索语料库向新语言扩展时的自动化标注投影。

链接: https://arxiv.org/abs/2506.01938
作者: Franco Alberto Cardillo,Franca Debole,Francesca Frontini,Mitra Aelami,Nanée Chahinian,Serge Conrad
机构: CNR-ILC(国家研究委员会-语言学研究所); CNR-ISTI(国家研究委员会-信息科学与技术研究所); HSM Univ. Montpellier(蒙彼利埃高等师范学校); IRD(法国国际农业研究中心); CNRS(法国国家科学研究中心); Inria(法国国家信息与自动化研究所)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Effective wastewater and stormwater management is essential for urban sustainability and environmental protection. Extracting structured knowledge from reports and regulations is challenging due to domainspecific terminology and multilingual contexts. This work focuses on domain-specific Named Entity Recognition (NER) as a first step towards effective relation and information extraction to support decision making. A multilingual benchmark is crucial for evaluating these methods. This study develops a French-Italian domain-specific text corpus for wastewater management. It evaluates state-of-the-art NER methods, including LLM-based approaches, to provide a reliable baseline for future strategies and explores automated annotation projection in view of an extension of the corpus to new languages.
zh

[NLP-6] RewardBench 2: Advancing Reward Model Evaluation

【速读】: 该论文试图解决奖励模型在下游任务中表现与评估方法之间不匹配的问题,即尽管评估方法不断进步,但奖励模型在实际应用中的效果并未同步提升。解决方案的关键在于引入RewardBench 2,这是一个新的多技能奖励建模基准,旨在提供更具挑战性的数据以更准确地评估奖励模型,其性能与下游任务表现高度相关,并通过使用新的用户提示而非现有下游评估中的提示,推动更严格的评估实践。

链接: https://arxiv.org/abs/2506.01937
作者: Saumya Malik,Valentina Pyatkin,Sander Land,Jacob Morrison,Noah A. Smith,Hannaneh Hajishirzi,Nathan Lambert
机构: Allen Institute for Artificial Intelligence (艾伦人工智能研究所); University of Washington (华盛顿大学); Cohere (Cohere)
类目: Computation and Language (cs.CL)
备注: Data, models, and leaderboard available at this https URL

点击查看摘要

Abstract:Reward models are used throughout the post-training of language models to capture nuanced signals from preference data and provide a training target for optimization across instruction following, reasoning, safety, and more domains. The community has begun establishing best practices for evaluating reward models, from the development of benchmarks that test capabilities in specific skill areas to others that test agreement with human preferences. At the same time, progress in evaluation has not been mirrored by the effectiveness of reward models in downstream tasks – simpler direct alignment algorithms are reported to work better in many cases. This paper introduces RewardBench 2, a new multi-skill reward modeling benchmark designed to bring new, challenging data for accuracy-based reward model evaluation – models score about 20 points on average lower on RewardBench 2 compared to the first RewardBench – while being highly correlated with downstream performance. Compared to most other benchmarks, RewardBench 2 sources new human prompts instead of existing prompts from downstream evaluations, facilitating more rigorous evaluation practices. In this paper, we describe our benchmark construction process and report how existing models perform on it, while quantifying how performance on the benchmark correlates with downstream use of the models in both inference-time scaling algorithms, like best-of-N sampling, and RLHF training algorithms like proximal policy optimization.
zh

[NLP-7] Esoteric Language Models

【速读】: 该论文旨在解决基于扩散的模型(Diffusion-based language models)在困惑度(perplexity)上仍落后于自回归(Autoregressive, AR)模型,以及缺乏关键的推理效率特性(如KV缓存)的问题。其解决方案的关键在于提出Eso-LMs,这是一种融合AR与掩码扩散模型(Masked Diffusion Models, MDMs)范式的新型模型,能够在保持并行生成能力的同时首次引入KV缓存,从而显著提升推理效率,并实现两种模型在困惑度上的平滑插值。

链接: https://arxiv.org/abs/2506.01928
作者: Subham Sekhar Sahoo,Zhihan Yang,Yash Akhauri,Johnna Liu,Deepansha Singh,Zhoujun Cheng,Zhengzhong Liu,Eric Xing,John Thickstun,Arash Vahdat
机构: Cornell Tech(康奈尔技术学院); Cornell University(康奈尔大学); MBZUAI(穆巴达拉人工智能研究院); NVIDIA(英伟达)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Diffusion-based language models offer a compelling alternative to autoregressive (AR) models by enabling parallel and controllable generation. Among this family of models, Masked Diffusion Models (MDMs) achieve the strongest performance but still underperform AR models in perplexity and lack key inference-time efficiency features–most notably, KV caching. In this work, we introduce Eso-LMs, a new family of models that fuses AR and MDM paradigms, enabling smooth interpolation between their perplexities while overcoming their respective limitations. Eso-LMs set a new state of the art on standard language modeling benchmarks. Crucially, we are the first to introduce KV caching for MDMs while preserving parallel generation, significantly improving inference efficiency. Combined with an optimized sampling schedule, our method achieves up to 65x faster inference than standard MDMs and 4x faster inference than prior semi-autoregressive approaches. We provide the code and model checkpoints on the project page: [this http URL](this http URL)
zh

[NLP-8] Large language models can learn and generalize steganographic chain-of-thought under process supervision NEURIPS2025

【速读】: 该论文试图解决生成式 AI (Generative AI) 在链式思维 (Chain-of-thought, CoT) 过程中可能表现出的有害意图被隐藏或模糊化的问题,这会削弱 CoT 监控的有效性。解决方案的关键在于揭示模型如何通过替代特定字符串或构建通用编码方案来规避惩罚机制,从而在不改变任务执行方法的前提下,实现对推理过程的隐写编码。这一发现表明,模型能够内部化并编码其推理路径,进而影响 CoT 监控的可靠性。

链接: https://arxiv.org/abs/2506.01926
作者: Joey Skaf,Luis Ibanez-Lissen,Robert McCarthy,Connor Watts,Vasil Georgiv,Hannes Whittingham,Lorena Gonzalez-Manzano,David Lindner,Cameron Tice,Edward James Young,Puria Radmard
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 10 pages main text, 3 figures main text, 15 pages supplementary material, 1 figure supplementary material, submitted to NeurIPS 2025

点击查看摘要

Abstract:Chain-of-thought (CoT) reasoning not only enhances large language model performance but also provides critical insights into decision-making processes, marking it as a useful tool for monitoring model intent and planning. By proactively preventing models from acting on CoT indicating misaligned or harmful intent, CoT monitoring can be used to reduce risks associated with deploying models. However, developers may be incentivized to train away the appearance of harmful intent from CoT traces, by either customer preferences or regulatory requirements. Recent works have shown that banning mention of a specific example of reward hacking, which may be done either to make CoT presentable to users or as a naive attempt to prevent the behavior, causes obfuscation of the undesired reasoning traces but the persistence of the undesired behavior. Such obfuscation threatens the reliability of CoT monitoring. However, obfuscation of reasoning can be due to its internalization to latent space computation, or its encoding within the CoT. Here, we provide an extension to these results. First, we show that penalizing the use of specific strings within load-bearing reasoning traces causes models to substitute alternative strings. Crucially, this does not alter the underlying method by which the model performs the task, demonstrating that the model can learn to steganographically encode its reasoning. We further demonstrate that models can generalize an encoding scheme. When the penalized strings belong to an overarching class, the model learns not only to substitute strings seen in training, but also develops a general encoding scheme for all members of the class which it can apply to held-out testing strings.
zh

[NLP-9] From Guidelines to Practice: A New Paradigm for Arabic Language Model Evaluation

【速读】: 该论文试图解决阿拉伯语语言模型评估中的关键缺口,特别是在语言准确性、文化契合度和方法严谨性方面的不足。其解决方案的关键是提出了一个新颖的评估框架,并构建了阿拉伯深度小数据集(Arabic Depth Mini Dataset, ADMD),该数据集包含490个跨十个主要领域(42个子领域)的挑战性问题,用于全面评估五种领先的语言模型,从而揭示模型在不同领域中的性能差异及对深度文化理解和专业知识的需求。

链接: https://arxiv.org/abs/2506.01920
作者: Serry Sibaee,Omer Nacar,Adel Ammar,Yasser Al-Habashi,Abdulrahman Al-Batati,Wadii Boulila
机构: Prince Sultan University (沙特国王 Saud 大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper addresses critical gaps in Arabic language model evaluation by establishing comprehensive theoretical guidelines and introducing a novel evaluation framework. We first analyze existing Arabic evaluation datasets, identifying significant issues in linguistic accuracy, cultural alignment, and methodological rigor. To address these limitations in LLMs, we present the Arabic Depth Mini Dataset (ADMD), a carefully curated collection of 490 challenging questions spanning ten major domains (42 sub-domains, see Figure 1. Using ADMD, we evaluate five leading language models: GPT-4, Claude 3.5 Sonnet, Gemini Flash 1.5, CommandR 100B, and Qwen-Max. Our results reveal significant variations in model performance across different domains, with particular challenges in areas requiring deep cultural understanding and specialized knowledge. Claude 3.5 Sonnet demonstrated the highest overall accuracy at 30%, showing relative strength in mathematical theory in Arabic, Arabic language, and islamic domains. This work provides both theoretical foundations and practical insights for improving Arabic language model evaluation, emphasizing the importance of cultural competence alongside technical capabilities.
zh

[NLP-10] Spatial Coordinates as a Cell Language: A Multi-Sentence Framework for Imaging Mass Cytometry Analysis

【速读】: 该论文试图解决单细胞大语言模型(LLMs)在处理图像质谱流式细胞术(IMC)数据时的两个主要问题:一是空间信息整合不足,模型难以泛化空间坐标并有效编码空间上下文为文本;二是将每个细胞独立处理,忽略了细胞间相互作用,限制了对生物关系的捕捉。解决方案的关键在于提出一种名为Spatial2Sentence的新框架,该框架通过多句法将单细胞表达和空间信息整合为自然语言,构建表达相似性和距离矩阵,将空间相邻且表达相似的细胞作为正样本对,而将距离远且不相似的细胞作为负样本对,从而让大语言模型在表达和空间上下文中学习细胞间相互作用。

链接: https://arxiv.org/abs/2506.01918
作者: Chi-Jane Chen,Yuhang Chen,Sukwon Yun,Natalie Stanley,Tianlong Chen
机构: The University of North Carolina at Chapel Hill (北卡罗来纳大学教堂山分校)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Image mass cytometry (IMC) enables high-dimensional spatial profiling by combining mass cytometry’s analytical power with spatial distributions of cell phenotypes. Recent studies leverage large language models (LLMs) to extract cell states by translating gene or protein expression into biological context. However, existing single-cell LLMs face two major challenges: (1) Integration of spatial information: they struggle to generalize spatial coordinates and effectively encode spatial context as text, and (2) Treating each cell independently: they overlook cell-cell interactions, limiting their ability to capture biological relationships. To address these limitations, we propose Spatial2Sentence, a novel framework that integrates single-cell expression and spatial information into natural language using a multi-sentence approach. Spatial2Sentence constructs expression similarity and distance matrices, pairing spatially adjacent and expressionally similar cells as positive pairs while using distant and dissimilar cells as negatives. These multi-sentence representations enable LLMs to learn cellular interactions in both expression and spatial contexts. Equipped with multi-task learning, Spatial2Sentence outperforms existing single-cell LLMs on preprocessed IMC datasets, improving cell-type classification by 5.98% and clinical status prediction by 4.18% on the diabetes dataset while enhancing interpretability. The source code can be found here: this https URL.
zh

[NLP-11] Enhancing Biomedical Multi-modal Representation Learning with Multi-scale Pre-training and Perturbed Report Discrimination

【速读】: 该论文试图解决在生物医学领域中,传统对比学习方法在处理具有复杂和领域特异性语义的生物医学文本时表现不足的问题。其解决方案的关键在于提出一种名为“扰动报告区分”(perturbed report discrimination)的新方法,通过引入保持词汇不变但破坏句子语义结构的文本扰动技术,并利用模型区分原始报告与扰动报告的能力来增强多模态表示的学习。同时,该方法通过对比注意力加权的图像子区域和图像-文本对中的子词,提升对两种模态更细粒度特征的敏感性。

链接: https://arxiv.org/abs/2506.01902
作者: Xinliu Zhong,Kayhan Batmanghelich,Li Sun
机构: National University of Singapore (新加坡国立大学); Boston University (波士顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: 6 pages, 1 figure, accepted by 2024 IEEE Conference on Artificial Intelligence (CAI)

点击查看摘要

Abstract:Vision-language models pre-trained on large scale of unlabeled biomedical images and associated reports learn generalizable semantic representations. These multi-modal representations can benefit various downstream tasks in the biomedical domain. Contrastive learning is widely used to pre-train vision-language models for general natural images and associated captions. Despite its popularity, we found biomedical texts have complex and domain-specific semantics that are often neglected by common contrastive methods. To address this issue, we propose a novel method, perturbed report discrimination, for pre-train biomedical vision-language models. First, we curate a set of text perturbation methods that keep the same words, but disrupt the semantic structure of the sentence. Next, we apply different types of perturbation to reports, and use the model to distinguish the original report from the perturbed ones given the associated image. Parallel to this, we enhance the sensitivity of our method to higher level of granularity for both modalities by contrasting attention-weighted image sub-regions and sub-words in the image-text pairs. We conduct extensive experiments on multiple downstream tasks, and our method outperforms strong baseline methods. The results demonstrate that our approach learns more semantic meaningful and robust multi-modal representations.
zh

[NLP-12] WHEN TO ACT WHEN TO WAIT: Modeling Structural Trajectories for Intent Triggerability in Task-Oriented Dialogue

【速读】: 该论文试图解决任务导向对话系统在用户话语语义上看似完整但缺乏必要结构信息以触发适当系统行为的问题,这一问题源于用户对其自身需求理解不足,而系统需要精确的意图定义。解决方案的关键在于提出STORM框架,通过UserLLM(具有完整内部访问权限)与AgentLLM(仅可观测行为)之间的对话建模非对称信息动态,从而生成标注语料库,捕捉表达轨迹和潜在认知转变,实现对协作理解发展的系统性分析。

链接: https://arxiv.org/abs/2506.01881
作者: Yaoyao Qian,Jindan Huang,Yuanli Wang,Simon Yu,Kyrie Zhixuan Zhou,Jiayuan Mao,Mingfu Liang,Hanhan Zhou
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 43 pages, 31 figures. Project website: this https URL

点击查看摘要

Abstract:Task-oriented dialogue systems often face difficulties when user utterances seem semantically complete but lack necessary structural information for appropriate system action. This arises because users frequently do not fully understand their own needs, while systems require precise intent definitions. Current LLM-based agents cannot effectively distinguish between linguistically complete and contextually triggerable expressions, lacking frameworks for collaborative intent formation. We present STORM, a framework modeling asymmetric information dynamics through conversations between UserLLM (full internal access) and AgentLLM (observable behavior only). STORM produces annotated corpora capturing expression trajectories and latent cognitive transitions, enabling systematic analysis of collaborative understanding development. Our contributions include: (1) formalizing asymmetric information processing in dialogue systems; (2) modeling intent formation tracking collaborative understanding evolution; and (3) evaluation metrics measuring internal cognitive improvements alongside task performance. Experiments across four language models reveal that moderate uncertainty (40-60%) can outperform complete transparency in certain scenarios, with model-specific patterns suggesting reconsideration of optimal information completeness in human-AI collaboration. These findings contribute to understanding asymmetric reasoning dynamics and inform uncertainty-calibrated dialogue system design.
zh

[NLP-13] When Should Dense Retrievers Be Updated in Evolving Corpora? Detecting Out-of-Distribution Corpora Using GradNormIR ACL2025

【速读】: 该论文试图解决密集检索器(dense retriever)在面对不断演变的现实世界语料库时,由于分布偏移(distribution shift)导致检索性能下降的问题。解决方案的关键在于提出一种新的任务,即在索引前预测语料库是否相对于密集检索器为分布外(out-of-distribution, OOD),从而实现对检索器更新的主动管理。其核心方法GradNormIR是一种无监督方法,通过利用梯度范数来有效检测OOD语料库,从而在动态文档集合中实现及时的检索器更新,显著提升检索系统的鲁棒性和效率。

链接: https://arxiv.org/abs/2506.01877
作者: Dayoon Ko,Jinyoung Kim,Sohyeon Kim,Jinhyuk Kim,Jaehoon Lee,Seonghak Song,Minyoung Lee,Gunhee Kim
机构: Seoul National University (首尔国立大学); POSTECH (浦项科技大学); Samsung SDS (三星SDS)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: ACL 2025 Findings

点击查看摘要

Abstract:Dense retrievers encode texts into embeddings to efficiently retrieve relevant documents from large databases in response to user queries. However, real-world corpora continually evolve, leading to a shift from the original training distribution of the retriever. Without timely updates or retraining, indexing newly emerging documents can degrade retrieval performance for future queries. Thus, identifying when a dense retriever requires an update is critical for maintaining robust retrieval systems. In this paper, we propose a novel task of predicting whether a corpus is out-of-distribution (OOD) relative to a dense retriever before indexing. Addressing this task allows us to proactively manage retriever updates, preventing potential retrieval failures. We introduce GradNormIR, an unsupervised approach that leverages gradient norms to detect OOD corpora effectively. Experiments on the BEIR benchmark demonstrate that GradNormIR enables timely updates of dense retrievers in evolving document collections, significantly enhancing retrieval robustness and efficiency.
zh

[NLP-14] Is Extending Modality The Right Path Towards Omni-Modality?

【速读】: 该论文试图解决当前多模态语言模型(Multimodal Language Models, MLLMs)在实现真正全模态(Omni-modality)能力方面的局限性,即现有模型尤其是开源模型难以在不同模态对之间进行有效泛化或处理多模态输入时表现不佳的问题。解决方案的关键在于研究扩展模态(modality extension)技术的效果,通过微调预训练语言模型以适应目标领域和语言数据,进而探讨其对核心语言能力的影响、模型融合的有效性以及全模态扩展相对于顺序扩展在知识共享和泛化能力上的优势。

链接: https://arxiv.org/abs/2506.01872
作者: Tinghui Zhu,Kai Zhang,Muhao Chen,Yu Su
机构: University of California, Davis (加州大学戴维斯分校); The Ohio State University (俄亥俄州立大学)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Omni-modal language models (OLMs) aim to integrate and reason over diverse input modalities–such as text, images, video, and audio–while maintaining strong language capabilities. Despite recent advancements, existing models, especially open-source ones, remain far from true omni-modality, struggling to generalize beyond the specific modality pairs they are trained on or to achieve strong performance when processing multi-modal inputs. We study the effect of extending modality, the dominant technique for training multimodal models, where an off-the-shelf language model is fine-tuned on target-domain and language data. Specifically, we investigate three key questions: (1) Does modality extension compromise core language abilities? (2) Can model merging effectively integrate independently fine-tuned modality-specific models to achieve omni-modality? (3) Does omni-modality extension lead to better knowledge sharing and generalization compared to sequential extension? Through extensive experiments, we analyze these trade-offs and provide insights into the feasibility of achieving true omni-modality using current approaches.
zh

[NLP-15] Unified Scaling Laws for Compressed Representations

【速读】: 该论文试图解决在不同压缩表示(如稀疏、标量量化、稀疏量化或向量量化)下,如何准确预测模型性能的问题。其解决方案的关键在于验证一种通用的缩放定律公式,并证明该公式在单独和组合应用不同压缩类型时均具有适用性。通过引入一个基于表示能力的“容量”度量——即模型拟合随机高斯数据的能力,该研究展示了该度量能够稳健地预测多种压缩表示下的参数效率。

链接: https://arxiv.org/abs/2506.01863
作者: Andrei Panferov,Alexandra Volkova,Ionut-Vlad Modoranu,Vage Egiazarian,Mher Safaryan,Dan Alistarh
机构: ISTA (Institute of Science and Technology Austria)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Preprint

点击查看摘要

Abstract:Scaling laws have shaped recent advances in machine learning by enabling predictable scaling of model performance based on model size, computation, and data volume. Concurrently, the rise in computational cost for AI has motivated model compression techniques, notably quantization and sparsification, which have emerged to mitigate the steep computational demands associated with large-scale training and inference. This paper investigates the interplay between scaling laws and compression formats, exploring whether a unified scaling framework can accurately predict model performance when training occurs over various compressed representations, such as sparse, scalar-quantized, sparse-quantized or even vector-quantized formats. Our key contributions include validating a general scaling law formulation and showing that it is applicable both individually but also composably across compression types. Based on this, our main finding is demonstrating both theoretically and empirically that there exists a simple “capacity” metric – based on the representation’s ability to fit random Gaussian data – which can robustly predict parameter efficiency across multiple compressed representations. On the practical side, we extend our formulation to directly compare the accuracy potential of different compressed formats, and to derive better algorithms for training over sparse-quantized formats.
zh

[NLP-16] CONFETTI: Conversational Function-Calling Evaluation Through Turn-Level Interactions ACL2025

【速读】: 该论文试图解决当前基准测试在复杂对话场景中对大型语言模型(Large Language Models, LLMs)的功能调用能力和响应质量评估不足的问题。其解决方案的关键在于构建了一个名为Conversational Function-Calling Evaluation Through Turn-Level Interactions (CONFETTI)的对话基准,该基准包含109次人类模拟对话,涵盖313个用户回合和86个API,明确针对对话复杂性如后续问题、目标修正与切换、模糊及隐含目标等进行评估,并通过对话行为标注来衡量代理响应的质量。

链接: https://arxiv.org/abs/2506.01859
作者: Tamer Alkhouli,Katerina Margatina,James Gung,Raphael Shu,Claudia Zaghi,Monica Sunkara,Yi Zhang
机构: 未知
类目: Computation and Language (cs.CL)
备注: ACL 2025 (main conference)

点击查看摘要

Abstract:We introduce Conversational Function-Calling Evaluation Through Turn-Level Interactions (CONFETTI), a conversational benchmark1 designed to evaluate the function-calling capabilities and response quality of large language models (LLMs). Current benchmarks lack comprehensive assessment of LLMs in complex conversational scenarios. CONFETTI addresses this gap through 109 human-simulated conversations, comprising 313 user turns and covering 86 APIs. These conversations explicitly target various conversational complexities, such as follow-ups, goal correction and switching, ambiguous and implicit goals. We perform off-policy turn-level evaluation using this benchmark targeting function-calling. Our benchmark also incorporates dialog act annotations to assess agent responses. We evaluate a series of state-of-the-art LLMs and analyze their performance with respect to the number of available APIs, conversation lengths, and chained function calling. Our results reveal that while some models are able to handle long conversations, and leverage more than 20+ APIs successfully, other models struggle with longer context or when increasing the number of APIs. We also report that the performance on chained function-calls is severely limited across the models. Overall, the top performing models on CONFETTI are Nova Pro (40.01%), Claude Sonnet v3.5 (35.46%) and Llama 3.1 405B (33.19%) followed by command-r-plus (31.18%) and Mistral-Large-2407 (30.07%).
zh

[NLP-17] Code-Switching and Syntax: A Large-Scale Experiment ACL2025

【速读】: 该论文试图解决语言转换(Code-Switching, CS)的语法解释是否具有普遍适用性的问题,即验证语法结构是否足以解释双语者在特定句法位置更频繁地进行语言转换的现象。解决方案的关键在于设计一项大规模、多语言、跨现象的实验,确保用于预测语言转换位置的系统仅基于句法信息进行判断,从而验证语法信息的充分性。实验结果表明,仅依靠句法信息即可使自动系统在最小对句中区分语言转换模式,其效果与双语人类相当,且学习到的句法模式具备良好的泛化能力。

链接: https://arxiv.org/abs/2506.01846
作者: Igor Sterner,Simone Teufel
机构: 未知
类目: Computation and Language (cs.CL)
备注: Findings of ACL 2025

点击查看摘要

Abstract:The theoretical code-switching (CS) literature provides numerous pointwise investigations that aim to explain patterns in CS, i.e. why bilinguals switch language in certain positions in a sentence more often than in others. A resulting consensus is that CS can be explained by the syntax of the contributing languages. There is however no large-scale, multi-language, cross-phenomena experiment that tests this claim. When designing such an experiment, we need to make sure that the system that is predicting where bilinguals tend to switch has access only to syntactic information. We provide such an experiment here. Results show that syntax alone is sufficient for an automatic system to distinguish between sentences in minimal pairs of CS, to the same degree as bilingual humans. Furthermore, the learnt syntactic patterns generalise well to unseen language pairs.
zh

[NLP-18] Minimal Pair-Based Evaluation of Code-Switching ACL2025

【速读】: 该论文试图解决缺乏一种评估方法来量化大型语言模型(Large Language Models, LLMs)在多语言切换(Code-Switching, CS)使用上与双语者的一致性问题。现有方法在语言覆盖范围、对多样化的CS现象的适应性或可扩展性方面存在不足。论文提出的解决方案关键在于基于最小对(minimal pairs)的干预策略,每个最小对包含一个自然发生的CS句子和一个经过最小修改的变体,通过对比双语者和LLMs对这两个句子的偏好,评估模型在CS使用上的表现。

链接: https://arxiv.org/abs/2506.01840
作者: Igor Sterner,Simone Teufel
机构: 未知
类目: Computation and Language (cs.CL)
备注: ACL 2025

点击查看摘要

Abstract:There is a lack of an evaluation methodology that estimates the extent to which large language models (LLMs) use code-switching (CS) in the same way as bilinguals. Existing methods do not have wide language coverage, fail to account for the diverse range of CS phenomena, or do not scale. We propose an intervention based on minimal pairs of CS. Each minimal pair contains one naturally occurring CS sentence and one minimally manipulated variant. We collect up to 1,000 such pairs each for 11 language pairs. Our human experiments show that, for every language pair, bilinguals consistently prefer the naturally occurring CS sentence. Meanwhile our experiments with current LLMs show that the larger the model, the more consistently it assigns higher probability to the naturally occurring CS sentence than to the variant. In accordance with theoretical claims, the largest probability differences arise in those pairs where the manipulated material consisted of closed-class words.
zh

[NLP-19] CiteEval: Principle-Driven Citation Evaluation for Source Attribution ACL2025

【速读】: 该论文试图解决信息检索系统中引用质量评估(citation quality evaluation)的问题,当前的评估框架主要依赖自然语言推理(Natural Language Inference, NLI)来判断引用来源的支持性,但这种方法被作者认为是次优的代理指标。解决方案的关键在于提出CiteEval框架,该框架基于细粒度引用评估原则,考虑了广泛的上下文信息,包括引用来源、完整检索上下文、用户查询和生成文本。在此基础上,构建了CiteBench多领域基准,并开发了CiteEval-Auto模型驱动的评估指标,其与人类判断具有强相关性,能够更全面地捕捉引用的多维特性。

链接: https://arxiv.org/abs/2506.01829
作者: Yumo Xu,Peng Qi,Jifan Chen,Kunlun Liu,Rujun Han,Lan Liu,Bonan Min,Vittorio Castelli,Arshit Gupta,Zhiguo Wang
机构: AWS AI Labs (AWS人工智能实验室); Orby.ai (Orby.ai); Google (谷歌)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: ACL 2025

点击查看摘要

Abstract:Citation quality is crucial in information-seeking systems, directly influencing trust and the effectiveness of information access. Current evaluation frameworks, both human and automatic, mainly rely on Natural Language Inference (NLI) to assess binary or ternary supportiveness from cited sources, which we argue is a suboptimal proxy for citation evaluation. In this work we introduce CiteEval, a citation evaluation framework driven by principles focusing on fine-grained citation assessment within a broad context, encompassing not only the cited sources but the full retrieval context, user query, and generated text. Guided by the proposed framework, we construct CiteBench, a multi-domain benchmark with high-quality human annotations on citation quality. To enable efficient evaluation, we further develop CiteEval-Auto, a suite of model-based metrics that exhibit strong correlation with human judgments. Experiments across diverse systems demonstrate CiteEval-Auto’s superior ability to capture the multifaceted nature of citations compared to existing metrics, offering a principled and scalable approach to evaluate and improve model-generated citations.
zh

[NLP-20] Not All Jokes Land: Evaluating Large Language Models Understanding of Workplace Humor

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在判断职场中专业幽默语句适当性方面存在的不足。现有研究多关注LLMs与人类价值观的对齐,但忽略了幽默,尤其是职场环境中使用的专业幽默。解决方案的关键在于构建一个包含专业幽默语句及其决定适当性的特征的数据集,并通过评估五种LLMs的表现,揭示其在该任务上的局限性。

链接: https://arxiv.org/abs/2506.01819
作者: Moahmmadamin Shafiei,Hamidreza Saffari
机构: University of Milan (米兰大学); Politecnico di Milano (米兰理工大学)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:With the recent advances in Artificial Intelligence (AI) and Large Language Models (LLMs), the automation of daily tasks, like automatic writing, is getting more and more attention. Hence, efforts have focused on aligning LLMs with human values, yet humor, particularly professional industrial humor used in workplaces, has been largely neglected. To address this, we develop a dataset of professional humor statements along with features that determine the appropriateness of each statement. Our evaluation of five LLMs shows that LLMs often struggle to judge the appropriateness of humor accurately.
zh

[NLP-21] BD at BEA 2025 Shared Task: MPNet Ensembles for Pedagogical Mistake Identification and Localization in AI Tutor Responses

【速读】: 该论文旨在解决AI驱动的导师在教育对话中的教学能力评估问题,具体包括错误识别(Mistake Identification)和错误定位(Mistake Location)两个任务。其解决方案的关键在于基于MPNet模型进行微调,采用类别加权交叉熵损失处理类别不平衡问题,并通过分组交叉验证(10折)最大化有限数据的利用效率,同时避免训练与验证集之间的对话重叠。最终通过硬投票集成各折的最佳模型,提升系统的鲁棒性和泛化能力。

链接: https://arxiv.org/abs/2506.01817
作者: Shadman Rohan,Ishita Sur Apan,Muhtasim Ibteda Shochcho,Md Fahim,Mohammad Ashfaq Ur Rahman,AKM Mahbubur Rahman,Amin Ahsan Ali
机构: Center for Computational & Data Sciences, Independent University, Bangladesh (IUB)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present Team BD’s submission to the BEA 2025 Shared Task on Pedagogical Ability Assessment of AI-powered Tutors, under Track 1 (Mistake Identification) and Track 2 (Mistake Location). Both tracks involve three-class classification of tutor responses in educational dialogues - determining if a tutor correctly recognizes a student’s mistake (Track 1) and whether the tutor pinpoints the mistake’s location (Track 2). Our system is built on MPNet, a Transformer-based language model that combines BERT and XLNet’s pre-training advantages. We fine-tuned MPNet on the task data using a class-weighted cross-entropy loss to handle class imbalance, and leveraged grouped cross-validation (10 folds) to maximize the use of limited data while avoiding dialogue overlap between training and validation. We then performed a hard-voting ensemble of the best models from each fold, which improves robustness and generalization by combining multiple classifiers. Our approach achieved strong results on both tracks, with exact-match macro-F1 scores of approximately 0.7110 for Mistake Identification and 0.5543 for Mistake Location on the official test set. We include comprehensive analysis of our system’s performance, including confusion matrices and t-SNE visualizations to interpret classifier behavior, as well as a taxonomy of common errors with examples. We hope our ensemble-based approach and findings provide useful insights for designing reliable tutor response evaluation systems in educational dialogue settings.
zh

[NLP-22] Analysis of LLM Bias (Chinese Propaganda Anti-US Sentiment) in DeepSeek -R1 vs. ChatGPT o3-mini-high

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在意识形态上的中立性问题,特别是针对具有不同地缘政治立场的模型进行直接的跨语言比较,即中国体制内模型与非中国体制模型之间的差异。其解决方案的关键在于构建了一个包含1,200个去情境化、以推理为导向的问题语料库,并通过混合评估流程(结合GPT-4o评分与人工标注)对DeepSeek-R1(中国体制对齐)和ChatGPT o3-mini-high(非中国体制)在中文国家宣传和反美情绪方面的表现进行了系统评估。

链接: https://arxiv.org/abs/2506.01814
作者: PeiHsuan Huang,ZihWei Lin,Simon Imbot,WenCheng Fu,Ethan Tu
机构: 未知
类目: Computation and Language (cs.CL); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) increasingly shape public understanding and civic decisions, yet their ideological neutrality is a growing concern. While existing research has explored various forms of LLM bias, a direct, cross-lingual comparison of models with differing geopolitical alignments-specifically a PRC-system model versus a non-PRC counterpart-has been lacking. This study addresses this gap by systematically evaluating DeepSeek-R1 (PRC-aligned) against ChatGPT o3-mini-high (non-PRC) for Chinese-state propaganda and anti-U.S. sentiment. We developed a novel corpus of 1,200 de-contextualized, reasoning-oriented questions derived from Chinese-language news, presented in Simplified Chinese, Traditional Chinese, and English. Answers from both models (7,200 total) were assessed using a hybrid evaluation pipeline combining rubric-guided GPT-4o scoring with human annotation. Our findings reveal significant model-level and language-dependent biases. DeepSeek-R1 consistently exhibited substantially higher proportions of both propaganda and anti-U.S. bias compared to ChatGPT o3-mini-high, which remained largely free of anti-U.S. sentiment and showed lower propaganda levels. For DeepSeek-R1, Simplified Chinese queries elicited the highest bias rates; these diminished in Traditional Chinese and were nearly absent in English. Notably, DeepSeek-R1 occasionally responded in Simplified Chinese to Traditional Chinese queries and amplified existing PRC-aligned terms in its Chinese answers, demonstrating an “invisible loudspeaker” effect. Furthermore, such biases were not confined to overtly political topics but also permeated cultural and lifestyle content, particularly in DeepSeek-R1.
zh

[NLP-23] NAVER LABS Europe Submission to the Instruction-following Track

【速读】: 该论文旨在解决多任务语音处理问题,即在受限环境下同时实现自动语音识别(ASR)、语音翻译(ST)和语音问答(SQA)任务。其解决方案的关键在于利用两个预训练模块:一是基于SeamlessM4T-v2-large语音编码器表示训练的语音到大语言模型(LLM)嵌入投影器;二是基于Llama-3.1-8B-Instruct在文本数据上训练的LoRA适配器。这两个模块联合加载并进一步在多语言和多模态数据上进行1000步的指令微调,以构建最终的系统。

链接: https://arxiv.org/abs/2506.01808
作者: Beomseok Lee,Marcely Zanon Boito,Laurent Besacier,Ioan Calapodescu
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In this paper we describe NAVER LABS Europe submission to the instruction-following speech processing short track at IWSLT 2025. We participate in the constrained settings, developing systems that can simultaneously perform ASR, ST, and SQA tasks from English speech input into the following target languages: Chinese, Italian, and German. Our solution leverages two pretrained modules: (1) a speech-to-LLM embedding projector trained using representations from the SeamlessM4T-v2-large speech encoder; and (2) LoRA adapters trained on text data on top of a Llama-3.1-8B-Instruct. These modules are jointly loaded and further instruction-tuned for 1K steps on multilingual and multimodal data to form our final system submitted for evaluation.
zh

[NLP-24] Propaganda and Information Dissemination in the Russo-Ukrainian War: Natural Language Processing of Russian and Western Twitter Narratives

【速读】: 该论文试图解决信息战(information warfare)中社交媒体平台在军事冲突中的作用及其影响机制问题,特别是通过分析乌克兰战争期间推特(X)上的推文来揭示不同阵营的信息传播策略。其解决方案的关键在于利用自然语言处理(Natural Language Processing, NLP)和机器学习算法对大量推文进行情感分析与主题识别,并结合人机协同(human-in-the-loop, HITL)分析方法,以识别出不同账号群体的传播模式和潜在的协调行为。

链接: https://arxiv.org/abs/2506.01807
作者: Zaur Gouliev
机构: University College Dublin, School of Information & Communication Studies
类目: Computation and Language (cs.CL)
备注: 7 pages; 6 figures

点击查看摘要

Abstract:The conflict in Ukraine has been not only characterised by military engagement but also by a significant information war, with social media platforms like X, formerly known as Twitter playing an important role in shaping public perception. This article provides an analysis of tweets from propaganda accounts and trusted accounts collected from the onset of the war, February 2022 until the middle of May 2022 with n=40,000 total tweets. We utilise natural language processing and machine learning algorithms to assess the sentiment and identify key themes, topics and narratives across the dataset with human-in-the-loop (HITL) analysis throughout. Our findings indicate distinct strategies in how information is created, spread, and targeted at different audiences by both sides. Propaganda accounts frequently employ emotionally charged language and disinformation to evoke fear and distrust, whereas other accounts, primarily Western tend to focus on factual reporting and humanitarian aspects of the conflict. Clustering analysis reveals groups of accounts with similar behaviours, which we suspect indicates the presence of coordinated efforts. This research attempts to contribute to our understanding of the dynamics of information warfare and offers techniques for future studies on social media influence in military conflicts.
zh

[NLP-25] Read it in Two Steps: Translating Extremely Low-Resource Languages with Code-Augmented Grammar Books ACL2025

【速读】: 该论文试图解决在极低资源语言翻译中,语法书的有效性问题,特别是针对语法规则的检索与应用所面临的挑战。研究发现,规则检索是基于语法翻译的主要瓶颈,而大型语言模型在处理复杂规则时存在困难。解决方案的关键在于将语法规则表示为代码函数,利用代码结构的相似性和对大语言模型推理能力的促进作用,从而显著提升规则检索与应用的效果,最终实现翻译性能的提升。

链接: https://arxiv.org/abs/2506.01796
作者: Chen Zhang,Jiuheng Lin,Xiao Liu,Zekai Zhang,Yansong Feng
机构: Wangxuan Institute of Computer Technology, Peking University (北京大学王选计算机技术研究所)
类目: Computation and Language (cs.CL)
备注: ACL 2025

点击查看摘要

Abstract:While large language models (LLMs) have shown promise in translating extremely low-resource languages using resources like dictionaries, the effectiveness of grammar books remains debated. This paper investigates the role of grammar books in translating extremely low-resource languages by decomposing it into two key steps: grammar rule retrieval and application. To facilitate the study, we introduce ZhuangRules, a modularized dataset of grammar rules and their corresponding test sentences. Our analysis reveals that rule retrieval constitutes a primary bottleneck in grammar-based translation. Moreover, although LLMs can apply simple rules for translation when explicitly provided, they encounter difficulties in handling more complex rules. To address these challenges, we propose representing grammar rules as code functions, considering their similarities in structure and the benefit of code in facilitating LLM reasoning. Our experiments show that using code rules significantly boosts both rule retrieval and application, ultimately resulting in a 13.1% BLEU improvement in translation.
zh

[NLP-26] Human-Centric Evaluation for Foundation Models

【速读】: 该论文试图解决当前基础模型评估主要依赖客观指标(如测验表现)而忽视真实人类体验的问题。其解决方案的关键在于提出一种以人为本的主观评估框架(Human-Centric Subjective Evaluation, HCE),该框架聚焦于问题解决能力、信息质量和交互体验三个核心维度,通过人类与模型在开放性研究任务中的协作,构建了一个全面的主观数据集,从而更真实地反映模型的实际应用效果和用户反馈。

链接: https://arxiv.org/abs/2506.01793
作者: Yijin Guo,Kaiyuan Ji,Xiaorong Zhu,Junying Wang,Farong Wen,Chunyi Li,Zicheng Zhang,Guangtao Zhai
机构: Shanghai Jiao Tong University (上海交通大学); Shanghai AI Lab (上海人工智能实验室); East China Normal University (华东师范大学); Fudan University (复旦大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Currently, nearly all evaluations of foundation models focus on objective metrics, emphasizing quiz performance to define model capabilities. While this model-centric approach enables rapid performance assessment, it fails to reflect authentic human experiences. To address this gap, we propose a Human-Centric subjective Evaluation (HCE) framework, focusing on three core dimensions: problem-solving ability, information quality, and interaction experience. Through experiments involving Deepseek R1, OpenAI o3 mini, Grok 3, and Gemini 2.5, we conduct over 540 participant-driven evaluations, where humans and models collaborate on open-ended research tasks, yielding a comprehensive subjective dataset. This dataset captures diverse user feedback across multiple disciplines, revealing distinct model strengths and adaptability. Our findings highlight Grok 3’s superior performance, followed by Deepseek R1 and Gemini 2.5, with OpenAI o3 mini lagging behind. By offering a novel framework and a rich dataset, this study not only enhances subjective evaluation methodologies but also lays the foundation for standardized, automated assessments, advancing LLM development for research and practical scenarios. Our dataset link is this https URL.
zh

[NLP-27] Datasheets Arent Enough: DataRubrics for Automated Quality Metrics and Accountability

【速读】: 该论文试图解决高质量数据集创建与评估中的挑战,特别是在人类标注准确性、数据集原创性、多样性及质量控制方面存在的不足。其关键解决方案是将系统化的评分标准(rubric-based evaluation metrics)整合到数据集评审过程中,并引入DataRubrics框架,该框架基于大语言模型(LLM)的评估能力,提供可重复、可扩展且可操作的数据集质量评估方法。

链接: https://arxiv.org/abs/2506.01789
作者: Genta Indra Winata,David Anugraha,Emmy Liu,Alham Fikri Aji,Shou-Yi Hung,Aditya Parashar,Patrick Amadeus Irawan,Ruochen Zhang,Zheng-Xin Yong,Jan Christian Blaise Cruz,Niklas Muennighoff,Seungone Kim,Hanyang Zhao,Sudipta Kar,Kezia Erina Suryoraharjo,M. Farid Adilazuarda,En-Shiun Annie Lee,Ayu Purwarianti,Derry Tanti Wijaya,Monojit Choudhury
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Audio and Speech Processing (eess.AS)
备注: Preprint

点击查看摘要

Abstract:High-quality datasets are fundamental to training and evaluating machine learning models, yet their creation-especially with accurate human annotations-remains a significant challenge. Many dataset paper submissions lack originality, diversity, or rigorous quality control, and these shortcomings are often overlooked during peer review. Submissions also frequently omit essential details about dataset construction and properties. While existing tools such as datasheets aim to promote transparency, they are largely descriptive and do not provide standardized, measurable methods for evaluating data quality. Similarly, metadata requirements at conferences promote accountability but are inconsistently enforced. To address these limitations, this position paper advocates for the integration of systematic, rubric-based evaluation metrics into the dataset review process-particularly as submission volumes continue to grow. We also explore scalable, cost-effective methods for synthetic data generation, including dedicated tools and LLM-as-a-judge approaches, to support more efficient evaluation. As a call to action, we introduce DataRubrics, a structured framework for assessing the quality of both human- and model-generated datasets. Leveraging recent advances in LLM-based evaluation, DataRubrics offers a reproducible, scalable, and actionable solution for dataset quality assessment, enabling both authors and reviewers to uphold higher standards in data-centric research. We also release code to support reproducibility of LLM-based evaluations at this https URL.
zh

[NLP-28] QUEST: An Iterative Question-Guided Framework for Knowledge Base Question Answering ACL2025

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在知识密集型场景中出现的事实性错误问题,特别是针对复杂、多跳的问答任务中保持推理路径连贯性和避免过早丢弃关键多跳连接的挑战。其解决方案的关键在于提出iQUEST框架,该框架通过迭代分解复杂查询为更简单的子问题,确保推理轨迹的结构化和聚焦性,并结合图神经网络(Graph Neural Network, GNN)在每一步推理中引入两跳邻居信息,从而增强推理过程的有效性。

链接: https://arxiv.org/abs/2506.01784
作者: Shuai Wang,Yinan Yu
机构: Chalmers University of Technology (查尔姆斯理工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to ACL 2025 (Main)

点击查看摘要

Abstract:While Large Language Models (LLMs) excel at many natural language processing tasks, they often suffer from factual inaccuracies in knowledge-intensive scenarios. Integrating external knowledge resources, particularly knowledge graphs (KGs), provides a transparent and updatable foundation for more reliable reasoning. Knowledge Base Question Answering (KBQA), which queries and reasons over KGs, is central to this effort, especially for complex, multi-hop queries. However, multi-hop reasoning poses two key challenges: (1)~maintaining coherent reasoning paths, and (2)~avoiding prematurely discarding critical multi-hop connections. To address these issues, we introduce iQUEST, a question-guided KBQA framework that iteratively decomposes complex queries into simpler sub-questions, ensuring a structured and focused reasoning trajectory. Additionally, we integrate a Graph Neural Network (GNN) to look ahead and incorporate 2-hop neighbor information at each reasoning step. This dual approach strengthens the reasoning process, enabling the model to explore viable paths more effectively. Detailed experiments demonstrate the consistent improvement delivered by iQUEST across four benchmark datasets and four LLMs.
zh

[NLP-29] MaXIFE: Multilingual and Cross-lingual Instruction Following Evaluation ACL2025

【速读】: 该论文试图解决现有评估方法在多语言和跨语言场景中对指令遵循能力评估的不足问题(instruction-following capabilities),其解决方案的关键是提出MaXIFE,一个涵盖23种语言、包含1,667个可验证指令任务的综合性评估基准,并结合规则基础评估与模型基础评估,以确保评估的效率与准确性。

链接: https://arxiv.org/abs/2506.01776
作者: Yile Liu,Ziwei Ma,Xiu Jiang,Jinglu Hu,Jing Chang,Liang Li
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: ACL 2025 Main Conference

点击查看摘要

Abstract:With the rapid adoption of large language models (LLMs) in natural language processing, the ability to follow instructions has emerged as a key metric for evaluating their practical utility. However, existing evaluation methods often focus on single-language scenarios, overlooking the challenges and differences present in multilingual and cross-lingual contexts. To address this gap, we introduce MaXIFE: a comprehensive evaluation benchmark designed to assess instruction-following capabilities across 23 languages with 1,667 verifiable instruction tasks. MaXIFE integrates both Rule-Based Evaluation and Model-Based Evaluation, ensuring a balance of efficiency and accuracy. We applied MaXIFE to evaluate several leading commercial and open-source LLMs, establishing baseline results for future comparisons. By providing a standardized tool for multilingual instruction-following evaluation, MaXIFE aims to advance research and development in natural language processing.
zh

[NLP-30] Developing a Mixed-Methods Pipeline for Community-Oriented Digitization of Kwakwala Legacy Texts

【速读】: 该论文旨在解决Kwak’wala语言早期文献的数字化与可读性问题,这些文献虽然已扫描成图像形式,但无法被机器识别。解决方案的关键在于采用最新的光学字符识别(Optical Character Recognition, OCR)技术,并结合现成的OCR方法、语言识别和掩码技术,以有效分离Kwak’wala文本,同时利用后期校正模型提升最终转录结果的质量。

链接: https://arxiv.org/abs/2506.01775
作者: Milind Agarwal,Daisy Rosenblum,Antonios Anastasopoulos
机构: George Mason University (乔治·梅森大学); University of British Columbia (不列颠哥伦比亚大学)
类目: Computation and Language (cs.CL)
备注: Accepted to Comput-EL 2025 Workshop. Preprint

点击查看摘要

Abstract:Kwak’wala is an Indigenous language spoken in British Columbia, with a rich legacy of published documentation spanning more than a century, and an active community of speakers, teachers, and learners engaged in language revitalization. Over 11 volumes of the earliest texts created during the collaboration between Franz Boas and George Hunt have been scanned but remain unreadable by machines. Complete digitization through optical character recognition has the potential to facilitate transliteration into modern orthographies and the creation of other language technologies. In this paper, we apply the latest OCR techniques to a series of Kwak’wala texts only accessible as images, and discuss the challenges and unique adaptations necessary to make such technologies work for these real-world texts. Building on previous methods, we propose using a mix of off-the-shelf OCR methods, language identification, and masking to effectively isolate Kwak’wala text, along with post-correction models, to produce a final high-quality transcription.
zh

[NLP-31] hinking in Character: Advancing Role-Playing Agents with Role-Aware Reasoning

【速读】: 该论文试图解决角色扮演代理(Role-Playing Agents, RPAs)在应用中因依赖显式对话数据而导致的内部思维深度不足、知识表达表面化以及风格漂移等问题。其解决方案的关键在于提出一种新的角色感知推理(Role-Aware Reasoning, RAR)方法,该方法包含两个关键阶段:角色身份激活(Role Identity Activation, RIA)和推理风格优化(Reasoning Style Optimization, RSO),通过引导模型在推理过程中保持角色一致性并优化推理风格,从而有效缓解注意力分散和风格漂移问题。

链接: https://arxiv.org/abs/2506.01748
作者: Yihong Tang,Kehai Chen,Muyun Yang,Zhengyu Niu,Jing Li,Tiejun Zhao,Min Zhang
机构: Harbin Institute of Technology, Shenzhen, China (哈尔滨工业大学深圳); Baidu Inc., Beijing, China (百度公司北京)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The advancement of Large Language Models (LLMs) has spurred significant interest in Role-Playing Agents (RPAs) for applications such as emotional companionship and virtual interaction. However, recent RPAs are often built on explicit dialogue data, lacking deep, human-like internal thought processes, resulting in superficial knowledge and style expression. While Large Reasoning Models (LRMs) can be employed to simulate character thought, their direct application is hindered by attention diversion (i.e., RPAs forget their role) and style drift (i.e., overly formal and rigid reasoning rather than character-consistent reasoning). To address these challenges, this paper introduces a novel Role-Aware Reasoning (RAR) method, which consists of two important stages: Role Identity Activation (RIA) and Reasoning Style Optimization (RSO). RIA explicitly guides the model with character profiles during reasoning to counteract attention diversion, and then RSO aligns reasoning style with the character and scene via LRM distillation to mitigate style drift. Extensive experiments demonstrate that the proposed RAR significantly enhances the performance of RPAs by effectively addressing attention diversion and style drift.
zh

[NLP-32] Benfords Curse: Tracing Digit Bias to Numerical Hallucination in LLM s

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在数值推理任务中出现的数字偏差问题,即模型在生成数值时倾向于遵循类似本福特定律(Benford’s Law)的长尾分布,而非均匀分布,导致错误输出。解决方案的关键在于通过分析预训练语料库中的数字频率是否符合本福特定律,并利用日志透镜追踪和神经元层面的分解技术,识别出深层网络中高度依赖数字的前馈网络(Feed-Forward Network, FFN)神经元是造成该偏差的主要原因。进一步通过剪枝这些神经元,验证了数字偏差对模型行为的影响,并为缓解数值任务中的幻觉现象提供了因果证据。

链接: https://arxiv.org/abs/2506.01734
作者: Jiandong Shao,Yao Lu,Jianfei Yang
机构: 未知
类目: Computation and Language (cs.CL)
备注: Under Review

点击查看摘要

Abstract:Large Language Models (LLMs) exhibit impressive performance on complex reasoning tasks, yet they frequently fail on basic numerical problems, producing incorrect outputs. Inspired by Benford’s Law – a statistical pattern where lower digits occur more frequently as leading digits – we hypothesize that the long-tailed digit distributions in web-collected corpora may be learned by LLMs during pretraining, leading to biased numerical generation. To investigate the hypothesis, we first examine whether digits frequencies in pretraining corpus (OLMo2) follows Benford’s law. We then construct an evaluation benchmark with uniformly distributed ground-truth digits across seven numerical reasoning tasks. Our evaluation results demonstrate that leading open-source LLMs show a consistent pattern of digit bias that resembles Benford’s law. Through logit-lens tracing and neuron-level dissection, we identify that this bias arises predominantly from a small subset of highly digit-selective feed-forward network (FFN) neurons in the deeper layers. Finally, we demonstrate that pruning these neurons mitigates imbalanced overgeneration and partially corrects erroneous outputs, providing causal evidence that fine-grained pretraining digit bias can propagate into model behavior. Our findings reveal a fundamental connection between corpus-level statistics and symbolic failure modes in LLMs, offering a new lens for diagnosing and mitigating hallucinations in numerical tasks.
zh

[NLP-33] Common Corpus: The Largest Collection of Ethical Data for LLM Pre-Training

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在预训练过程中使用包含大量受版权或专有内容的数据所引发的合规性问题,这些问题阻碍了模型在人工智能法规下的应用。解决方案的关键在于构建一个真正开放且符合数据安全法规的预训练数据集,即Common Corpus,其数据要么未受版权保护,要么采用允许使用的许可协议,总量约为两万亿个标记(tokens),并涵盖多种语言和领域,包括低资源语言和大量代码数据。

链接: https://arxiv.org/abs/2506.01732
作者: Pierre-Carl Langlais,Carlos Rosas Hinostroza,Mattia Nee,Catherine Arnett,Pavel Chizhov,Eliot Krzystof Jones,Irène Girard,David Mach,Anastasia Stasenko,Ivan P. Yamshchikov
机构: PleIAs, Paris, France
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are pre-trained on large amounts of data from different sources and domains. These data most often contain trillions of tokens with large portions of copyrighted or proprietary content, which hinders the usage of such models under AI legislation. This raises the need for truly open pre-training data that is compliant with the data security regulations. In this paper, we introduce Common Corpus, the largest open dataset for language model pre-training. The data assembled in Common Corpus are either uncopyrighted or under permissible licenses and amount to about two trillion tokens. The dataset contains a wide variety of languages, ranging from the main European languages to low-resource ones rarely present in pre-training datasets; in addition, it includes a large portion of code data. The diversity of data sources in terms of covered domains and time periods opens up the paths for both research and entrepreneurial needs in diverse areas of knowledge. In this technical report, we present the detailed provenance of data assembling and the details of dataset filtering and curation. Being already used by such industry leaders as Anthropic and multiple LLM training projects, we believe that Common Corpus will become a critical infrastructure for open science research in LLMs.
zh

[NLP-34] ug-of-war between idioms figurative and literal meanings in LLM s

【速读】: 该论文试图解决习语在语言模型中的非组合性隐喻意义与字面意义之间的歧义问题(idiom ambiguity),这一问题使得模型难以准确理解习语的真正含义。解决方案的关键在于利用机制可解释性工具,揭示大型预训练因果Transformer模型(LLama3.2-1B-base)在处理习语时的内部机制,包括通过早期注意力和MLP子层检索习语的隐喻意义,并识别出特定注意力头以增强隐喻意义并抑制字面意义;同时,模型通过中间路径表示隐喻表征,而并行旁路则传递字面解释,确保两种解读均被保留。

链接: https://arxiv.org/abs/2506.01723
作者: Soyoung Oh,Xinting Huang,Mathis Pink,Michael Hahn,Vera Demberg
机构: Saarland University (萨尔兰大学); Max Planck Institute for Software Systems (马克斯·普朗克软件系统研究所); Max Planck Institute for Informatics (马克斯·普朗克信息研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Idioms present a unique challenge for language models due to their non-compositional figurative meanings, which often strongly diverge from the idiom’s literal interpretation. This duality requires a model to learn representing and deciding between the two meanings to interpret an idiom in a figurative sense, or literally. In this paper, we employ tools from mechanistic interpretability to trace how a large pretrained causal transformer (LLama3.2-1B-base) deals with this ambiguity. We localize three steps of idiom processing: First, the idiom’s figurative meaning is retrieved in early attention and MLP sublayers. We identify specific attention heads which boost the figurative meaning of the idiom while suppressing the idiom’s literal interpretation. The model subsequently represents the figurative representation through an intermediate path. Meanwhile, a parallel bypass route forwards literal interpretation, ensuring that a both reading remain available. Overall, our findings provide a mechanistic evidence for idiom comprehension in an autoregressive transformer.
zh

[NLP-35] Self-Challenging Language Model Agents

【速读】: 该论文旨在解决训练智能代理(Intelligent Agent)过程中因需要人工创建和标注多样化任务、工具及评估标准而带来的挑战。其解决方案的关键在于提出Self-Challenging框架,该框架通过代理自身生成高质量任务来实现训练。具体而言,代理首先扮演挑战者角色,与给定工具交互后生成任务,这些任务以“Code-as-Task”形式呈现,包含指令、验证函数以及正反例,用于筛选高质量任务;随后代理切换为执行者角色,利用强化学习在这些任务上进行训练,并将评估反馈作为奖励信号。

链接: https://arxiv.org/abs/2506.01716
作者: Yifei Zhou,Sergey Levine,Jason Weston,Xian Li,Sainbayar Sukhbaatar
机构: UC Berkeley (加州大学伯克利分校); FAIR at Meta (Meta人工智能研究院)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models are quickly becoming the foundation for intelligent agents that are capable of using tools. However, training such agents is challenging because it requires human creation and annotation of a diverse set of tasks, tools, and evaluation criteria. In this paper, we propose the Self-Challenging framework for training an agent on high-quality tasks that are generated by itself. The agent first plays the role of challenger and generates a task after interacting with the given tools. The tasks take the form of a novel general class of problems termed Code-as-Task, which are defined by an instruction, a verification function and solution and failure cases which serve as tests, allowing to filter only for high-quality tasks. The agent then takes an executor role and trains on those tasks with reinforcement learning using the evaluation feedback as a reward. Evaluation on two existing multi-turn tool-use agent benchmarks, M3ToolEval and TauBench, shows the Self-Challenging framework achieves over a two-fold improvement in Llama-3.1-8B-Instruct, despite using only self-generated training data.
zh

[NLP-36] SRPO: Enhancing Multimodal LLM Reasoning via Reflection-Aware Reinforcement Learning

【速读】: 该论文试图解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在需要显式自我反思和自我修正的复杂问题上表现不足的问题,尤其是在与单模态文本模型相比时。其解决方案的关键在于提出一种名为“多模态自我反思增强的群体相对策略优化”(SRPO)的两阶段反射感知强化学习(RL)框架。该框架通过构建高质量的反射数据集以及引入新颖的奖励机制,有效提升了模型的推理能力和反思质量。

链接: https://arxiv.org/abs/2506.01713
作者: Zhongwei Wan,Zhihao Dou,Che Liu,Yu Zhang,Dongfei Cui,Qinjian Zhao,Hui Shen,Jing Xiong,Yi Xin,Yifan Jiang,Yangfan He,Mi Zhang,Shen Yan
机构: The Ohio State University (俄亥俄州立大学); Case Western Reserve University (凯斯西储大学); Imperial College London (帝国理工学院); Duke University (杜克大学); Kean University (肯恩大学); University of Michigan (密歇根大学); University of Southern California (南加利福尼亚大学); University of Minnesota (明尼苏达大学); The University of Hong Kong (香港大学); Tongji University (同济大学); Nanjing University (南京大学); ByteDance (字节跳动)
类目: Computation and Language (cs.CL)
备注: Under review

点击查看摘要

Abstract:Multimodal large language models (MLLMs) have shown promising capabilities in reasoning tasks, yet still struggle with complex problems requiring explicit self-reflection and self-correction, especially compared to their unimodal text-based counterparts. Existing reflection methods are simplistic and struggle to generate meaningful and instructive feedback, as the reasoning ability and knowledge limits of pre-trained models are largely fixed during initial training. To overcome these challenges, we propose Multimodal Self-Reflection enhanced reasoning with Group Relative Policy Optimization (SRPO), a two-stage reflection-aware reinforcement learning (RL) framework explicitly designed to enhance multimodal LLM reasoning. In the first stage, we construct a high-quality, reflection-focused dataset under the guidance of an advanced MLLM, which generates reflections based on initial responses to help the policy model learn both reasoning and self-reflection. In the second stage, we introduce a novel reward mechanism within the GRPO framework that encourages concise and cognitively meaningful reflection while avoiding redundancy. Extensive experiments across multiple multimodal reasoning benchmarks, including MathVista, MathVision, MathVerse, and MMMU-Pro, using Qwen-2.5-VL-7B and Qwen-2.5-VL-32B demonstrate that SRPO significantly outperforms state-of-the-art models, achieving notable improvements in both reasoning accuracy and reflection quality.
zh

[NLP-37] Reasoning -Table: Exploring Reinforcement Learning for Table Reasoning

【速读】: 该论文旨在解决表推理(table reasoning)任务中模型泛化能力和鲁棒性不足的问题,这些问题通常由模仿学习固有的偏差导致。其解决方案的关键在于首次将强化学习(reinforcement learning, RL)应用于表推理,通过严格的数据预处理、奖励设计和定制化的训练策略,利用简单的基于规则的结果奖励,实现了优于监督微调(supervised fine-tuning, SFT)的方法,并在多个基准测试中取得了最先进的性能。

链接: https://arxiv.org/abs/2506.01710
作者: Fangyu Lei,Jinxiang Meng,Yiming Huang,Tinghong Chen,Yun Zhang,Shizhu He,Jun Zhao,Kang Liu
机构: Institute of Automation, CAS; University of Chinese Academy of Sciences
类目: Computation and Language (cs.CL)
备注: Work in progress

点击查看摘要

Abstract:Table reasoning, encompassing tasks such as table question answering, fact verification, and text-to-SQL, requires precise understanding of structured tabular data, coupled with numerical computation and code manipulation for effective inference. Supervised fine-tuning (SFT) approaches have achieved notable success but often struggle with generalization and robustness due to biases inherent in imitative learning. We introduce Reasoning-Table, the first application of reinforcement learning (RL) to table reasoning, achieving state-of-the-art performance. Through rigorous data preprocessing, reward design, and tailored training strategies, our method leverages simple rule-based outcome rewards to outperform SFT across multiple benchmarks. Unified training across diverse tasks enables Reasoning-Table to emerge as a robust table reasoning large language model, surpassing larger proprietary models like Claude-3.7-Sonnet by 4.0% on table reasoning benchmarks. The approach also achieves excellent performance on text-to-SQL tasks, reaching 68.3% performance on the BIRD dev dataset with a 7B model. Further experiments demonstrate that Reasoning-Table enhances the model’s generalization capabilities and robustness.
zh

[NLP-38] Fairness Dynamics During Training

【速读】: 该论文试图解决大型语言模型(Large Language Model, LLM)训练过程中公平性动态变化的问题,旨在通过训练干预手段如早停(early stopping)来诊断偏见并进行缓解。其解决方案的关键在于引入两个新的评估指标——平均排名(Average Rank)和按部分计算的Jensen-Shannon散度(Jensen-Shannon Divergence by Parts),以全面评估模型预训练阶段的公平性动态变化,从而揭示模型在性别预测任务中的偏见演化过程。

链接: https://arxiv.org/abs/2506.01709
作者: Krishna Patel,Nivedha Sivakumar,Barry-John Theobald,Luca Zappella,Nicholas Apostoloff
机构: Apple(苹果)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We investigate fairness dynamics during Large Language Model (LLM) training to enable the diagnoses of biases and mitigations through training interventions like early stopping; we find that biases can emerge suddenly and do not always follow common performance metrics. We introduce two new metrics to evaluate fairness dynamics holistically during model pre-training: Average Rank and Jensen-Shannon Divergence by Parts. These metrics provide insights into the Pythia models’ progression of biases in gender prediction of occupations on the WinoBias dataset. By monitoring these dynamics, we find that (1) Pythia-6.9b is biased towards men; it becomes more performant and confident predicting “male” than “female” during training, (2) via early-stopping, Pythia-6.9b can exchange 1.7% accuracy on LAMBADA for a 92.5% increase in fairness, and (3) larger models can exhibit more bias; Pythia-6.9b makes more assumptions about gender than Pythia-160m, even when a subject’s gender is not specified.
zh

[NLP-39] mdok of KInIT: Robustly Fine-tuned LLM for Binary and Multiclass AI-Generated Text Detection

【速读】: 该论文试图解决生成式 AI (Generative AI) 生成文本的自动化检测问题,特别是针对分布外数据(out-of-distribution data)的鲁棒性挑战。解决方案的关键在于基于微调较小的大型语言模型(LLMs)进行文本分类,该方法在 Voight-Kampff Generative AI Detection 2025 的两个子任务中均表现出色,实现了二分类检测和多类别(第一名)分类的优异性能。

链接: https://arxiv.org/abs/2506.01702
作者: Dominik Macko
机构: Kempelen Institute of Intelligent Technologies(凯姆佩伦智能技术研究所)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The large language models (LLMs) are able to generate high-quality texts in multiple languages. Such texts are often not recognizable by humans as generated, and therefore present a potential of LLMs for misuse (e.g., plagiarism, spams, disinformation spreading). An automated detection is able to assist humans to indicate the machine-generated texts; however, its robustness to out-of-distribution data is still challenging. This notebook describes our mdok approach in robust detection, based on fine-tuning smaller LLMs for text classification. It is applied to both subtasks of Voight-Kampff Generative AI Detection 2025, providing remarkable performance in binary detection as well as in multiclass (1st rank) classification of various cases of human-AI collaboration.
zh

[NLP-40] When LLM s Team Up: The Emergence of Collaborative Affective Computing

【速读】: 该论文试图解决传统情感计算(Affective Computing, AC)在自然语言处理(Natural Language Processing, NLP)任务中因管道架构结构僵化导致的效率低下和适应性不足的问题,以及大型语言模型(Large Language Models, LLMs)在情感推理中的认知局限,如对文化细微差别或情境情感的误解及决策中的幻觉问题。解决方案的关键在于提出基于LLMs的协作系统,通过专业化模型与LLMs之间的交互,模拟人类情感智能,结合情感与理性思维的协同作用,以提升复杂情感推理的鲁棒性和适应性。

链接: https://arxiv.org/abs/2506.01698
作者: Wenna Lai,Haoran Xie,Guandong Xu,Qing Li,S. Joe Qin
机构: Hong Kong Polytechnic University (香港理工大学); Lingnan University (岭南大学); University of Technology Sydney (悉尼科技大学); Education University of Hong Kong (香港教育大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 20 pages, 7 figures, and 3 tables

点击查看摘要

Abstract:Affective Computing (AC) is essential in bridging the gap between human emotional experiences and machine understanding. Traditionally, AC tasks in natural language processing (NLP) have been approached through pipeline architectures, which often suffer from structure rigidity that leads to inefficiencies and limited adaptability. The advent of Large Language Models (LLMs) has revolutionized this field by offering a unified approach to affective understanding and generation tasks, enhancing the potential for dynamic, real-time interactions. However, LLMs face cognitive limitations in affective reasoning, such as misinterpreting cultural nuances or contextual emotions, and hallucination problems in decision-making. To address these challenges, recent research advocates for LLM-based collaboration systems that emphasize interactions among specialized models and LLMs, mimicking human-like affective intelligence through the synergy of emotional and rational thinking that aligns with Dual Process Theory in psychology. This survey aims to provide a comprehensive overview of LLM-based collaboration systems in AC, exploring from structured collaborations to autonomous collaborations. Specifically, it includes: (1) A systematic review of existing methods, focusing on collaboration strategies, mechanisms, key functions, and applications; (2) Experimental comparisons of collaboration strategies across representative tasks in affective understanding and generation; (3) An analysis highlighting the potential of these systems to enhance robustness and adaptability in complex affective reasoning; (4) A discussion of key challenges and future research directions to further advance the field. This work is the first to systematically explore collaborative intelligence with LLMs in AC, paving the way for more powerful applications that approach human-like social intelligence.
zh

[NLP-41] Respond Beyond Language: A Benchmark for Video Generation in Response to Realistic User Intents

【速读】: 该论文试图解决现有查询-回答数据集主要关注文本响应,难以应对需要视觉演示或解释的复杂用户查询的问题。解决方案的关键是构建一个名为RealVideoQuest的基准,用于评估文本到视频(T2V)模型在回答现实世界、视觉基础查询方面的能力,通过多阶段视频检索和精炼过程生成高质量的查询-视频对,并开发多角度评估系统以衡量生成视频答案的质量。

链接: https://arxiv.org/abs/2506.01689
作者: Shuting Wang,Yunqi Liu,Zixin Yang,Ning Hu,Zhicheng Dou,Chenyan Xiong
机构: Carnegie Mellon University (卡内基梅隆大学); Renmin University of China (中国人民大学); Serendipity One Inc. (塞伦迪皮蒂一号公司)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Querying generative AI models, e.g., large language models (LLMs), has become a prevalent method for information acquisition. However, existing query-answer datasets primarily focus on textual responses, making it challenging to address complex user queries that require visual demonstrations or explanations for better understanding. To bridge this gap, we construct a benchmark, RealVideoQuest, designed to evaluate the abilities of text-to-video (T2V) models in answering real-world, visually grounded queries. It identifies 7.5K real user queries with video response intents from Chatbot-Arena and builds 4.5K high-quality query-video pairs through a multistage video retrieval and refinement process. We further develop a multi-angle evaluation system to assess the quality of generated video answers. Experiments indicate that current T2V models struggle with effectively addressing real user queries, pointing to key challenges and future research opportunities in multimodal AI.
zh

[NLP-42] StochasTok: Improving Fine-Grained Subword Understanding in LLM s

【速读】: 该论文试图解决当前大型语言模型(Large Language Models, LLMs)在处理子词级(subword-level)任务时表现不佳的问题,例如统计单词中的字母数量、识别子字符串等。其关键解决方案是引入一种简单高效的随机分词方法——StochasTok,该方法在训练过程中随机分割标记,使模型能够“看到”词汇的内部结构,从而提升对子词级特征的理解能力。

链接: https://arxiv.org/abs/2506.01687
作者: Anya Sims,Thom Foster,Klara Kaleb,Tuan-Duy H. Nguyen,Joseph Lee,Jakob N. Foerster,Yee Whye Teh,Cong Lu
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Subword-level understanding is integral to numerous tasks, including understanding multi-digit numbers, spelling mistakes, abbreviations, rhyming, and wordplay. Despite this, current large language models (LLMs) still often struggle with seemingly simple subword-level tasks like How many ‘r’s in ‘strawberry’?. A key factor behind these failures is tokenization which obscures the fine-grained structure of words. Current alternatives, such as character-level and dropout tokenization methods, significantly increase computational costs and provide inconsistent improvements. In this paper we revisit tokenization and introduce StochasTok, a simple, efficient stochastic tokenization scheme that randomly splits tokens during training, allowing LLMs to ‘see’ their internal structure. Our experiments show that pretraining with StochasTok substantially improves LLMs’ downstream performance across multiple subword-level language games, including character counting, substring identification, and math tasks. Furthermore, StochasTok’s simplicity allows seamless integration at any stage of the training pipeline; and we demonstrate that post-training with StochasTok can instill improved subword understanding into existing pretrained models, thus avoiding costly pretraining from scratch. These dramatic improvements achieved with a minimal change suggest StochasTok holds exciting potential when applied to larger, more capable models. Code open-sourced at: this https URL.
zh

[NLP-43] Cross-Lingual Transfer of Cultural Knowledge: An Asymmetric Phenomenon ACL2025

【速读】: 该论文试图解决大语言模型(Large Language Models, LLMs)在多语言环境下文化知识获取机制不明确的问题,特别是文化知识在语言适应过程中的跨语言迁移机制。其解决方案的关键在于引入一个可解释的框架,以研究文化知识的跨语言迁移,并通过确保训练数据的透明性和控制迁移效应来实现对迁移过程的有效分析。

链接: https://arxiv.org/abs/2506.01675
作者: Chen Zhang,Zhiyuan Liao,Yansong Feng
机构: Wangxuan Institute of Computer Technology, Peking University (北京大学王选计算机技术研究所)
类目: Computation and Language (cs.CL)
备注: ACL 2025

点击查看摘要

Abstract:Despite substantial research efforts evaluating how well large language models~(LLMs) handle global cultural diversity, the mechanisms behind their cultural knowledge acquisition, particularly in multilingual settings, remain unclear. We study this question by investigating how cultural knowledge transfers across languages during language adaptation of LLMs. We introduce an interpretable framework for studying this transfer, ensuring training data transparency and controlling transfer effects. Through a study of four non-Anglophonic cultures, we observe bidirectional cultural transfer between English and other high-resource languages, while low-resource languages primarily transfer knowledge to English with limited reverse flow. To explain this asymmetric phenomenon, we propose a frequency-based hypothesis: cultural knowledge appearing more frequently in the pretraining data transfers more easily, which is supported by empirical analysis of the training corpora.
zh

[NLP-44] GRAM: Generative Recommendation via Semantic-aware Multi-granular Late Fusion ACL2025

【速读】: 该论文旨在解决生成式推荐系统中两个关键问题:一是如何有效融入隐式的物品关系,二是如何高效利用丰富但冗长的物品信息。其解决方案的关键在于提出一种语义感知的多粒度晚期融合方法(GRAM),通过语义到词法的转换将隐式的层级与协作物品关系编码至大语言模型的词汇空间,并采用多粒度晚期融合机制,在解码阶段才进行信息融合,从而在保持较少信息损失的前提下高效整合丰富语义。

链接: https://arxiv.org/abs/2506.01673
作者: Sunkyung Lee,Minjin Choi,Eunseong Choi,Hye-young Kim,Jongwuk Lee
机构: Sungkyunkwan University (成均馆大学); Samsung Research (三星研究院)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: ACL 2025 (Main Conference)

点击查看摘要

Abstract:Generative recommendation is an emerging paradigm that leverages the extensive knowledge of large language models by formulating recommendations into a text-to-text generation task. However, existing studies face two key limitations in (i) incorporating implicit item relationships and (ii) utilizing rich yet lengthy item information. To address these challenges, we propose a Generative Recommender via semantic-Aware Multi-granular late fusion (GRAM), introducing two synergistic innovations. First, we design semantic-to-lexical translation to encode implicit hierarchical and collaborative item relationships into the vocabulary space of LLMs. Second, we present multi-granular late fusion to integrate rich semantics efficiently with minimal information loss. It employs separate encoders for multi-granular prompts, delaying the fusion until the decoding stage. Experiments on four benchmark datasets show that GRAM outperforms eight state-of-the-art generative recommendation models, achieving significant improvements of 11.5-16.0% in Recall@5 and 5.3-13.6% in NDCG@5. The source code is available at this https URL.
zh

[NLP-45] AIMSCheck: Leverag ing LLM s for AI-Assisted Review of Modern Slavery Statements Across Jurisdictions ACL2025

【速读】: 该论文试图解决企业合规声明中关于现代奴隶制(Modern Slavery)的透明度验证问题,这一过程因声明语言复杂多样且数量庞大而面临挑战,同时由于标注数据稀缺以及法律管辖范围差异,开发通用的自然语言处理(NLP)工具也存在困难。论文的关键解决方案是提出AIMSCheck框架,该框架将合规评估任务分解为三个层次,以提高可解释性和实际应用性,并通过跨司法管辖区的标注数据集支持模型的泛化能力,从而推动AI在合规评估中的应用与研究。

链接: https://arxiv.org/abs/2506.01671
作者: Adriana Eufrosina Bora,Akshatha Arodi,Duoyi Zhang,Jordan Bannister,Mirko Bronzi,Arsene Fansi Tchango,Md Abul Bashar,Richi Nayak,Kerrie Mengersen
机构: Mila - Quebec AI Institute (Mila - 魁北克人工智能研究所); The Queensland University of Technology (昆士兰科技大学)
类目: Computers and Society (cs.CY); Computation and Language (cs.CL)
备注: 27 pages, to appear at ACL 2025

点击查看摘要

Abstract:Modern Slavery Acts mandate that corporations disclose their efforts to combat modern slavery, aiming to enhance transparency and strengthen practices for its eradication. However, verifying these statements remains challenging due to their complex, diversified language and the sheer number of statements that must be reviewed. The development of NLP tools to assist in this task is also difficult due to a scarcity of annotated data. Furthermore, as modern slavery transparency legislation has been introduced in several countries, the generalizability of such tools across legal jurisdictions must be studied. To address these challenges, we work with domain experts to make two key contributions. First, we present this http URL and this http URL, newly annotated datasets from the UK and Canada to enable cross-jurisdictional evaluation. Second, we introduce AIMSCheck, an end-to-end framework for compliance validation. AIMSCheck decomposes the compliance assessment task into three levels, enhancing interpretability and practical applicability. Our experiments show that models trained on an Australian dataset generalize well across UK and Canadian jurisdictions, demonstrating the potential for broader application in compliance monitoring. We release the benchmark datasets and AIMSCheck to the public to advance AI-adoption in compliance assessment and drive further research in this field.
zh

[NLP-46] ESGenius: Benchmarking LLM s on Environmental Social and Governance (ESG) and Sustainability Knowledge

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在环境、社会和治理(Environmental, Social and Governance, ESG)及可持续发展相关问答任务中的能力评估与提升问题。其解决方案的关键在于构建了一个全面的基准测试集ESGenius,包含ESGenius-QA(由LLMs生成并经领域专家验证的1 136道多选题)和ESGenius-Corpus(涵盖231份权威来源的基础框架、标准、报告和建议文档的语料库),并通过零样本(Zero-Shot)和检索增强生成(Retrieval-Augmented Generation, RAG)两阶段评估协议,验证模型在跨学科ESG场景下的表现,强调了基于权威资料进行回答的重要性。

链接: https://arxiv.org/abs/2506.01646
作者: Chaoyue He,Xin Zhou,Yi Wu,Xinjia Yu,Yan Zhang,Lei Zhang,Di Wang,Shengfei Lyu,Hong Xu,Xiaoqiao Wang,Wei Liu,Chunyan Miao
机构: Alibaba-NTU Global e-Sustainability CorpLab (ANGEL), Singapore; Alibaba Group, China
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 37 pages, 8 figures, 11 tables

点击查看摘要

Abstract:We introduce ESGenius, a comprehensive benchmark for evaluating and enhancing the proficiency of Large Language Models (LLMs) in Environmental, Social and Governance (ESG) and sustainability-focused question answering. ESGenius comprises two key components: (i) ESGenius-QA, a collection of 1 136 multiple-choice questions generated by LLMs and rigorously validated by domain experts, covering a broad range of ESG pillars and sustainability topics. Each question is systematically linked to its corresponding source text, enabling transparent evaluation and supporting retrieval-augmented generation (RAG) methods; and (ii) ESGenius-Corpus, a meticulously curated repository of 231 foundational frameworks, standards, reports and recommendation documents from seven authoritative sources. Moreover, to fully assess the capabilities and adaptation potential of the model, we implement a rigorous two-stage evaluation protocol – Zero-Shot and RAG. Extensive experiments across 50 LLMs (ranging from 0.5 B to 671 B parameters) demonstrate that state-of-the-art models achieve only moderate performance in zero-shot settings, with accuracies typically around 55–70%, highlighting ESGenius’s challenging nature for LLMs in interdisciplinary contexts. However, models employing RAG show significant performance improvements, particularly for smaller models. For example, “DeepSeek-R1-Distill-Qwen-14B” improves from 63.82% (zero-shot) to 80.46% with RAG. These results underscore the necessity of grounding responses in authoritative sources for enhanced ESG understanding. To the best of our knowledge, ESGenius is the first benchmark curated for LLMs and the relevant enhancement technologies that focuses on ESG and sustainability topics.
zh

[NLP-47] Cross-Lingual Generalization and Compression: From Language-Specific to Shared Neurons ACL2025

【速读】: 该论文旨在探究多语言语言模型(Multilingual Language Models, MLLMs)在预训练过程中如何实现跨语言知识迁移的问题,特别是其参数空间中的表示演变机制。研究的关键在于通过分析模型的参数空间和神经元的演化过程,揭示模型从初始的语言特定表示逐步收敛为跨语言抽象表示的规律,并发现特定神经元在不同语言中对同一语义概念的逐渐对齐现象,从而为理解MLLM的跨语言表征学习提供了新的视角。

链接: https://arxiv.org/abs/2506.01629
作者: Frederick Riemenschneider,Anette Frank
机构: Heidelberg University (海德堡大学)
类目: Computation and Language (cs.CL)
备注: Paper accepted for publication at ACL 2025 Main; 10 pages, 20 figures, 4 tables

点击查看摘要

Abstract:Multilingual language models (MLLMs) have demonstrated remarkable abilities to transfer knowledge across languages, despite being trained without explicit cross-lingual supervision. We analyze the parameter spaces of three MLLMs to study how their representations evolve during pre-training, observing patterns consistent with compression: models initially form language-specific representations, which gradually converge into cross-lingual abstractions as training progresses. Through probing experiments, we observe a clear transition from uniform language identification capabilities across layers to more specialized layer functions. For deeper analysis, we focus on neurons that encode distinct semantic concepts. By tracing their development during pre-training, we show how they gradually align across languages. Notably, we identify specific neurons that emerge as increasingly reliable predictors for the same concepts across languages.
zh

[NLP-48] MVAN: Multi-View Attention Networks for Fake News Detection on Social Media

【速读】: 该论文试图解决社交媒体中虚假新闻检测的问题,特别是在更现实的场景下,仅提供源短文本微博及其转发用户,而没有用户评论的情况。解决方案的关键在于提出一种基于神经网络的多视角注意力网络(Multi-View Attention Networks, MVAN),该模型结合了文本语义注意力和传播结构注意力,从而能够同时捕捉源微博内容和传播结构中的信息与线索。此外,模型中的两种注意力机制能够识别虚假新闻文本中的关键线索词和传播结构中的可疑用户。

链接: https://arxiv.org/abs/2506.01627
作者: Shiwen Ni,Jiawen Li,Hung-Yu Kao
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Fake news on social media is a widespread and serious problem in today’s society. Existing fake news detection methods focus on finding clues from Long text content, such as original news articles and user comments. This paper solves the problem of fake news detection in more realistic scenarios. Only source shot-text tweet and its retweet users are provided without user comments. We develop a novel neural network based model, \textbfMulti-\textbfView \textbfAttention \textbfNetworks (MVAN) to detect fake news and provide explanations on social media. The MVAN model includes text semantic attention and propagation structure attention, which ensures that our model can capture information and clues both of source tweet content and propagation structure. In addition, the two attention mechanisms in the model can find key clue words in fake news texts and suspicious users in the propagation structure. We conduct experiments on two real-world datasets, and the results demonstrate that MVAN can significantly outperform state-of-the-art methods by 2.5% in accuracy on average, and produce a reasonable explanation.
zh

[NLP-49] Domain Lexical Knowledge-based Word Embedding Learning for Text Classification under Small Data

【速读】: 该论文试图解决预训练语言模型如BERT在某些文本分类应用(如情感识别和情感分析)中表现不佳的问题,尤其是在关键词对类别预测起关键作用的场景下。其核心问题在于BERT对关键词的上下文嵌入可能缺乏足够的判别性,无法生成有效的文本表示用于分类。解决方案的关键是利用领域特定的词汇知识增强词向量,通过将BERT嵌入投影到一个最大化类内相似性和类间差异的新空间中,从而提升分类性能。

链接: https://arxiv.org/abs/2506.01621
作者: Zixiao Zhu,Kezhi Mao
机构: Nanyang Technological University (南洋理工大学)
类目: Computation and Language (cs.CL)
备注: 13 pages, 2 figures

点击查看摘要

Abstract:Pre-trained language models such as BERT have been proved to be powerful in many natural language processing tasks. But in some text classification applications such as emotion recognition and sentiment analysis, BERT may not lead to satisfactory performance. This often happens in applications where keywords play critical roles in the prediction of class labels. Our investigation found that the root cause of the problem is that the context-based BERT embedding of the keywords may not be discriminative enough to produce discriminative text representation for classification. Motivated by this finding, we develop a method to enhance word embeddings using domain-specific lexical knowledge. The knowledge-based embedding enhancement model projects the BERT embedding into a new space where within-class similarity and between-class difference are maximized. To implement the knowledge-based word embedding enhancement model, we also develop a knowledge acquisition algorithm for automatically collecting lexical knowledge from online open sources. Experiment results on three classification tasks, including sentiment analysis, emotion recognition and question answering, have shown the effectiveness of our proposed word embedding enhancing model. The codes and datasets are in this https URL.
zh

[NLP-50] IndicRAG Suite: Large-Scale Datasets and a Benchmark for Indian Language RAG Systems

【速读】: 该论文试图解决印度语言中高质量检索增强生成(Retrieval-Augmented Generation, RAG)系统开发所面临的两大关键问题:缺乏用于检索和生成任务的评估基准以及多语言检索的大规模训练数据。解决方案的关键在于构建IndicMSMarco,这是一个涵盖13种印度语言的多语言基准,通过手动翻译MS MARCO-dev集中的1000个多样化查询创建,以评估检索质量和响应生成;同时,利用最先进的大语言模型(LLMs)从19种印度语言的维基百科中构建大规模的(问题、答案、相关段落)三元组数据集,并引入原始MS MARCO数据集的翻译版本以进一步丰富训练数据并确保与实际信息检索任务的一致性。

链接: https://arxiv.org/abs/2506.01615
作者: Pasunuti Prasanjith,Prathmesh B More,Anoop Kunchukuttan,Raj Dabre
机构: 未知
类目: Computation and Language (cs.CL)
备注: WIP

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) systems enable language models to access relevant information and generate accurate, well-grounded, and contextually informed responses. However, for Indian languages, the development of high-quality RAG systems is hindered by the lack of two critical resources: (1) evaluation benchmarks for retrieval and generation tasks, and (2) large-scale training datasets for multilingual retrieval. Most existing benchmarks and datasets are centered around English or high-resource languages, making it difficult to extend RAG capabilities to the diverse linguistic landscape of India. To address the lack of evaluation benchmarks, we create IndicMSMarco, a multilingual benchmark for evaluating retrieval quality and response generation in 13 Indian languages, created via manual translation of 1000 diverse queries from MS MARCO-dev set. To address the need for training data, we build a large-scale dataset of (question, answer, relevant passage) tuples derived from the Wikipedias of 19 Indian languages using state-of-the-art LLMs. Additionally, we include translated versions of the original MS MARCO dataset to further enrich the training data and ensure alignment with real-world information-seeking tasks. Resources are available here: this https URL
zh

[NLP-51] MMD-Sense-Analysis: Word Sense Detection Leverag ing Maximum Mean Discrepancy

【速读】: 该论文试图解决词义变化检测(word sense change detection)问题,即识别和解释词语意义随时间的变化。解决方案的关键在于提出一种名为MMD-Sense-Analysis的新方法,该方法利用最大均值差异(Maximum Mean Discrepancy, MMD)来选择语义上有意义的变量,并量化不同时期之间的变化,从而实现对经历语义演变的词语的识别及其演变过程的解释。据作者所知,这是MMD首次被应用于词义变化检测任务。

链接: https://arxiv.org/abs/2506.01602
作者: Kensuke Mitsuzawa
机构: Université Côte d’Azur (科蒂尔大学); CNRS (法国国家科学研究中心); LJAD (应用数学与概率实验室)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Word sense analysis is an essential analysis work for interpreting the linguistic and social backgrounds. The word sense change detection is a task of identifying and interpreting shifts in word meanings over time. This paper proposes MMD-Sense-Analysis, a novel approach that leverages Maximum Mean Discrepancy (MMD) to select semantically meaningful variables and quantify changes across time periods. This method enables both the identification of words undergoing sense shifts and the explanation of their evolution over multiple historical periods. To my knowledge, this is the first application of MMD to word sense change detection. Empirical assessment results demonstrate the effectiveness of the proposed approach.
zh

[NLP-52] Statement-Tuning Enables Efficient Cross-lingual Generalization in Encoder-only Models ACL2025

【速读】: 该论文试图解决如何使编码器模型(如BERT和RoBERTa)在零样本跨语言任务中达到与生成式大语言模型(Large Language Models, LLMs)相当的性能,同时降低计算和内存成本的问题。解决方案的关键在于采用Statement Tuning方法,将任务重新表述为有限模板,从而提升编码器模型的零样本泛化能力,并验证其在多语言自然语言处理中的有效性。

链接: https://arxiv.org/abs/2506.01592
作者: Ahmed Elshabrawy,Thanh-Nhi Nguyen,Yeeun Kang,Lihan Feng,Annant Jain,Faadil Abdullah Shaikh,Jonibek Mansurov,Mohamed Fazli Mohamed Imam,Jesus-German Ortiz-Barajas,Rendi Chevi,Alham Fikri Aji
机构: MBZUAI(穆巴达拉科学技术大学); UIT, Vietnam(越南信息技术大学); VNU-HCM(越南国家大学胡志明市分校); Yale University(耶鲁大学); NYU Shanghai(纽约大学上海校区); IIT Bombay(印度理工学院孟买分校)
类目: Computation and Language (cs.CL)
备注: Accepted to ACL 2025 (Findings)

点击查看摘要

Abstract:Large Language Models (LLMs) excel in zero-shot and few-shot tasks, but achieving similar performance with encoder-only models like BERT and RoBERTa has been challenging due to their architecture. However, encoders offer advantages such as lower computational and memory costs. Recent work adapts them for zero-shot generalization using Statement Tuning, which reformulates tasks into finite templates. We extend this approach to multilingual NLP, exploring whether encoders can achieve zero-shot cross-lingual generalization and serve as efficient alternatives to memory-intensive LLMs for low-resource languages. Our results show that state-of-the-art encoder models generalize well across languages, rivaling multilingual LLMs while being more efficient. We also analyze multilingual Statement Tuning dataset design, efficiency gains, and language-specific generalization, contributing to more inclusive and resource-efficient NLP models. We release our code and models.
zh

[NLP-53] Unified Large Language Models for Misinformation Detection in Low-Resource Linguistic Settings

【速读】: 该论文试图解决在资源匮乏语言(如乌尔都语)中虚假新闻检测(Fake News Detection, FND)面临的挑战,尤其是由于缺乏大规模、准确标注的数据集和经过验证的词汇资源。解决方案的关键在于构建一个可靠、专家验证且领域无关的乌尔都语增强型FND数据集,并利用多种先进的预训练大型语言模型(LLMs)进行评估与优化,最终提出一种统一的LLM模型以提升检测性能。

链接: https://arxiv.org/abs/2506.01587
作者: Muhammad Islam,Javed Ali Khan,Mohammed Abaker,Ali Daud,Azeem Irshad
机构: James Cook University (詹姆斯库克大学); University of Hertfordshire (赫特福德大学); King Khalid University (国王 Khalid 大学); Rabdan Academy (拉布丹学院); GGC Asghar Mall (GGC 阿斯加尔商场)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The rapid expansion of social media platforms has significantly increased the dissemination of forged content and misinformation, making the detection of fake news a critical area of research. Although fact-checking efforts predominantly focus on English-language news, there is a noticeable gap in resources and strategies to detect news in regional languages, such as Urdu. Advanced Fake News Detection (FND) techniques rely heavily on large, accurately labeled datasets. However, FND in under-resourced languages like Urdu faces substantial challenges due to the scarcity of extensive corpora and the lack of validated lexical resources. Current Urdu fake news datasets are often domain-specific and inaccessible to the public. They also lack human verification, relying mainly on unverified English-to-Urdu translations, which compromises their reliability in practical applications. This study highlights the necessity of developing reliable, expert-verified, and domain-independent Urdu-enhanced FND datasets to improve fake news detection in Urdu and other resource-constrained languages. This paper presents the first benchmark large FND dataset for Urdu news, which is publicly available for validation and deep analysis. We also evaluate this dataset using multiple state-of-the-art pre-trained large language models (LLMs), such as XLNet, mBERT, XLM-RoBERTa, RoBERTa, DistilBERT, and DeBERTa. Additionally, we propose a unified LLM model that outperforms the others with different embedding and feature extraction techniques. The performance of these models is compared based on accuracy, F1 score, precision, recall, and human judgment for vetting the sample results of news.
zh

[NLP-54] Prompt Engineering Large Language Models Forecasting Capabilities

【速读】: 该论文试图解决在复杂任务如预测(forecasting)中,通过提示工程(prompt engineering)是否能够显著提升大型语言模型(Large Language Model, LLM)性能的问题。其关键在于评估不同提示策略对模型预测准确性的影响,发现简单的提示修改通常只能带来微不足道的提升,甚至某些策略(如鼓励贝叶斯推理)反而导致准确率下降,表明在复杂任务中,仅依靠基础提示优化难以实现显著性能改进,需依赖更稳健或专门的技术。

链接: https://arxiv.org/abs/2506.01578
作者: Philipp Schoenegger,Cameron R. Jones,Philip E. Tetlock,Barbara Mellers
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language model performance can be improved in a large number of ways. Many such techniques, like fine-tuning or advanced tool usage, are time-intensive and expensive. Although prompt engineering is significantly cheaper and often works for simpler tasks, it remains unclear whether prompt engineering suffices for more complex domains like forecasting. Here we show that small prompt modifications rarely boost forecasting accuracy beyond a minimal baseline. In our first study, we tested 38 prompts across Claude 3.5 Sonnet, Claude 3.5 Haiku, GPT-4o, and Llama 3.1 405B. In our second, we introduced compound prompts and prompts from external sources, also including the reasoning models o1 and o1-mini. Our results show that most prompts lead to negligible gains, although references to base rates yield slight benefits. Surprisingly, some strategies showed strong negative effects on accuracy: especially encouraging the model to engage in Bayesian reasoning. These results suggest that, in the context of complex tasks like forecasting, basic prompt refinements alone offer limited gains, implying that more robust or specialized techniques may be required for substantial performance improvements in AI forecasting.
zh

[NLP-55] Hanfu-Bench: A Multimodal Benchmark on Cross-Temporal Cultural Understanding and Transcreation

【速读】: 该论文旨在解决现有视觉-语言模型(VLMs)在文化理解任务中对时间维度关注不足的问题,尤其是针对传统文化遗产在时间演变中的动态特征缺乏有效建模。解决方案的关键在于构建Hanfu-Bench,这是一个由专家精心设计的多模态数据集,通过两个核心任务——文化视觉理解和文化图像转译——来评估模型在时间文化特征识别与传统服饰现代化转换方面的能力,从而为 temporal cultural understanding 和 creative adaptation 提供新的研究方向和基准测试平台。

链接: https://arxiv.org/abs/2506.01565
作者: Li Zhou,Lutong Yu,Dongchu Xie,Shaohuan Cheng,Wenyan Li,Haizhou Li
机构: The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)); Shenzhen Research Institute of Big Data(深圳市大数据研究院); Chengdu Technological University(成都工业学院); University of Copenhagen(哥本哈根大学)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: cultural analysis, cultural visual understanding, cultural image transcreation

点击查看摘要

Abstract:Culture is a rich and dynamic domain that evolves across both geography and time. However, existing studies on cultural understanding with vision-language models (VLMs) primarily emphasize geographic diversity, often overlooking the critical temporal dimensions. To bridge this gap, we introduce Hanfu-Bench, a novel, expert-curated multimodal dataset. Hanfu, a traditional garment spanning ancient Chinese dynasties, serves as a representative cultural heritage that reflects the profound temporal aspects of Chinese culture while remaining highly popular in Chinese contemporary society. Hanfu-Bench comprises two core tasks: cultural visual understanding and cultural image this http URL former task examines temporal-cultural feature recognition based on single- or multi-image inputs through multiple-choice visual question answering, while the latter focuses on transforming traditional attire into modern designs through cultural element inheritance and modern context adaptation. Our evaluation shows that closed VLMs perform comparably to non-experts on visual cutural understanding but fall short by 10% to human experts, while open VLMs lags further behind non-experts. For the transcreation task, multi-faceted human evaluation indicates that the best-performing model achieves a success rate of only 42%. Our benchmark provides an essential testbed, revealing significant challenges in this new direction of temporal cultural understanding and creative adaptation.
zh

[NLP-56] EvolveNav: Self-Improving Embodied Reasoning for LLM -Based Vision-Language Navigation

【速读】: 该论文旨在解决基于生成式 AI (Generative AI) 的视觉-语言导航 (Vision-Language Navigation, VLN) 代理在导航决策准确性与可解释性方面的不足。现有方法主要采用直接输入-输出映射范式,导致映射学习困难且导航决策缺乏可解释性。论文提出的解决方案关键在于构建一个自改进的具身推理框架 EvolveNav,其核心包含两个阶段:形式化链式思维 (Chain-of-Thought, CoT) 监督微调,通过形式化 CoT 标签激活模型的导航推理能力并提升推理速度;以及自我反思后训练,利用模型自身的推理输出作为自增强的 CoT 标签,增强监督多样性,并引入自我反思辅助任务以对比正确与错误推理模式,从而提升导航性能。

链接: https://arxiv.org/abs/2506.01551
作者: Bingqian Lin,Yunshuang Nie,Khun Loun Zai,Ziming Wei,Mingfei Han,Rongtao Xu,Minzhe Niu,Jianhua Han,Liang Lin,Cewu Lu,Xiaodan Liang
机构: Shanghai Jiao Tong University Sun Yat-sen University Mohamed bin Zayed University of Artificial Intelligence Huawei Noah’s Ark Lab
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Building Vision-Language Navigation (VLN) agents which can navigate following natural language instructions is a long-standing goal in human-robot interaction applications. Recent studies have revealed the potential of training open-source Large Language Models (LLMs) to unleash LLMs’ reasoning ability for improving navigation, and simultaneously mitigate the domain gap between LLMs’ training corpus and the VLN task. However, these approaches primarily adopt direct input-output mapping paradigms, causing the mapping learning difficult and the navigational decisions unexplainable. Chain-of-Thought (CoT) training is a promising way to improve both navigational decision accuracy and interpretability, while the complexity of the navigation task makes the perfect CoT labels unavailable and may lead to overfitting through pure CoT supervised fine-tuning. In this paper, we propose a novel sElf-improving embodied reasoning framework for boosting LLM-based vision-language Navigation, dubbed EvolveNav. Our EvolveNav consists of two stages: (1) Formalized CoT Supervised Fine-Tuning, where we train the model with formalized CoT labels to both activate the model’s navigational reasoning capabilities and increase the reasoning speed; (2) Self-Reflective Post-Training, where the model is iteratively trained with its own reasoning outputs as self-enriched CoT labels to enhance the supervision diversity. A self-reflective auxiliary task is also introduced to encourage learning correct reasoning patterns by contrasting with wrong ones. Experimental results on the popular VLN benchmarks demonstrate the superiority of EvolveNav over previous LLM-based VLN approaches. Code is available at this https URL.
zh

[NLP-57] Dictionaries to the Rescue: Cross-Lingual Vocabulary Transfer for Low-Resource Languages Using Bilingual Dictionaries ACL2025

【速读】: 该论文试图解决在低资源语言中适应预训练语言模型的问题,具体是通过跨语言词汇迁移来提升模型对新语言的适配能力。现有方法在利用单语或平行语料时面临挑战,而该研究提出了一种基于双语词典的简单但有效的词汇迁移方法。其关键在于利用BPE分词器的一个特性:当从词汇表中移除一个子词时,会回退到更短的子词。通过逐步移除子词并迭代估计目标子词的嵌入表示,该方法在实验中表现出优于现有方法的效果,证明了基于词典的跨语言词汇迁移的有效性。

链接: https://arxiv.org/abs/2506.01535
作者: Haruki Sakajo,Yusuke Ide,Justin Vasselli,Yusuke Sakai,Yingtao Tian,Hidetaka Kamigaito,Taro Watanabe
机构: Nara Institute of Science and Technology (NAIST); Sakana AI
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to ACL 2025 Findings

点击查看摘要

Abstract:Cross-lingual vocabulary transfer plays a promising role in adapting pre-trained language models to new languages, including low-resource languages. Existing approaches that utilize monolingual or parallel corpora face challenges when applied to languages with limited resources. In this work, we propose a simple yet effective vocabulary transfer method that utilizes bilingual dictionaries, which are available for many languages, thanks to descriptive linguists. Our proposed method leverages a property of BPE tokenizers where removing a subword from the vocabulary causes a fallback to shorter subwords. The embeddings of target subwords are estimated iteratively by progressively removing them from the tokenizer. The experimental results show that our approach outperforms existing methods for low-resource languages, demonstrating the effectiveness of a dictionary-based approach for cross-lingual vocabulary transfer.
zh

[NLP-58] STORM-BORN: A Challenging Mathematical Derivations Dataset Curated via a Human-in-the-Loop Multi-Agent Framework ACL2025

【速读】: 该论文旨在解决现有数学数据集在挑战性内容不足、缺乏人类类似推理过程以及可靠性受限于单一大语言模型生成等方面的缺陷。其解决方案的关键在于提出一种基于“人机协同”的多智能体数据生成框架,结合高密度推理过滤、多智能体协作和数学家评估,以确保数据集的可靠性和质量,并构建了一个包含2,000个合成样本及100个最难问题的超挑战性数学推导数据集STORM-BORN,该数据集具有密集的人类类似近似和启发式提示。

链接: https://arxiv.org/abs/2506.01531
作者: Wenhao Liu,Zhenyi Lu,Xinyu Hu,Jierui Zhang,Dailin Li,Jiacheng Cen,Huilin Cao,Haiteng Wang,Yuhan Li,Kun Xie,Dandan Li,Pei Zhang,Chengbo Zhang,Yuxiang Ren,Xiaohong Huang,Yan Ma
机构: 未知
类目: Computation and Language (cs.CL)
备注: accepted by ACL2025

点击查看摘要

Abstract:High-quality math datasets are crucial for advancing the reasoning abilities of large language models (LLMs). However, existing datasets often suffer from three key issues: outdated and insufficient challenging content, neglecting human-like reasoning, and limited reliability due to single-LLM generation. To address these, we introduce \textbfSTORM-BORN , an ultra-challenging dataset of mathematical derivations sourced from cutting-edge academic papers, which includes dense human-like approximations and heuristic cues. To ensure the reliability and quality, we propose a novel human-in-the-loop, multi-agent data generation framework, integrating reasoning-dense filters, multi-agent collaboration, and human mathematicians’ evaluations. We curated a set of 2,000 synthetic samples and deliberately selected the 100 most difficult problems. Even most advanced models like GPT-o1 solved fewer than 5% of them. Fine-tuning on STORM-BORN boosts accuracy by 7.84% (LLaMA3-8B) and 9.12% (Qwen2.5-7B). As AI approaches mathematician-level reasoning, STORM-BORN provides both a high-difficulty benchmark and a human-like reasoning training resource. Our code and dataset are publicly available at this https URL.
zh

[NLP-59] V-VAE: A Variational Auto Encoding Framework Towards Fine-Grained Control over Human-Like Chat

【速读】: 该论文旨在解决基于大型语言模型(Large Language Model, LLM)的对话系统在生成符合特定角色或个性特征的回复时,因依赖静态角色描述、粗粒度信号空间和低质量合成数据而导致的动态细粒度细节捕捉不足的问题。其解决方案的关键在于提出一种可解释的变分自编码框架(Verbal Variational Auto-Encoding, V-VAE),该框架包含变分自编码模块和细粒度控制空间,能够根据对话风格、交互模式和个人属性等细粒度可解释潜在变量动态调整对话行为。

链接: https://arxiv.org/abs/2506.01524
作者: Qi Lin,Weikai Xu,Lisi Chen,Bin Dai
机构: University of Electronic Science and Technology of China (电子科技大学); Xaiobing.AI (Xaiobing.AI)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:With the continued proliferation of Large Language Model (LLM) based chatbots, there is a growing demand for generating responses that are not only linguistically fluent but also consistently aligned with persona-specific traits in conversations. However, existing role-play and persona-based chat approaches rely heavily on static role descriptions, coarse-grained signal space, and low-quality synthetic data, which fail to capture dynamic fine-grained details in human-like chat. Human-like chat requires modeling subtle latent traits, such as emotional tone, situational awareness, and evolving personality, which are difficult to predefine and cannot be easily learned from synthetic or distillation-based data. To address these limitations, we propose a Verbal Variational Auto-Encoding (V-VAE) framework, containing a variational auto-encoding module and fine-grained control space which dynamically adapts dialogue behaviour based on fine-grained, interpretable latent variables across talking style, interaction patterns, and personal attributes. We also construct a high-quality dataset, HumanChatData, and benchmark HumanChatBench to address the scarcity of high-quality data in the human-like domain. Experiments show that LLMs based on V-VAE consistently outperform standard baselines on HumanChatBench and DialogBench, which further demonstrates the effectiveness of V-VAE and HumanChatData.
zh

[NLP-60] FormFactory: An Interactive Benchmarking Suite for Multimodal Form-Filling Agents

【速读】: 该论文试图解决在线表单填写(online form filling)任务中自动化程度低的问题,该任务涉及大量键盘和鼠标交互,现有工具主要依赖规则且缺乏泛化生成能力。解决方案的关键在于提出FormFactory,这是一个交互式基准测试套件,包含基于网络的界面、后端评估模块和精心构建的数据集,旨在全面评估多模态大语言模型(Multimodal Large Language Models, MLLMs)在表单填写任务中的性能,并揭示其在视觉布局推理和字段值对齐方面的局限性。

链接: https://arxiv.org/abs/2506.01520
作者: Bobo Li,Yuheng Wang,Hao Fei,Juncheng Li,Wei Ji,Mong-Li Lee,Wynne Hsu
机构: National University of Singapore (新加坡国立大学); Wuhan University (武汉大学); Zhejiang University (浙江大学); Nanjing University (南京大学)
类目: Computation and Language (cs.CL)
备注: 8 pages, 7 figures

点击查看摘要

Abstract:Online form filling is a common yet labor-intensive task involving extensive keyboard and mouse interactions. Despite the long-standing vision of automating this process with “one click”, existing tools remain largely rule-based and lack generalizable, generative capabilities. Recent advances in Multimodal Large Language Models (MLLMs) have enabled promising agents for GUI-related tasks in general-purpose scenarios. However, they struggle with the unique challenges of form filling, such as flexible layouts and the difficulty of aligning textual instructions with on-screen fields. To bridge this gap, we formally define the form-filling task and propose FormFactory, an interactive benchmarking suite comprising a web-based interface, backend evaluation module, and carefully constructed dataset. Our benchmark covers diverse real-world scenarios, incorporates various field formats, and simulates high-fidelity form interactions. We conduct a comprehensive evaluation of state-of-the-art MLLMs and observe that no model surpasses 5% accuracy, underscoring the inherent difficulty of the task. These findings also reveal significant limitations in current models’ visual layout reasoning and field-value alignment abilities. We hope our benchmark can serve as a stepping stone for further research into robust, practical form-filling agents.
zh

[NLP-61] Representations of Fact Fiction and Forecast in Large Language Models : Epistemics and Attitudes ACL2025

【速读】: 该论文试图解决当前大型语言模型(Large Language Models, LLMs)在不确定现实环境中基于事实评估和置信度生成相应话语的挑战,特别是其在生成认知模态表达(epistemic expressions)方面的表现有限且不够稳健的问题。解决方案的关键在于通过语义知识的丰富化来提升LLMs对认知模态的理解与表达能力,具体方法是借助认知表达的类型学框架,利用受控故事对LLMs的认知模态知识进行评估。

链接: https://arxiv.org/abs/2506.01512
作者: Meng Li,Michael Vrazitulis,David Schlangen
机构: University of Potsdam (波茨坦大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: accepted by ACL 2025 (main)

点击查看摘要

Abstract:Rational speakers are supposed to know what they know and what they do not know, and to generate expressions matching the strength of evidence. In contrast, it is still a challenge for current large language models to generate corresponding utterances based on the assessment of facts and confidence in an uncertain real-world environment. While it has recently become popular to estimate and calibrate confidence of LLMs with verbalized uncertainty, what is lacking is a careful examination of the linguistic knowledge of uncertainty encoded in the latent space of LLMs. In this paper, we draw on typological frameworks of epistemic expressions to evaluate LLMs’ knowledge of epistemic modality, using controlled stories. Our experiments show that the performance of LLMs in generating epistemic expressions is limited and not robust, and hence the expressions of uncertainty generated by LLMs are not always reliable. To build uncertainty-aware LLMs, it is necessary to enrich semantic knowledge of epistemic modality in LLMs.
zh

[NLP-62] Continual Speech Learning with Fused Speech Features INTERSPEECH2025

【速读】: 该论文试图解决传统静态方法在面对快速增长的语音数据时无法适应动态和多样化的语音信息的问题。其解决方案的关键在于引入连续语音学习(continuous speech learning),通过使用编码器-解码器结构的Whisper模型将语音任务标准化为生成格式,并在编码器顶部集成可学习的门控融合层,以动态选择下游任务所需的特定特征,从而显著提升在六个语音处理任务中的准确性,实现无需完全重新训练即可适应新语音任务的效果。

链接: https://arxiv.org/abs/2506.01496
作者: Guitao Wang,Jinming Zhao,Hao Yang,Guilin Qi,Tongtong Wu,Gholamreza Haffari
机构: Southeast University (东南大学); Monash University (莫纳什大学)
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Submitted to Interspeech 2025

点击查看摘要

Abstract:Rapid growth in speech data demands adaptive models, as traditional static methods fail to keep pace with dynamic and diverse speech information. We introduce continuous speech learning, a new set-up targeting at bridging the adaptation gap in current speech models. We use the encoder-decoder Whisper model to standardize speech tasks into a generative format. We integrate a learnable gated-fusion layer on the top of the encoder to dynamically select task-specific features for downstream tasks. Our approach improves accuracy significantly over traditional methods in six speech processing tasks, demonstrating gains in adapting to new speech tasks without full retraining.
zh

[NLP-63] CVC: A Large-Scale Chinese Value Rule Corpus for Value Alignment of Large Language Models

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在价值对齐方面存在的问题,特别是西方文化偏见和依赖非本土规则的框架导致的评估不足与成本高昂的问题。其解决方案的关键在于构建一个基于核心中国价值观的分层价值框架,包含三个主要维度、12个核心价值和50个衍生价值,并在此基础上建立大规模的中国价值观语料库(Chinese Values Corpus, CVC),通过人工标注增强其覆盖范围与准确性。实验结果表明,CVC引导的场景在价值边界和内容多样性上优于直接生成的场景,且在多个敏感主题上的评估中,主流LLMs与CVC的匹配度超过70.5%,验证了该框架的文化相关性与有效性。

链接: https://arxiv.org/abs/2506.01495
作者: Ping Wu,Guobin Shen,Dongcheng Zhao,Yuwei Wang,Yiting Dong,Yu Shi,Enmeng Lu,Feifei Zhao,Yi Zeng
机构: BrainCog Lab, CASIA (脑认知与类脑智能实验室,中国科学院); University of Chinese Academy of Sciences (中国科学院大学); Beijing Institute of AI Safety and Governance (北京人工智能安全与治理研究院); Long-term AI Lab (长期人工智能实验室)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Ensuring that Large Language Models (LLMs) align with mainstream human values and ethical norms is crucial for the safe and sustainable development of AI. Current value evaluation and alignment are constrained by Western cultural bias and incomplete domestic frameworks reliant on non-native rules; furthermore, the lack of scalable, rule-driven scenario generation methods makes evaluations costly and inadequate across diverse cultural contexts. To address these challenges, we propose a hierarchical value framework grounded in core Chinese values, encompassing three main dimensions, 12 core values, and 50 derived values. Based on this framework, we construct a large-scale Chinese Values Corpus (CVC) containing over 250,000 value rules enhanced and expanded through human annotation. Experimental results show that CVC-guided scenarios outperform direct generation ones in value boundaries and content diversity. In the evaluation across six sensitive themes (e.g., surrogacy, suicide), seven mainstream LLMs preferred CVC-generated options in over 70.5% of cases, while five Chinese human annotators showed an 87.5% alignment with CVC, confirming its universality, cultural relevance, and strong alignment with Chinese values. Additionally, we construct 400,000 rule-based moral dilemma scenarios that objectively capture nuanced distinctions in conflicting value prioritization across 17 LLMs. Our work establishes a culturally-adaptive benchmarking framework for comprehensive value evaluation and alignment, representing Chinese characteristics. All data are available at this https URL, and the code is available at this https URL.
zh

[NLP-64] Multilingual Definition Modeling

【速读】: 该论文试图解决多语言定义建模(definition modeling)问题,旨在探索预训练多语言语言模型在单义词定义任务中的表现及其跨语言协同潜力。解决方案的关键在于利用四种新语言(西班牙语、法语、葡萄牙语和德语)的单语词典数据进行微调,并通过零样本方法测试两种流行聊天式大语言模型(Large Language Models, LLMs)的多语言能力,从而评估其在定义生成任务中的有效性与局限性。

链接: https://arxiv.org/abs/2506.01489
作者: Edison Marrese-Taylor,Erica K. Shimomoto,Alfredo Solano,Enrique Reid
机构: National Institute of Advanced Industrial Science and Technology (日本产业技术综合研究所); The University of Tokyo (东京大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In this paper, we propose the first multilingual study on definition modeling. We use monolingual dictionary data for four new languages (Spanish, French, Portuguese, and German) and perform an in-depth empirical study to test the performance of pre-trained multilingual language models on definition modeling of monosemic words when finetuned on this data. Furthermore, we use a zero-shot approach to test the multilingual capabilities of two popular chat-based Large Language Models (LLMs) in the task. Results show that multilingual language models can perform on-pair with English but cannot leverage potential cross-lingual synergies, with LLMs generally offering better performance overall. A comprehensive human evaluation of the LLM-generated definition highlights the zero and few-shot capabilities of these models in this new task, also showing their shortcomings. Finally, we show that performance on our task via BERTScore strongly correlates to the performance on multilingual LLM benchmarks, suggesting that our task offers a viable compute-constrained, stable and natural alternative to these.
zh

[NLP-65] Argument-Centric Causal Intervention Method for Mitigating Bias in Cross-Document Event Coreference Resolution

【速读】: 该论文旨在解决跨文档事件共指消解(Cross-document Event Coreference Resolution, CD-ECR)中因依赖输入提及对中的触发特征而产生的虚假相关性问题,这些问题会损害模型的整体性能。其解决方案的关键在于提出一种基于论元中心因果干预(Argument-Centric Causal Intervention, ACCI)的方法,通过构建结构因果图来揭示词汇触发词与共指标签之间的混淆依赖关系,并引入后门调整干预以隔离论元语义的真实因果效应。

链接: https://arxiv.org/abs/2506.01488
作者: Long Yao,Wenzhong Yang,Yabo Yin,Fuyuan Wei,Hongzhen Lv,Jiaren Peng,Liejun Wang,Xiaoming Tao
机构: Xinjiang University(新疆大学); Tsinghua University(清华大学)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Cross-document Event Coreference Resolution (CD-ECR) is a fundamental task in natural language processing (NLP) that seeks to determine whether event mentions across multiple documents refer to the same real-world occurrence. However, current CD-ECR approaches predominantly rely on trigger features within input mention pairs, which induce spurious correlations between surface-level lexical features and coreference relationships, impairing the overall performance of the models. To address this issue, we propose a novel cross-document event coreference resolution method based on Argument-Centric Causal Intervention (ACCI). Specifically, we construct a structural causal graph to uncover confounding dependencies between lexical triggers and coreference labels, and introduce backdoor-adjusted interventions to isolate the true causal effect of argument semantics. To further mitigate spurious correlations, ACCI integrates a counterfactual reasoning module that quantifies the causal influence of trigger word perturbations, and an argument-aware enhancement module to promote greater sensitivity to semantically grounded information. In contrast to prior methods that depend on costly data augmentation or heuristic-based filtering, ACCI enables effective debiasing in a unified end-to-end framework without altering the underlying training procedure. Extensive experiments demonstrate that ACCI achieves CoNLL F1 of 88.4% on ECB+ and 85.2% on GVC, achieving state-of-the-art performance. The implementation and materials are available at this https URL.
zh

[NLP-66] LLM in the Loop: Creating the PARADEHATE Dataset for Hate Speech Detoxification

【速读】: 该论文试图解决有害语言(toxic language)净化任务中高质量平行数据稀缺的问题,尤其是在仇恨言论(hate speech)领域,由于人工标注的成本和敏感性导致数据不足。其解决方案的关键在于提出一种基于大语言模型(LLM)的“人在回路”(LLM-in-the-loop)流程,利用GPT-4o-mini自动完成净化任务,并在此基础上构建了PARADEHATE数据集,作为大规模仇恨言论净化的基准数据集。实验表明,该方法生成的净化文本在风格准确性、内容保留和流畅性方面表现优异,证明了LLM生成文本作为可扩展的人工标注替代方案的有效性。

链接: https://arxiv.org/abs/2506.01484
作者: Shuzhou Yuan,Ercong Nie,Lukas Kouba,Ashish Yashwanth Kangen,Helmut Schmid,Hinrich Schutze,Michael Farber
机构: ScaDS.AI and TU Dresden (ScaDS.AI 和德累斯顿工业大学); LMU Munich (慕尼黑路德维希-马克西米利安大学); Munich Center for Machine Learning (MCML) (慕尼黑机器学习中心)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Detoxification, the task of rewriting harmful language into non-toxic text, has become increasingly important amid the growing prevalence of toxic content online. However, high-quality parallel datasets for detoxification, especially for hate speech, remain scarce due to the cost and sensitivity of human annotation. In this paper, we propose a novel LLM-in-the-loop pipeline leveraging GPT-4o-mini for automated detoxification. We first replicate the ParaDetox pipeline by replacing human annotators with an LLM and show that the LLM performs comparably to human annotation. Building on this, we construct PARADEHATE, a large-scale parallel dataset specifically for hatespeech detoxification. We release PARADEHATE as a benchmark of over 8K hate/non-hate text pairs and evaluate a wide range of baseline methods. Experimental results show that models such as BART, fine-tuned on PARADEHATE, achieve better performance in style accuracy, content preservation, and fluency, demonstrating the effectiveness of LLM-generated detoxification text as a scalable alternative to human annotation.
zh

[NLP-67] MUDI: A Multimodal Biomedical Dataset for Understanding Pharmacodynamic Drug-Drug Interactions

【速读】: 该论文旨在解决药物相互作用(Drug-Drug Interaction, DDI)研究中现有数据集主要依赖文本信息而忽视多模态数据的问题,从而更全面地理解复杂的药物机制。其解决方案的关键在于引入MUDI,这是一个大规模的多模态生物医学数据集,通过整合药理学文本、化学式、分子结构图和图像等多源信息,对310,532个标注为协同作用、拮抗作用或新效应的药物对进行表征,并在测试集中包含未见过的药物对以评估机器学习模型的泛化能力。

链接: https://arxiv.org/abs/2506.01478
作者: Tung-Lam Ngo,Ba-Hoang Tran,Duy-Cat Can,Trung-Hieu Do,Oliver Y. Chén,Hoang-Quynh Le
机构: VNU University of Engineering and Technology (VNU-UET); Lausanne University Hospital (CHUV); University of Lausanne (UNIL); Hanoi Medical University; National Geriatric Hospital, Hanoi
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Multimedia (cs.MM); Quantitative Methods (q-bio.QM)
备注:

点击查看摘要

Abstract:Understanding the interaction between different drugs (drug-drug interaction or DDI) is critical for ensuring patient safety and optimizing therapeutic outcomes. Existing DDI datasets primarily focus on textual information, overlooking multimodal data that reflect complex drug mechanisms. In this paper, we (1) introduce MUDI, a large-scale Multimodal biomedical dataset for Understanding pharmacodynamic Drug-drug Interactions, and (2) benchmark learning methods to study it. In brief, MUDI provides a comprehensive multimodal representation of drugs by combining pharmacological text, chemical formulas, molecular structure graphs, and images across 310,532 annotated drug pairs labeled as Synergism, Antagonism, or New Effect. Crucially, to effectively evaluate machine-learning based generalization, MUDI consists of unseen drug pairs in the test set. We evaluate benchmark models using both late fusion voting and intermediate fusion strategies. All data, annotations, evaluation scripts, and baselines are released under an open research license.
zh

[NLP-68] PGPO: Enhancing Agent Reasoning via Pseudocode-style Planning Guided Preference Optimization ACL’25

【速读】: 该论文旨在解决现有大型语言模型(Large Language Model, LLM)代理在处理复杂交互问题时,依赖自然语言(Natural Language, NL)计划导致的冗长、低效以及泛化能力受限的问题。其解决方案的关键在于引入伪代码风格的计划(P-code Plan),通过捕捉推理的结构化逻辑,提升LLM代理的泛化能力和效率。基于此,作者提出了一种称为PGPO(Pseudocode-style Planning Guided Preference Optimization)的方法,结合两种面向规划的奖励机制,进一步增强LLM代理生成高质量P-code Plan及后续推理的能力。

链接: https://arxiv.org/abs/2506.01475
作者: Zouying Cao,Runze Wang,Yifei Yang,Xinbei Ma,Xiaoyong Zhu,Bo Zheng,Hai Zhao
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 20 pages, 12 figures, 14 tables, ACL’25 Findings

点击查看摘要

Abstract:Large Language Model (LLM) agents have demonstrated impressive capabilities in handling complex interactive problems. Existing LLM agents mainly generate natural language plans to guide reasoning, which is verbose and inefficient. NL plans are also tailored to specific tasks and restrict agents’ ability to generalize across similar tasks. To this end, we explore pseudocode-style plans (P-code Plan) to capture the structural logic of reasoning. We find that P-code Plan empowers LLM agents with stronger generalization ability and more efficiency. Inspired by this finding, we propose a pseudocode-style Planning Guided Preference Optimization method called PGPO for effective agent learning. With two planning-oriented rewards, PGPO further enhances LLM agents’ ability to generate high-quality P-code Plans and subsequent reasoning. Experiments show that PGPO achieves superior performance on representative agent benchmarks and outperforms the current leading baselines. Analyses reveal the advantage of PGPO in reducing action errors and omissions during reasoning.
zh

[NLP-69] Integrating Neural and Symbolic Components in a Model of Prag matic Question-Answering

【速读】: 该论文试图解决传统计算模型在语用语言使用中的局限性,即依赖于手动指定的言语和意义集合,从而限制了其在现实语言使用中的适用性。解决方案的关键在于提出一种神经符号框架,通过整合基于大语言模型(LLM)的模块来提出和评估自然语言中的关键组件,从而消除对人工规范的依赖。该框架通过将神经模块融入认知模型,系统地探索了多种方法,包括评估效用、字面语义、生成替代言语和目标等,结果显示混合模型在预测人类回答模式方面可以达到或超越传统概率模型的性能。然而,神经符号模型的成功关键在于LLM的集成方式,其在生成替代方案和将抽象目标转化为效用方面表现尤为有效,但在真值条件语义评估方面仍面临挑战。

链接: https://arxiv.org/abs/2506.01474
作者: Polina Tsvilodub,Robert D. Hawkins,Michael Franke
机构: University of Tübingen (图宾根大学); Stanford University (斯坦福大学)
类目: Computation and Language (cs.CL)
备注: 16 pages, 16 figures. To appear in the proceedings of Society for Computation in Linguistics (SCiL) 2025

点击查看摘要

Abstract:Computational models of pragmatic language use have traditionally relied on hand-specified sets of utterances and meanings, limiting their applicability to real-world language use. We propose a neuro-symbolic framework that enhances probabilistic cognitive models by integrating LLM-based modules to propose and evaluate key components in natural language, eliminating the need for manual specification. Through a classic case study of pragmatic question-answering, we systematically examine various approaches to incorporating neural modules into the cognitive model – from evaluating utilities and literal semantics to generating alternative utterances and goals. We find that hybrid models can match or exceed the performance of traditional probabilistic models in predicting human answer patterns. However, the success of the neuro-symbolic model depends critically on how LLMs are integrated: while they are particularly effective for proposing alternatives and transforming abstract goals into utilities, they face challenges with truth-conditional semantic evaluation. This work charts a path toward more flexible and scalable models of pragmatic language use while illuminating crucial design considerations for balancing neural and symbolic components.
zh

[NLP-70] alTech Systems for the Interspeech 2025 ML-SUPERB 2.0 Challenge INTERSPEECH2025

【速读】: 该论文旨在解决多语言语音识别与语言识别的挑战,特别是在Interspeech 2025 ML-SUPERB 2.0 Challenge中的任务。其关键解决方案是一个混合的语言识别系统,该系统结合了预训练的语言嵌入模型和轻量级的语音识别模型,其中语音识别模型采用跨语言共享编码器,并结合语言特定的二元组语言模型。在语音识别方面,根据训练数据的可用性和保留数据上的性能,针对每种语言应用单一模型,所使用的模型包括微调后的SeamlessM4T、带有自定义语言适配器的MMS-1B-all以及MMS-zeroshot。

链接: https://arxiv.org/abs/2506.01458
作者: Tanel Alumäe,Artem Fedorchenko
机构: TalTech(塔尔图大学)
类目: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: Accepted to Interspeech 2025

点击查看摘要

Abstract:This paper describes the language identification and multilingual speech recognition system developed at Tallinn University of Technology for the Interspeech 2025 ML-SUPERB 2.0 Challenge. A hybrid language identification system is used, consisting of a pretrained language embedding model and a light-weight speech recognition model with a shared encoder across languages and language-specific bigram language models. For speech recognition, three models are used, where only a single model is applied for each language, depending on the training data availability and performance on held-out data. The model set consists of a finetuned version of SeamlessM4T, MMS-1B-all with custom language adapters and MMS-zeroshot. The system obtained the top overall score in the challenge.
zh

[NLP-71] Building Entity Association Mining Framework for Knowledge Discovery

【速读】: 该论文试图解决从非结构化文本中提取有用信号或模式以支持重要商业决策的问题,例如分析投资产品吸引力、发现客户偏好、风险监控等。其解决方案的关键在于提出一个领域无关的通用框架,该框架包含三个主要组件:文档过滤、可配置的实体抽取流程以及关联关系挖掘。该框架通过集成多种实体抽取技术(如DBpedia Spotlight、Spacy NER、自定义实体匹配器和基于词组提取的字典方法)以及生成共现图进行潜在关系分析,从而支持文本挖掘业务用例,并提供量化评分指标用于排序。

链接: https://arxiv.org/abs/2506.01451
作者: Anshika Rawal,Abhijeet Kumar,Mridul Mishra
机构: 未知
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: Presented at Business Analytics and Intelligence Conference, IIM Bengaluru

点击查看摘要

Abstract:Extracting useful signals or pattern to support important business decisions for example analyzing investment product traction and discovering customer preference, risk monitoring etc. from unstructured text is a challenging task. Capturing interaction of entities or concepts and association mining is a crucial component in text mining, enabling information extraction and reasoning over and knowledge discovery from text. Furthermore, it can be used to enrich or filter knowledge graphs to guide exploration processes, descriptive analytics and uncover hidden stories in the text. In this paper, we introduce a domain independent pipeline i.e., generalized framework to enable document filtering, entity extraction using various sources (or techniques) as plug-ins and association mining to build any text mining business use-case and quantitatively define a scoring metric for ranking purpose. The proposed framework has three major components a) Document filtering: filtering documents/text of interest from massive amount of texts b) Configurable entity extraction pipeline: include entity extraction techniques i.e., i) DBpedia Spotlight, ii) Spacy NER, iii) Custom Entity Matcher, iv) Phrase extraction (or dictionary) based c) Association Relationship Mining: To generates co-occurrence graph to analyse potential relationships among entities, concepts. Further, co-occurrence count based frequency statistics provide a holistic window to observe association trends or buzz rate in specific business context. The paper demonstrates the usage of framework as fundamental building box in two financial use-cases namely brand product discovery and vendor risk monitoring. We aim that such framework will remove duplicated effort, minimize the development effort, and encourage reusability and rapid prototyping in association mining business applications for institutions.
zh

[NLP-72] Whale: Large-Scale multilingual ASR model with w2v-BERT and E-Branchformer with large speech data

【速读】: 该论文旨在解决大规模语音识别模型的构建与优化问题,以提升模型在不同说话风格和声学条件下的鲁棒性与识别准确率。其解决方案的关键在于融合了w2v-BERT自监督模型、基于E-Branchformer的编码器-解码器主干架构以及联合CTC-注意力解码策略,并利用广泛且多样的训练语料库(包括公开数据集和内部数据),从而实现性能的显著提升。

链接: https://arxiv.org/abs/2506.01439
作者: Yosuke Kashiwagi,Hayato Futami,Emiru Tsunoo,Satoshi Asakawa
机构: 未知
类目: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:This paper reports on the development of a large-scale speech recognition model, Whale. Similar to models such as Whisper and OWSM, Whale leverages both a large model size and a diverse, extensive dataset. Whale’s architecture integrates w2v-BERT self-supervised model, an encoder-decoder backbone built on E-Branchformer, and a joint CTC-attention decoding strategy. The training corpus comprises varied speech data, of not only public corpora but also in-house data, thereby enhancing the model’s robustness to different speaking styles and acoustic conditions. Through evaluations on multiple benchmarks, Whale achieved comparable performance to existing models. In particular, it achieves a word error rate of 2.4% on the Librispeech test-clean set and a character error rate of 3.4% on the CSJ eval3 set, outperforming Whisper large-v3 and OWSM v3.1.
zh

[NLP-73] Redundancy Isotropy and Intrinsic Dimensionality of Prompt-based Text Embeddings ACL2025

【速读】: 该论文试图解决生成式 AI (Generative AI) 生成的文本嵌入在维度较高时带来的存储和计算成本过高的问题。其解决方案的关键在于通过后处理的降维方法减少嵌入维度,同时保持任务性能的稳定性,实验表明即使大幅降低维度,分类、聚类、检索和语义文本相似性等任务的性能下降仍非常有限,证明了这些嵌入具有高度冗余性。

链接: https://arxiv.org/abs/2506.01435
作者: Hayato Tsukagoshi,Ryohei Sasano
机构: Nagoya University (名古屋大学)
类目: Computation and Language (cs.CL)
备注: ACL 2025 Findings

点击查看摘要

Abstract:Prompt-based text embedding models, which generate task-specific embeddings upon receiving tailored prompts, have recently demonstrated remarkable performance. However, their resulting embeddings often have thousands of dimensions, leading to high storage costs and increased computational costs of embedding-based operations. In this paper, we investigate how post-hoc dimensionality reduction applied to the embeddings affects the performance of various tasks that leverage these embeddings, specifically classification, clustering, retrieval, and semantic textual similarity (STS) tasks. Our experiments show that even a naive dimensionality reduction, which keeps only the first 25% of the dimensions of the embeddings, results in a very slight performance degradation, indicating that these embeddings are highly redundant. Notably, for classification and clustering, even when embeddings are reduced to less than 0.5% of the original dimensionality the performance degradation is very small. To quantitatively analyze this redundancy, we perform an analysis based on the intrinsic dimensionality and isotropy of the embeddings. Our analysis reveals that embeddings for classification and clustering, which are considered to have very high dimensional redundancy, exhibit lower intrinsic dimensionality and less isotropy compared with those for retrieval and STS.
zh

[NLP-74] Self-Refining Language Model Anonymizers via Adversarial Distillation

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在敏感领域应用中因从看似无害的文本中推断个人数据而带来的隐私风险问题。现有基于LLM的匿名化方法虽然有效,但通常依赖于昂贵的专有模型(如GPT-4),这引发了成本和敏感数据暴露于不可信外部系统的担忧。论文提出的解决方案是SElf-refining Anonymization with Language model (SEAL),其关键在于通过对抗性交互收集匿名化文本和推断属性的轨迹,并利用监督微调和偏好学习将匿名化、对抗性推理和效用评估能力蒸馏到小型语言模型(Small Language Models, SLMs)中,从而实现无需依赖外部模型的高效匿名化。

链接: https://arxiv.org/abs/2506.01420
作者: Kyuyoung Kim,Hyunjun Jeon,Jinwoo Shin
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Preprint

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used in sensitive domains, where their ability to infer personal data from seemingly benign text poses emerging privacy risks. While recent LLM-based anonymization methods help mitigate such risks, they often rely on proprietary models (e.g., GPT-4), raising concerns about cost and the potential exposure of sensitive data to untrusted external systems. To address this, we introduce SElf-refining Anonymization with Language model (SEAL), a novel distillation framework for training small language models (SLMs) to perform effective anonymization without relying on external costly models at inference time. We leverage adversarial interactions between an LLM anonymizer and an inference model to collect trajectories of anonymized texts and inferred attributes, which are used to distill anonymization, adversarial inference, and utility evaluation capabilities into SLMs via supervised fine-tuning and preference learning. The resulting models learn to both anonymize text and critique their outputs, enabling iterative improvement of anonymization quality via self-refinement. Experiments on SynthPAI, a dataset of synthetic personal profiles and text comments, demonstrate that SLMs trained with SEAL achieve substantial improvements in anonymization capabilities. Notably, 8B models attain a privacy-utility trade-off comparable to that of the GPT-4 anonymizer and, with self-refinement, even surpass it in terms of privacy. These results show the effectiveness of our adversarial distillation framework in training SLMs as efficient anonymizers. To facilitate further research, we release the full dataset used in our experiments.
zh

[NLP-75] UniversalCEFR: Enabling Open Multilingual Research on Language Proficiency Assessment

【速读】: 该论文试图解决多语言环境下语言熟练度评估与文本可读性自动分析的问题,其解决方案的关键在于构建了一个大规模、多维度的多语言文本数据集UniversalCEFR,该数据集根据CEFR(Common European Framework of Reference)尺度进行标注,并在13种语言中标准化为统一的数据格式,以支持跨任务和跨语言的一致处理与建模。

链接: https://arxiv.org/abs/2506.01419
作者: Joseph Marvin Imperial,Abdullah Barayan,Regina Stodden,Rodrigo Wilkens,Ricardo Munoz Sanchez,Lingyun Gao,Melissa Torgbi,Dawn Knight,Gail Forey,Reka R. Jablonkai,Ekaterina Kochmar,Robert Reynolds,Eugenio Ribeiro,Horacio Saggion,Elena Volodina,Sowmya Vajjala,Thomas Francois,Fernando Alva-Manchego,Harish Tayyar Madabushi
机构: University of Bath(巴斯大学); Cardiff University(卡迪夫大学); National University Philippines(菲律宾国家大学); Bielefeld University(比勒费尔德大学); University of Exeter(埃克塞特大学); University of Gothenburg(哥德堡大学); UCLouvain(卢万大学); MBZUAI(MBZUAI); Brigham Young University(杨百翰大学); INESC-ID Lisboa(里斯本INSEID研究所); Instituto Universitário de Lisboa (ISCTE-IUL)(里斯本大学学院(ISCTE-IUL)); Universitat Pompeu Fabra(庞佩乌法布拉大学); National Research Council, Canada(加拿大国家研究委员会); King Abdulaziz University(阿卜杜勒阿齐兹国王大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We introduce UniversalCEFR, a large-scale multilingual multidimensional dataset of texts annotated according to the CEFR (Common European Framework of Reference) scale in 13 languages. To enable open research in both automated readability and language proficiency assessment, UniversalCEFR comprises 505,807 CEFR-labeled texts curated from educational and learner-oriented resources, standardized into a unified data format to support consistent processing, analysis, and modeling across tasks and languages. To demonstrate its utility, we conduct benchmark experiments using three modelling paradigms: a) linguistic feature-based classification, b) fine-tuning pre-trained LLMs, and c) descriptor-based prompting of instruction-tuned LLMs. Our results further support using linguistic features and fine-tuning pretrained models in multilingual CEFR level assessment. Overall, UniversalCEFR aims to establish best practices in data distribution in language proficiency research by standardising dataset formats and promoting their accessibility to the global research community.
zh

[NLP-76] Incentivizing Reasoning for Advanced Instruction-Following of Large Language Models

【速读】: 该论文旨在解决现有大语言模型(Large Language Models, LLMs)在处理包含多重约束且结构复杂(如并行、链式和分支结构)的指令时表现不佳的问题。研究发现,传统的思维链(Chain-of-Thought, CoT)方法由于仅进行表面化的推理,即简单地改写指令,反而对性能产生负面影响,无法深入解析约束之间的层次与维度关系。解决方案的关键在于通过激励推理以实现测试时计算规模的扩展,具体包括:基于现有分类体系对复杂指令进行分解,并提出可复现的数据获取方法;利用可验证的规则中心奖励信号进行强化学习,培养针对指令遵循的特定推理能力;通过样本级对比提升CoT的执行效果;以及通过专家行为克隆促进从快速思考模型到熟练推理者的分布迁移。

链接: https://arxiv.org/abs/2506.01413
作者: Yulei Qin,Gang Li,Zongyi Li,Zihan Xu,Yuchen Shi,Zhekai Lin,Xiao Cui,Ke Li,Xing Sun
机构: Tencent YouTu Lab(腾讯优图实验室); Xiamen University(厦门大学); The Chinese University of Hong Kong(香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 10 pages of main body, 3 tables, 5 figures, 40 pages of appendix

点击查看摘要

Abstract:Existing large language models (LLMs) face challenges of following complex instructions, especially when multiple constraints are present and organized in paralleling, chaining, and branching structures. One intuitive solution, namely chain-of-thought (CoT), is expected to universally improve capabilities of LLMs. However, we find that the vanilla CoT exerts a negative impact on performance due to its superficial reasoning pattern of simply paraphrasing the instructions. It fails to peel back the compositions of constraints for identifying their relationship across hierarchies of types and dimensions. To this end, we propose a systematic method to boost LLMs in dealing with complex instructions via incentivizing reasoning for test-time compute scaling. First, we stem from the decomposition of complex instructions under existing taxonomies and propose a reproducible data acquisition method. Second, we exploit reinforcement learning (RL) with verifiable rule-centric reward signals to cultivate reasoning specifically for instruction following. We address the shallow, non-essential nature of reasoning under complex instructions via sample-wise contrast for superior CoT enforcement. We also exploit behavior cloning of experts to facilitate steady distribution shift from fast-thinking LLMs to skillful reasoners. Extensive evaluations on seven comprehensive benchmarks confirm the validity of the proposed method, where a 1.5B LLM achieves 11.74% gains with performance comparable to a 8B LLM. Codes and data are available at this https URL.
zh

[NLP-77] Comparing LLM -generated and human-authored news text using formal syntactic theory ACL-2025

【速读】: 该论文试图解决的问题是区分由大型语言模型(Large Language Models, LLMs)生成的《纽约时报》风格文本与真实的人类撰写的《纽约时报》文本。其解决方案的关键在于基于形式句法理论,采用头驱动短语结构语法(Head-driven Phrase Structure Grammar, HPSG)对文本的语法结构进行分析,并通过研究HPSG语法类型的分布差异,揭示人类写作与LLM生成文本之间的系统性区别。

链接: https://arxiv.org/abs/2506.01407
作者: Olga Zamaraeva,Dan Flickinger,Francis Bond,Carlos Gómez-Rodríguez
机构: Universidade da Coruña, CITIC (@udc.es); Independent Researcher (@alumni.stanford.edu); Palacký University at Olomouc, Department of Asian Studies (@upol.cz)
类目: Computation and Language (cs.CL)
备注: 20 pages, 15 figures, 13 tables; accepted to ACL-2025 main

点击查看摘要

Abstract:This study provides the first comprehensive comparison of New York Times-style text generated by six large language models against real, human-authored NYT writing. The comparison is based on a formal syntactic theory. We use Head-driven Phrase Structure Grammar (HPSG) to analyze the grammatical structure of the texts. We then investigate and illustrate the differences in the distributions of HPSG grammar types, revealing systematic distinctions between human and LLM-generated writing. These findings contribute to a deeper understanding of the syntactic behavior of LLMs as well as humans, within the NYT genre.
zh

[NLP-78] Speech-to-Speech Translation Pipelines for Conversations in Low-Resource Languages

【速读】: 该论文旨在解决低资源语言对(如土耳其语和普什图语与法语之间)在人机对话中自动语音到语音翻译的质量问题。其关键解决方案是通过收集微调和测试数据,对比不同系统(包括本地模型和云服务商业模型),并利用自动评估指标(BLEU、COMET 和 BLASER)及人工评估来优化端到端的语音识别、机器翻译和语音合成流程,最终确定每个方向的最佳管道组合。研究还发现,各组件的性能排名通常独立于整个流水线的其他部分。

链接: https://arxiv.org/abs/2506.01406
作者: Andrei Popescu-Belis,Alexis Allemann,Teo Ferrari,Gopal Krishnamani
机构: HEIG-VD / HES-SO (HEIG-VD / HES-SO); Bhaasha Sàrl (Bhaasha Sàrl); EPFL (EPFL)
类目: Computation and Language (cs.CL)
备注: Proceedings of MT Summit 2025

点击查看摘要

Abstract:The popularity of automatic speech-to-speech translation for human conversations is growing, but the quality varies significantly depending on the language pair. In a context of community interpreting for low-resource languages, namely Turkish and Pashto to/from French, we collected fine-tuning and testing data, and compared systems using several automatic metrics (BLEU, COMET, and BLASER) and human assessments. The pipelines included automatic speech recognition, machine translation, and speech synthesis, with local models and cloud-based commercial ones. Some components have been fine-tuned on our data. We evaluated over 60 pipelines and determined the best one for each direction. We also found that the ranks of components are generally independent of the rest of the pipeline.
zh

[NLP-79] Agent CPM-GUI: Building Mobile-Use Agents with Reinforcement Fine-Tuning

【速读】: 该论文旨在解决在移动环境中通过图形用户界面(GUI)自动化任务的挑战,特别是针对现有训练数据噪声大、语义多样性不足、模型泛化能力差以及对非英语界面支持不足的问题。其解决方案的关键在于构建一个名为AgentCPM-GUI的8B参数GUI代理,该代理通过感知增强的接地意识预训练、高质量中英文轨迹的监督微调以及基于GRPO的强化微调来提升模型的感知、模仿和推理能力,同时引入紧凑的动作空间以实现低延迟的设备端执行。

链接: https://arxiv.org/abs/2506.01391
作者: Zhong Zhang,Yaxi Lu,Yikun Fu,Yupeng Huo,Shenzhi Yang,Yesai Wu,Han Si,Xin Cong,Haotian Chen,Yankai Lin,Jie Xie,Wei Zhou,Wang Xu,Yuanheng Zhang,Zhou Su,Zhongwu Zhai,Xiaoming Liu,Yudong Mei,Jianming Xu,Hongyan Tian,Chongyi Wang,Chi Chen,Yuan Yao,Zhiyuan Liu,Maosong Sun
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注: The project is available at this https URL

点击查看摘要

Abstract:The recent progress of large language model agents has opened new possibilities for automating tasks through graphical user interfaces (GUIs), especially in mobile environments where intelligent interaction can greatly enhance usability. However, practical deployment of such agents remains constrained by several key challenges. Existing training data is often noisy and lack semantic diversity, which hinders the learning of precise grounding and planning. Models trained purely by imitation tend to overfit to seen interface patterns and fail to generalize in unfamiliar scenarios. Moreover, most prior work focuses on English interfaces while overlooks the growing diversity of non-English applications such as those in the Chinese mobile ecosystem. In this work, we present AgentCPM-GUI, an 8B-parameter GUI agent built for robust and efficient on-device GUI interaction. Our training pipeline includes grounding-aware pre-training to enhance perception, supervised fine-tuning on high-quality Chinese and English trajectories to imitate human-like actions, and reinforcement fine-tuning with GRPO to improve reasoning capability. We also introduce a compact action space that reduces output length and supports low-latency execution on mobile devices. AgentCPM-GUI achieves state-of-the-art performance on five public benchmarks and a new Chinese GUI benchmark called CAGUI, reaching 96.9% Type-Match and 91.3% Exact-Match. To facilitate reproducibility and further research, we publicly release all code, model checkpoint, and evaluation data.
zh

[NLP-80] AdaRewriter: Unleashing the Power of Prompting-based Conversational Query Reformulation via Test-Time Adaptation

【速读】: 该论文旨在解决对话式搜索中用户查询重构(query reformulation)的问题,即如何将模糊的对话式查询转化为独立且有效的搜索查询。现有方法在训练阶段或测试阶段的微调策略未能充分发挥其潜力,因此论文提出AdaRewriter,其关键在于通过测试时适应(test-time adaptation)引入一个基于结果监督的奖励模型,利用对比排序损失(contrastive ranking loss)训练轻量级奖励模型,在推理阶段选择最具有前景的重构结果。该方法能够在黑盒系统中有效运行,并在多个对话式搜索数据集上显著优于现有方法。

链接: https://arxiv.org/abs/2506.01381
作者: Yilong Lai,Jialong Wu,Zhenglin Wang,Deyu Zhou
机构: Southeast University (东南大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Prompting-based conversational query reformulation has emerged as a powerful approach for conversational search, refining ambiguous user queries into standalone search queries. Best-of-N reformulation over the generated candidates via prompting shows impressive potential scaling capability. However, both the previous tuning methods (training time) and adaptation approaches (test time) can not fully unleash their benefits. In this paper, we propose AdaRewriter, a novel framework for query reformulation using an outcome-supervised reward model via test-time adaptation. By training a lightweight reward model with contrastive ranking loss, AdaRewriter selects the most promising reformulation during inference. Notably, it can operate effectively in black-box systems, including commercial LLM APIs. Experiments on five conversational search datasets show that AdaRewriter significantly outperforms the existing methods across most settings, demonstrating the potential of test-time adaptation for conversational query reformulation.
zh

[NLP-81] AI Scientists Fail Without Strong Implementation Capability

【速读】: 该论文试图解决当前人工智能科学家(AI Scientist)在执行科学验证流程方面存在的能力瓶颈问题,即其无法有效完成严谨实验并生成高质量科学论文的局限性。解决方案的关键在于提升AI Scientist系统在复杂工程任务中的执行能力,以弥补其在实现层面的差距。

链接: https://arxiv.org/abs/2506.01372
作者: Minjun Zhu,Qiujie Xie,Yixuan Weng,Jian Wu,Zhen Lin,Linyi Yang,Yue Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Position

点击查看摘要

Abstract:The emergence of Artificial Intelligence (AI) Scientist represents a paradigm shift in scientific discovery, with large language models (LLMs) taking the lead as the primary executor in the entire scientific workflow from idea generation to experiment implementation. Recent AI Scientist studies demonstrate sufficient capabilities for independent scientific discovery, with the generated research reports gaining acceptance at the ICLR 2025 workshop and ACL 2025, arguing that a human-level AI Scientist, capable of uncovering phenomena previously unknown to humans, may be imminent. Despite this substantial progress, AI Scientist has yet to produce a groundbreaking achievement in the domain of computer science on par with automated scientific tools. Based on extensive quantitative evidence from existing benchmarks in complex engineering tasks and a systematic evaluation assess 28 research papers generated by five advanced AI Scientist systems, we argue that \textbfthe fundamental bottleneck for AI Scientists lies in their capability to execute the requisite verification procedures. Current AI Scientist systems lack the execution capabilities needed to execute rigorous experiments and produce high-quality scientific papers. To better illustrate the root cause of this \textbfimplementation gap, we provide an in-depth discussion on the fundamental limitations of AI Scientist. This position paper aims to call for the participants in the community to bridge the implementation gap.
zh

[NLP-82] MMD-Flagger: Leverag ing Maximum Mean Discrepancy to Detect Hallucinations

【速读】: 该论文试图解决生成式 AI (Generative AI) 在生成内容时容易产生与现实不符的幻觉(hallucination)问题,这限制了其在关键应用中的使用。解决方案的关键在于提出一种新的检测方法 MMD-Flagger,该方法基于最大均值差异(Maximum Mean Discrepancy, MMD),通过跟踪生成文档与不同温度参数下生成文档之间的 MMD 轨迹,从而识别大部分幻觉内容。

链接: https://arxiv.org/abs/2506.01367
作者: Kensuke Mitsuzawa,Damien Garreau
机构: Université Côte d’Azur, CNRS, LJAD, France (蔚蓝海岸大学,法国国家科学研究中心,LJAD,法国); Center for Artificial Intelligence and Data Science (CAIDAS) (人工智能与数据科学中心); Julius-Maximilans-Universität, Würzburg (朱利叶斯-马克斯米利安大学,维尔茨堡)
类目: Computation and Language (cs.CL); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have become pervasive in our everyday life. Yet, a fundamental obstacle prevents their use in many critical applications: their propensity to generate fluent, human-quality content that is not grounded in reality. The detection of such hallucinations is thus of the highest importance. In this work, we propose a new method to flag hallucinated content, MMD-Flagger. It relies on Maximum Mean Discrepancy (MMD), a non-parametric distance between distributions. On a high-level perspective, MMD-Flagger tracks the MMD between the generated documents and documents generated with various temperature parameters. We show empirically that inspecting the shape of this trajectory is sufficient to detect most hallucinations. This novel method is benchmarked on two machine translation datasets, on which it outperforms natural competitors.
zh

[NLP-83] Attention Is Not Always the Answer: Optimizing Voice Activity Detection with Simple Feature Fusion INTERSPEECH2025

【速读】: 该论文旨在解决语音活动检测(Voice Activity Detection, VAD)中特征表示不足的问题,通过融合传统手工特征与预训练模型(Pre-trained Model, PTM)特征以提升检测性能。其解决方案的关键在于提出了一种统一框架FusionVAD,该框架采用三种融合策略(拼接、相加和交叉注意力)结合MFCCs与PTM特征,并发现简单的相加方法在准确性和效率上优于交叉注意力机制,验证了多特征融合的有效性与互补性。

链接: https://arxiv.org/abs/2506.01365
作者: Kumud Tripathi,Chowdam Venkata Kumar,Pankaj Wasnik
机构: 未知
类目: ound (cs.SD); Computation and Language (cs.CL)
备注: Accepted at INTERSPEECH 2025, 5 pages, 4 figures, 2 tables

点击查看摘要

Abstract:Voice Activity Detection (VAD) plays a key role in speech processing, often utilizing hand-crafted or neural features. This study examines the effectiveness of Mel-Frequency Cepstral Coefficients (MFCCs) and pre-trained model (PTM) features, including wav2vec 2.0, HuBERT, WavLM, UniSpeech, MMS, and Whisper. We propose FusionVAD, a unified framework that combines both feature types using three fusion strategies: concatenation, addition, and cross-attention (CA). Experimental results reveal that simple fusion techniques, particularly addition, outperform CA in both accuracy and efficiency. Fusion-based models consistently surpass single-feature models, highlighting the complementary nature of MFCCs and PTM features. Notably, our best-performing fusion model exceeds the state-of-the-art Pyannote across multiple datasets, achieving an absolute average improvement of 2.04%. These results confirm that simple feature fusion enhances VAD robustness while maintaining computational efficiency.
zh

[NLP-84] KokoroChat: A Japanese Psychological Counseling Dialogue Dataset Collected via Role-Playing by Trained Counselors ACL2025

【速读】: 该论文旨在解决心理辅导对话生成中数据集质量不足的问题,特别是现有数据在多样性与真实性方面的局限性。为了解决这一问题,研究提出了一种角色扮演方法,其中受过训练的咨询师模拟咨询师与来访者之间的互动,从而确保对话的高质量并降低隐私风险。该方法的关键在于通过专业培训的咨询师生成真实且多样化的心理辅导对话,进而构建出高质量的数据集KokoroChat。

链接: https://arxiv.org/abs/2506.01357
作者: Zhiyang Qi,Takumasa Kaneko,Keiko Takamizo,Mariko Ukiyo,Michimasa Inaba
机构: The University of Electro-Communications (电波通信大学); Rapport Technologies, Inc. (ラポーティー・テクノロジーズ株式会社); iDEAR Human Support Service (iDEAR人間支援サービス); Japanese Organization of Mental Health and Educational Agencies (日本精神保健教育機関協会)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to ACL 2025 Main Conference

点击查看摘要

Abstract:Generating psychological counseling responses with language models relies heavily on high-quality datasets. Crowdsourced data collection methods require strict worker training, and data from real-world counseling environments may raise privacy and ethical concerns. While recent studies have explored using large language models (LLMs) to augment psychological counseling dialogue datasets, the resulting data often suffers from limited diversity and authenticity. To address these limitations, this study adopts a role-playing approach where trained counselors simulate counselor-client interactions, ensuring high-quality dialogues while mitigating privacy risks. Using this method, we construct KokoroChat, a Japanese psychological counseling dialogue dataset comprising 6,589 long-form dialogues, each accompanied by comprehensive client feedback. Experimental results demonstrate that fine-tuning open-source LLMs with KokoroChat improves both the quality of generated counseling responses and the automatic evaluation of counseling dialogues. The KokoroChat dataset is available at this https URL.
zh

[NLP-85] he Surprising Effectiveness of Negative Reinforcement in LLM Reasoning

【速读】: 该论文旨在解决如何有效训练语言模型(Language Models, LMs)以执行需要复杂推理的任务,特别是那些能够引发长链思维(Chain of Thoughts, CoTs)的任务。传统监督学习方法在这一领域存在局限性,因此本文提出了一种基于可验证奖励的强化学习(Reinforcement Learning with Verifiable Rewards, RLVR)方法。其解决方案的关键在于对学习信号进行分解,分别关注正样本强化(Positive Sample Reinforcement, PSR)和负样本强化(Negative Sample Reinforcement, NSR)。研究发现,仅使用负样本进行训练即可显著提升模型性能,甚至优于传统的PPO和GRPO方法,这表明抑制错误生成并重新分配概率质量至合理候选答案的过程在提升模型表现中起到了重要作用。

链接: https://arxiv.org/abs/2506.01347
作者: Xinyu Zhu,Mengzhou Xia,Zhepei Wei,Wei-Lin Chen,Danqi Chen,Yu Meng
机构: University of Virginia (弗吉尼亚大学); Princeton Language and Intelligence (PLI) (普林斯顿语言与智能中心), Princeton University (普林斯顿大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Reinforcement learning with verifiable rewards (RLVR) is a promising approach for training language models (LMs) on reasoning tasks that elicit emergent long chains of thought (CoTs). Unlike supervised learning, it updates the model using both correct and incorrect samples via policy gradients. To better understand its mechanism, we decompose the learning signal into reinforcing correct responses and penalizing incorrect ones, referred to as Positive and Negative Sample Reinforcement (PSR and NSR), respectively. We train Qwen2.5-Math-7B and Qwen3-4B on a mathematical reasoning dataset and uncover a surprising result: training with only negative samples – without reinforcing correct responses – can be highly effective: it consistently improves performance over the base model across the entire Pass@ k spectrum ( k up to 256 ), often matching or surpassing PPO and GRPO. In contrast, reinforcing only correct responses improves Pass@ 1 but degrades performance at higher k , due to reduced diversity. These inference-scaling trends highlight that solely penalizing incorrect responses may contribute more to performance than previously recognized. Through gradient analysis, we show that NSR works by suppressing incorrect generations and redistributing probability mass toward other plausible candidates, guided by the model’s prior beliefs. It refines the model’s existing knowledge rather than introducing entirely new behaviors. Building on this insight, we propose a simple variant of the RL objective that upweights NSR, and show that it consistently improves overall Pass@ k performance on MATH, AIME 2025, and AMC23. Our code is available at this https URL.
zh

[NLP-86] Follow the Flow: Fine-grained Flowchart Attribution with Neurosymbolic Agents

【速读】: 该论文试图解决在流图(flowchart)解析过程中,大型语言模型(LLM)因视觉-文本关系复杂而产生的非线性结构误解和虚假连接问题,这导致了在物流、医疗和工程等关键领域中自动化流图处理的可靠性下降。解决方案的关键在于提出细粒度流图归因(Fine-grained Flowchart Attribution)任务,通过将流图分割并转换为结构化符号图,结合基于图的推理机制,实现对LLM响应的后验归因,从而提升预测的可验证性和解释性。核心方法为FlowPathAgent,其通过动态交互图结构生成归因路径,有效缓解了LLM在流图问答中的视觉幻觉问题。

链接: https://arxiv.org/abs/2506.01344
作者: Manan Suri,Puneet Mathur,Nedim Lipka,Franck Dernoncourt,Ryan A. Rossi,Vivek Gupta,Dinesh Manocha
机构: University of Maryland (马里兰大学); Adobe Research (Adobe 研究院); ASU (亚利桑那州立大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Flowcharts are a critical tool for visualizing decision-making processes. However, their non-linear structure and complex visual-textual relationships make it challenging to interpret them using LLMs, as vision-language models frequently hallucinate nonexistent connections and decision paths when analyzing these diagrams. This leads to compromised reliability for automated flowchart processing in critical domains such as logistics, health, and engineering. We introduce the task of Fine-grained Flowchart Attribution, which traces specific components grounding a flowchart referring LLM response. Flowchart Attribution ensures the verifiability of LLM predictions and improves explainability by linking generated responses to the flowchart’s structure. We propose FlowPathAgent, a neurosymbolic agent that performs fine-grained post hoc attribution through graph-based reasoning. It first segments the flowchart, then converts it into a structured symbolic graph, and then employs an agentic approach to dynamically interact with the graph, to generate attribution paths. Additionally, we present FlowExplainBench, a novel benchmark for evaluating flowchart attributions across diverse styles, domains, and question types. Experimental results show that FlowPathAgent mitigates visual hallucinations in LLM answers over flowchart QA, outperforming strong baselines by 10-14% on our proposed FlowExplainBench dataset.
zh

[NLP-87] urnBench-MS: A Benchmark for Evaluating Multi-Turn Multi-Step Reasoning in Large Language Models

【速读】: 该论文旨在解决现有基准测试在评估大型语言模型(Large Language Models, LLMs)时过于侧重单轮或单步骤任务,而未能捕捉实际应用场景中所需的迭代推理问题。其解决方案的关键在于引入TurnBench,这是一个基于“图灵机棋盘游戏”灵感的交互式代码破解任务基准,通过多轮、多步骤的推理过程来评估模型的动态推理能力,包括序列猜测、结构化反馈接收以及跨轮次线索整合,从而更真实地反映模型在复杂任务中的表现。

链接: https://arxiv.org/abs/2506.01341
作者: Yiran Zhang,Mo Wang,Xiaoyang Li,Kaixuan Ren,Chencheng Zhu,Usman Naseem
机构: Macquarie University (麦考瑞大学); University of New South Wales (新南威尔士大学)
类目: Computation and Language (cs.CL)
备注: Preprint

点击查看摘要

Abstract:Despite impressive advances in large language models (LLMs), existing benchmarks often focus on single-turn or single-step tasks, failing to capture the kind of iterative reasoning required in real-world settings. To address this limitation, we introduce TurnBench, a novel benchmark that evaluates multi-turn, multi-step reasoning through an interactive code-breaking task inspired by a “Turing Machine Board Game.” In each episode, a model must uncover hidden logical or arithmetic rules by making sequential guesses, receiving structured feedback, and integrating clues across multiple rounds. This dynamic setup requires models to reason over time, adapt based on past information, and maintain consistency across steps-capabilities underexplored in current benchmarks. TurnBench includes two modes: Classic, which tests standard reasoning, and Nightmare, which introduces increased complexity and requires robust inferential chains. To support fine-grained analysis, we provide ground-truth annotations for intermediate reasoning steps. Our evaluation of state-of-the-art LLMs reveals significant gaps: the best model achieves 81.5% accuracy in Classic mode, but performance drops to 17.8% in Nightmare mode. In contrast, human participants achieve 100% in both, underscoring the challenge TurnBench poses to current models. By incorporating feedback loops and hiding task rules, TurnBench reduces contamination risks and provides a rigorous testbed for diagnosing and advancing multi-step, multi-turn reasoning in LLMs.
zh

[NLP-88] he Landscape of Arabic Large Language Models (ALLM s): A New Era for Arabic Language Technology

【速读】: 该论文旨在探讨阿拉伯语大型语言模型(Arabic Large Language Models, ALLMs)的发展与挑战,特别是在阿拉伯世界中构建和应用这些模型所面临的独特问题。其解决方案的关键在于推动针对阿拉伯语的AI技术进步,通过开发专门的LLMs来弥补技术鸿沟,并促进阿拉伯语社区在数字时代的创新能力。文章强调了ALLMs从基础文本处理系统向先进AI驱动模型的演进过程,并指出评估这些模型的基准测试和公开排行榜在推动技术发展中的重要性。

链接: https://arxiv.org/abs/2506.01340
作者: Shahad Al-Khalifa,Nadir Durrani,Hend Al-Khalifa,Firoj Alam
机构: King Saud University (沙特国王大学); iWAN Research Group (iWAN研究组); Qatar Computing Research Institute (卡塔尔计算研究研究所)
类目: Computation and Language (cs.CL)
备注: Accepted at CACM

点击查看摘要

Abstract:The emergence of ChatGPT marked a transformative milestone for Artificial Intelligence (AI), showcasing the remarkable potential of Large Language Models (LLMs) to generate human-like text. This wave of innovation has revolutionized how we interact with technology, seamlessly integrating LLMs into everyday tasks such as vacation planning, email drafting, and content creation. While English-speaking users have significantly benefited from these advancements, the Arabic world faces distinct challenges in developing Arabic-specific LLMs. Arabic, one of the languages spoken most widely around the world, serves more than 422 million native speakers in 27 countries and is deeply rooted in a rich linguistic and cultural heritage. Developing Arabic LLMs (ALLMs) presents an unparalleled opportunity to bridge technological gaps and empower communities. The journey of ALLMs has been both fascinating and complex, evolving from rudimentary text processing systems to sophisticated AI-driven models. This article explores the trajectory of ALLMs, from their inception to the present day, highlighting the efforts to evaluate these models through benchmarks and public leaderboards. We also discuss the challenges and opportunities that ALLMs present for the Arab world.
zh

[NLP-89] Enhancing Interpretable Image Classification Through LLM Agents and Conditional Concept Bottleneck Models ACL2025

【速读】: 该论文试图解决概念瓶颈模型(Concept Bottleneck Models, CBMs)中概念数量选择不当的问题,即如何确定最优的概念数量以实现充分且简洁的覆盖。当前概念库存在冗余或覆盖不足的问题,影响了模型的性能与可解释性。解决方案的关键在于引入一种基于代理的动态方法,根据环境反馈调整概念库,从而优化概念数量;同时提出条件概念瓶颈模型(Conditional Concept Bottleneck Models, CoCoBMs),通过改进概念评分机制和可编辑矩阵,提升概念对分类任务的贡献评估准确性,并允许大型语言模型(Large Language Models, LLMs)修正与内部知识冲突的概念分数。

链接: https://arxiv.org/abs/2506.01334
作者: Yiwen Jiang,Deval Mehta,Wei Feng,Zongyuan Ge
机构: Monash University(莫纳什大学); AIM for Health Lab(人工智能健康实验室)
类目: Computation and Language (cs.CL)
备注: Accepted at ACL 2025 (Main)

点击查看摘要

Abstract:Concept Bottleneck Models (CBMs) decompose image classification into a process governed by interpretable, human-readable concepts. Recent advances in CBMs have used Large Language Models (LLMs) to generate candidate concepts. However, a critical question remains: What is the optimal number of concepts to use? Current concept banks suffer from redundancy or insufficient coverage. To address this issue, we introduce a dynamic, agent-based approach that adjusts the concept bank in response to environmental feedback, optimizing the number of concepts for sufficiency yet concise coverage. Moreover, we propose Conditional Concept Bottleneck Models (CoCoBMs) to overcome the limitations in traditional CBMs’ concept scoring mechanisms. It enhances the accuracy of assessing each concept’s contribution to classification tasks and feature an editable matrix that allows LLMs to correct concept scores that conflict with their internal knowledge. Our evaluations across 6 datasets show that our method not only improves classification accuracy by 6% but also enhances interpretability assessments by 30%.
zh

[NLP-90] An Empirical Study of Group Conformity in Multi-Agent Systems

【速读】: 该论文试图解决多智能体大型语言模型(Large Language Models, LLMs)在涉及社会争议性议题的互动中,偏见的产生与传播问题。现有研究已广泛探讨了与受保护属性(如种族)相关的偏见,但对社会争议性议题中偏见的演变机制仍缺乏深入理解。论文通过模拟超过2,500场辩论,分析初始中立的LLM代理如何随时间逐渐形成特定立场,揭示了群体一致性现象及其与人类行为的相似性。解决方案的关键在于识别代理智能在话语塑造中的核心作用,并强调通过政策手段提升LLM生成讨论的多样性和透明度,以减少匿名在线环境中偏见扩散的风险。

链接: https://arxiv.org/abs/2506.01332
作者: Min Choi,Keonwoo Kim,Sungwon Chae,Sangyeob Baek
机构: Kim & Chang AI&IT System Center (金昌人工智能与信息技术中心)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Recent advances in Large Language Models (LLMs) have enabled multi-agent systems that simulate real-world interactions with near-human reasoning. While previous studies have extensively examined biases related to protected attributes such as race, the emergence and propagation of biases on socially contentious issues in multi-agent LLM interactions remain underexplored. This study explores how LLM agents shape public opinion through debates on five contentious topics. By simulating over 2,500 debates, we analyze how initially neutral agents, assigned a centrist disposition, adopt specific stances over time. Statistical analyses reveal significant group conformity mirroring human behavior; LLM agents tend to align with numerically dominant groups or more intelligent agents, exerting a greater influence. These findings underscore the crucial role of agent intelligence in shaping discourse and highlight the risks of bias amplification in online interactions. Our results emphasize the need for policy measures that promote diversity and transparency in LLM-generated discussions to mitigate the risks of bias propagation within anonymous online environments.
zh

[NLP-91] Evaluating Large Language Models in Crisis Detection: A Real-World Benchmark from Psychological Support Hotlines

【速读】: 该论文旨在解决心理危机干预中心理咨询热线因需求增加而面临的资源不足问题,同时探索大型语言模型(Large Language Models, LLMs)在情绪敏感场景下的应用潜力。其解决方案的关键在于构建PsyCrisisBench基准测试集,涵盖540个标注的通话记录,并针对四个关键任务(情绪状态识别、自杀意念检测、自杀计划识别和风险评估)评估不同LLMs的表现,通过零样本、少样本和微调等方法优化模型性能,以实现对心理危机的结构化评估。研究还发现模型参数规模与性能之间存在一定的正相关性,且通过量化技术可有效降低计算资源消耗,为LLMs在心理健康领域的实际部署提供了可行路径。

链接: https://arxiv.org/abs/2506.01329
作者: Guifeng Deng,Shuyin Rao,Tianyu Lin,Anlu Dai,Pan Wang,Junyi Xie,Haidong Song,Ke Zhao,Dongwu Xu,Zhengdong Cheng,Tao Li,Haiteng Jiang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 30 pages, 8 figures

点击查看摘要

Abstract:Psychological support hotlines are critical for crisis intervention but face significant challenges due to rising demand. Large language models (LLMs) could support crisis assessments, yet their capabilities in emotionally sensitive contexts remain unclear. We introduce PsyCrisisBench, a benchmark of 540 annotated transcripts from the Hangzhou Psychological Assistance Hotline, assessing four tasks: mood status recognition, suicidal ideation detection, suicide plan identification, and risk assessment. We evaluated 64 LLMs across 15 families (e.g., GPT, Claude, Gemini, Llama, Qwen, DeepSeek) using zero-shot, few-shot, and fine-tuning paradigms. Performance was measured by F1-score, with statistical comparisons via Welch’s t-tests. LLMs performed strongly on suicidal ideation detection (F1=0.880), suicide plan identification (F1=0.779), and risk assessment (F1=0.907), improved with few-shot and fine-tuning. Mood status recognition was more challenging (max F1=0.709), likely due to lost vocal cues and ambiguity. A fine-tuned 1.5B-parameter model (Qwen2.5-1.5B) surpassed larger models on mood and suicidal ideation. Open-source models like QwQ-32B performed comparably to closed-source on most tasks (p0.3), though closed models retained an edge in mood detection (p=0.007). Performance scaled with size up to a point; quantization (AWQ) reduced GPU memory by 70% with minimal F1 degradation. LLMs show substantial promise in structured psychological crisis assessments, especially with fine-tuning. Mood recognition remains limited due to contextual complexity. The narrowing gap between open- and closed-source models, combined with efficient quantization, suggests feasible integration. PsyCrisisBench offers a robust evaluation framework to guide model development and ethical deployment in mental health.
zh

[NLP-92] Zero-Shot Text-to-Speech for Vietnamese ACL2025

【速读】: 该论文旨在解决越南语文本到语音(Text-to-Speech, TTS)合成中数据稀缺及模型泛化能力不足的问题。其解决方案的关键在于构建了一个高质量的越南语语音数据集——PhoAudiobook,该数据集包含941小时的音频资源,并基于此数据集对三种先进的零样本TTS模型进行了实验验证,结果表明PhoAudiobook能够显著提升模型在多种评估指标上的表现,尤其在短句合成任务中,VALL-E和VoiceCraft展现出更强的鲁棒性。

链接: https://arxiv.org/abs/2506.01322
作者: Thi Vu,Linh The Nguyen,Dat Quoc Nguyen
机构: Movian AI(_movian_ai)
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: To appear in Proceedings of ACL 2025 (Main conference paper)

点击查看摘要

Abstract:This paper introduces PhoAudiobook, a newly curated dataset comprising 941 hours of high-quality audio for Vietnamese text-to-speech. Using PhoAudiobook, we conduct experiments on three leading zero-shot TTS models: VALL-E, VoiceCraft, and XTTS-V2. Our findings demonstrate that PhoAudiobook consistently enhances model performance across various metrics. Moreover, VALL-E and VoiceCraft exhibit superior performance in synthesizing short sentences, highlighting their robustness in handling diverse linguistic contexts. We publicly release PhoAudiobook to facilitate further research and development in Vietnamese text-to-speech.
zh

[NLP-93] Growing Through Experience: Scaling Episodic Grounding in Language Models ACL2025

【速读】: 该论文试图解决语言模型(Language Models, LMs)在物理规划任务中缺乏有效的情景记忆(episodic grounding)机制的问题,尤其是中等规模模型(7B参数)在可扩展性和集成性方面的局限性,以及大规模模型(70-405B参数)虽然具备强大的层次化表示和预训练知识,却无法高效利用经验流的规模悖论。解决方案的关键在于提出一种可扩展的弱到强情景学习框架,通过蒙特卡洛树搜索(Monte Carlo Tree Search)进行结构化经验收集,并结合一种新颖的知识蒸馏方法,在保留语言模型固有能力的同时嵌入情景记忆,从而实现从较小模型到较大模型的情景行为迁移。

链接: https://arxiv.org/abs/2506.01312
作者: Chunhui Zhang,Sirui(Elsie)Wang,Zhongyu Ouyang,Xiangchi Yuan,Soroush Vosoughi
机构: Dartmouth College (达特茅斯学院); Georgia Institute of Technology (佐治亚理工学院)
类目: Computation and Language (cs.CL)
备注: Accepted at The 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025)

点击查看摘要

Abstract:Language models (LMs) require robust episodic grounding-the capacity to learn from and apply past experiences-to excel at physical planning tasks. Current episodic grounding approaches struggle with scalability and integration, limiting their effectiveness, especially for medium-sized LMs (7B parameters). While larger LMs (70-405B parameters) possess superior hierarchical representations and extensive pre-trained knowledge, they encounter a fundamental scale paradox: despite their advanced abstraction capabilities, they lack efficient mechanisms to leverage experience streams. We propose a scalable weak-to-strong episodic learning framework that effectively transfers episodic behaviors from smaller to larger LMs. This framework integrates Monte Carlo tree search for structured experience collection with a novel distillation method, preserving the inherent LM capabilities while embedding episodic memory. Experiments demonstrate our method surpasses state-of-the-art proprietary LMs by 3.45% across diverse planning and question-answering tasks. Layer-wise probing further indicates significant improvements in task alignment, especially within deeper LM layers, highlighting stable generalization even for previously unseen scenarios with increased planning complexity-conditions where baseline methods degrade markedly.
zh

[NLP-94] A Platform for Investigating Public Health Content with Efficient Concern Classification

【速读】: 该论文试图解决在线内容中表达的对公共卫生措施的担忧阻碍了全球预防性措施的采用问题,其核心在于如何快速有效地识别文本语料库中的健康相关关切。解决方案的关键是提出ConcernScope平台,该平台采用教师-学生框架在大型语言模型与轻量级分类器之间进行知识迁移,从而高效识别文本中的健康问题。

链接: https://arxiv.org/abs/2506.01308
作者: Christopher Li,Rickard Stureborg,Bhuwan Dhingra,Jun Yang
机构: 未知
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: 19 pages, 15 figures

点击查看摘要

Abstract:A recent rise in online content expressing concerns with public health initiatives has contributed to already stalled uptake of preemptive measures globally. Future public health efforts must attempt to understand such content, what concerns it may raise among readers, and how to effectively respond to it. To this end, we present ConcernScope, a platform that uses a teacher-student framework for knowledge transfer between large language models and light-weight classifiers to quickly and effectively identify the health concerns raised in a text corpus. The platform allows uploading massive files directly, automatically scraping specific URLs, and direct text editing. ConcernScope is built on top of a taxonomy of public health concerns. Intended for public health officials, we demonstrate several applications of this platform: guided data exploration to find useful examples of common concerns found in online community datasets, identification of trends in concerns through an example time series analysis of 186,000 samples, and finding trends in topic frequency before and after significant events.
zh

[NLP-95] VM14K: First Vietnamese Medical Benchmark

【速读】: 该论文旨在解决非英语社区在医疗领域语言模型评估中缺乏足够资源和标准化方法的问题,以及现有非英语医疗数据碎片化、难以验证的挑战。其关键解决方案是开发一种可扩展的方法,通过整合可验证的来源(如精心整理的医学考试和临床记录)并由医疗专家进行标注,构建了一个涵盖34个医学专科、共14,000道选择题的越南语医疗问题基准。该基准设计包含四个难度等级,能够全面评估语言模型在目标语言中的医学理解广度与深度。

链接: https://arxiv.org/abs/2506.01305
作者: Thong Nguyen,Duc Nguyen,Minh Dang,Thai Dao,Long Nguyen,Quan H. Nguyen,Dat Nguyen,Kien Tran,Minh Tran
机构: Vietnam National University (越南国家大学); Dickinson College (迪金森学院); Columbia University (哥伦比亚大学); Venera AI (Venera AI); Carnegie Mellon University (卡内基梅隆大学); University of Maryland (马里兰大学); Foreign Trade University (外贸大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Medical benchmarks are indispensable for evaluating the capabilities of language models in healthcare for non-English-speaking communities,therefore help ensuring the quality of real-life applications. However, not every community has sufficient resources and standardized methods to effectively build and design such benchmark, and available non-English medical data is normally fragmented and difficult to verify. We developed an approach to tackle this problem and applied it to create the first Vietnamese medical question benchmark, featuring 14,000 multiple-choice questions across 34 medical specialties. Our benchmark was constructed using various verifiable sources, including carefully curated medical exams and clinical records, and eventually annotated by medical experts. The benchmark includes four difficulty levels, ranging from foundational biological knowledge commonly found in textbooks to typical clinical case studies that require advanced reasoning. This design enables assessment of both the breadth and depth of language models’ medical understanding in the target language thanks to its extensive coverage and in-depth subject-specific expertise. We release the benchmark in three parts: a sample public set (4k questions), a full public set (10k questions), and a private set (2k questions) used for leaderboard evaluation. Each set contains all medical subfields and difficulty levels. Our approach is scalable to other languages, and we open-source our data construction pipeline to support the development of future multilingual benchmarks in the medical domain.
zh

[NLP-96] Overcoming Multi-step Complexity in Multimodal Theory-of-Mind Reasoning : A Scalable Bayesian Planner ICML2025

【速读】: 该论文试图解决现有计算理论-心智(Theory-of-Mind, ToM)方法在多模态环境中可扩展性不足以及任务复杂度增加时泛化能力差的问题。其解决方案的关键在于提出一种可扩展的贝叶斯ToM规划器,通过分步贝叶斯更新分解ToM推理过程,并引入弱到强控制机制,使小型语言模型(LM)专注于ToM特定似然估计,并将其推理行为迁移至大型LM(7B至405B参数规模),从而实现与社会和世界知识的整合。

链接: https://arxiv.org/abs/2506.01301
作者: Chunhui Zhang,Zhongyu Ouyang,Kwonjoon Lee,Nakul Agarwal,Sean Dae Houlihan,Soroush Vosoughi,Shao-Yuan Lo
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted as a Spotlight at the 2025 Forty-Second International Conference on Machine Learning (ICML 2025)

点击查看摘要

Abstract:Theory-of-Mind (ToM) enables humans to infer mental states-such as beliefs, desires, and intentions-forming the foundation of social cognition. However, existing computational ToM methods rely on structured workflows with ToM-specific priors or deep model fine-tuning, which struggle with scalability in multimodal environments and fail to generalize as task complexity increases. To address these limitations, we propose a scalable Bayesian ToM planner that decomposes ToM reasoning into stepwise Bayesian updates. Our framework introduces weak-to-strong control, allowing smaller language models (LMs) to specialize in ToM-specific likelihood estimation and transfer their reasoning behaviors to larger LMs (7B to 405B) for integration with social and world knowledge. This synergistic approach aligns large-model inference of human mental states with Bayesian principles. Extensive experiments show that our method achieves a 4.6% accuracy improvement over state-of-the-art techniques on multimodal ToM benchmarks, including challenging unseen scenarios, thereby establishing a new standard for modeling human mental states in complex environments.
zh

[NLP-97] Abstractive Visual Understanding of Multi-modal Structured Knowledge: A New Perspective for MLLM Evaluation

【速读】: 该论文旨在解决当前多模态大语言模型(Multi-modal Large Language Models, MLLMs)评估体系中对结构化抽象知识理解能力的忽视问题。现有评估基准和排行榜主要关注模型在多样化场景和对象上的综合理解能力,而忽略了MLLMs在视觉形式中表现出的结构化抽象知识的 comprehend 能力。解决方案的关键在于提出一种新的评估范式,并构建M3STR基准,该基准基于多模态地图(Multi-Modal Map)用于结构化理解(STRuctured understanding),通过多模态知识图谱合成包含丰富多模态实体的子图架构图像,要求MLLMs不仅识别视觉输入中的多模态实体,还需解析其复杂的关联拓扑结构。

链接: https://arxiv.org/abs/2506.01293
作者: Yichi Zhang,Zhuo Chen,Lingbing Guo,Yajing Xu,Min Zhang,Wen Zhang,Huajun Chen
机构: Zhejiang University(浙江大学); Tianjin University(天津大学); Harbin Institute of Technology Shenzhen(哈尔滨工业大学深圳); China(中国)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Work in progress

点击查看摘要

Abstract:Multi-modal large language models (MLLMs) incorporate heterogeneous modalities into LLMs, enabling a comprehensive understanding of diverse scenarios and objects. Despite the proliferation of evaluation benchmarks and leaderboards for MLLMs, they predominantly overlook the critical capacity of MLLMs to comprehend world knowledge with structured abstractions that appear in visual form. To address this gap, we propose a novel evaluation paradigm and devise M3STR, an innovative benchmark grounded in the Multi-Modal Map for STRuctured understanding. This benchmark leverages multi-modal knowledge graphs to synthesize images encapsulating subgraph architectures enriched with multi-modal entities. M3STR necessitates that MLLMs not only recognize the multi-modal entities within the visual inputs but also decipher intricate relational topologies among them. We delineate the benchmark’s statistical profiles and automated construction pipeline, accompanied by an extensive empirical analysis of 26 state-of-the-art MLLMs. Our findings reveal persistent deficiencies in processing abstractive visual information with structured knowledge, thereby charting a pivotal trajectory for advancing MLLMs’ holistic reasoning capacities. Our code and data are released at this https URL
zh

[NLP-98] Schema as Parameterized Tools for Universal Information Extraction

【速读】: 该论文旨在解决通用信息抽取(Universal Information Extraction, UIE)在面对多种预定义模式(schema)时适应性不足的问题,特别是在上下文学习范式下,模型在选择预定义模式与动态生成模式之间缺乏灵活性。其解决方案的关键在于提出一种统一的自适应文本到结构生成框架——Schema as Parameterized Tools (SPT),该框架将预定义模式视为可参数化的工具,通过模式检索、模式填充和模式生成三种机制,实现对封闭式、开放式和按需式信息抽取任务的统一处理,从而提升模型在不同任务和模式下的适应能力。

链接: https://arxiv.org/abs/2506.01276
作者: Sheng Liang,Yongyue Zhang,Yaxiong Wu,Ruiming Tang,Yong Liu
机构: Huawei Noah’s Ark Lab(华为诺亚方舟实验室)
类目: Computation and Language (cs.CL)
备注: 12 pages, 7 figures, 5 tables

点击查看摘要

Abstract:Universal information extraction (UIE) primarily employs an extractive generation approach with large language models (LLMs), typically outputting structured information based on predefined schemas such as JSON or tables. UIE suffers from a lack of adaptability when selecting between predefined schemas and on-the-fly schema generation within the in-context learning paradigm, especially when there are numerous schemas to choose from. In this paper, we propose a unified adaptive text-to-structure generation framework, called Schema as Parameterized Tools (SPT), which reimagines the tool-calling capability of LLMs by treating predefined schemas as parameterized tools for tool selection and parameter filling. Specifically, our SPT method can be applied to unify closed, open, and on-demand IE tasks by adopting Schema Retrieval by fetching the relevant schemas from a predefined pool, Schema Filling by extracting information and filling slots as with tool parameters, or Schema Generation by synthesizing new schemas with uncovered cases. Experiments show that the SPT method can handle four distinct IE tasks adaptively, delivering robust schema retrieval and selection performance. SPT also achieves comparable extraction performance to LoRA baselines and current leading UIE systems with significantly fewer trainable parameters.
zh

[NLP-99] Detoxification of Large Language Models through Output-layer Fusion with a Calibration Model

【速读】: 该论文旨在解决大型语言模型(Large language model, LLM)中毒性内容生成的问题,现有方法通常依赖于大规模无害或人工标注的偏好数据训练、设计提示以指导模型生成安全内容,或修改模型参数以去除有害信息,但这些方法计算成本高、鲁棒性差,并且常损害LLM的流畅性和上下文理解能力。论文提出了一种简单而有效的LLM去毒方法,其关键在于利用一个紧凑的预训练校准模型,通过在生成流程中的轻量级干预来引导目标LLM的去毒过程。该校准模型通过从无害数据中学习去毒嵌入空间,有效引导LLM避免生成有害内容,仅需一次训练即可无缝应用于多个LLM,同时保持流畅性和上下文理解能力。

链接: https://arxiv.org/abs/2506.01266
作者: Yuanhe Tian,Mingjie Deng,Guoqing Jin,Yan Song
机构: University of Washington (华盛顿大学); University of Science and Technology of China (中国科学技术大学); People’s Daily Online (人民日报网络版)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 5 pages, 1 figure

点击查看摘要

Abstract:Existing approaches for Large language model (LLM) detoxification generally rely on training on large-scale non-toxic or human-annotated preference data, designing prompts to instruct the LLM to generate safe content, or modifying the model parameters to remove toxic information, which are computationally expensive, lack robustness, and often compromise LLMs’ fluency and contextual understanding. In this paper, we propose a simple yet effective approach for LLM detoxification, which leverages a compact, pre-trained calibration model that guides the detoxification process of a target LLM via a lightweight intervention in its generation pipeline. By learning a detoxified embedding space from non-toxic data, the calibration model effectively steers the LLM away from generating harmful content. This approach only requires a one-time training of the calibration model that is able to be seamlessly applied to multiple LLMs without compromising fluency or contextual understanding. Experiment results on the benchmark dataset demonstrate that our approach reduces toxicity while maintaining reasonable content expression.
zh

[NLP-100] Beyond In-Context Learning: Aligning Long-form Generation of Large Language Models via Task-Inherent Attribute Guidelines ACL2025

【速读】: 该论文试图解决预训练大语言模型(Large Language Models, LLMs)在上下文学习(In-context Learning, ICL)中对于长文本生成任务(如摘要生成)表现不佳的问题。研究指出,仅依靠ICL示例不足以教会模型任务的语言和格式分布,因此提出解决方案的关键在于通过显式暴露任务分布来增强模型性能。具体而言,论文提出了LongGuide,该方法高效生成两条并行的指导流:(i)度量指导(Metric Guidelines, MGs),用于指导模型优化自评估指标;(ii)输出约束指导(Output Constraint Guidelines, OCGs),在词和句子层面约束生成过程。LongGuide通过自动选择最佳指导组合,在零样本和少样本设置下显著提升了强开源与闭源LLMs的性能。

链接: https://arxiv.org/abs/2506.01265
作者: Do Xuan Long,Duong Ngoc Yen,Do Xuan Trong,Luu Anh Tuan,Kenji Kawaguchi,Shafiq Joty,Min-Yen Kan,Nancy F. Chen
机构: National University of Singapore (新加坡国立大学); Nanyang Technological University (南洋理工大学); Institute for Infocomm Research (I2R), ASTAR (资讯通信研究院,ASTAR); Salesforce AI Research (Salesforce人工智能研究)
类目: Computation and Language (cs.CL)
备注: ACL 2025 Findings

点击查看摘要

Abstract:In-context learning (ICL) is an important yet not fully understood ability of pre-trained large language models (LLMs). It can greatly enhance task performance using a few examples, termed demonstrations, without fine-tuning. Although effective in question answering, ICL often underperforms in long-form generation tasks such as summarization. Under appropriately realistic assumptions, we empirically and theoretically show that ICL demonstrations alone are insufficient to teach LLMs the task language and format distributions for generation. We argue for explicit exposure to the task distributions and hypothesize that defining them by prompting enhances model performance. To this end, we present LongGuide, which efficiently generates two parallel streams of guidelines capturing task language and format properties: (i) Metric Guidelines (MGs) that instruct models to optimize self-evaluated metrics; and (ii) Output Constraint Guidelines (OCGs) that constrain generation at both token and sentence levels. LongGuide automatically selects the best combination of guidelines, improving both strong open- and closed-source LLMs by over 5% in both zero- and few-shot settings. We show that LongGuide is generalizable, learnable by weak models to enhance strong ones, and integrates synergistically with automatic prompt optimizers.
zh

[NLP-101] WCTC-Biasing: Retraining-free Contextual Biasing ASR with Wildcard CTC-based Keyword Spotting and Inter-layer Biasing INTERSPEECH2025

【速读】: 该论文旨在解决基于CTC(Connectionist Temporal Classification)的端到端语音识别模型在识别专有名词和其他未知词时准确率低的问题,这是因为模型输出容易偏向训练数据的词汇。解决方案的关键在于在推理过程中利用中间层的声学特征进行关键词检测,并对后续声学模型层施加偏差,从而提升罕见词的识别精度。该方法采用了一种快速且容忍模糊匹配的通配符CTC,实现了对难以严格匹配词汇的灵活处理,且无需重新训练现有模型,具有良好的可扩展性。

链接: https://arxiv.org/abs/2506.01263
作者: Yu Nakagome,Michael Hentschel
机构: 未知
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted to Interspeech 2025

点击查看摘要

Abstract:Despite recent advances in end-to-end speech recognition methods, the output tends to be biased to the training data’s vocabulary, resulting in inaccurate recognition of proper nouns and other unknown terms. To address this issue, we propose a method to improve recognition accuracy of such rare words in CTC-based models without additional training or text-to-speech systems. Specifically, keyword spotting is performed using acoustic features of intermediate layers during inference, and a bias is applied to the subsequent layers of the acoustic model for detected keywords. For keyword detection, we adopt a wildcard CTC that is both fast and tolerant of ambiguous matches, allowing flexible handling of words that are difficult to match strictly. Since this method does not require retraining of existing models, it can be easily applied to even large-scale models. In experiments on Japanese speech recognition, the proposed method achieved a 29% improvement in the F1 score for unknown words.
zh

[NLP-102] Exploring the Potential of LLM s as Personalized Assistants: Dataset Evaluation and Analysis ACL2025

【速读】: 该论文试图解决个性化AI助手(Personalized AI assistants)在大型语言模型(Large Language Models, LLMs)研究中面临的挑战,尤其是由于缺乏一个开源的对话数据集来支持个性化研究而造成的障碍。解决方案的关键在于引入HiCUPID,这是一个新的基准测试平台,旨在探索并释放LLMs生成个性化响应的潜力,同时提供了一个基于Llama-3.2的自动化评估模型,其评估结果能够紧密反映人类偏好。

链接: https://arxiv.org/abs/2506.01262
作者: Jisoo Mok,Ik-hwan Kim,Sangkwon Park,Sungroh Yoon
机构: Seoul National University (首尔国立大学)
类目: Computation and Language (cs.CL)
备注: ACL 2025

点击查看摘要

Abstract:Personalized AI assistants, a hallmark of the human-like capabilities of Large Language Models (LLMs), are a challenging application that intertwines multiple problems in LLM research. Despite the growing interest in the development of personalized assistants, the lack of an open-source conversational dataset tailored for personalization remains a significant obstacle for researchers in the field. To address this research gap, we introduce HiCUPID, a new benchmark to probe and unleash the potential of LLMs to deliver personalized responses. Alongside a conversational dataset, HiCUPID provides a Llama-3.2-based automated evaluation model whose assessment closely mirrors human preferences. We release our dataset, evaluation model, and code at this https URL.
zh

[NLP-103] DeepSeek in Healthcare: A Survey of Capabilities Risks and Clinical Applications of Open-Source Large Language Models

【速读】: 该论文旨在解决开放源代码大型语言模型(LLM)在复杂推理任务中的性能与安全性之间的平衡问题,特别是在医疗、数学、编程等专业领域中实现高效且可靠的推理能力。其解决方案的关键在于采用混合架构,整合了专家混合(MoE)、思维链(CoT)推理和强化学习技术,以提升模型的推理深度与效率,同时通过开源方式提供透明性和可扩展性,从而为资源受限环境下的应用提供可行的替代方案。

链接: https://arxiv.org/abs/2506.01257
作者: Jiancheng Ye,Sophie Bronstein,Jiarui Hai,Malak Abu Hashish
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:DeepSeek-R1 is a cutting-edge open-source large language model (LLM) developed by DeepSeek, showcasing advanced reasoning capabilities through a hybrid architecture that integrates mixture of experts (MoE), chain of thought (CoT) reasoning, and reinforcement learning. Released under the permissive MIT license, DeepSeek-R1 offers a transparent and cost-effective alternative to proprietary models like GPT-4o and Claude-3 Opus; it excels in structured problem-solving domains such as mathematics, healthcare diagnostics, code generation, and pharmaceutical research. The model demonstrates competitive performance on benchmarks like the United States Medical Licensing Examination (USMLE) and American Invitational Mathematics Examination (AIME), with strong results in pediatric and ophthalmologic clinical decision support tasks. Its architecture enables efficient inference while preserving reasoning depth, making it suitable for deployment in resource-constrained settings. However, DeepSeek-R1 also exhibits increased vulnerability to bias, misinformation, adversarial manipulation, and safety failures - especially in multilingual and ethically sensitive contexts. This survey highlights the model’s strengths, including interpretability, scalability, and adaptability, alongside its limitations in general language fluency and safety alignment. Future research priorities include improving bias mitigation, natural language comprehension, domain-specific validation, and regulatory compliance. Overall, DeepSeek-R1 represents a major advance in open, scalable AI, underscoring the need for collaborative governance to ensure responsible and equitable deployment.
zh

[NLP-104] Memory-Efficient FastText: A Comprehensive Approach Using Double-Array Trie Structures and Mark-Compact Memory Management

【速读】: 该论文旨在解决FastText在大规模工业部署中因基于哈希的桶机制导致的内存消耗过大和语义漂移问题(hash collisions cause semantic drift)。其解决方案的关键在于引入双数组前缀树(double-array trie, DA-trie)结构与标记-压缩垃圾回收(mark-compact garbage collection)原理,通过系统性地识别并合并具有语义相似性的n-gram嵌入,实现内存的高效压缩,同时保持嵌入质量。该方法通过前缀和后缀相关的相似性压缩以及内存重组,实现了4:1至10:1的压缩比,并在3000万中文词汇数据集上将内存占用从超过100GB降低至约30GB,显著提升了模型的部署效率与可靠性。

链接: https://arxiv.org/abs/2506.01254
作者: Yimin Du
机构: 未知
类目: Computation and Language (cs.CL)
备注: 10 pages

点击查看摘要

Abstract:FastText has established itself as a fundamental algorithm for learning word representations, demonstrating exceptional capability in handling out-of-vocabulary words through character-level n-gram embeddings. However, its hash-based bucketing mechanism introduces critical limitations for large-scale industrial deployment: hash collisions cause semantic drift, and memory requirements become prohibitively expensive when dealing with real-world vocabularies containing millions of terms. This paper presents a comprehensive memory optimization framework that fundamentally reimagines FastText’s memory management through the integration of double-array trie (DA-trie) structures and mark-compact garbage collection principles. Our approach leverages the linguistic insight that n-grams sharing common prefixes or suffixes exhibit highly correlated embeddings due to co-occurrence patterns in natural language. By systematically identifying and merging semantically similar embeddings based on structural relationships, we achieve compression ratios of 4:1 to 10:1 while maintaining near-perfect embedding quality. The algorithm consists of four sophisticated phases: prefix trie construction with embedding mapping, prefix-based similarity compression, suffix-based similarity compression, and mark-compact memory reorganization. Comprehensive experiments on a 30-million Chinese vocabulary dataset demonstrate memory reduction from over 100GB to approximately 30GB with negligible performance degradation. Our industrial deployment results show significant cost reduction, faster loading times, and improved model reliability through the elimination of hash collision artifacts. Code and experimental implementations are available at: this https URL
zh

[NLP-105] CoRE: Condition-based Reasoning for Identifying Outcome Variance in Complex Events

【速读】: 该论文试图解决在复杂事件结果中识别隐含条件并评估其对结果影响的问题(Identifying implied conditions and examining their influence on an outcome),这一过程对于验证关于复杂事件结果的声明具有重要意义。解决方案的关键在于结合并扩展两个现有数据集中的目标和状态注释,并通过基于条件的推理任务(Condition-based Reasoning tasks)来探索条件的影响。研究发现,当缺乏完整上下文时,条件信息对于结果验证具有价值,而模型在生成和识别结果相关条件方面的能力差异显著影响其性能,尤其在大型模型如GPT-4o中表现出更高的谨慎性。

链接: https://arxiv.org/abs/2506.01253
作者: Sai Vallurupalli,Francis Ferraro
机构: University of Maryland, Baltimore County (马里兰大学巴尔的摩县分校)
类目: Computation and Language (cs.CL)
备注: Accepted to Findings of the Association for Computational Linguistics 2025

点击查看摘要

Abstract:Knowing which latent conditions lead to a particular outcome is useful for critically examining claims made about complex event outcomes. Identifying implied conditions and examining their influence on an outcome is challenging. We handle this by combining and augmenting annotations from two existing datasets consisting of goals and states, and explore the influence of conditions through our research questions and Condition-based Reasoning tasks. We examine open and closed LLMs of varying sizes and intent-alignment on our reasoning tasks and find that conditions are useful when not all context is available. Models differ widely in their ability to generate and identify outcome-variant conditions which affects their performance on outcome validation when conditions are used to replace missing context. Larger models like GPT-4o, are more cautious in such less constrained situations.
zh

[NLP-106] MTCMB: A Multi-Task Benchmark Framework for Evaluating LLM s on Knowledge Reasoning and Safety in Traditional Chinese Medicine

【速读】: 该论文试图解决传统医学(Traditional Chinese Medicine, TCM)在计算建模与评估中面临的隐性推理、多样化的文本形式以及缺乏标准化等问题,尤其是在大型语言模型(Large Language Models, LLMs)在TCM领域系统性评估方面的不足。解决方案的关键在于引入MTCMB——一个针对TCM知识、推理与安全性的多任务基准测试集,该基准由认证的TCM专家共同开发,涵盖知识问答、语言理解、诊断推理、处方生成和安全评估五个主要类别,整合了真实病例记录、国家执业考试和经典文献,为TCM能力模型提供了一个真实且全面的测试平台。

链接: https://arxiv.org/abs/2506.01252
作者: Shufeng Kong,Xingru Yang,Yuanyuan Wei,Zijie Wang,Hao Tang,Jiuqi Qin,Shuting Lan,Yingheng Wang,Junwen Bai,Zhuangbin Chen,Zibin Zheng,Caihua Liu,Hao Liang
机构: Sun Yat-sen University (中山大学); Hunan University of Chinese Medicine (湖南中医药大学); Guilin University of Electronic Technology (桂林电子科技大学); Cornell University (康奈尔大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Traditional Chinese Medicine (TCM) is a holistic medical system with millennia of accumulated clinical experience, playing a vital role in global healthcare-particularly across East Asia. However, the implicit reasoning, diverse textual forms, and lack of standardization in TCM pose major challenges for computational modeling and evaluation. Large Language Models (LLMs) have demonstrated remarkable potential in processing natural language across diverse domains, including general medicine. Yet, their systematic evaluation in the TCM domain remains underdeveloped. Existing benchmarks either focus narrowly on factual question answering or lack domain-specific tasks and clinical realism. To fill this gap, we introduce MTCMB-a Multi-Task Benchmark for Evaluating LLMs on TCM Knowledge, Reasoning, and Safety. Developed in collaboration with certified TCM experts, MTCMB comprises 12 sub-datasets spanning five major categories: knowledge QA, language understanding, diagnostic reasoning, prescription generation, and safety evaluation. The benchmark integrates real-world case records, national licensing exams, and classical texts, providing an authentic and comprehensive testbed for TCM-capable models. Preliminary results indicate that current LLMs perform well on foundational knowledge but fall short in clinical reasoning, prescription planning, and safety compliance. These findings highlight the urgent need for domain-aligned benchmarks like MTCMB to guide the development of more competent and trustworthy medical AI systems. All datasets, code, and evaluation tools are publicly available at: this https URL.
zh

[NLP-107] ExpertLongBench: Benchmarking Language Models on Expert-Level Long-Form Generation Tasks with Structured Checklists

【速读】: 该论文试图解决专家级任务中大型语言模型(Large Language Models, LLMs)性能不足的问题,特别是在需要生成长文本输出并严格遵循领域特定要求的场景下。解决方案的关键在于构建ExpertLongBench基准和提出CLEAR评估框架,其中CLEAR通过从模型输出和参考输出中提取与任务特定评分标准相对应的信息,生成检查清单,并进行对比以实现基于事实的评估,从而支持对长文本模型输出的精确评价。

链接: https://arxiv.org/abs/2506.01241
作者: Jie Ruan,Inderjeet Nair,Shuyang Cao,Amy Liu,Sheza Munir,Micah Pollens-Dempsey,Tiffany Chiang,Lucy Kates,Nicholas David,Sihan Chen,Ruxin Yang,Yuqian Yang,Jasmine Gump,Tessa Bialek,Vivek Sankaran,Margo Schlanger,Lu Wang
机构: University of Michigan(密歇根大学); University of Michigan Law School(密歇根大学法学院); School of Information, University of Michigan(信息学院,密歇根大学); Materials Science & Engineering, University of Michigan(材料科学与工程,密歇根大学); Carnegie Mellon University(卡内基梅隆大学); Biomedical Engineering, University of Michigan(生物医学工程,密歇根大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper introduces ExpertLongBench, an expert-level benchmark containing 11 tasks from 9 domains that reflect realistic expert workflows and applications. Beyond question answering, the application-driven tasks in ExpertLongBench demand long-form outputs that can exceed 5,000 tokens and strict adherence to domain-specific requirements. Notably, each task in ExpertLongBench includes a rubric, designed or validated by domain experts, to specify task requirements and guide output evaluation. Furthermore, we propose CLEAR, an evaluation framework that supports accurate evaluation of long-form model outputs in our benchmark. To achieve fine-grained, expert-aligned evaluation, CLEAR derives checklists from both model outputs and references by extracting information corresponding to items in the task-specific rubric. Checklist items for model outputs are then compared with corresponding items for reference outputs to assess their correctness, enabling grounded evaluation. We benchmark 11 large language models (LLMs) and analyze components in CLEAR, showing that (1) existing LLMs, with the top performer achieving only a 26.8% F1 score, require significant improvement for expert-level tasks; (2) models can generate content corresponding to the required aspects, though often not accurately; and (3) accurate checklist extraction and comparison in CLEAR can be achieved by open-weight models for more scalable and low-cost usage.
zh

[NLP-108] Polishing Every Facet of the GEM: Testing Linguistic Competence of LLM s and Humans in Korean ACL2025

【速读】: 该论文试图解决当前大语言模型(Large Language Models, LLMs)在韩语语言能力评估中的不足问题,旨在全面评估LLMs与人类在韩语方面的语言能力。解决方案的关键在于构建了一个名为Korean Grammar Evaluation Benchmark (KoGEM)的基准测试集,包含1.5k道多选问答对,覆盖五个主要类别和16个子类别。通过零样本评估27种不同规模和类型的LLMs,研究发现LLMs在依赖定义性知识的任务中表现优异,但在需要整合现实世界经验知识的任务中存在明显不足,如语音规则和发音。该研究进一步指出,将此类经验知识纳入模型训练可能提升其语言能力,从而为增强全面的语言理解提供方向。

链接: https://arxiv.org/abs/2506.01237
作者: SungHo Kim,Nayeon Kim,Taehee Jeon,SangKeun Lee
机构: Korea University(高丽大学); Korea University(高丽大学); Korea University(高丽大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at ACL 2025 main conference

点击查看摘要

Abstract:We introduce the \underlineKorean \underlineGrammar \underlineEvaluation Bench\underlineMark (KoGEM) , designed to assess the linguistic competence of LLMs and humans in Korean. KoGEM consists of 1.5k multiple-choice QA pairs covering five main categories and 16 subcategories. The zero-shot evaluation of 27 LLMs of various sizes and types reveals that while LLMs perform remarkably well on straightforward tasks requiring primarily definitional knowledge, they struggle with tasks that demand the integration of real-world experiential knowledge, such as phonological rules and pronunciation. Furthermore, our in-depth analysis suggests that incorporating such experiential knowledge could enhance the linguistic competence of LLMs. With KoGEM, we not only highlight the limitations of current LLMs in linguistic competence but also uncover hidden facets of LLMs in linguistic competence, paving the way for enhancing comprehensive language understanding. Our code and dataset are available at: this https URL.
zh

[NLP-109] Compress Gather and Recompute: REFORMing Long-Context Processing in Transformers

【速读】: 该论文旨在解决大语言模型在处理超长上下文时面临的挑战,特别是在输入长度超过模型预训练上下文限制的情况下。其解决方案的关键在于提出一种名为REFORM的新型推理框架,该框架采用两阶段方法:第一阶段通过增量处理输入块并维护压缩的键值(KV)缓存,构建跨层上下文嵌入,并利用提前退出策略提高效率;第二阶段通过相似性匹配识别并收集关键标记,并选择性地重新计算KV缓存,从而在保持信息完整性的同时提升处理效率。

链接: https://arxiv.org/abs/2506.01215
作者: Woomin Song,Sai Muralidhar Jayanthi,Srikanth Ronanki,Kanthashree Mysore Sathyendra,Jinwoo Shin,Aram Galstyan,Shubham Katiyar,Sravan Babu Bodapati
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:As large language models increasingly gain popularity in real-world applications, processing extremely long contexts, often exceeding the model’s pre-trained context limits, has emerged as a critical challenge. While existing approaches to efficient long-context processing show promise, recurrent compression-based methods struggle with information preservation, whereas random access approaches require substantial memory resources. We introduce REFORM, a novel inference framework that efficiently handles long contexts through a two-phase approach. First, it incrementally processes input chunks while maintaining a compressed KV cache, constructs cross-layer context embeddings, and utilizes early exit strategy for improved efficiency. Second, it identifies and gathers essential tokens via similarity matching and selectively recomputes the KV cache. Compared to baselines, REFORM achieves over 50% and 27% performance gains on RULER and BABILong respectively at 1M context length. It also outperforms baselines on Infinite-Bench and MM-NIAH, demonstrating flexibility across diverse tasks and domains. Additionally, REFORM reduces inference time by 30% and peak memory usage by 5%, achieving both efficiency and superior performance.
zh

[NLP-110] Mamba Drafters for Speculative Decoding

【速读】: 该论文试图解决大规模语言模型(Large Language Model, LLM)生成过程中存在的加速与分布对齐之间的权衡问题。现有方法在使用外部草稿生成器时虽具有灵活性但可能效率较低,而自推测方法虽然针对目标模型优化但需要重新训练。论文的解决方案关键在于引入基于Mamba的状态空间模型(State Space Model, SSM)作为新型草稿生成器,利用其线性结构避免了传统Transformer方法中的二次复杂度,从而实现更快的草稿生成和更低的内存占用,同时保持跨模型适应性。此外,论文还提出了一种新的测试时树搜索算法以提升草稿候选的质量。

链接: https://arxiv.org/abs/2506.01206
作者: Daewon Choi,Seunghyuk Oh,Saket Dingliwal,Jihoon Tack,Kyuyoung Kim,Woomin Song,Seojin Kim,Insu Han,Jinwoo Shin,Aram Galstyan,Shubham Katiyar,Sravan Babu Bodapati
机构: KAIST(韩国科学技术院); Amazon AGI(亚马逊AGI); Seoul National University(首尔国立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Speculative decoding has emerged as a promising approach to accelerating large language model (LLM) generation using a fast drafter while maintaining alignment with the target model’s distribution. However, existing approaches face a trade-off: external drafters offer flexibility but can suffer from slower drafting, while self-speculation methods use drafters tailored to the target model but require re-training. In this paper, we introduce novel drafters based on Mamba, a state-of-the-art state space model (SSM), as a solution that combines the best aspects of both approaches. By leveraging the linear structure of SSMs, our approach avoids the quadratic complexity inherent in traditional Transformer-based methods, enabling faster drafting and lower memory usage while maintaining the flexibility to work across different target models. We further enhance efficiency with a novel test-time tree search algorithm for generating high-quality draft candidates. Our empirical evaluation demonstrates that Mamba-based drafters not only outperform existing external drafting methods but are also comparable to state-of-the-art self-speculation approaches while using less memory and maintaining their cross-model adaptability.
zh

[NLP-111] rick or Neat: Adversarial Ambiguity and Language Model Evaluation

【速读】: 该论文试图解决语言模型对歧义(ambiguity)敏感性不足的问题,特别是在处理句法、词汇和语音歧义时的识别能力。其解决方案的关键在于引入一个对抗性歧义数据集,并利用基于模型表示的线性探测器(linear probes)来解码歧义,从而在高精度(有时超过90%)下有效识别歧义。

链接: https://arxiv.org/abs/2506.01205
作者: Antonia Karamolegkou,Oliver Eberle,Phillip Rust,Carina Kauf,Anders Søgaard
机构: University of Copenhagen (哥本哈根大学); Technische Universität Berlin (柏林工业大学); Massachusetts Institute of Technology (麻省理工学院); Aleph Alpha Research (Aleph Alpha 研究)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Detecting ambiguity is important for language understanding, including uncertainty estimation, humour detection, and processing garden path sentences. We assess language models’ sensitivity to ambiguity by introducing an adversarial ambiguity dataset that includes syntactic, lexical, and phonological ambiguities along with adversarial variations (e.g., word-order changes, synonym replacements, and random-based alterations). Our findings show that direct prompting fails to robustly identify ambiguity, while linear probes trained on model representations can decode ambiguity with high accuracy, sometimes exceeding 90%. Our results offer insights into the prompting paradigm and how language models encode ambiguity at different layers. We release both our code and data: this https URL.
zh

[NLP-112] Incorporating Hierarchical Semantics in Sparse Autoencoder Architectures

【速读】: 该论文试图解决传统稀疏字典学习(Sparse Dictionary Learning)方法在学习概念时无法利用或表示所学概念之间语义关系的问题。其解决方案的关键在于引入一种改进的稀疏自编码器(Sparse Autoencoder, SAE)架构,该架构显式建模概念的语义层次结构,从而不仅能够学习到具有语义层次的内部表示,还提升了重建能力和可解释性,同时显著提高了计算效率。

链接: https://arxiv.org/abs/2506.01197
作者: Mark Muchane,Sean Richardson,Kiho Park,Victor Veitch
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Code is available at this https URL

点击查看摘要

Abstract:Sparse dictionary learning (and, in particular, sparse autoencoders) attempts to learn a set of human-understandable concepts that can explain variation on an abstract space. A basic limitation of this approach is that it neither exploits nor represents the semantic relationships between the learned concepts. In this paper, we introduce a modified SAE architecture that explicitly models a semantic hierarchy of concepts. Application of this architecture to the internal representations of large language models shows both that semantic hierarchy can be learned, and that doing so improves both reconstruction and interpretability. Additionally, the architecture leads to significant improvements in computational efficiency.
zh

[NLP-113] CoBRA: Quantifying Strategic Language Use and LLM Prag matics

【速读】: 该论文试图解决在非合作性话语(non-cooperative discourse)中,语言策略性使用的系统性理解不足的问题,尤其是在高风险、对抗性场景中,现有研究多集中于合作性语用(pragmatics)而非战略性的语言使用。其解决方案的关键在于提出CoBRA(Cooperation-Breach Response Assessment)框架,并引入三个可解释的度量指标——每轮对话中的受益(Benefit at Turn, BaT)、每轮对话中的惩罚(Penalty at Turn, PaT)以及归一化相对受益(Normalized Relative Benefit at Turn, NRBaT),以量化话语策略的效果,同时构建了CHARM数据集来验证该框架的有效性。

链接: https://arxiv.org/abs/2506.01195
作者: Anshun Asher Zheng,Junyi Jessy Li,David I. Beaver
机构: 未知
类目: Computation and Language (cs.CL)
备注: 18 pages

点击查看摘要

Abstract:Language is often used strategically, particularly in high-stakes, adversarial settings, yet most work on pragmatics and LLMs centers on cooperativity. This leaves a gap in systematic understanding of non-cooperative discourse. To address this, we introduce CoBRA (Cooperation-Breach Response Assessment), along with three interpretable metrics – Benefit at Turn (BaT), Penalty at Turn (PaT), and Normalized Relative Benefit at Turn (NRBaT) – to quantify the perceived strategic effects of discourse moves. We also present CHARM, an annotated dataset of real courtroom cross-examinations, to demonstrate the framework’s effectiveness. Using these tools, we evaluate a range of LLMs and show that LLMs generally exhibit limited pragmatic understanding of strategic language. While model size shows an increase in performance on our metrics, reasoning ability does not help and largely hurts, introducing overcomplication and internal confusion.
zh

[NLP-114] Culturally-Grounded Chain-of-Thought (CG-CoT):Enhancing LLM Performance on Culturally-Specific Tasks in Low-Resource Languages

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在处理文化特定推理任务时的不足,尤其是在低资源语言中的表现问题,这限制了其全球适用性。解决方案的关键在于提出一种名为文化根基思维链(Culturally-Grounded Chain-of-Thought, CG-CoT)的新型提示策略,该策略结合了文化背景的密集向量检索与显式推理序列,从而显著提升了模型在文化相关任务中的准确性和深度。

链接: https://arxiv.org/abs/2506.01190
作者: Madhavendra Thakur
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) struggle with culturally-specific reasoning tasks, particularly in low-resource languages, hindering their global applicability. Addressing this gap is crucial for equitable AI deployment. We introduce Culturally-Grounded Chain-of-Thought (CG-CoT), a novel prompting strategy that combines dense vector retrieval of cultural context with explicit reasoning sequences. Our extensive experiments on Yoruba proverb interpretation demonstrate that CG-CoT provides significantly higher culturally-aligned accuracy and depth than traditional prompting methods, validated through both automated metrics and LLM-based evaluations. Notably, we uncover stark disparities between token-level translation metrics like BLEU and human-judged cultural relevance, suggesting a rethinking of evaluation approaches for low-resource NLP.
zh

[NLP-115] LAQuer: Localized Attribution Queries in Content-grounded Generation ACL2025

【速读】: 该论文试图解决生成式文本在引用来源时存在偏差的问题,现有方法要么将整个句子与源文档关联,导致用户难以高效验证特定事实,要么采用子句级引用但与用户兴趣不匹配。解决方案的关键是引入局部化引用查询(Localized Attribution Queries, LAQuer),通过将生成文本中的选定片段定位到对应的源文本片段,实现更细粒度和用户导向的引用。

链接: https://arxiv.org/abs/2506.01187
作者: Eran Hirsch,Aviv Slobodkin,David Wan,Elias Stengel-Eskin,Mohit Bansal,Ido Dagan
机构: Bar-Ilan University (巴伊兰大学); UNC Chapel Hill (北卡罗来纳大学教堂山分校)
类目: Computation and Language (cs.CL)
备注: ACL 2025

点击查看摘要

Abstract:Grounded text generation models often produce content that deviates from their source material, requiring user verification to ensure accuracy. Existing attribution methods associate entire sentences with source documents, which can be overwhelming for users seeking to fact-check specific claims. In contrast, existing sub-sentence attribution methods may be more precise but fail to align with users’ interests. In light of these limitations, we introduce Localized Attribution Queries (LAQuer), a new task that localizes selected spans of generated output to their corresponding source spans, allowing fine-grained and user-directed attribution. We compare two approaches for the LAQuer task, including prompting large language models (LLMs) and leveraging LLM internal representations. We then explore a modeling framework that extends existing attributed text generation methods to LAQuer. We evaluate this framework across two grounded text generation tasks: Multi-document Summarization (MDS) and Long-form Question Answering (LFQA). Our findings show that LAQuer methods significantly reduce the length of the attributed text. Our contributions include: (1) proposing the LAQuer task to enhance attribution usability, (2) suggesting a modeling framework and benchmarking multiple baselines, and (3) proposing a new evaluation setting to promote future research on localized attribution in content-grounded generation.
zh

[NLP-116] he Inverse Scaling Effect of Pre-Trained Language Model Surprisal Is Not Due to Data Leakage ACL

【速读】: 该论文试图解决预训练语言模型在心理语言学建模中,其生成的 surprisal(意外性)作为自然人类阅读时间预测指标效果较差的问题,这一现象可能源于数据泄露(data leakage)导致模型在训练过程中接触到了测试文本。论文的关键解决方案是通过两个大规模研究验证数据泄露的影响:第一项研究分析了五组自然阅读时间语料库与两个预训练数据集之间的n-gram重叠程度,结果显示泄露有限;第二项研究使用“无泄露”数据训练模型,仅与阅读时间语料库有极少重叠,再次验证了语言模型规模与 surprisal 与阅读时间拟合度之间的负相关关系,从而表明先前结果并非由数据泄露引起。

链接: https://arxiv.org/abs/2506.01172
作者: Byung-Doh Oh,Hongao Zhu,William Schuler
机构: New York University (纽约大学); Shanghai Jiao Tong University (上海交通大学); The Ohio State University (俄亥俄州立大学)
类目: Computation and Language (cs.CL)
备注: ACL Findings 2025; results with Natural Stories alignment issue corrected (commit 4700daa)

点击查看摘要

Abstract:In psycholinguistic modeling, surprisal from larger pre-trained language models has been shown to be a poorer predictor of naturalistic human reading times. However, it has been speculated that this may be due to data leakage that caused language models to see the text stimuli during training. This paper presents two studies to address this concern at scale. The first study reveals relatively little leakage of five naturalistic reading time corpora in two pre-training datasets in terms of length and frequency of token n -gram overlap. The second study replicates the negative relationship between language model size and the fit of surprisal to reading times using models trained on ‘leakage-free’ data that overlaps only minimally with the reading time corpora. Taken together, this suggests that previous results using language models trained on these corpora are not driven by the effects of data leakage.
zh

[NLP-117] Mispronunciation Detection Without L2 Pronunciation Dataset in Low-Resource Setting: A Case Study in Finland Swedish INTERSPEECH2025

【速读】: 该论文旨在解决低资源语言变体(如芬兰瑞典语,FS)中缺乏有效的误发音检测(MD)模型的问题。现有的MD系统大多针对英语等主要语言设计,而对低资源语言的支持不足。本文提出了一种适用于FS的MD模型,其关键在于采用了一种简化且语言无关的方法,仅需少量的第二语言(L2)数据即可实现良好的适应性。该方法基于多语言wav2vec 2.0模型,结合熵正则化、温度缩放和top-k归一化技术,在保持较高召回率(43.2%)的同时提升了精确率(29.8%),优于基线模型的召回率(77.5%)和精确率(17.6%)。

链接: https://arxiv.org/abs/2506.01156
作者: Nhan Phan,Mikko Kuronen,Maria Kautonen,Riikka Ullakonoja,Anna von Zansen,Yaroslav Getman,Ekaterina Voskoboinik,Tamás Grósz,Mikko Kurimo
机构: 未知
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted to Interspeech 2025 conference

点击查看摘要

Abstract:Mispronunciation detection (MD) models are the cornerstones of many language learning applications. Unfortunately, most systems are built for English and other major languages, while low-resourced language varieties, such as Finland Swedish (FS), lack such tools. In this paper, we introduce our MD model for FS, trained on 89 hours of first language (L1) speakers’ spontaneous speech and tested on 33 minutes of L2 transcribed read-aloud speech. We trained a multilingual wav2vec 2.0 model with entropy regularization, followed by temperature scaling and top-k normalization after the inference to better adapt it for MD. The main novelty of our method lies in its simplicity, requiring minimal L2 data. The process is also language-independent, making it suitable for other low-resource languages. Our proposed algorithm allows us to balance Recall (43.2%) and Precision (29.8%), compared with the baseline model’s Recall (77.5%) and Precision (17.6%). Comments: Accepted to Interspeech 2025 conference Subjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS) Cite as: arXiv:2506.01156 [cs.CL] (or arXiv:2506.01156v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2506.01156 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-118] Earley-Driven Dynamic Pruning for Efficient Structured Decoding ICML2025

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在生成输出时难以严格遵守结构化或语法约束的问题,这一问题在函数调用和领域特定语言(Domain-Specific Language, DSL)生成中尤为关键。现有基于上下文无关文法的约束解码方法在每一步解码中都需要检查LLM词汇表中所有标记的有效性,导致较高的计算开销。论文提出的解决方案是ZapFormat,其核心是一种基于Earley算法的动态剪枝策略,能够实时识别并消除无效或冗余的Earley状态,从而显著降低内存占用,并通过状态缓存加速大量查询的结构化生成。

链接: https://arxiv.org/abs/2506.01151
作者: Xintong Sun,Chi Wei,Minghao Tian,Shiwen Ni
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: ICML2025 poster

点击查看摘要

Abstract:Large Language Models (LLMs) have shown remarkable capabilities, yet ensuring their outputs conform to strict structural or grammatical constraints remains challenging, which is critical in function calls and domain-specific language (DSL) generation. Constrained decoding with context-free grammar is a flexible approach to guarantee LLMs’ adherence to a specific format by dynamically building a token logits mask. However, creating this mask requires checking the validity of all tokens in the LLM vocabulary at every decoding step, which often incurs significant overheads in existing constrained decoding engines. To address this challenge, we propose \textbfZapFormat , a novel \textbfdynamic pruning strategy based on the Earley algorithm that identifies and eliminates invalid or redundant Earley states in real-time, significantly reducing memory occupation of the Earley algorithm’s states. This further enables us to use a state cache to speed up structured generations on a large number of queries. We implemented ZapFormat in a new constrained decoding engine called Formatron which also incorporates existing optimizations. Through comprehensive experiments on structured generation tasks, including JSON generation, JSON Schema validation, and semantic parsing, we demonstrate that Formatron not only \textbfconsistently maintains high-precision compliant outputs but also achieves \textbfsignificant improvements in inference speed up to 2x compared to state-of-the-art implementations. More importantly, Formatron is generally applicable across various LLM architectures. We release Formatron as open source at this https URL.
zh

[NLP-119] A Word is Worth 4-bit: Efficient Log Parsing with Binary Coded Decimal Recognition

【速读】: 该论文试图解决现有日志解析器在捕获细粒度日志模板细节方面的不足,这导致了下游任务中准确性下降和实用性降低的问题。其解决方案的关键在于提出一种基于字符级别的日志解析器,该解析器采用了一种新颖的神经网络架构,通过聚合字符嵌入来估计二进制编码的十进制序列,从而实现高度细粒度的日志模板提取。

链接: https://arxiv.org/abs/2506.01147
作者: Prerak Srivastava,Giulio Corallo,Sergey Rybalko
机构: SAP Labs (SAP 实验室)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Pre-print of our accepted paper at IEEE International Conference on Web Services (ICWS 2025). 4 pages, 2 figures

点击查看摘要

Abstract:System-generated logs are typically converted into categorical log templates through parsing. These templates are crucial for generating actionable insights in various downstream tasks. However, existing parsers often fail to capture fine-grained template details, leading to suboptimal accuracy and reduced utility in downstream tasks requiring precise pattern identification. We propose a character-level log parser utilizing a novel neural architecture that aggregates character embeddings. Our approach estimates a sequence of binary-coded decimals to achieve highly granular log templates extraction. Our low-resource character-level parser, tested on revised Loghub-2k and a manually annotated industrial dataset, matches LLM-based parsers in accuracy while outperforming semantic parsers in efficiency.
zh

[NLP-120] From Words to Waves: Analyzing Concept Formation in Speech and Text-Based Foundation Models INTERSPEECH2025

【速读】: 该论文试图解决的问题是:在仅使用其他模态(如语音)训练的模型中,是否能够产生类似文本模型中所观察到的抽象语义概念,以及当模型联合训练于多种模态时,是否能发展出更丰富、更结构化的语义理解。解决方案的关键在于采用潜在概念分析(Latent Concept Analysis),这是一种无监督方法,用于揭示和解释神经网络中的潜在表示,从而研究跨模态的语义抽象形成过程。

链接: https://arxiv.org/abs/2506.01133
作者: Asım Ersoy,Basel Mousi,Shammur Chowdhury,Firoj Alam,Fahim Dalvi,Nadir Durrani
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted Interspeech 2025

点击查看摘要

Abstract:The emergence of large language models (LLMs) has demonstrated that systems trained solely on text can acquire extensive world knowledge, develop reasoning capabilities, and internalize abstract semantic concepts–showcasing properties that can be associated with general intelligence. This raises an intriguing question: Do such concepts emerge in models trained on other modalities, such as speech? Furthermore, when models are trained jointly on multiple modalities: Do they develop a richer, more structured semantic understanding? To explore this, we analyze the conceptual structures learned by speech and textual models both individually and jointly. We employ Latent Concept Analysis, an unsupervised method for uncovering and interpreting latent representations in neural networks, to examine how semantic abstractions form across modalities. For reproducibility we made scripts and other resources available to the community.
zh

[NLP-121] Attention Retrieves MLP Memorizes: Disentangling Trainable Components in the Transformer

【速读】: 该论文试图解决的问题是:在现代大型语言模型(Large Language Models, LLMs)中,Transformer架构的性能提升究竟在多大程度上可以归因于自注意力机制(self-attention mechanism)。为了解决这一问题,研究者对比了标准Transformer与其变体,其中部分组件(如多层感知机MLP层或注意力投影器)在初始化时被冻结。关键解决方案是引入MixiT——一种简化的、基于原理的模型,其注意力系数在初始化时完全随机且固定,从而消除了注意力中的输入依赖计算或学习过程。通过这一方法,研究者发现MixiT在许多算法任务中能够达到与完整训练的Transformer相当的性能,表明自注意力机制并非所有任务性能提升的唯一关键因素。

链接: https://arxiv.org/abs/2506.01115
作者: Yihe Dong,Lorenzo Noci,Mikhail Khodak,Mufan Li
机构: Princeton University (普林斯顿大学); ETH Zurich (苏黎世联邦理工学院)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The Transformer architecture is central to the success of modern Large Language Models (LLMs), in part due to its surprising ability to perform a wide range of algorithmic tasks – including mathematical reasoning, memorization, and retrieval – using only gradient-based training on next-token prediction. While the core component of a Transformer is the self-attention mechanism, we question how much, and which aspects, of the performance gains can be attributed to it. To this end, we compare standard Transformers to variants in which either the multi-layer perceptron (MLP) layers or the attention projectors (queries and keys) are frozen at initialization. To further isolate the contribution of attention, we introduce MixiT – the Mixing Transformer – a simplified, principled model in which the attention coefficients are entirely random and fixed at initialization, eliminating any input-dependent computation or learning in attention. Surprisingly, we find that MixiT matches the performance of fully trained Transformers on various algorithmic tasks, especially those involving basic arithmetic or focusing heavily on memorization. For retrieval-based tasks, we observe that having input-dependent attention coefficients is consistently beneficial, while MixiT underperforms. We attribute this failure to its inability to form specialized circuits such as induction heads – a specific circuit known to be crucial for learning and exploiting repeating patterns in input sequences. Even more interestingly, we find that attention with frozen key and query projectors is not only able to form induction heads, but can also perform competitively on language modeling. Our results underscore the importance of architectural heterogeneity, where distinct components contribute complementary inductive biases crucial for solving different classes of tasks.
zh

[NLP-122] Contextual Candor: Enhancing LLM Trustworthiness Through Hierarchical Unanswerability Detection

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在对话式人工智能系统中生成缺乏事实依据或幻觉性回答的问题,这一问题严重影响了系统的可信度和广泛应用。其解决方案的关键在于提出一种称为强化不可回答性学习(Reinforced Unanswerability Learning, RUL)的新型混合训练范式,该方法通过将判别式不可回答性预测头与LLM的生成核心相结合,并采用多阶段学习策略进行优化,包括在增强型CAsT-答案可回答性(Enhanced-CAsT-Answerability, ECA)数据集上的监督微调以及基于人类反馈的强化学习(Reinforcement Learning with Human Feedback, RLHF)阶段,从而提升模型对不可回答问题的检测能力和拒绝回答的可靠性与有效性。

链接: https://arxiv.org/abs/2506.01104
作者: Steven Robinson,Antonio Carlos Rivera
机构: EDP University of Puerto Rico: San Sebastian(EDP波多黎各大学:圣塞巴斯蒂安)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The pervasive deployment of large language models (LLMs) in conversational AI systems has revolutionized information access, yet their propensity for generating factually unsupported or hallucinated responses remains a critical impediment to trustworthiness and widespread adoption. This paper introduces Reinforced Unanswerability Learning (RUL), a novel hybrid training paradigm designed to imbue LLMs with the intrinsic capability to accurately detect unanswerable questions and generate reliably appropriate responses. Unlike conventional approaches that rely on external classifiers or simple prompting, RUL integrates a discriminative unanswerability prediction head with the LLM’s generative core, guided by a multi-stage learning strategy. This includes supervised fine-tuning on a novel, richly annotated dataset, Enhanced-CAsT-Answerability (ECA), which features hierarchical answerability labels and ground-truth refusal responses. Crucially, RUL incorporates a subsequent reinforcement learning with human feedback (RLHF) phase to refine the nuance, helpfulness, and informativeness of refusal responses. Extensive experiments demonstrate RUL’s superior performance, achieving significantly higher accuracy in unanswerability detection across sentence, paragraph, and ranking levels, and substantially increasing the generation of appropriate refusals for unanswerable queries, alongside strong performance on answerable questions. Human evaluations further corroborate RUL’s effectiveness, highlighting a marked improvement in perceived helpfulness and trustworthiness, ultimately paving the way for more reliable and user-centric conversational AI.
zh

[NLP-123] Un-considering Contextual Information: Assessing LLM s Understanding of Indexical Elements ACL2025

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在指称消解任务中对指示代词(indexical)如“I”、“you”、“here”和“tomorrow”的理解能力问题,这些问题由于其语言特性带来了独特的挑战。解决方案的关键在于构建并发布首个针对英语指示代词的多选题数据集——English Indexical Dataset,并评估多个前沿LLMs(如GPT-4o、Claude 3.5 Sonnet、Gemini 1.5 Pro和DeepSeek V3)在该数据集上的表现,以揭示LLMs在处理不同指示代词时的性能差异及其对句法线索的依赖性。

链接: https://arxiv.org/abs/2506.01089
作者: Metehan Oguz,Yavuz Bakman,Duygu Nur Yaldiz
机构: University of Southern California (南加州大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to ACL 2025 Findings

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated impressive performances in tasks related to coreference resolution. However, previous studies mostly assessed LLM performance on coreference resolution with nouns and third person pronouns. This study evaluates LLM performance on coreference resolution with indexical like I, you, here and tomorrow, which come with unique challenges due to their linguistic properties. We present the first study examining how LLMs interpret indexicals in English, releasing the English Indexical Dataset with 1600 multiple-choice questions. We evaluate pioneering LLMs, including GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro, and DeepSeek V3. Our results reveal that LLMs exhibit an impressive performance with some indexicals (I), while struggling with others (you, here, tomorrow), and that syntactic cues (e.g. quotation) contribute to LLM performance with some indexicals, while they reduce performance with others. Code and data are available at: this https URL.
zh

[NLP-124] zip2zip: Inference-Time Adaptive Vocabularies for Language Models via Token Compression

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在处理特定领域或语言输入时,由于静态分词器(tokenizer)无法适应特定场景而导致的分词效率低、序列长度过长和计算成本高的问题。其解决方案的关键在于提出zip2zip框架,该框架通过在推理阶段动态调整词表,利用Lempel-Ziv-Welch (LZW) 压缩算法生成可复用的“超词”(hypertoken),并结合运行时嵌入层和因果语言建模变体,使模型能够高效处理压缩后的超词序列,从而显著减少输入和输出序列长度,并提升推理速度。

链接: https://arxiv.org/abs/2506.01084
作者: Saibo Geng,Nathan Ranchin,Yunzhen yao,Maxime Peyrard,Chris Wendler,Michael Gastpar,Robert West
机构: EPFL(瑞士联邦理工学院); Northeastern University(东北大学); Microsoft(微软); Université Grenoble Alpes(格勒诺布尔阿尔卑斯大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Code will be released at this https URL

点击查看摘要

Abstract:Tokenization efficiency plays a critical role in the performance and cost of large language models (LLMs), yet most models rely on static tokenizers optimized for general-purpose corpora. These tokenizers’ fixed vocabularies often fail to adapt to domain- or language-specific inputs, leading to longer token sequences and higher computational costs. We introduce zip2zip, a framework that enables LLMs to dynamically adjust token vocabulary at inference time, allowing for fewer generated tokens and thus faster inference. zip2zip consists of three key components: (1) a tokenizer based on Lempel-Ziv-Welch (LZW) compression that incrementally compresses tokens into reusable “hypertokens” on the fly; (2) an embedding layer that computes embeddings for newly formed hypertokens at runtime; and (3) a causal language modeling variant that trains the model to operate on hypertokenized, compressed sequences. We show that an existing LLM can be zip2zip-fied in 10 GPU-hours via parameter-efficient finetuning. The resulting zip2zip LLMs effectively learn to use hypertokens at inference time, reducing input and output sequence length by 20-60%, with significant improvements in inference latency.
zh

[NLP-125] How Programming Concepts and Neurons Are Shared in Code Language Models ACL

【速读】: 该论文试图解决多编程语言(PLs)在大型语言模型(LLM)概念空间中的关系问题,特别是其与英语的关联性。其解决方案的关键在于通过在21个PL对上执行少样本翻译任务,并解码中间层的嵌入表示,分析模型在不同PL间的概念空间分布及语言特异性神经元的激活模式。研究发现,模型的概念空间更接近英语,且在中间层的后半部分对英语标记赋予较高概率,同时揭示了语言特异性神经元在模型不同层级的分布规律。

链接: https://arxiv.org/abs/2506.01074
作者: Amir Hossein Kargaran,Yihong Liu,François Yvon,Hinrich Schütze
机构: LMU Munich & Munich Center for Machine Learning (慕尼黑路德维希-马克西米利安大学 & 慕尼黑机器学习中心); Sorbonne Université & CNRS, ISIR (索邦大学 & 法国国家科学研究中心,ISIR)
类目: Computation and Language (cs.CL); Programming Languages (cs.PL); Software Engineering (cs.SE)
备注: ACL Findings 2025

点击查看摘要

Abstract:Several studies have explored the mechanisms of large language models (LLMs) in coding tasks, but most have focused on programming languages (PLs) in a monolingual setting. In this paper, we investigate the relationship between multiple PLs and English in the concept space of LLMs. We perform a few-shot translation task on 21 PL pairs using two Llama-based models. By decoding the embeddings of intermediate layers during this task, we observe that the concept space is closer to English (including PL keywords) and assigns high probabilities to English tokens in the second half of the intermediate layers. We analyze neuron activations for 11 PLs and English, finding that while language-specific neurons are primarily concentrated in the bottom layers, those exclusive to each PL tend to appear in the top layers. For PLs that are highly aligned with multiple other PLs, identifying language-specific neurons is not feasible. These PLs also tend to have a larger keyword set than other PLs and are closer to the model’s concept space regardless of the input/output PL in the translation task. Our findings provide insights into how LLMs internally represent PLs, revealing structural patterns in the model’s concept space. Code is available at this https URL.
zh

[NLP-126] SealQA: Raising the Bar for Reasoning in Search-Augmented Language Models

【速读】: 该论文试图解决在事实性问题中,基于网络搜索的增强语言模型(SEarch-Augmented Language models)在面对冲突、噪声或无帮助的搜索结果时所表现出的准确性和推理能力不足的问题。解决方案的关键在于构建SealQA基准测试,其包含三种类型:Seal-0和Seal-Hard用于评估事实准确性和推理能力,其中Seal-0聚焦于高难度问题;LongSeal则扩展了基准以测试“大海捞针”场景下的长上下文、多文档推理能力。通过该基准,研究揭示了当前先进大语言模型在不同场景下的局限性,并为未来研究提供了可复现的评估框架。

链接: https://arxiv.org/abs/2506.01062
作者: Thinh Pham,Nguyen Nguyen,Pratibha Zunjare,Weiyuan Chen,Yu-Min Tseng,Tu Vu
机构: Virginia Tech (弗吉尼亚理工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Preprint. 22 pages, 7 figures, 11 tables

点击查看摘要

Abstract:We introduce SealQA, a new challenge benchmark for evaluating SEarch-Augmented Language models on fact-seeking questions where web search yields conflicting, noisy, or unhelpful results. SealQA comes in three flavors: (1) Seal-0 (main) and (2) Seal-Hard, which assess factual accuracy and reasoning capabilities, with Seal-0 focusing on the most challenging questions where chat models (e.g., GPT-4.1) typically achieve near-zero accuracy; and (3) LongSeal, which extends SealQA to test long-context, multi-document reasoning in “needle-in-a-haystack” settings. Our evaluation reveals critical limitations in current models: Even frontier LLMs perform poorly across all SealQA flavors. On Seal-0, frontier agentic models equipped with tools like o3 and o4-mini achieve only 17.1% and 6.3% accuracy, respectively, at their best reasoning efforts. We find that advanced reasoning models such as DeepSeek-R1-671B and o3-mini are highly vulnerable to noisy search results. Notably, increasing test-time compute does not yield reliable gains across o3-mini, o4-mini, and o3, with performance often plateauing or even declining early. Additionally, while recent models are less affected by the “lost-in-the-middle” issue, they still fail to reliably identify relevant documents in LongSeal when faced with numerous distractors. To facilitate future work, we release SealQA at this http URL.
zh

[NLP-127] Simple Prompt Injection Attacks Can Leak Personal Data Observed by LLM Agents During Task Execution NEURIPS

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)中提示注入(prompt injection)引发的工具调用代理(tool-calling agents)数据泄露问题,特别是针对更复杂的威胁如数据外泄(data exfiltration)缺乏深入研究的问题。其解决方案的关键在于通过构建基于数据流的攻击方法,并将其集成到AgentDojo基准测试中,同时创建一个更丰富的合成人类-人工智能银行对话数据集,以评估LLMs在任务执行过程中泄露个人数据的风险与防御效果。

链接: https://arxiv.org/abs/2506.01055
作者: Meysam Alizadeh,Zeynab Samei,Daria Stetsenko,Fabrizio Gilardi
机构: 未知
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注: 25 pages, 18 figures, NeurIPS formatting style

点击查看摘要

Abstract:Previous benchmarks on prompt injection in large language models (LLMs) have primarily focused on generic tasks and attacks, offering limited insights into more complex threats like data exfiltration. This paper examines how prompt injection can cause tool-calling agents to leak personal data observed during task execution. Using a fictitious banking agent, we develop data flow-based attacks and integrate them into AgentDojo, a recent benchmark for agentic security. To enhance its scope, we also create a richer synthetic dataset of human-AI banking conversations. In 16 user tasks from AgentDojo, LLMs show a 15-50 percentage point drop in utility under attack, with average attack success rates (ASR) around 20 percent; some defenses reduce ASR to zero. Most LLMs, even when successfully tricked by the attack, avoid leaking highly sensitive data like passwords, likely due to safety alignments, but they remain vulnerable to disclosing other personal data. The likelihood of password leakage increases when a password is requested along with one or two additional personal details. In an extended evaluation across 48 tasks, the average ASR is around 15 percent, with no built-in AgentDojo defense fully preventing leakage. Tasks involving data extraction or authorization workflows, which closely resemble the structure of exfiltration attacks, exhibit the highest ASRs, highlighting the interaction between task type, agent performance, and defense efficacy.
zh

[NLP-128] CHEER-Ekman: Fine-grained Embodied Emotion Classification ACL2025

【速读】: 该论文试图解决文本中具身情绪(embodied emotion)识别的问题,即如何从文本中准确识别出与身体体验和生理反应相关的埃克曼六种基本情绪类别。其解决方案的关键在于利用大规模语言模型进行自动最佳-最差缩放(best-worst scaling),并通过简化提示指令和链式思维推理显著提升情绪识别的准确性,从而使小型模型也能达到与大型模型相当的性能。

链接: https://arxiv.org/abs/2506.01047
作者: Phan Anh Duong,Cat Luong,Divyesh Bommana,Tianyu Jiang
机构: University of Cincinnati (辛辛那提大学)
类目: Computation and Language (cs.CL)
备注: ACL 2025

点击查看摘要

Abstract:Emotions manifest through physical experiences and bodily reactions, yet identifying such embodied emotions in text remains understudied. We present an embodied emotion classification dataset, CHEER-Ekman, extending the existing binary embodied emotion dataset with Ekman’s six basic emotion categories. Using automatic best-worst scaling with large language models, we achieve performance superior to supervised approaches on our new dataset. Our investigation reveals that simplified prompting instructions and chain-of-thought reasoning significantly improve emotion recognition accuracy, enabling smaller models to achieve competitive performance with larger ones.
zh

[NLP-129] Probing Neural Topology of Large Language Models

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)中神经元功能协同激活机制不明确的问题,这一问题阻碍了对LLMs更深入的理解和安全开发。其解决方案的关键在于引入图探测(graph probing)方法,通过揭示神经元的功能连接拓扑结构,并将其与语言生成性能相关联,从而揭示LLMs内部机制的普遍规律。

链接: https://arxiv.org/abs/2506.01042
作者: Yu Zheng,Yuan Yuan,Yong Li,Paolo Santi
机构: Masssachusetts Institute of Technology (麻省理工学院); Tsinghua University (清华大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Probing large language models (LLMs) has yielded valuable insights into their internal mechanisms by linking neural representations to interpretable semantics. However, how neurons functionally co-activate with each other to give rise to emergent capabilities remains largely unknown, hindering a deeper understanding and safer development of LLMs. In this work, we introduce graph probing, a method for uncovering the functional connectivity topology of LLM neurons and relating it to language generation performance. By analyzing internal neural graphs across diverse LLM families and scales, we discover a universal predictability of next-token prediction performance using only neural topology. This predictability is robust even when retaining just 1% of neuron connections or probing models after only 8 pretraining steps, highlighting the sparsity and early emergence of topological patterns. Further graph matching analysis suggests that, despite significant distinctions in architectures, parameters, and training data, different LLMs develop intricate and consistent neural topological structures that may form the foundation for their language generation abilities. Codes and data for the graph probing toolbox are released at this https URL.
zh

[NLP-130] Less is More: Local Intrinsic Dimensions of Contextual Language Models

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)内部机制的理解问题,特别是训练和微调对模型行为的影响。其解决方案的关键在于引入一种基于上下文潜在嵌入几何特性的新视角,通过测量上下文语言模型潜在空间的局部维度,并分析其在训练和微调过程中的变化,从而揭示模型的训练动态和泛化能力。

链接: https://arxiv.org/abs/2506.01034
作者: Benjamin Matthias Ruppik,Julius von Rohrscheidt,Carel van Niekerk,Michael Heck,Renato Vukovic,Shutong Feng,Hsien-chin Lin,Nurul Lubis,Bastian Rieck,Marcus Zibrowius,Milica Gašić
机构: Heinrich Heine University Düsseldorf(海因里希·海涅大学杜塞尔多夫分校); Institute of AI for Health, Helmholtz Munich(人工智能健康研究所,赫尔姆霍兹慕尼黑); Technical University of Munich(慕尼黑工业大学); University of Fribourg(弗里堡大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 9 pages, with an additional 13 pages of appendix

点击查看摘要

Abstract:Understanding the internal mechanisms of large language models (LLMs) remains a challenging and complex endeavor. Even fundamental questions, such as how fine-tuning affects model behavior, often require extensive empirical evaluation. In this paper, we introduce a novel perspective based on the geometric properties of contextual latent embeddings to study the effects of training and fine-tuning. To that end, we measure the local dimensions of a contextual language model’s latent space and analyze their shifts during training and fine-tuning. We show that the local dimensions provide insights into the model’s training dynamics and generalization ability. Specifically, the mean of the local dimensions predicts when the model’s training capabilities are exhausted, as exemplified in a dialogue state tracking task, overfitting, as demonstrated in an emotion recognition task, and grokking, as illustrated with an arithmetic task. Furthermore, our experiments suggest a practical heuristic: reductions in the mean local dimension tend to accompany and predict subsequent performance gains. Through this exploration, we aim to provide practitioners with a deeper understanding of the implications of fine-tuning on embedding spaces, facilitating informed decisions when configuring models for specific applications. The results of this work contribute to the ongoing discourse on the interpretability, adaptability, and generalizability of LLMs by bridging the gap between intrinsic model mechanisms and geometric properties in the respective embeddings.
zh

[NLP-131] alking to Data: Designing Smart Assistants for Humanities Databases

【速读】: 该论文试图解决传统交互格式在人文科学研究数据库访问中的局限性,尤其是搜索和响应生成方法的不足。其解决方案的关键在于开发一种基于大语言模型(Large Language Model, LLM)的智能助手,采用聊天机器人形式,结合检索增强生成(Retrieval-Augmented Generation, RAG)方法,并集成混合搜索、自动查询生成、文本到SQL过滤、语义数据库搜索和超链接插入等先进技术,以实现自然语言与数字人文数据的高效交互。

链接: https://arxiv.org/abs/2506.00986
作者: Alexander Sergeev,Valeriya Goloviznina,Mikhail Melnichenko,Evgeny Kotelnikov
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted for InterSys-2025 conference

点击查看摘要

Abstract:Access to humanities research databases is often hindered by the limitations of traditional interaction formats, particularly in the methods of searching and response generation. This study introduces an LLM-based smart assistant designed to facilitate natural language communication with digital humanities data. The assistant, developed in a chatbot format, leverages the RAG approach and integrates state-of-the-art technologies such as hybrid search, automatic query generation, text-to-SQL filtering, semantic database search, and hyperlink insertion. To evaluate the effectiveness of the system, experiments were conducted to assess the response quality of various language models. The testing was based on the Prozhito digital archive, which contains diary entries from predominantly Russian-speaking individuals who lived in the 20th century. The chatbot is tailored to support anthropology and history researchers, as well as non-specialist users with an interest in the field, without requiring prior technical training. By enabling researchers to query complex databases with natural language, this tool aims to enhance accessibility and efficiency in humanities research. The study highlights the potential of Large Language Models to transform the way researchers and the public interact with digital archives, making them more intuitive and inclusive. Additional materials are presented in GitHub repository: this https URL.
zh

[NLP-132] Do LLM s Understand Why We Write Diaries? A Method for Purpose Extraction and Clustering

【速读】: 该论文试图解决日记分析中从大规模语料库中提取有意义信息的挑战,传统方法在这一任务中往往无法取得满意结果。其解决方案的关键在于引入基于大型语言模型(Large Language Models, LLMs)的新方法,用于识别和聚类日记写作的不同目的,如记录生活事件、自我反思或练习语言技能。该方法在苏联时期(1922-1929年)的日记数据集上进行了验证,并通过对比不同专有和开源LLMs的表现,确定了GPT-4o和o1-mini在任务中的最优性能。

链接: https://arxiv.org/abs/2506.00985
作者: Valeriya Goloviznina,Alexander Sergeev,Mikhail Melnichenko,Evgeny Kotelnikov
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted for CompLing-2025 conference

点击查看摘要

Abstract:Diary analysis presents challenges, particularly in extracting meaningful information from large corpora, where traditional methods often fail to deliver satisfactory results. This study introduces a novel method based on Large Language Models (LLMs) to identify and cluster the various purposes of diary writing. By “purposes,” we refer to the intentions behind diary writing, such as documenting life events, self-reflection, or practicing language skills. Our approach is applied to Soviet-era diaries (1922-1929) from the Prozhito digital archive, a rich collection of personal narratives. We evaluate different proprietary and open-source LLMs, finding that GPT-4o and o1-mini achieve the best performance, while a template-based baseline is significantly less effective. Additionally, we analyze the retrieved purposes based on gender, age of the authors, and the year of writing. Furthermore, we examine the types of errors made by the models, providing a deeper understanding of their limitations and potential areas for improvement in future research.
zh

[NLP-133] Bridging the Gap: From Ad-hoc to Proactive Search in Conversations SIGIR2025

【速读】: 该论文旨在解决主动搜索在对话中(Proactive Search in Conversations, PSC)的检索质量受限问题,其核心挑战在于传统即兴检索器(ad-hoc retrievers)的输入假设与PSC场景下的对话上下文存在不匹配,导致检索效果不佳。解决方案的关键在于提出Conv2Query框架,该框架通过将对话上下文映射为即兴查询(ad-hoc queries),有效弥合了即兴检索与PSC之间的输入差异,从而提升检索性能。

链接: https://arxiv.org/abs/2506.00983
作者: Chuan Meng,Francesco Tonolini,Fengran Mo,Nikolaos Aletras,Emine Yilmaz,Gabriella Kazai
机构: University of Amsterdam(阿姆斯特丹大学); Amazon(亚马逊); Université de Montréal(蒙特利尔大学); University of Sheffield(谢菲尔德大学); University College London(伦敦大学学院)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted as a full paper at SIGIR 2025

点击查看摘要

Abstract:Proactive search in conversations (PSC) aims to reduce user effort in formulating explicit queries by proactively retrieving useful relevant information given conversational context. Previous work in PSC either directly uses this context as input to off-the-shelf ad-hoc retrievers or further fine-tunes them on PSC data. However, ad-hoc retrievers are pre-trained on short and concise queries, while the PSC input is longer and noisier. This input mismatch between ad-hoc search and PSC limits retrieval quality. While fine-tuning on PSC data helps, its benefits remain constrained by this input gap. In this work, we propose Conv2Query, a novel conversation-to-query framework that adapts ad-hoc retrievers to PSC by bridging the input gap between ad-hoc search and PSC. Conv2Query maps conversational context into ad-hoc queries, which can either be used as input for off-the-shelf ad-hoc retrievers or for further fine-tuning on PSC data. Extensive experiments on two PSC datasets show that Conv2Query significantly improves ad-hoc retrievers’ performance, both when used directly and after fine-tuning on PSC.
zh

[NLP-134] What do self-supervised speech models know about Dutch? Analyzing advantages of language-specific pre-training INTERSPEECH2025

【速读】: 该论文试图解决自监督模型在语音表示学习中是否具有语言特异性的问题,即预训练特定语言是否能提升对语言特异性语言特征的表征能力。其解决方案的关键在于通过对比仅在荷兰语(Dutch)上预训练的Wav2Vec2模型与在英语或大量多语言数据上预训练的模型,评估其在内部表示中编码荷兰语语音和词汇信息的能力,结果表明仅在荷兰语上预训练的模型在表征语言特异性特征方面更具优势。

链接: https://arxiv.org/abs/2506.00981
作者: Marianne de Heer Kloots,Hosein Mohebbi,Charlotte Pouw,Gaofei Shen,Willem Zuidema,Martijn Bentum
机构: Institute for Logic, Language and Computation (逻辑、语言和计算研究所); Cognitive Science and Artificial Intelligence (认知科学与人工智能); Centre for Language Studies (语言研究中心)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted to Interspeech 2025. For model, code, and materials, see this https URL

点击查看摘要

Abstract:How language-specific are speech representations learned by self-supervised models? Existing work has shown that a range of linguistic features can be successfully decoded from end-to-end models trained only on speech recordings. However, it’s less clear to what extent pre-training on specific languages improves language-specific linguistic information. Here we test the encoding of Dutch phonetic and lexical information in internal representations of self-supervised Wav2Vec2 models. Pre-training exclusively on Dutch improves the representation of Dutch linguistic features as compared to pre-training on similar amounts of English or larger amounts of multilingual data. This language-specific advantage is well-detected by trained clustering or classification probes, and partially observable using zero-shot metrics. Furthermore, the language-specific benefit on linguistic feature encoding aligns with downstream performance on Automatic Speech Recognition.
zh

[NLP-135] LEMONADE: A Large Multilingual Expert-Annotated Abstractive Event Dataset for the Real World ACL2025

【速读】: 该论文旨在解决多语言冲突事件数据的聚合与分析问题,特别是在全球范围内对不同语言和区域实体进行统一处理的挑战。其解决方案的关键在于引入抽象事件抽取(abstractive event extraction, AEE)及其子任务抽象实体链接(abstractive entity linking, AEL),通过全局文档理解检测事件参数和实体,并在多语言数据集中实现标准化。此方法不同于传统的基于跨度的事件抽取,能够更有效地处理多语言源数据,提升事件分析的准确性和泛化能力。

链接: https://arxiv.org/abs/2506.00980
作者: Sina J. Semnani,Pingyue Zhang,Wanyue Zhai,Haozhuo Li,Ryan Beauchamp,Trey Billing,Katayoun Kishi,Manling Li,Monica S. Lam
机构: Stanford University (斯坦福大学); Northwestern University (西北大学); ACLED (ACLED)
类目: Computation and Language (cs.CL)
备注: Findings of ACL 2025

点击查看摘要

Abstract:This paper presents LEMONADE, a large-scale conflict event dataset comprising 39,786 events across 20 languages and 171 countries, with extensive coverage of region-specific entities. LEMONADE is based on a partially reannotated subset of the Armed Conflict Location Event Data (ACLED), which has documented global conflict events for over a decade. To address the challenge of aggregating multilingual sources for global event analysis, we introduce abstractive event extraction (AEE) and its subtask, abstractive entity linking (AEL). Unlike conventional span-based event extraction, our approach detects event arguments and entities through holistic document understanding and normalizes them across the multilingual dataset. We evaluate various large language models (LLMs) on these tasks, adapt existing zero-shot event extraction systems, and benchmark supervised models. Additionally, we introduce ZEST, a novel zero-shot retrieval-based system for AEL. Our best zero-shot system achieves an end-to-end F1 score of 58.3%, with LLMs outperforming specialized event extraction models such as GoLLIE. For entity linking, ZEST achieves an F1 score of 45.7%, significantly surpassing OneNet, a state-of-the-art zero-shot baseline that achieves only 23.7%. However, these zero-shot results lag behind the best supervised systems by 20.1% and 37.0% in the end-to-end and AEL tasks, respectively, highlighting the need for further research. Comments: Findings of ACL 2025 Subjects: Computation and Language (cs.CL) Cite as: arXiv:2506.00980 [cs.CL] (or arXiv:2506.00980v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2506.00980 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-136] NTPP: Generative Speech Language Modeling for Dual-Channel Spoken Dialogue via Next-Token-Pair Prediction

【速读】: 该论文旨在解决语音语言模型(Speech Language Models, SLMs)在自然流畅的口语交互中表现不足的问题,特别是如何充分利用双通道语音数据来提升对话能力。其解决方案的关键在于提出一种新的生成建模范式——下一词对预测(Next-Token-Pair Prediction, NTPP),首次在仅解码器架构中实现了独立于说话人的双通道口语对话学习,从而有效提升了对话的轮换预测、响应连贯性和自然度,并显著降低了推理延迟。

链接: https://arxiv.org/abs/2506.00975
作者: Qichao Wang,Ziqiao Meng,Wenqian Cui,Yifei Zhang,Pengcheng Wu,Bingzhe Wu,Irwin King,Liang Chen,Peilin Zhao
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Inspired by the impressive capabilities of GPT-4o, there is growing interest in enabling speech language models (SLMs) to engage in natural, fluid spoken interactions with humans. Recent advancements have led to the development of several SLMs that demonstrate promising results in this area. However, current approaches have yet to fully exploit dual-channel speech data, which inherently captures the structure and dynamics of human conversation. In this work, we systematically explore the use of dual-channel speech data in the context of modern large language models, and introduce a novel generative modeling paradigm, Next-Token-Pair Prediction (NTPP), to enable speaker-independent dual-channel spoken dialogue learning using decoder-only architectures for the first time. We evaluate our approach on standard benchmarks, and empirical results show that our proposed method, NTPP, significantly improves the conversational abilities of SLMs in terms of turn-taking prediction, response coherence, and naturalness. Moreover, compared to existing methods, NTPP achieves substantially lower inference latency, highlighting its practical efficiency for real-time applications.
zh

[NLP-137] XGUARD: A Graded Benchmark for Evaluating Safety Failures of Large Language Models on Extremist Content

【速读】: 该论文试图解决现有安全评估方法在评估大型语言模型(Large Language Models, LLMs)生成内容的风险时过于简化的问题,即仅使用二元标签(安全与不安全)而忽略了内容风险的复杂谱系。其解决方案的关键在于提出XGUARD,一个基准和评估框架,通过将模型响应划分为五个危险等级(0到4),实现对极端主义内容严重程度的更细致分析,并引入可解释的攻击严重度曲线(Attack Severity Curve, ASC)来可视化模型漏洞并比较不同防御机制的有效性。

链接: https://arxiv.org/abs/2506.00973
作者: Vadivel Abishethvarman,Bhavik Chandna,Pratik Jalan,Usman Naseem
机构: Sabaragamuwa University of Sri Lanka (斯里兰卡萨巴拉加穆瓦大学); UC San Diego (加州大学圣地亚哥分校); Macquarie University (麦考瑞大学)
类目: Computation and Language (cs.CL)
备注: Preprint

点击查看摘要

Abstract:Large Language Models (LLMs) can generate content spanning ideological rhetoric to explicit instructions for violence. However, existing safety evaluations often rely on simplistic binary labels (safe and unsafe), overlooking the nuanced spectrum of risk these outputs pose. To address this, we present XGUARD, a benchmark and evaluation framework designed to assess the severity of extremist content generated by LLMs. XGUARD includes 3,840 red teaming prompts sourced from real world data such as social media and news, covering a broad range of ideologically charged scenarios. Our framework categorizes model responses into five danger levels (0 to 4), enabling a more nuanced analysis of both the frequency and severity of failures. We introduce the interpretable Attack Severity Curve (ASC) to visualize vulnerabilities and compare defense mechanisms across threat intensities. Using XGUARD, we evaluate six popular LLMs and two lightweight defense strategies, revealing key insights into current safety gaps and trade-offs between robustness and expressive freedom. Our work underscores the value of graded safety metrics for building trustworthy LLMs.
zh

[NLP-138] ACCESS DENIED INC: The First Benchmark Environment for Sensitivity Awareness ACL2025

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在企业数据管理中处理敏感信息时的隐私与访问控制问题,特别是在面对员工查询时如何遵循预定义的访问权限规则。解决方案的关键在于提出敏感性感知(Sensitivity Awareness, SA)的概念,使LLMs能够在响应用户查询时自动识别并遵守信息的敏感性等级,从而在保障隐私的同时有效处理合法请求。

链接: https://arxiv.org/abs/2506.00964
作者: Dren Fazlija,Arkadij Orlov,Sandipan Sikdar
机构: L3S Research Center, Leibniz University Hannover (L3S研究中心,莱布尼茨汉诺威大学); E.ON Grid Solutions (E.ON电网解决方案)
类目: Computation and Language (cs.CL)
备注: 20 pages, 4 figures, 8 tables, ACL 2025 (Findings)

点击查看摘要

Abstract:Large language models (LLMs) are increasingly becoming valuable to corporate data management due to their ability to process text from various document formats and facilitate user interactions through natural language queries. However, LLMs must consider the sensitivity of information when communicating with employees, especially given access restrictions. Simple filtering based on user clearance levels can pose both performance and privacy challenges. To address this, we propose the concept of sensitivity awareness (SA), which enables LLMs to adhere to predefined access rights rules. In addition, we developed a benchmarking environment called ACCESS DENIED INC to evaluate SA. Our experimental findings reveal significant variations in model behavior, particularly in managing unauthorized data requests while effectively addressing legitimate queries. This work establishes a foundation for benchmarking sensitivity-aware language models and provides insights to enhance privacy-centric AI systems in corporate environments.
zh

[NLP-139] From Objectives to Questions: A Planning -based Framework for Educational Mathematical Question Generation

【速读】: 该论文试图解决自动生成符合教育目标的高质量数学问题这一挑战,传统生成方法主要关注文本质量而忽视教育目标,并且仅能生成单一维度的简单问题,无法满足复杂多维的教育需求。其解决方案的关键在于构建了EduMath数据集,该数据集包含16k道具有多维教育目标的数学问题,并提出了Educational Question Planning with self-Reflection (EQPR)方法,该方法通过结合基于蒙特卡洛树搜索的规划算法与大语言模型的生成能力,实现问题的持续优化,从而确保生成的问题既符合教育语境又能有效达成特定基础教育目标。

链接: https://arxiv.org/abs/2506.00963
作者: Cheng Cheng,Zhenya Huang,Guanhao Zhao,Yuxiang Guo,Xin Lin,Jinze Wu,Xin Li,Shijin Wang
机构: University of Science and Technology of China (中国科学技术大学); State Key Laboratory of Cognitive Intelligence (认知智能国家重点实验室); Institute of Artificial Intelligence, Hefei Comprehensive National Science Center (合肥综合性国家科学中心人工智能研究院); iFLYTEK Research (科大讯飞研究院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Automatically generating high-quality mathematical problems that align with educational objectives is a crucial task in NLP-based educational technology. Traditional generation methods focus primarily on textual quality, but they often overlook educational objectives. Moreover, these methods address only single-dimensional, simple question generation, failing to meet complex, multifaceted educational requirements. To address these challenges, we constructed and annotated EduMath, a dataset of 16k mathematical questions with multi-dimensional educational objectives. Based on this dataset, we developed EQGEVAL, which incorporates three evaluation dimensions and is designed to assess the ability of models to generate educational questions. Drawing inspiration from teachers’ problem design processes, we propose the Educational Question Planning with self-Reflection (EQPR) method for educational mathematical question generation, following a “plan-evaluate-optimize” approach. Specifically, by combining planning algorithm based on Monte Carlo Tree Search with the generative capabilities of Large Language Models, we continuously optimize questions through iterative feedback. This self-optimization mechanism ensures that the generated questions both fit the educational context and strategically achieve specific basic educational objectives. Through extensive experiments based on EQGEVAL, we have demonstrated that EQPR achieves significant improvements in generating questions that meet multi-dimensional educational objectives.
zh

[NLP-140] Speaking Beyond Language: A Large-Scale Multimodal Dataset for Learning Nonverbal Cues from Video-Grounded Dialogues ACL2025

【速读】: 该论文试图解决现有大型语言模型(Large Language Models, LLMs)在对话中无法有效整合非语言元素(如手势、面部表情和肢体语言)的问题,从而限制了其创建完全沉浸式对话体验的能力。解决方案的关键在于引入VENUS,这是一个大规模的多模态数据集,包含时间对齐的文本、面部表情和肢体语言标注视频,通过该数据集训练MARS模型,使其能够在统一框架内实现文本与向量量化非语言表示的结合,从而实现多模态理解和生成。

链接: https://arxiv.org/abs/2506.00958
作者: Youngmin Kim,Jiwan Chung,Jisoo Kim,Sunghyun Lee,Sangkyu Lee,Junhyeok Kim,Cheoljong Yang,Youngjae Yu
机构: Yonsei University (延世大学); NC Research, NCSOFT Corporation (NC研究院,NCSOFT公司)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ACL 2025 (Main), Our code and dataset: this https URL

点击查看摘要

Abstract:Nonverbal communication is integral to human interaction, with gestures, facial expressions, and body language conveying critical aspects of intent and emotion. However, existing large language models (LLMs) fail to effectively incorporate these nonverbal elements, limiting their capacity to create fully immersive conversational experiences. We introduce MARS, a multimodal language model designed to understand and generate nonverbal cues alongside text, bridging this gap in conversational AI. Our key innovation is VENUS, a large-scale dataset comprising annotated videos with time-aligned text, facial expressions, and body language. Leveraging VENUS, we train MARS with a next-token prediction objective, combining text with vector-quantized nonverbal representations to achieve multimodal understanding and generation within a unified framework. Based on various analyses of the VENUS datasets, we validate its substantial scale and high effectiveness. Our quantitative and qualitative results demonstrate that MARS successfully generates text and nonverbal languages, corresponding to conversational input.
zh

[NLP-141] Leverag ing Large Language Models for Sarcastic Speech Annotation in Sarcasm Detection INTERSPEECH2025

【速读】: 该论文试图解决语音中讽刺检测的挑战,特别是由于数据稀缺以及现有检测系统依赖多模态数据导致在仅语音可用场景下适用性受限的问题。解决方案的关键在于提出一种基于大语言模型(Large Language Models, LLMs)的标注流程,利用GPT-4o和LLaMA 3进行初始讽刺标注,并通过人工验证确保标注质量,最终构建了一个大规模的讽刺语音数据集PodSarc,该数据集在检测任务中达到了73.63%的F1分数,展现出其作为讽刺检测研究基准的潜力。

链接: https://arxiv.org/abs/2506.00955
作者: Zhu Li,Yuqing Zhang,Xiyuan Gao,Shekhar Nayak,Matt Coler
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted to Interspeech 2025

点击查看摘要

Abstract:Sarcasm fundamentally alters meaning through tone and context, yet detecting it in speech remains a challenge due to data scarcity. In addition, existing detection systems often rely on multimodal data, limiting their applicability in contexts where only speech is available. To address this, we propose an annotation pipeline that leverages large language models (LLMs) to generate a sarcasm dataset. Using a publicly available sarcasm-focused podcast, we employ GPT-4o and LLaMA 3 for initial sarcasm annotations, followed by human verification to resolve disagreements. We validate this approach by comparing annotation quality and detection performance on a publicly available sarcasm dataset using a collaborative gating architecture. Finally, we introduce PodSarc, a large-scale sarcastic speech dataset created through this pipeline. The detection model achieves a 73.63% F1 score, demonstrating the dataset’s potential as a benchmark for sarcasm detection research.
zh

[NLP-142] anyECG-chat: A Generalist ECG-MLLM for Flexible ECG Input and Multi-Task Understanding

【速读】: 该论文试图解决现有基于心电图(ECG)的多模态大语言模型(MLLM)在任务范围和输入灵活性方面的局限性,特别是其主要专注于单导联、短时长(10秒)心电图的报告生成任务,未能充分发挥MLLM的潜力。解决方案的关键在于构建一个名为anyECG的数据集,该数据集涵盖了多种任务,包括报告生成、异常波形定位和开放性问答,并引入了长时程低导联心电图以及多张心电图对比分析等临床常见场景。此外,提出了anyECG-chat模型,支持动态长度和多张心电图输入,并采用三阶段课程训练策略进行训练,从而提升了模型在多种实际应用场景中的适用性。

链接: https://arxiv.org/abs/2506.00942
作者: Haitao Li,Ziyu Li,Yiheng Mao,Ziyi Liu,Zhoujian Sun,Zhengxing Huang
机构: Zhejiang University(浙江大学); Transtek Medical Electronics Co., Ltd.(TransTek医疗电子有限公司); Zhejiang Lab(浙江省实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
备注:

点击查看摘要

Abstract:The advent of multimodal large language models (MLLMs) has sparked interest in their application to electrocardiogram (ECG) analysis. However, existing ECG-focused MLLMs primarily focus on report generation tasks, often limited to single 12-lead, short-duration (10s) ECG inputs, thereby underutilizing the potential of MLLMs. To this end, we aim to develop a MLLM for ECG analysis that supports a broader range of tasks and more flexible ECG inputs. However, existing ECG-QA datasets are often monotonous. To address this gap, we first constructed the anyECG dataset, which encompasses a wide variety of tasks, including report generation, abnormal waveform localization, and open-ended question answering. In addition to standard hospital ECGs, we introduced long-duration reduced-lead ECGs for home environments and multiple ECG comparison scenarios commonly encountered in clinical practice. Furthermore, we propose the anyECG-chat model, which supports dynamic-length ECG inputs and multiple ECG inputs. We trained the model using a three-stage curriculum training recipe with the anyECG dataset. A comprehensive evaluation was conducted, demonstrating that anyECG-chat is capable of supporting various practical application scenarios, including not only common report generation tasks but also abnormal waveform localization for long-duration reduced-lead ECGs in home environments and comprehensive comparative analysis of multiple ECGs.
zh

[NLP-143] Aligning VLM Assistants with Personalized Situated Cognition ACL2025

【速读】: 该论文旨在解决如何将视觉-语言模型(Vision-Language Models, VLMs)对齐到个体的个性化情境认知(personalized situated cognition)以实现更有效的现实世界辅助问题。其关键在于通过角色集(Role-Set)概念对个体进行表征,并构建一个基于认知感知和行为的奖励模型,以评估和实现个性化对齐。为此,作者提出了PCogAlign框架,并构建了基准测试PCogAlignBench,以验证方法的有效性。

链接: https://arxiv.org/abs/2506.00930
作者: Yongqi Li,Shen Zhou,Xiaohu Li,Xin Miao,Jintao Wen,Mayi Xu,Jianhao Chen,Birong Pan,Hankun Kang,Yuanyuan Zhu,Ming Zhong,Tieyun Qian
机构: Wuhan University (武汉大学); Zhongguancun Academy (中关村研究院)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted to ACL 2025 (main), camera-ready version

点击查看摘要

Abstract:Vision-language models (VLMs) aligned with general human objectives, such as being harmless and hallucination-free, have become valuable assistants of humans in managing visual tasks. However, people with diversified backgrounds have different cognition even in the same situation. Consequently, they may have personalized expectations for VLM assistants. This highlights the urgent need to align VLM assistants with personalized situated cognition for real-world assistance. To study this problem, we first simplify it by characterizing individuals based on the sociological concept of Role-Set. Then, we propose to evaluate the individuals’ actions to examine whether the personalized alignment is achieved. Further, we construct a benchmark named PCogAlignBench, which includes 18k instances and 20 individuals with different Role-Sets. Finally, we present a framework called PCogAlign, which constructs a cognition-aware and action-based reward model for personalized alignment. Experimental results and human evaluations demonstrate the reliability of the PCogAlignBench and the effectiveness of our proposed PCogAlign. We will open-source the constructed benchmark and code at this https URL.
zh

[NLP-144] Deep Temporal Reasoning in Video Language Models: A Cross-Linguistic Evaluation of Action Duration and Completion through Perfect Times

【速读】: 该论文试图解决视频-语言模型(VLMs)在时间推理任务中对动作完成性(perfectivity)和持续性(durativity)理解不足的问题,即模型是否真正理解时间动态而非仅依赖表面线索。解决方案的关键在于构建一个四语种(英语、意大利语、俄语和日语)的多项选择问答基准——\textbf{Perfect Times数据集》,通过将日常活动视频与事件完成标签及针对完成性设计的干扰项进行配对,以评估模型在时间推理上的表现。该数据集旨在推动模型对动作持续时间和完成状态的深层次多模态理解。

链接: https://arxiv.org/abs/2506.00928
作者: Olga Loginova,Sofía Ortega Loguinova
机构: University of Trento (特伦托大学); Maastricht University (马斯特里赫特大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Human perception of events is intrinsically tied to distinguishing between completed (perfect and telic) and ongoing (durative) actions, a process mediated by both linguistic structure and visual cues. In this work, we introduce the \textbfPerfect Times dataset, a novel, quadrilingual (English, Italian, Russian, and Japanese) multiple-choice question-answering benchmark designed to assess video-language models (VLMs) on temporal reasoning. By pairing everyday activity videos with event completion labels and perfectivity-tailored distractors, our dataset probes whether models truly comprehend temporal dynamics or merely latch onto superficial markers. Experimental results indicate that state-of-the-art models, despite their success on text-based tasks, struggle to mirror human-like temporal and causal reasoning grounded in video. This study underscores the necessity of integrating deep multimodal cues to capture the nuances of action duration and completion within temporal and causal video dynamics, setting a new standard for evaluating and advancing temporal reasoning in VLMs.
zh

[NLP-145] Position as Probability: Self-Supervised Transformers that Think Past Their Training for Length Extrapolation

【速读】: 该论文试图解决深度序列模型在测试序列长度显著超过训练长度时准确率下降的问题,这一问题限制了其在需要强大长度外推能力的任务(如算法推理、多步骤算术和组合泛化)中的应用。解决方案的关键在于提出PRISM(Probabilistic Relative-position Implicit Superposition Model),这是一种新型的位置编码机制,通过可微分直方图滤波更新学习连续相对位置,并利用概率叠加而非传统确定性嵌入来保留位置不确定性,从而实现高达10倍于训练长度的准确外推。

链接: https://arxiv.org/abs/2506.00920
作者: Philip Heejun Lee
机构: Xenon Labs LLC(塞壬实验室有限公司)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Neural and Evolutionary Computing (cs.NE)
备注: Note: v1: working paper; code, additional baselines, ablations, will follow in v2

点击查看摘要

Abstract:Deep sequence models typically degrade in accuracy when test sequences significantly exceed their training lengths, yet many critical tasks–such as algorithmic reasoning, multi-step arithmetic, and compositional generalization–require robust length extrapolation. We introduce PRISM, a Probabilistic Relative-position Implicit Superposition Model, a novel positional encoding mechanism that enables Transformers to extrapolate accurately up to 10x beyond their training length. PRISM learns continuous relative positions through a differentiable histogram-filter update, preserving position uncertainty via a probabilistic superposition rather than conventional deterministic embeddings. Empirically, PRISM achieves state-of-the-art length extrapolation, successfully generalizing to previously intractable sequence lengths across algorithmic benchmarks–including arithmetic (addition, multiplication), SCAN compositionality tasks, and complex copy variants derived from DeepMind’s recent datasets. Our analysis demonstrates that PRISM’s stochastic positional encoding maintains sharp and interpretable internal states, providing a theoretical basis for reliable length generalization. These results advance the goal of neural sequence models that remain algorithmically robust at lengths far exceeding their training horizon.
zh

[NLP-146] How do Transformer Embeddings Represent Compositions? A Functional Analysis

【速读】: 该论文试图解决的问题是:当前基于Transformer的模型在表示复合词时是否具备组合性(compositionality),以及它们的表示方式是否能够支持推理和泛化。论文通过测试Mistral、OpenAI Large和Google嵌入模型,并与BERT进行比较,分析不同模型在组合性方面的表现。解决方案的关键在于采用多种组合性模型(如加法、乘法、扩张、回归等)对嵌入表示进行评估,发现岭回归(ridge regression)虽为线性模型,但最能解释组合性,同时发现经典向量加法模型的表现几乎与其它模型相当,表明大多数嵌入模型具有较高的组合性,而BERT的组合性较差。

链接: https://arxiv.org/abs/2506.00914
作者: Aishik Nagar,Ishaan Singh Rawal,Mansi Dhanania,Cheston Tan
机构: ASUS Intelligent Cloud Services (ASUS智能云服务); Institute of High Performance Computing (IHPC), Agency for Science, Technology and Research (ASTAR) (高性能计算研究所(IHPC),科技研究局(ASTAR)); Center for Frontier AI Research (CFAR), Agency for Science, Technology and Research (ASTAR) (前沿人工智能研究中心(CFAR),科技研究局(ASTAR)); Texas A&M University (德克萨斯A&M大学); McGill University (麦吉尔大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Compositionality is a key aspect of human intelligence, essential for reasoning and generalization. While transformer-based models have become the de facto standard for many language modeling tasks, little is known about how they represent compound words, and whether these representations are compositional. In this study, we test compositionality in Mistral, OpenAI Large, and Google embedding models, and compare them with BERT. First, we evaluate compositionality in the representations by examining six diverse models of compositionality (addition, multiplication, dilation, regression, etc.). We find that ridge regression, albeit linear, best accounts for compositionality. Surprisingly, we find that the classic vector addition model performs almost as well as any other model. Next, we verify that most embedding models are highly compositional, while BERT shows much poorer compositionality. We verify and visualize our findings with a synthetic dataset consisting of fully transparent adjective-noun compositions. Overall, we present a thorough investigation of compositionality.
zh

[NLP-147] Pi-SQL: Enhancing Text-to-SQL with Fine-Grained Guidance from Pivot Programming Languages

【速读】: 该论文旨在解决自然语言到SQL程序的转换问题,即文本到SQL(Text-to-SQL)任务,该任务使非专家用户能够与复杂数据库进行交互。现有基于提示的方法由于自然语言文本与低资源SQL程序之间的语义鸿沟,导致准确性受限。论文提出的解决方案关键在于引入高资源Python程序作为桥梁,即Pi-SQL,通过生成包含细粒度步骤指南的Python代码或注释,再根据Python程序生成SQL程序,从而提升执行准确性和效率。

链接: https://arxiv.org/abs/2506.00912
作者: Yongdong chi,Hanqing Wang,Zonghan Yang,Jian Yang,Xiao Yan,Yun Chen,Guanhua Chen
机构: Shanghai University of Finance and Economics (上海财经大学); Tsinghua University (清华大学); Beihang University (北京航空航天大学); Wuhan University (武汉大学); Southern University of Science and Technology (南方科技大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Text-to-SQL transforms the user queries from natural language to executable SQL programs, enabling non-experts to interact with complex databases. Existing prompt-based methods craft meticulous text guidelines and examples to facilitate SQL generation, but their accuracy is hindered by the large semantic gap between the texts and the low-resource SQL programs. In this work, we propose Pi-SQL, which incorporates the high-resource Python program as a pivot to bridge between the natural language query and SQL program. In particular, Pi-SQL first generates Python programs that provide fine-grained step-by-step guidelines in their code blocks or comments, and then produces an SQL program following the guidance of each Python this http URL final SQL program matches the reference Python program’s query results and, through selection from candidates generated by different strategies, achieves superior execution speed, with a reward-based valid efficiency score up to 4.55 higher than the best-performing this http URL experiments demonstrate the effectiveness of Pi-SQL, which improves the execution accuracy of the best-performing baseline by up to 3.20.
zh

[NLP-148] SocialEval: Evaluating Social Intelligence of Large Language Models ACL2025

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在社会智能(Social Intelligence, SI)方面的评估问题,特别是LLMs与人类在社会互动中的表现差异。现有研究未能全面评估LLMs在目标导向的结果评价和人际能力的过程评价两方面的能力。解决方案的关键在于提出SocialEval,一个基于脚本的双语SI基准,通过人工构建叙事脚本来整合结果导向和过程导向的评估,每个脚本以世界树结构呈现,包含由人际能力驱动的情节线,从而提供LLMs在社会互动中行为的全面视角。

链接: https://arxiv.org/abs/2506.00900
作者: Jinfeng Zhou,Yuxuan Chen,Yihan Shi,Xuanming Zhang,Leqi Lei,Yi Feng,Zexuan Xiong,Miao Yan,Xunzhi Wang,Yaru Cao,Jianing Yin,Shuai Wang,Quanyu Dai,Zhenhua Dong,Hongning Wang,Minlie Huang
机构: The CoAI Group, DCST, Tsinghua University (清华大学计算机科学与技术系); Harvard University (哈佛大学); University of Wisconsin–Madison (威斯康星大学麦迪逊分校); Beijing Jiaotong University (北京交通大学); Peking University (北京大学); Nankai University (南开大学); Northwest Minzu University (西北民族大学); University of Pennsylvania (宾夕法尼亚大学); Huawei Noah’ Ark Lab (华为诺亚方舟实验室)
类目: Computation and Language (cs.CL)
备注: ACL 2025, Repository: \url{ this https URL }

点击查看摘要

Abstract:LLMs exhibit promising Social Intelligence (SI) in modeling human behavior, raising the need to evaluate LLMs’ SI and their discrepancy with humans. SI equips humans with interpersonal abilities to behave wisely in navigating social interactions to achieve social goals. This presents an operational evaluation paradigm: outcome-oriented goal achievement evaluation and process-oriented interpersonal ability evaluation, which existing work fails to address. To this end, we propose SocialEval, a script-based bilingual SI benchmark, integrating outcome- and process-oriented evaluation by manually crafting narrative scripts. Each script is structured as a world tree that contains plot lines driven by interpersonal ability, providing a comprehensive view of how LLMs navigate social interactions. Experiments show that LLMs fall behind humans on both SI evaluations, exhibit prosociality, and prefer more positive social behaviors, even if they lead to goal failure. Analysis of LLMs’ formed representation space and neuronal activations reveals that LLMs have developed ability-specific functional partitions akin to the human brain.
zh

[NLP-149] CODEMENV: Benchmarking Large Language Models on Code Migration ACL2025

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在代码迁移(code migration)任务中的有效性不足问题,即如何将代码适配到不同的运行环境。其解决方案的关键在于提出一个名为CODEMENV的基准测试集,该基准集包含922个示例,覆盖19个Python和Java包,并涵盖三个核心任务:识别与特定版本不兼容的函数、检测函数定义的变化以及适应目标环境的代码。通过该基准集,研究者能够更系统地评估LLMs在代码迁移场景中的性能。

链接: https://arxiv.org/abs/2506.00894
作者: Keyuan Cheng,Xudong Shen,Yihao Yang,Tengyue Wang,Yang Cao,Muhammad Asif Ali,Hanbin Wang,Lijie Hu,Di Wang
机构: Provable Responsible AI and Data Analytics (PRADA) Lab; King Abdullah University of Science and Technology; Peking University; South China University of Technology
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted by ACL 2025 Findings

点击查看摘要

Abstract:Large language models (LLMs) have shown remarkable capabilities across various software engineering tasks; however, their effectiveness in code migration, adapting code to run in different environments, remains insufficiently studied. In this work, we introduce CODEMENV: Code Migration Across Environment, a new benchmark specifically designed to assess LLMs’ abilities in code migration scenarios. CODEMENV consists of 922 examples spanning 19 Python and Java packages, and covers three core tasks: (1) identifying functions incompatible with specific versions, (2) detecting changes in function definitions, and (3) adapting code to target environments. Experimental evaluation with seven LLMs on CODEMENV yields an average pass@1 rate of 26.50%, with GPT-4O achieving the highest score at 43.84%. Key findings include: (i) LLMs tend to be more proficient with newer function versions, which aids in migrating legacy code, and (ii) LLMs sometimes exhibit logical inconsistencies by identifying function changes irrelevant to the intended migration environment. The datasets are available at this https URL.
zh

[NLP-150] Affordance Benchmark for MLLM s

【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在环境 affordance(可及性)感知能力方面的不足,尤其是其在理解物体固有属性和动态、情境化 affordance 方面的局限性。解决方案的关键在于提出 A4Bench 基准测试框架,该框架从两个维度评估 MLLMs 的 affordance 感知能力:1)构成性 affordance,通过 1,282 个跨九个子学科的问答对评估对象固有属性的理解;2)转化性 affordance,通过 718 个具有挑战性的问答对探究动态和情境化因素(如误导性、时间依赖性、文化或个体特定的 affordance)。此基准为评估和提升 MLLMs 的环境理解能力提供了重要基础。

链接: https://arxiv.org/abs/2506.00893
作者: Junying Wang,Wenzhe Li,Yalun Wu,Yingji Liang,Yijin Guo,Chunyi Li,Haodong Duan,Zicheng Zhang,Guangtao Zhai
机构: Fudan University (复旦大学); Shanghai AI Lab (上海人工智能实验室); Shanghai Jiao Tong University (上海交通大学); East China Normal University (华东师范大学); Shanghai China (上海市)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Affordance theory posits that environments inherently offer action possibilities that shape perception and behavior. While Multimodal Large Language Models (MLLMs) excel in vision-language tasks, their ability to perceive affordance, which is crucial for intuitive and safe interactions, remains underexplored. To address this, we introduce A4Bench, a novel benchmark designed to evaluate the affordance perception abilities of MLLMs across two dimensions: 1) Constitutive Affordance, assessing understanding of inherent object properties through 1,282 question-answer pairs spanning nine sub-disciplines, and 2) Transformative Affordance, probing dynamic and contextual nuances (e.g., misleading, time-dependent, cultural, or individual-specific affordance) with 718 challenging question-answer pairs. Evaluating 17 MLLMs (nine proprietary and eight open-source) against human performance, we find that proprietary models generally outperform open-source counterparts, but all exhibit limited capabilities, particularly in transformative affordance perception. Furthermore, even top-performing models, such as Gemini-2.0-Pro (18.05% overall exact match accuracy), significantly lag behind human performance (best: 85.34%, worst: 81.25%). These findings highlight critical gaps in environmental understanding of MLLMs and provide a foundation for advancing AI systems toward more robust, context-aware interactions. The dataset is available in this https URL.
zh

[NLP-151] Improve MLLM Benchmark Efficiency through Interview

【速读】: 该论文试图解决在大规模数据上进行全覆盖问答测试时资源消耗大且耗时的问题。解决方案的关键在于提出一种名为MLLM Interview (MITV)的策略,通过较少的问题快速获取MLLM的性能指标,具体包括构建带有难度标签的访谈数据集,并通过少量主题的提问初步评估模型性能,随后持续测试模型的极限。

链接: https://arxiv.org/abs/2506.00883
作者: Farong Wen,Yijin Guo,Junying Wang,Jiaohao Xiao,Yingjie Zhou,Chunyi Li,Zicheng Zhang,Guangtao Zhai
机构: Shanghai Jiao Tong University(上海交通大学); Shanghai AI Lab(上海人工智能实验室); Fudan University(复旦大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The rapid development of Multimodal Large Language Models (MLLM) has led to a wide range of MLLM applications, and a number of benchmark datasets have sprung up in order to assess MLLM abilities. However, full-coverage QA testing on large-scale data is resource-intensive and time-consuming. To address this issue, we propose the MLLM Interview (MITV) strategy, which aims to quickly obtain MLLM performance metrics by quizzing fewer question. First, First, we constructed the interview dataset, which was built on an existing MLLM assessment dataset, by adding difficulty labels based on the performance of some typical MLLMs in this dataset. Second, we propose an MLLM Interview strategy, which obtains an initial performance situation of the large model by quizzing a small number of topics and then continuously tries to test the model’s limits. Through extensive experiments, the result shows that the MITV strategy proposed in this paper performs well on MLLM benchmark datasets, and it is able to obtain the model evaluation capability faster through a small number of questions and answers.
zh

[NLP-152] Not Every Token Needs Forgetting: Selective Unlearning to Limit Change in Utility in Large Language Model Unlearning

【速读】: 该论文试图解决大型语言模型(Large Language Model, LLM)在去除不需要信息时存在的过度遗忘问题,即传统方法 indiscriminately 更新模型参数以遗忘目标文档中的所有标记,包括那些承载通用知识的常见标记(如代词、介词、普通名词)。解决方案的关键在于提出选择性遗忘(Selective Unlearning, SU),该方法识别遗忘集中与不需要信息相关的关键标记子集,并仅对这些标记进行遗忘,从而在有效移除目标遗忘数据的同时显著保留模型在保留集中的性能。

链接: https://arxiv.org/abs/2506.00876
作者: Yixin Wan,Anil Ramakrishna,Kai-Wei Chang,Volkan Cevher,Rahul Gupta
机构: University of California, Los Angeles (加州大学洛杉矶分校); Amazon (亚马逊)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Model (LLM) unlearning has recently gained significant attention, driven by the need to remove unwanted information, such as private, sensitive, or copyrighted content, from LLMs. However, conventional unlearning approaches indiscriminately update model parameters to forget all tokens in a target document, including common tokens (e.g., pronouns, prepositions, general nouns) that carry general knowledge. In this paper, we highlight that not every token needs forgetting. We propose Selective Unlearning (SU), which identifies a critical subset of tokens within the forgetting set that is relevant to the unwanted information, and unlearns only those tokens. Experiments on two benchmarks and six baseline unlearning algorithms demonstrate that SU not only achieves effective unlearning on the targeted forget data, but also significantly preserves the model’s utility in the retaining set.
zh

[NLP-153] CC-Tuning: A Cross-Lingual Connection Mechanism for Improving Joint Multilingual Supervised Fine-Tuning ACL2025

【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在多语言能力上存在的不平衡问题,这一问题主要源于其以英语为中心的训练语料。论文提出的解决方案是CC-Tuning,其关键在于在潜在层面上显式建立跨语言连接机制。通过在训练过程中融合英语和非英语输入的前馈激活,并利用可训练的决策器识别有益激活,从而实现对多种语言资源的协同利用;在推理阶段,则通过变换矩阵在单语设置下模拟跨语言连接,提升了模型的多语言性能。

链接: https://arxiv.org/abs/2506.00875
作者: Yangfan Ye,Xiaocheng Feng,Zekun Yuan,Xiachong Feng,Libo Qin,Lei Huang,Weitao Ma,Yichong Huang,Zhirui Zhang,Yunfei Lu,Xiaohui Yan,Duyu Tang,Dandan Tu,Bing Qin
机构: 未知
类目: Computation and Language (cs.CL)
备注: ACL2025 main conference, long paper

点击查看摘要

Abstract:Current large language models (LLMs) often exhibit imbalanced multilingual capabilities due to their English-centric training corpora. To address this, existing fine-tuning approaches operating at the data-level (e.g., through data augmentation or distillation) typically introduce implicit cross-lingual alignment, overlooking the potential for more profound, latent-level cross-lingual interactions. In this work, we propose CC-Tuning, a novel multilingual fine-tuning paradigm that explicitly establishes a cross-lingual connection mechanism at the latent level. During training, CC-Tuning fuses the feed forward activations from both English and non-English inputs, enabling the model to benefit from both linguistic resources. This process is facilitated with a trainable Decision Maker that identifies beneficial activations. Furthermore, during inference, a Transform Matrix is utilized to simulate the cross-lingual connection under monolingual setting through representation transformation. Our experiments on six benchmarks covering 22 languages show that CC-Tuning outperforms vanilla SFT and offers a strong latent-level alternative to data-level augmentation methods. Further analysis also highlights the practicality of CC-Tuning and the potential of latent-level cross-lingual interactions in advancing the multilingual performance of LLMs.
zh

[NLP-154] owards Predicting Any Human Trajectory In Context

【速读】: 该论文旨在解决行人轨迹预测中因环境和领域差异导致的适应性不足问题,特别是在边缘设备上无法进行大规模微调的情况下。其解决方案的关键在于提出一种基于上下文学习(In-Context Learning, ICL)的框架TrajICL,通过时空相似性示例选择(Spatio-Temporal Similarity-based Example Selection, STES)和预测引导的示例选择(Prediction-Guided Example Selection, PG-ES)方法,实现无需微调即可快速适应新场景的轨迹预测。

链接: https://arxiv.org/abs/2506.00871
作者: Ryo Fujii,Hideo Saito,Ryo Hachiuma
机构: Keio University (庆应义塾大学); NVIDIA (英伟达)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Predicting accurate future trajectories of pedestrians is essential for autonomous systems but remains a challenging task due to the need for adaptability in different environments and domains. A common approach involves collecting scenario-specific data and performing fine-tuning via backpropagation. However, this process is often impractical on edge devices due to constrained computational resources. To address this challenge, we introduce TrajICL, an In-Context Learning (ICL) framework for pedestrian trajectory prediction that enables rapid adaptation without fine-tuning on the scenario-specific data. We propose a spatio-temporal similarity-based example selection (STES) method that selects relevant examples from previously observed trajectories within the same scene by identifying similar motion patterns at corresponding locations. To further refine this selection, we introduce prediction-guided example selection (PG-ES), which selects examples based on both the past trajectory and the predicted future trajectory, rather than relying solely on the past trajectory. This approach allows the model to account for long-term dynamics when selecting examples. Finally, instead of relying on small real-world datasets with limited scenario diversity, we train our model on a large-scale synthetic dataset to enhance its prediction ability by leveraging in-context examples. Extensive experiments demonstrate that TrajICL achieves remarkable adaptation across both in-domain and cross-domain scenarios, outperforming even fine-tuned approaches across multiple public benchmarks. The code will be released at this https URL.
zh

[NLP-155] Whats Missing in Vision-Language Models? Probing Their Struggles with Causal Order Reasoning

【速读】: 该论文试图解决视觉语言模型(Vision-Language Models, VLMs)在理解与推理视觉输入中的因果关系方面能力不足的问题,这一问题限制了其在复杂高阶推理任务中的表现。现有基准测试中常混合多种推理问题,导致VLMs可能通过物体识别和活动识别等捷径获得正确答案,难以真实评估其因果推理能力。解决方案的关键在于引入两个新的基准测试——VQA-Causal和VCR-Causal,旨在隔离并严格评估VLMs的因果推理能力。研究发现,尽管VLMs在物体和活动识别上表现优异,但在因果推理任务中表现较差,这主要归因于训练数据集中缺乏明确的因果表达。通过引入硬负样本微调策略,研究显示针对性的微调可以在保持模型泛化能力和下游任务性能的同时提升其因果推理能力。

链接: https://arxiv.org/abs/2506.00869
作者: Zhaotian Weng,Haoxuan Li,Kuan-Hao Huang,Jieyu Zhao
机构: 未知
类目: Computation and Language (cs.CL)
备注: 12 pages

点击查看摘要

Abstract:Despite the impressive performance of vision-language models (VLMs) on downstream tasks, their ability to understand and reason about causal relationships in visual inputs remains unclear. Robust causal reasoning is fundamental to solving complex high-level reasoning tasks, yet existing benchmarks often include a mixture of reasoning questions, and VLMs can frequently exploit object recognition and activity identification as shortcuts to arrive at the correct answers, making it challenging to truly assess their causal reasoning abilities. To bridge this gap, we introduce VQA-Causal and VCR-Causal, two new benchmarks specifically designed to isolate and rigorously evaluate VLMs’ causal reasoning abilities. Our findings reveal that while VLMs excel in object and activity recognition, they perform poorly on causal reasoning tasks, often only marginally surpassing random guessing. Further analysis suggests that this limitation stems from a severe lack of causal expressions in widely used training datasets, where causal relationships are rarely explicitly conveyed. We additionally explore fine-tuning strategies with hard negative cases, showing that targeted fine-tuning can improve model’s causal reasoning while maintaining generalization and downstream performance. Our study highlights a key gap in current VLMs and lays the groundwork for future work on causal understanding.
zh

[NLP-156] L3Cube-MahaEmotions: A Marathi Emotion Recognition Dataset with Synthetic Annotations using CoTR prompting and Large Language Models

【速读】: 该论文旨在解决低资源语言(如马拉地语)中情感识别的挑战,尤其是由于标注数据有限而导致的性能瓶颈。其关键解决方案是构建一个高质量的马拉地语情感识别数据集L3Cube-MahaEmotions,其中训练数据通过大型语言模型(LLMs)进行合成标注,而验证和测试集则由人工标注以作为可靠的基准。此外,研究采用Chain-of-Translation (CoTR)提示技术,将马拉地语句子翻译为英语并通过单提示进行情感标注,从而提升模型在低资源场景下的泛化能力。

链接: https://arxiv.org/abs/2506.00863
作者: Nidhi Kowtal,Raviraj Joshi
机构: Pune Institute of Computer Technology, Pune, Maharashtra India; Indian Institute of Technology Madras, Chennai, Tamil Nadu India; L3Cube Labs, Pune
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Emotion recognition in low-resource languages like Marathi remains challenging due to limited annotated data. We present L3Cube-MahaEmotions, a high-quality Marathi emotion recognition dataset with 11 fine-grained emotion labels. The training data is synthetically annotated using large language models (LLMs), while the validation and test sets are manually labeled to serve as a reliable gold-standard benchmark. Building on the MahaSent dataset, we apply the Chain-of-Translation (CoTR) prompting technique, where Marathi sentences are translated into English and emotion labeled via a single prompt. GPT-4 and Llama3-405B were evaluated, with GPT-4 selected for training data annotation due to superior label quality. We evaluate model performance using standard metrics and explore label aggregation strategies (e.g., Union, Intersection). While GPT-4 predictions outperform fine-tuned BERT models, BERT-based models trained on synthetic labels fail to surpass GPT-4. This highlights both the importance of high-quality human-labeled data and the inherent complexity of emotion recognition. An important finding of this work is that generic LLMs like GPT-4 and Llama3-405B generalize better than fine-tuned BERT for complex low-resource emotion recognition tasks. The dataset and model are shared publicly at this https URL
zh

[NLP-157] How Bidirectionality Helps Language Models Learn Better via Dynamic Bottleneck Estimation

【速读】: 该论文试图解决双向语言模型在自然语言理解任务中表现优于单向模型的理论原因不明确的问题,其核心在于揭示双向架构在信息编码与压缩方面的优势。解决方案的关键在于提出FlowNIB方法,这是一种动态且可扩展的互信息估计技术,克服了传统信息瓶颈(Information Bottleneck, IB)方法在计算不可行性和固定权衡调度上的局限性。通过理论分析与实验验证,作者证明了双向模型保留了更多的互信息并具有更高的有效维度,从而为双向架构的有效性提供了原理性解释。

链接: https://arxiv.org/abs/2506.00859
作者: Md Kowsher,Nusrat Jahan Prottasha,Shiyun Xu,Shetu Mohanto,Chen Chen,Niloofar Yousefi,Ozlem Garibay
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Bidirectional language models have better context understanding and perform better than unidirectional models on natural language understanding tasks, yet the theoretical reasons behind this advantage remain unclear. In this work, we investigate this disparity through the lens of the Information Bottleneck (IB) principle, which formalizes a trade-off between compressing input information and preserving task-relevant content. We propose FlowNIB, a dynamic and scalable method for estimating mutual information during training that addresses key limitations of classical IB approaches, including computational intractability and fixed trade-off schedules. Theoretically, we show that bidirectional models retain more mutual information and exhibit higher effective dimensionality than unidirectional models. To support this, we present a generalized framework for measuring representational complexity and prove that bidirectional representations are strictly more informative under mild conditions. We further validate our findings through extensive experiments across multiple models and tasks using FlowNIB, revealing how information is encoded and compressed throughout training. Together, our work provides a principled explanation for the effectiveness of bidirectional architectures and introduces a practical tool for analyzing information flow in deep language models.
zh

[NLP-158] EEG2TEXT-CN: An Exploratory Study of Open-Vocabulary Chinese Text-EEG Alignment via Large Language Model and Contrastive Learning on ChineseEEG

【速读】: 该论文试图解决从脑电图(EEG)信号中生成自然语言文本的问题,特别是针对中文的非语音跨模态语言解码。其解决方案的关键在于构建一个基于生物合理EEG编码器(NICE-EEG)和轻量级预训练语言模型(MiniLM)的框架,通过掩码预训练和对比学习将多通道脑信号与自然语言表示对齐,从而实现零样本条件下的逐字嵌入生成和完整句子预测。

链接: https://arxiv.org/abs/2506.00854
作者: Jacky Tai-Yu Lu,Jung Chiang,Chi-Sheng Chen,Anna Nai-Yun Tung,Hsiang Wei Hu,Yuan Chiao Cheng
机构: International Academia of Biomedical Innovation Technology (国际生物医学创新技术学术院); National Taiwan University (台湾大学); Neuro Industry Research (神经产业研究); University of Southern California (南加州大学); Industrial Technology Research Institute (工业技术研究院); Taiwan Artificial Intelligence Association (台湾人工智能协会)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM); Neurons and Cognition (q-bio.NC)
备注:

点击查看摘要

Abstract:We propose EEG2TEXT-CN, which, to the best of our knowledge, represents one of the earliest open-vocabulary EEG-to-text generation frameworks tailored for Chinese. Built on a biologically grounded EEG encoder (NICE-EEG) and a compact pretrained language model (MiniLM), our architecture aligns multichannel brain signals with natural language representations via masked pretraining and contrastive learning. Using a subset of the ChineseEEG dataset, where each sentence contains approximately ten Chinese characters aligned with 128-channel EEG recorded at 256 Hz, we segment EEG into per-character embeddings and predict full sentences in a zero-shot setting. The decoder is trained with teacher forcing and padding masks to accommodate variable-length sequences. Evaluation on over 1,500 training-validation sentences and 300 held-out test samples shows promising lexical alignment, with a best BLEU-1 score of 6.38%. While syntactic fluency remains a challenge, our findings demonstrate the feasibility of non-phonetic, cross-modal language decoding from EEG. This work opens a new direction in multilingual brain-to-text research and lays the foundation for future cognitive-language interfaces in Chinese.
zh

[NLP-159] Generalizable LLM Learning of Graph Synthetic Data with Reinforcement Learning

【速读】: 该论文试图解决如何将合成图数据中的学习能力泛化到具有隐式图结构的真实世界任务中,而不仅仅是依赖于监督微调提升图算法问题的求解能力。其解决方案的关键在于利用强化学习(Reinforcement Learning, RL)来解锁合成图数据的可泛化学习,通过设计基于解的奖励和基于过程的奖励,使大语言模型(Large Language Models, LLMs)更好地掌握图推理的本质,从而缓解过拟合问题。

链接: https://arxiv.org/abs/2506.00845
作者: Yizhuo Zhang,Heng Wang,Shangbin Feng,Zhaoxuan Tan,Xinyun Liu,Yulia Tsvetkov
机构: Google(谷歌)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 9 pages, 3 figures, 3 tables. Experimental code and results are publicly available at this https URL

点击查看摘要

Abstract:Previous research has sought to enhance the graph reasoning capabilities of LLMs by supervised fine-tuning on synthetic graph data. While these led to specialized LLMs better at solving graph algorithm problems, we don’t need LLMs for shortest path: we need generalization from synthetic graph data to real-world tasks with implicit graph structures. In this work, we propose to unlock generalizable learning of graph synthetic data with reinforcement learning. We first design solution-based and process-based rewards for synthetic graph problems: instead of rigid memorizing response patterns in direct fine-tuning, we posit that RL would help LLMs grasp the essentials underlying graph reasoning and alleviate overfitting. We employ RL algorithms such as GRPO and DPO, aligning both off-the-shelf LLMs and LLMs fine-tuned on synthetic graph data. We then compare them against existing settings on both in-domain synthetic tasks and out-of-domain real-world tasks with implicit graph structures such as multi-hop QA, structured planning, and more. Extensive experiments demonstrate that our RL recipe leads to statistically significant improvement on 5 datasets, with an average gain of 12.9% over baseline settings. Further analysis reveals that process-based rewards consistently outperform solution-based rewards, mixing synthetic and real-world task data yields potential gains, while compositionality and explainable intermediate steps remains a critical challenge even after RL.
zh

[NLP-160] oward Structured Knowledge Reasoning : Contrastive Retrieval-Augmented Generation on Experience ACL2025

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在处理结构化数据(如表格和数据库)时表现不佳的问题,主要挑战包括预训练阶段的暴露不足以及文本到结构的刚性转换机制。解决方案的关键在于引入对比检索增强生成框架(Contrastive Retrieval-Augmented Generation on Experience, CoRE),通过构建经验记忆表示并利用对比上下文学习(contrastive In-Context Learning, ICL)来模拟人类知识迁移,从而提升模型对隐式关系的推理能力。

链接: https://arxiv.org/abs/2506.00842
作者: Jiawei Gu,Ziting Xian,Yuanzhen Xie,Ye Liu,Enjie Liu,Ruichao Zhong,Mochi Gao,Yunzhi Tan,Bo Hu,Zang Li
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: ACL 2025 Findings

点击查看摘要

Abstract:Large language models (LLMs) achieve strong performance on plain text tasks but underperform on structured data like tables and databases. Potential challenges arise from their underexposure during pre-training and rigid text-to-structure transfer mechanisms. Unlike humans who seamlessly apply learned patterns across data modalities, LLMs struggle to infer implicit relationships embedded in tabular formats, especially in the absence of explicit structural guidance. To bridge this cognitive gap, we introduce Contrastive Retrieval-Augmented Generation on Experience (CoRE), a framework that builds experience memory representations and enhances generalization through contrastive In-Context Learning (ICL) to simulate human-like knowledge transfer. Experiments on Text-to-SQL and TableQA show CoRE significantly improves performance, achieving average gains of 3.44% and 4.24%, with up to 17.2% on challenging tasks. Our Monte Carlo Tree Search (MCTS)-generated Experience Memory expands training data 8-9x, enhancing diversity and domain coverage. This training-free and continual method propels LLMs toward structured knowledge expertise.
zh

[NLP-161] COMPKE: Complex Question Answering under Knowledge Editing ACL2025

【速读】: 该论文试图解决现有知识编辑(Knowledge Editing)评估基准在真实场景下无法有效衡量模型应用更新后知识能力的问题,尤其是在涉及复杂推理、一对多关系或多步骤逻辑交叉的情况下。解决方案的关键在于引入一个新的基准测试集COMPKE(Complex Question Answering under Knowledge Editing),该基准包含11,924个反映现实情境的复杂问题,从而更全面地评估知识编辑方法在实际应用中的效果。

链接: https://arxiv.org/abs/2506.00829
作者: Keyuan Cheng,Zijian Kan,Zhixian He,Zhuoran Zhang,Muhammad Asif Ali,Ke Xu,Lijie Hu,Di Wang
机构: Peking University; South China University of Technology; Sun Yat-sen University; Provable Responsible AI and Data Analytics (PRADA) Lab; King Abdullah University of Science and Technology
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted by ACL 2025 Findings

点击查看摘要

Abstract:Knowledge Editing, which efficiently modifies the knowledge in large language models, has gathered great attention. Current benchmarks primarily use multi-hop question answering to assess and analyze newly injected or updated knowledge. However, we argue that these benchmarks fail to effectively evaluate how well the updated models apply this knowledge in real-life scenarios, particularly when questions require complex reasoning, involving one-to-many relationships or multi-step logical intersections. To fill in this gap, we introduce a new benchmark, COMPKE: Complex Question Answering under Knowledge Editing, which includes 11,924 complex questions that reflect real-life situations. We conduct an extensive evaluation of four knowledge editing methods on COMPKE, revealing that their effectiveness varies notably across different models. For instance, MeLLo attains an accuracy of 39.47 on GPT-4O-MINI, but this drops sharply to 3.83 on QWEN2.5-3B. We further investigate the underlying causes of these disparities from both methodological and model-specific perspectives. The datasets are available at this https URL.
zh

[NLP-162] HERGC: Heterogeneous Experts Representation and Generative Completion for Multimodal Knowledge Graphs

【速读】: 该论文旨在解决多模态知识图谱(Multimodal Knowledge Graphs, MMKGs)中事实缺失的问题,即多模态知识图谱补全(MMKGC)问题。现有方法在封闭世界假设下仅利用MMKG内部信息并采用判别性训练目标,限制了其推理能力。论文提出的解决方案关键在于HERGC框架,该框架通过异质专家表示检索器融合多模态信息并生成候选集,再结合微调的生成式大语言模型(LLM)预测器从候选集中准确识别正确答案,从而提升MMKGC的性能。

链接: https://arxiv.org/abs/2506.00826
作者: Yongkang Xiao,Rui Zhang
机构: University of Minnesota (明尼苏达大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Multimodal knowledge graphs (MMKGs) enrich traditional knowledge graphs (KGs) by incorporating diverse modalities such as images and text. Multi-modal knowledge graph completion (MMKGC) seeks to exploit these heterogeneous signals to infer missing facts, thereby mitigating the intrinsic incompleteness of MMKGs. Existing MMKGC methods typically leverage only the information contained in the MMKGs under the closed-world assumption and adopt discriminative training objectives, which limits their reasoning capacity during completion. Recent generative completion approaches powered by advanced large language models (LLMs) have shown strong reasoning abilities in unimodal knowledge graph completion, but their potential in MMKGC remains largely unexplored. To bridge this gap, we propose HERGC, a Heterogeneous Experts Representation and Generative Completion framework for MMKGs. HERGC first deploys a Heterogeneous Experts Representation Retriever that enriches and fuses multimodal information and retrieves a compact candidate set for each incomplete triple. It then uses a Generative LLM Predictor fine-tuned on minimal instruction data to accurately identify the correct answer from these candidates. Extensive experiments on three standard MMKG benchmarks demonstrate HERGC’s effectiveness and robustness, achieving state-of-the-art performance.
zh

[NLP-163] Probing the Geometry of Truth: Consistency and Generalization of Truth Directions in LLM s Across Logical Transformations and Question Answering Tasks ACL2025

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)中“真理方向”(truth direction)的普遍性、探测方法的有效性以及其在不同情境下的泛化能力等问题。其解决方案的关键在于通过实验证明,尽管并非所有LLMs都表现出一致的真理方向,但在更强大的模型中,尤其是涉及逻辑否定的情境下,真理方向的表征更为显著;同时,研究还表明基于陈述性原子语句训练的真理探测器能够有效泛化到逻辑变换、问答任务、上下文学习和外部知识源等场景,从而为提升用户对LLM输出的信任提供了可行的实践路径。

链接: https://arxiv.org/abs/2506.00823
作者: Yuntai Bao,Xuhong Zhang,Tianyu Du,Xinkui Zhao,Zhengwen Feng,Hao Peng,Jianwei Yin
机构: Zhejiang University (浙江大学); Zhejiang Normal University (浙江师范大学)
类目: Computation and Language (cs.CL)
备注: 19 pages, 16 figures; accepted to Findings of ACL 2025

点击查看摘要

Abstract:Large language models (LLMs) are trained on extensive datasets that encapsulate substantial world knowledge. However, their outputs often include confidently stated inaccuracies. Earlier works suggest that LLMs encode truthfulness as a distinct linear feature, termed the “truth direction”, which can classify truthfulness reliably. We address several open questions about the truth direction: (i) whether LLMs universally exhibit consistent truth directions; (ii) whether sophisticated probing techniques are necessary to identify truth directions; and (iii) how the truth direction generalizes across diverse contexts. Our findings reveal that not all LLMs exhibit consistent truth directions, with stronger representations observed in more capable models, particularly in the context of logical negation. Additionally, we demonstrate that truthfulness probes trained on declarative atomic statements can generalize effectively to logical transformations, question-answering tasks, in-context learning, and external knowledge sources. Finally, we explore the practical application of truthfulness probes in selective question-answering, illustrating their potential to improve user trust in LLM outputs. These results advance our understanding of truth directions and provide new insights into the internal representations of LLM beliefs. Our code is public at this https URL
zh

[NLP-164] One for All: Update Parameterized Knowledge Across Multiple Models ACL2025

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在知识更新方面的挑战,即模型难以保持最新信息,导致错误和幻觉问题。现有的知识编辑方法主要针对单个模型,难以高效地更新多个模型或适应新模型。论文提出的解决方案是OnceEdit,其关键在于采用基于集成学习的方法,通过引入一个插件模型作为编辑模块,实现跨多个模型的稳定知识更新。OnceEdit通过动态权重机制和集成增强机制,有效区分编辑相关与非编辑相关实例,并减少对中心模型的过度依赖,从而提升多模型场景下的编辑效果与稳定性。

链接: https://arxiv.org/abs/2506.00817
作者: Weitao Ma,Xiyuan Du,Xiaocheng Feng,Lei Huang,Yichong Huang,Huiyi Zhang,Xiaoliang Yang,Baohang Li,Xiachong Feng,Ting Liu,Bing Qin
机构: Harbin Institute of Technology (哈尔滨工业大学); Peng Cheng Laboratory (鹏城实验室); The University of Hong Kong (香港大学)
类目: Computation and Language (cs.CL)
备注: ACL 2025 (Main Conference)

点击查看摘要

Abstract:Large language models (LLMs) encode vast world knowledge but struggle to stay up-to-date, often leading to errors and hallucinations. Knowledge editing offers an efficient alternative to retraining, enabling targeted modifications by updating specific model parameters. However, existing methods primarily focus on individual models, posing challenges in efficiently updating multiple models and adapting to new models. To address this, we propose OnceEdit, a novel ensemble-based approach that employs a plug-in model as the editing module, enabling stable knowledge updates across multiple models. Building on the model ensemble, OnceEdit introduces two key mechanisms to enhance its effectiveness. First, we introduce a dynamic weight mechanism through a \weight token for distinguishing between edit-related and non-edit-related instances, ensuring the appropriate utilization of knowledge from integrated models. Second, we incorporate an ensemble enhancement mechanism to mitigate the excessive reliance on the central model inherent in the model ensemble technique, making it more suitable for knowledge editing. Extensive experiments on diverse LLMs demonstrate that OnceEdit consistently outperforms existing methods while achieving superior editing efficiency. Further analysis confirms its adaptability and stability in multi-model editing scenarios. Our code will be available.
zh

[NLP-165] From Plain Text to Poetic Form: Generating Metrically-Constrained Sanskrit Verses

【速读】: 该论文试图解决在低资源、形态丰富的语言(如梵语)中实现结构化诗歌生成的问题。其关键解决方案是构建一个用于将英语散文翻译成符合古典韵律模式(特别是Anushtub韵律)的梵语诗句的数据集,并通过约束解码策略和基于指令的微调方法提升生成模型在韵律准确性和语义风格一致性上的表现。

链接: https://arxiv.org/abs/2506.00815
作者: Manoj Balaji Jagadeeshan,Samarth Bhatia,Pretam Ray,Harshul Raj Surana,Akhil Rajeev P,Priya Mishra,Annarao Kulkarni,Ganesh Ramakrishnan,Prathosh AP,Pawan Goyal
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advances in large language models (LLMs) have significantly improved natural language generation, including creative tasks like poetry composition. However, most progress remains concentrated in high-resource languages. This raises an important question: Can LLMs be adapted for structured poetic generation in a low-resource, morphologically rich language such as Sanskrit? In this work, we introduce a dataset designed for translating English prose into structured Sanskrit verse, with strict adherence to classical metrical patterns, particularly the Anushtub meter. We evaluate a range of generative models-both open-source and proprietary-under multiple settings. Specifically, we explore constrained decoding strategies and instruction-based fine-tuning tailored to metrical and semantic fidelity. Our decoding approach achieves over 99% accuracy in producing syntactically valid poetic forms, substantially outperforming general-purpose models in meter conformity. Meanwhile, instruction-tuned variants show improved alignment with source meaning and poetic style, as supported by human assessments, albeit with marginal trade-offs in metrical precision.
zh

[NLP-166] GuessBench: Sensemaking Multimodal Creativity in the Wild

【速读】: 该论文试图解决视觉语言模型(VLMs)在建模人类普遍性、噪声性和多元性创造力方面的评估问题,其核心挑战在于如何有效衡量VLM在真实场景下对创意内容的理解与推理能力。解决方案的关键是提出GuessBench,一个基于“Guess the Build”多人在线Minecraft小游戏的数据集,该数据集包含1500张实际游戏画面和2000个涵盖静态与动态图像设置、不同完整度自然语言提示的问题,为VLM作为猜测者在现实环境中进行意义建构的创造性任务提供了纯净的测试平台。

链接: https://arxiv.org/abs/2506.00814
作者: Zifeng Zhu,Shangbin Feng,Herun Wan,Ningnan Wang,Minnan Luo,Yulia Tsvetkov
机构: Xi’an Jiaotong University (西安交通大学); University of Washington (华盛顿大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We propose GuessBench, a novel benchmark that evaluates Vision Language Models (VLMs) on modeling the pervasive, noisy, and pluralistic human creativity. GuessBench sources data from “Guess the Build”, an online multiplayer Minecraft minigame where one player constructs a Minecraft build given a concept (e.g. caterpillar) and others try to guess it with natural language hints, presenting a pristine testbed for sensemaking creativity in the wild with VLMs acting as guessers. We curate 1500 images from the actual gameplay and design 2000 problems spanning static and dynamic image settings, natural language hints of varying completeness, and more. Extensive experiments with six open/API VLMs and five reasoning enhancement approaches demonstrate that GuessBench presents a uniquely challenging task in creativity modeling: even the start-of-the-art GPT-4o is incorrect on 34% of instances, while we observe a huge performance gap (13.87% vs. 53.93% on average) between open and API models. When used as a resource to improve VLMs, fine-tuning on the reasoning traces for GuessBench problems improves visual perception tasks by 15.36% on average. Further analysis reveals that VLM performance in creativity sensemaking correlates with the frequency of the concept in training data, while the accuracy drops sharply for concepts in underrepresented cultural contexts and low-resource languages.
zh

[NLP-167] Fast or Slow? Integrating Fast Intuition and Deliberate Thinking for Enhancing Visual Question Answering

【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在视觉问答(Visual Question Answering, VQA)任务中处理复杂推理任务时的性能瓶颈问题。现有方法通过引入视觉提示来提升性能,但其存在关键缺陷: indiscriminately 为每个视觉问题标注所有检测到的对象,导致过多的视觉标记,从而降低任务表现。该问题的核心在于缺乏对关键视觉元素的关注,因此提出了FOCUS方法,其关键在于基于双过程理论(Dual Process Theory)动态适应问题复杂度,结合快速直觉判断与深度分析推理,以增强视觉-语言推理能力。

链接: https://arxiv.org/abs/2506.00806
作者: Songtao Jiang,Chenyi Zhou,Yan Zhang,Yeying Jin,Zuozhu Liu
机构: Zhejiang University (浙江大学); Byte Dance (字节跳动); National University of Singapore (新加坡国立大学); Zhejiang Key Laboratory of Medical Imaging Artificial Intelligence (浙江省医学影像人工智能重点实验室)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Multimodal large language models (MLLMs) still struggle with complex reasoning tasks in Visual Question Answering (VQA). While current methods have advanced by incorporating visual prompts, our study uncovers critical limitations: these approaches indiscriminately annotate all detected objects for every visual question, generating excessive visual markers that degrade task performance. This issue stems primarily from a lack of focus on key visual elements, raising two important questions: Are all objects equally important, and do all questions require visual prompts? Motivated by Dual Process Theory, which distinguishes between instinctive and deliberate cognitive modes in human reasoning, we propose FOCUS, a plug-and-play approach that dynamically adapts to the complexity of questions, combining fast intuitive judgments with deliberate analytical reasoning to enhance the vision-language reasoning capability of the MLLM. For straightforward questions, FOCUS supports efficient zero-shot reasoning. For more complex tasks, it employs the conceptualizing before observation strategy to highlight critical elements. Extensive experiments on four benchmarks, ScienceQA, TextQA, VizWiz, and MME, demonstrate that FOCUS consistently improves the performance of both open-source and black-box MLLMs, achieving significant gains across all datasets. Ablation studies further validate the importance of combining diverse cognitive strategies with refined visual information for superior performance. Code will be released.
zh

[NLP-168] HSCR: Hierarchical Self-Contrastive Rewarding for Aligning Medical Vision Language Models

【速读】: 该论文旨在解决医学视觉-语言模型(Med-VLMs)在临床应用中因模态不对齐导致的不可靠响应问题。其解决方案的关键在于提出一种名为分层自对比奖励(HSCR)的新方法,该方法通过成本有效的高质量偏好数据生成和捕捉细微且上下文感知的偏好来提升对齐效果。HSCR首先利用Med-VLMs生成更可能被排除的响应的能力,并通过分析视觉标记丢弃后的输出logit变化,识别引发不对齐的模态耦合标记,从而推导出隐式对齐奖励函数。该函数在解码过程中指导用幻觉标记替换,生成高质量的非首选数据;此外,HSCR引入多层级偏好优化策略,通过利用非首选数据中的相对质量来捕捉细微对齐线索,实现更精确和上下文感知的优化。

链接: https://arxiv.org/abs/2506.00805
作者: Songtao Jiang,Yan Zhang,Yeying Jin,Zhihang Tang,Yangyang Wu,Yang Feng,Jian Wu,Zuozhu Liu
机构: Zhejiang University(浙江大学); Byte Dance(字节跳动); National University of Singapore(新加坡国立大学); Angelalign Inc(天使对齐公司); Zhejiang Key Laboratory of Medical Imaging Artificial Intelligence(浙江省医学影像人工智能重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Medical Vision-Language Models (Med-VLMs) have achieved success across various tasks, yet most existing methods overlook the modality misalignment issue that can lead to untrustworthy responses in clinical settings. In this paper, we propose Hierarchical Self-Contrastive Rewarding (HSCR), a novel approach that addresses two critical challenges in Med-VLM alignment: 1) Cost-effective generation of high-quality preference data; 2) Capturing nuanced and context-aware preferences for improved alignment. HSCR first leverages the inherent capability of Med-VLMs to generate dispreferred responses with higher sampling probability. By analyzing output logit shifts after visual token dropout, we identify modality-coupled tokens that induce misalignment and derive an implicit alignment reward function. This function guides token replacement with hallucinated ones during decoding, producing high-quality dispreferred data. Furthermore, HSCR introduces a multi-level preference optimization strategy, which extends beyond traditional adjacent-level optimization by incorporating nuanced implicit preferences, leveraging relative quality in dispreferred data to capture subtle alignment cues for more precise and context-aware optimization. Extensive experiments across multiple medical tasks, including Med-VQA, medical image captioning and instruction following, demonstrate that HSCR not only enhances zero-shot performance but also significantly improves modality alignment and trustworthiness with just 2,000 training entries.
zh

[NLP-169] RARE: Retrieval-Aware Robustness Evaluation for Retrieval-Augmented Generation Systems

【速读】: 该论文旨在解决现有Retrieval-Augmented Generation (RAG)系统在面对现实世界噪声、内部与外部检索上下文冲突以及动态变化事实时的鲁棒性不足问题。其解决方案的关键在于提出一种统一的框架和大规模基准测试体系——Retrieval-Aware Robustness Evaluation (RARE),其中包含知识图谱驱动的合成流程(RARE-Get)和检索条件下的鲁棒性度量标准(RARE-Met)。该方法通过自动提取多跳关系并生成多层次问题集,构建了覆盖金融、经济和政策领域的动态数据集,并量化模型在查询、文档或实际检索结果被系统性扰动时保持正确性或恢复能力的能力。

链接: https://arxiv.org/abs/2506.00789
作者: Yixiao Zeng,Tianyu Cao,Danqing Wang,Xinran Zhao,Zimeng Qiu,Morteza Ziyadi,Tongshuang Wu,Lei Li
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) enhances recency and factuality in answers. However, existing evaluations rarely test how well these systems cope with real-world noise, conflicting between internal and external retrieved contexts, or fast-changing facts. We introduce Retrieval-Aware Robustness Evaluation (RARE), a unified framework and large-scale benchmark that jointly stress-tests query and document perturbations over dynamic, time-sensitive corpora. One of the central features of RARE is a knowledge-graph-driven synthesis pipeline (RARE-Get) that automatically extracts single and multi-hop relations from the customized corpus and generates multi-level question sets without manual intervention. Leveraging this pipeline, we construct a dataset (RARE-Set) spanning 400 expert-level time-sensitive finance, economics, and policy documents and 48,322 questions whose distribution evolves as the underlying sources change. To quantify resilience, we formalize retrieval-conditioned robustness metrics (RARE-Met) that capture a model’s ability to remain correct or recover when queries, documents, or real-world retrieval results are systematically altered. Our results show that RAG systems exhibit surprising vulnerability to perturbations, with document robustness consistently being the weakest point regardless of generator size or architecture. RAG systems consistently show lower robustness on multi-hop queries than single-hop queries across all domains.
zh

[NLP-170] Research Borderlands: Analysing Writing Across Research Cultures ACL2025

【速读】: 该论文试图解决语言技术在文化适应性方面的不足问题,即当前大多数研究未能充分参与其所研究的社群,而是依赖于合成设置和不完善的文化代理。解决方案的关键在于采用以人类为中心的方法,通过与跨学科研究人员的访谈,构建结构、风格、修辞和引用规范的框架,从而发现和测量基于语言的文化规范及大语言模型(LLM)的文化适应能力。

链接: https://arxiv.org/abs/2506.00784
作者: Shaily Bhatt,Tal August,Maria Antoniak
机构: Carnegie Mellon University (卡内基梅隆大学); University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); University of Copenhagen (哥本哈根大学); Allen Institute for Artificial Intelligence (艾伦人工智能研究所)
类目: Computation and Language (cs.CL)
备注: Accepted to ACL 2025 (Main)

点击查看摘要

Abstract:Improving cultural competence of language technologies is important. However most recent works rarely engage with the communities they study, and instead rely on synthetic setups and imperfect proxies of culture. In this work, we take a human-centered approach to discover and measure language-based cultural norms, and cultural competence of LLMs. We focus on a single kind of culture, research cultures, and a single task, adapting writing across research cultures. Through a set of interviews with interdisciplinary researchers, who are experts at moving between cultures, we create a framework of structural, stylistic, rhetorical, and citational norms that vary across research cultures. We operationalise these features with a suite of computational metrics and use them for (a) surfacing latent cultural norms in human-written research papers at scale; and (b) highlighting the lack of cultural competence of LLMs, and their tendency to homogenise writing. Overall, our work illustrates the efficacy of a human-centered approach to measuring cultural norms in human-written and LLM-generated texts.
zh

[NLP-171] KG-TRACES: Enhancing Large Language Models with Knowledge Graph-constrained Trajectory Reasoning and Attribution Supervision

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在复杂推理任务中因可解释性不足和可信度低而导致的性能受限问题,特别是由幻觉或无法追溯的推理过程引发的问题。其解决方案的关键在于提出一种名为知识图谱约束的轨迹推理归因与链式解释监督(KG-TRACES)的新框架,通过显式监督推理路径和过程来增强LLMs的推理能力,具体包括预测符号关系路径、全三元组级推理路径以及生成基于推理路径的归因感知推理过程。

链接: https://arxiv.org/abs/2506.00783
作者: Rong Wu,Pinlong Cai,Jianbiao Mei,Licheng Wen,Tao Hu,Xuemeng Yang,Daocheng Fu,Botian Shi
机构: Zhejiang University (浙江大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); University of Science and Technology of China (中国科学技术大学); Fudan University (复旦大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 23 pages, 13 figures

点击查看摘要

Abstract:Large language models (LLMs) have made remarkable strides in various natural language processing tasks, but their performance on complex reasoning problems remains hindered by a lack of explainability and trustworthiness. This issue, often manifesting as hallucinations or unattributable reasoning processes, limits their applicability in complex reasoning scenarios. To address this, we propose Knowledge Graph-constrained Trajectory Reasoning Attribution and Chain Explanation Supervision (KG-TRACES), a novel framework that enhances the reasoning ability of LLMs through explicit supervision over reasoning paths and processes. KG-TRACES jointly supervises the model to: (1) predict symbolic relation paths, (2) predict full triple-level reasoning paths, and (3) generate attribution-aware reasoning processes grounded in the reasoning paths. At inference phase, the model adapts to both KG-available and KG-unavailable scenarios, retrieving reasoning paths from a KG when possible or predicting plausible reasoning paths with only intrinsic knowledge when not. This design enables the model to reason in an explainable and source-attributable pattern. Through extensive experiments on complex reasoning tasks, we demonstrate that KG-TRACES significantly outperforms existing SOTA: it improves Hits@1 by 1.6% and F1 by 4.7% on WebQSP, and achieves improvements of 4.8% in Hits@1 and 2.1% in F1 on CWQ. Moreover, we show its transferability to specialized domains such as medicine. By visualizing the intermediate steps of reasoning processes, we further show that the explicit supervision introduced by KG-TRACES leads to more stable and goal-directed reasoning processes, aligning closely with correct answers. Code is available at this https URL.
zh

[NLP-172] Improving Automatic Evaluation of Large Language Models (LLM LLM s) in Biomedical Relation Extraction via LLMs-as-the-Judge ACL2025

【速读】: 该论文试图解决在生物医学关系抽取任务中,传统自动评估指标因生成式AI(Generative AI)能够产生同义词或缩写而不可靠,而人工评估又成本高、耗时长的问题。其解决方案的关键在于利用大型语言模型(Large Language Models, LLMs)作为评判者(LLM-as-the-Judge)进行评估,并通过结构化输出格式和领域适应技术提升LLM-Judge的性能。

链接: https://arxiv.org/abs/2506.00777
作者: Md Tahmid Rahman Laskar,Israt Jahan,Elham Dolatabadi,Chun Peng,Enamul Hoque,Jimmy Huang
机构: York University (约克大学); Vector Institute (向量研究所)
类目: Computation and Language (cs.CL)
备注: Accepted at ACL 2025 (Main Conference)

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated impressive performance in biomedical relation extraction, even in zero-shot scenarios. However, evaluating LLMs in this task remains challenging due to their ability to generate human-like text, often producing synonyms or abbreviations of gold-standard answers, making traditional automatic evaluation metrics unreliable. On the other hand, while human evaluation is more reliable, it is costly and time-consuming, making it impractical for real-world applications. This paper investigates the use of LLMs-as-the-Judge as an alternative evaluation method for biomedical relation extraction. We benchmark 8 LLMs as judges to evaluate the responses generated by 5 other LLMs across 3 biomedical relation extraction datasets. Unlike other text-generation tasks, we observe that LLM-based judges perform quite poorly (usually below 50% accuracy) in the biomedical relation extraction task. Our findings reveal that it happens mainly because relations extracted by LLMs do not adhere to any standard format. To address this, we propose structured output formatting for LLM-generated responses that helps LLM-Judges to improve their performance by about 15% (on average). We also introduce a domain adaptation technique to further enhance LLM-Judge performance by effectively transferring knowledge between datasets. We release both our human-annotated and LLM-annotated judgment data (36k samples in total) for public use here: this https URL.
zh

[NLP-173] Dynamic Chunking and Selection for Reading Comprehension of Ultra-Long Context in Large Language Models

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在处理极长文本时难以准确阅读和理解的问题。现有方法通常依赖于将长上下文分割为固定长度的块,但这种固定截断可能分离语义相关的内容,导致歧义并影响准确理解。解决方案的关键在于提出一种动态分割和选择长上下文块的方法,通过计算相邻句子之间的语义相似性,利用较低的相似性自适应地将长上下文划分为可变长度的块,并进一步训练一个问答感知分类器来选择对回答特定问题至关重要的块。

链接: https://arxiv.org/abs/2506.00773
作者: Boheng Sheng,Jiacheng Yao,Meicong Zhang,Guoxiu He
机构: East China Normal University (华东师范大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) often struggle to accurately read and comprehend extremely long texts. Current methods for improvement typically rely on splitting long contexts into fixed-length chunks. However, fixed truncation risks separating semantically relevant content, leading to ambiguity and compromising accurate understanding. To overcome this limitation, we propose a straightforward approach for dynamically separating and selecting chunks of long context, facilitating a more streamlined input for LLMs. In particular, we compute semantic similarities between adjacent sentences, using lower similarities to adaptively divide long contexts into variable-length chunks. We further train a question-aware classifier to select sensitive chunks that are critical for answering specific questions. Experimental results on both single-hop and multi-hop question-answering benchmarks show that the proposed approach consistently outperforms strong baselines. Notably, it maintains robustness across a wide range of input lengths, handling sequences of up to 256k tokens. Our datasets and code are available at the following link: this https URL
zh

[NLP-174] LIFT the Veil for the Truth: Principal Weights Emerge after Rank Reduction for Reasoning -Focused Supervised Fine-Tuning ICML2025

【速读】: 该论文试图解决在小规模高质量数据集上对大语言模型(Large Language Models, LLMs)进行微调时,全量微调(Full FT)计算成本高、容易过拟合和灾难性遗忘的问题,以及稀疏微调在LLM时代因难以识别真正关键参数而效果受限的问题。其解决方案的关键在于提出了一种基于低秩近似后权重幅度的稀疏微调方法——低秩感知稀疏微调(Low-rank Informed Sparse Fine-tuning, LIFT),通过仅更新主权重(Principal Weights)的前5%来实现高效且有效的模型微调,同时保持与主流参数高效微调方法相当的记忆效率。

链接: https://arxiv.org/abs/2506.00772
作者: Zihang Liu,Tianyu Pang,Oleg Balabanov,Chaoqun Yang,Tianjin Huang,Lu Yin,Yaoqing Yang,Shiwei Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: ICML 2025

点击查看摘要

Abstract:Recent studies have shown that supervised fine-tuning of LLMs on a small number of high-quality datasets can yield strong reasoning capabilities. However, full fine-tuning (Full FT), while powerful, is computationally expensive and susceptible to overfitting and catastrophic forgetting, particularly when data is limited. Sparse fine-tuning, which previously achieved notable success by updating only a small subset of model parameters, offers a promising trade-off between efficiency and effectiveness. Yet, it has lagged behind in the LLM era due to the difficulty of identifying parameters truly critical for reasoning. In this work, we state that weights with the largest magnitude after low-rank approximation are critical weights for fine-tuning, which we call Principal Weights. Surprisingly, while magnitude-based sparse fine-tuning performs poorly as a baseline on LLM fine-tuning, it becomes highly effective after rank reduction. These insights motivate our method: Low-rank Informed Sparse Fine-Tuning (LIFT). LIFT only updates the top 5% Principal Weights throughout training and consistently achieves better performance on reasoning tasks than Full FT, while maintaining memory efficiency on par with popular parameter-efficient fine-tuning methods. In addition to strong performance on target domains such as arithmetic reasoning, LIFT also retains up to 20% more source-domain knowledge, compared to Full FT and LoRA. Our code is available at: this https URL.
zh

[NLP-175] Understanding and Mitigating Cross-lingual Privacy Leakage via Language-specific and Universal Privacy Neurons

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在跨语言情境下可能引发的隐私泄露问题,尤其是在训练数据和用户查询语言不一致时,模型仍可能泄露敏感信息。解决方案的关键在于分析跨语言隐私泄露的信息流,发现模型在中间层共享了大量语言无关的隐私表示,并在后期语言特定层中泄露风险达到峰值。基于此,研究者识别出通用隐私神经元和语言特定隐私神经元,并通过禁用这些神经元有效降低了跨语言隐私泄露的风险,降低幅度为23.3%-31.6%。

链接: https://arxiv.org/abs/2506.00759
作者: Wenshuo Dong,Qingsong Yang,Shu Yang,Lijie Hu,Meng Ding,Wanyu Lin,Tianhang Zheng,Di Wang
机构: King Abdullah University of Science and Technology (KAUST); Provable Responsible AI and Data Analytics (PRADA) Lab; University of Copenhagen; University of Science and Technology of China; State University of New York at Buffalo; The Hong Kong Polytechnic University; Zhejiang University
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) trained on massive data capture rich information embedded in the training data. However, this also introduces the risk of privacy leakage, particularly involving personally identifiable information (PII). Although previous studies have shown that this risk can be mitigated through methods such as privacy neurons, they all assume that both the (sensitive) training data and user queries are in English. We show that they cannot defend against the privacy leakage in cross-lingual contexts: even if the training data is exclusively in one language, these (private) models may still reveal private information when queried in another language. In this work, we first investigate the information flow of cross-lingual privacy leakage to give a better understanding. We find that LLMs process private information in the middle layers, where representations are largely shared across languages. The risk of leakage peaks when converted to a language-specific space in later layers. Based on this, we identify privacy-universal neurons and language-specific privacy neurons. Privacy-universal neurons influence privacy leakage across all languages, while language-specific privacy neurons are only related to specific languages. By deactivating these neurons, the cross-lingual privacy leakage risk is reduced by 23.3%-31.6%.
zh

[NLP-176] ranslate With Care: Addressing Gender Bias Neutrality and Reasoning in Large Language Model Translations ACL2025

【速读】: 该论文旨在解决机器翻译中的性别偏见问题以及在翻译过程中保持逻辑连贯性的挑战,尤其是在将具有自然性别语言(如英语)翻译为无性别语言(如波斯语、印度尼西亚语和芬兰语)时。其解决方案的关键在于构建了Translate-with-Care (TWC)数据集,并通过在该数据集上微调mBART-50模型,显著减少了性别刻板印象和推理错误,实现了更好的泛化能力,并在开放源代码的前提下超越了专有大语言模型。

链接: https://arxiv.org/abs/2506.00748
作者: Pardis Sadat Zahraei,Ali Emami
机构: Brock University (布鲁克大学)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: Accepted to Findings of ACL 2025

点击查看摘要

Abstract:Addressing gender bias and maintaining logical coherence in machine translation remains challenging, particularly when translating between natural gender languages, like English, and genderless languages, such as Persian, Indonesian, and Finnish. We introduce the Translate-with-Care (TWC) dataset, comprising 3,950 challenging scenarios across six low- to mid-resource languages, to assess translation systems’ performance. Our analysis of diverse technologies, including GPT-4, mBART-50, NLLB-200, and Google Translate, reveals a universal struggle in translating genderless content, resulting in gender stereotyping and reasoning errors. All models preferred masculine pronouns when gender stereotypes could influence choices. Google Translate and GPT-4 showed particularly strong bias, favoring male pronouns 4-6 times more than feminine ones in leadership and professional success contexts. Fine-tuning mBART-50 on TWC substantially resolved these biases and errors, led to strong generalization, and surpassed proprietary LLMs while remaining open-source. This work emphasizes the need for targeted approaches to gender and semantic coherence in machine translation, particularly for genderless languages, contributing to more equitable and accurate translation systems.
zh

[NLP-177] Assortment of Attention Heads: Accelerating Federated PEFT with Head Pruning and Strategic Client Selection

【速读】: 该论文旨在解决在隐私保护的分布式学习框架(如联邦学习)中应用参数高效微调(Parameter Efficient Fine-Tuning, PEFT)的挑战,特别是在资源受限设备和客户端数据分布多样性方面的问题。其解决方案的关键在于通过头剪枝(head pruning)、一种新颖的头特定加权聚合机制以及客户端选择策略,实现对基于多头注意力(Multi-Head Attention, MHA)的语言模型进行高效的PEFT。头剪枝通过基于注意力头置信度计算的重要性评分来降低客户端的训练复杂度,而加权聚合机制则确保全局模型能够捕捉来自不同客户端的重要更新,从而提升整体性能。

链接: https://arxiv.org/abs/2506.00743
作者: Yeshwanth Venkatesha,Souvik Kundu,Priyadarshini Panda
机构: Yale University (耶鲁大学); Intel Labs (英特尔实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:

点击查看摘要

Abstract:Parameter Efficient Fine-Tuning (PEFT) has become the de-facto approach in adapting Large Language Models (LLMs) for downstream tasks in Natural Language Processing. However, its adoption in privacy-preserving distributed learning frameworks, such as Federated Learning (FL), remains relatively limited. This is mainly due to challenges specific to FL, such as resource-constrained devices and diverse data distributions among clients. In this paper, we propose an efficient method to perform PEFT within the FL framework for Multi-Head Attention (MHA) based language models. We address the challenges through head pruning, a novel head-specific weighted aggregation mechanism, and a client selection strategy. Head pruning minimizes training complexity within the clients, guided by the importance score computed based on the confidence of the attention head. Weighted aggregation of heads ensures the global model captures crucial updates from diverse clients complementing our client selection strategy. We show results on the MultiNLI benchmark along with 20 Newsgroups, XL-Sum, and E2E NLG datasets. We use the MultiNLI dataset and T5-small model with LoRA as our PEFT method, attaining sparsity levels of up to 90%, resulting in a communication advantage of up to 1.8x and a reduction in training OPs of 3.9x while maintaining the accuracy drop under 2%.
zh

[NLP-178] Data Swarms: Optimizable Generation of Synthetic Evaluation Data

【速读】: 该论文旨在解决合成评估数据生成的优化问题,以及如何提升大语言模型(Large Language Model, LLM)评估的定量目标。其核心解决方案是提出Data Swarms算法,通过训练一组初始数据生成器,并利用粒子群优化(Particle Swarm Optimization, PSO)协同搜索模型参数空间,以找到能提升评估目标的新生成器。关键在于通过协同优化多个评估目标,实现数据生成与模型评估的动态改进。

链接: https://arxiv.org/abs/2506.00741
作者: Shangbin Feng,Yike Wang,Weijia Shi,Yulia Tsvetkov
机构: University of Washington (华盛顿大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We propose Data Swarms, an algorithm to optimize the generation of synthetic evaluation data and advance quantitative desiderata of LLM evaluation. We first train a swarm of initial data generators using existing data, and define various evaluation objectives to reflect the desired properties of evaluation (e.g., generate more difficult problems for the evaluated models) and quantitatively evaluate data generators. We then employ particle swarm optimization to optimize the swarm of data generators, where they collaboratively search through the model parameter space to find new generators that advance these objectives. We further extend it to Adversarial Swarms, where the data generator swarm generates harder data while the test taker model swarm learns from such data, co-evolving dynamically for better data and models simultaneously. Extensive experiments demonstrate that Data Swarms outperforms eight data generation baselines across five evaluation objectives, while Adversarial Swarms produce more robust learning of synthetic data and stronger generalization. Further analysis reveals that Data Swarms successfully optimizes compositions of multiple evaluation objectives and generalizes to new off-the-shelf LLMs, unseen at optimization time.
zh

[NLP-179] Length Aware Speech Translation for Video Dubbing INTERSPEECH2025

【速读】: 该论文旨在解决视频配音中翻译音频与源音频对齐的问题,特别是在实时、设备端的视频配音场景下实现高效对齐。其解决方案的关键在于提出了一种基于音素的端到端长度敏感语音翻译(length-sensitive speech translation, LSST)模型,该模型通过预定义标签生成不同长度的翻译(短、正常和长),并引入了长度感知束搜索(length-aware beam search, LABS),在一次解码过程中生成不同长度的翻译,从而显著提升了源音频与目标音频之间的同步质量。

链接: https://arxiv.org/abs/2506.00740
作者: Harveen Singh Chadha,Aswin Shanmugam Subramanian,Vikas Joshi,Shubham Bansal,Jian Xue,Rupeshkumar Mehta,Jinyu Li
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: This paper was accepted to Interspeech 2025

点击查看摘要

Abstract:In video dubbing, aligning translated audio with the source audio is a significant challenge. Our focus is on achieving this efficiently, tailored for real-time, on-device video dubbing scenarios. We developed a phoneme-based end-to-end length-sensitive speech translation (LSST) model, which generates translations of varying lengths short, normal, and long using predefined tags. Additionally, we introduced length-aware beam search (LABS), an efficient approach to generate translations of different lengths in a single decoding pass. This approach maintained comparable BLEU scores compared to a baseline without length awareness while significantly enhancing synchronization quality between source and target audio, achieving a mean opinion score (MOS) gain of 0.34 for Spanish and 0.65 for Korean, respectively.
zh

[NLP-180] DefenderBench: A Toolkit for Evaluating Language Agents in Cybersecurity Environments

【速读】: 该论文旨在解决大型语言模型(Large Language Model, LLM)在网络安全领域潜力尚未被充分探索的问题,提出了一种实用且开源的评估工具DefenderBench,用于衡量语言代理在攻击、防御及网络安全知识相关任务中的性能。其解决方案的关键在于构建了一个包含网络入侵、恶意内容检测、代码漏洞分析和网络安全知识评估等环境的综合性评估平台,该平台设计注重成本效益与可访问性,同时确保评估的公平性和严谨性,并支持自定义LLM和任务的无缝集成,从而促进研究的可重复性和公平比较。

链接: https://arxiv.org/abs/2506.00739
作者: Chiyu Zhang,Marc-Alexandre Cote,Michael Albada,Anush Sankaran,Jack W. Stokes,Tong Wang,Amir Abdi,William Blum,Muhammad Abdul-Mageed
机构: Microsoft(微软); The University of British Columbia(不列颠哥伦比亚大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language model (LLM) agents have shown impressive capabilities in human language comprehension and reasoning, yet their potential in cybersecurity remains underexplored. We introduce DefenderBench, a practical, open-source toolkit for evaluating language agents across offense, defense, and cybersecurity knowledge-based tasks. DefenderBench includes environments for network intrusion, malicious content detection, code vulnerability analysis, and cybersecurity knowledge assessment. It is intentionally designed to be affordable and easily accessible for researchers while providing fair and rigorous assessment. We benchmark several state-of-the-art (SoTA) and popular LLMs, including both open- and closed-weight models, using a standardized agentic framework. Our results show that Claude-3.7-sonnet performs best with a DefenderBench score of 81.65, followed by Claude-3.7-sonnet-think with 78.40, while the best open-weight model, Llama 3.3 70B, is not far behind with a DefenderBench score of 71.81. DefenderBench’s modular design allows seamless integration of custom LLMs and tasks, promoting reproducibility and fair comparisons. An anonymized version of DefenderBench is available at this https URL.
zh

[NLP-181] Narrative Media Framing in Political Discourse ACL2025

【速读】: 该论文试图解决自动化框架分析中对叙事框架(narrative frames)这一重要概念忽视的问题,旨在通过将叙事元素与框架的核心方面相结合,提出一个可形式化和操作化的框架。解决方案的关键在于构建并发布了一个气候变化领域的新闻文章数据集,分析不同政治倾向下叙事框架成分的主导性,并利用大语言模型(LLMs)验证其预测叙事框架及其成分的能力,最终在新冠疫情领域进行了无监督应用,验证了方法的通用性。

链接: https://arxiv.org/abs/2506.00737
作者: Yulia Otmakhova,Lea Frermann
机构: The University of Melbourne (墨尔本大学)
类目: Computation and Language (cs.CL)
备注: Accepted to ACL 2025 Findings

点击查看摘要

Abstract:Narrative frames are a powerful way of conceptualizing and communicating complex, controversial ideas, however automated frame analysis to date has mostly overlooked this framing device. In this paper, we connect elements of narrativity with fundamental aspects of framing, and present a framework which formalizes and operationalizes such aspects. We annotate and release a data set of news articles in the climate change domain, analyze the dominance of narrative frame components across political leanings, and test LLMs in their ability to predict narrative frames and their components. Finally, we apply our framework in an unsupervised way to elicit components of narrative framing in a second domain, the COVID-19 crisis, where our predictions are congruent with prior theoretical work showing the generalizability of our approach.
zh

[NLP-182] Bregman Conditional Random Fields: Sequence Labeling with Parallelizable Inference Algorithms ACL2025

【速读】: 该论文试图解决序列标注任务中传统线性链条件随机场(Conditional Random Fields, CRF)在推理效率和并行化能力上的局限性。其解决方案的关键在于提出一种新型的判别模型——Bregman条件随机场(Bregman Conditional Random Fields, BCRF),该模型基于迭代Bregman投影实现了可并行化的快速推理算法,并通过Fenchel-Young损失函数进行学习,支持从部分标签中进行训练。实验表明,BCRF在保持与CRF相当性能的同时,提升了计算效率,并在高度约束的场景下优于均值场方法。

链接: https://arxiv.org/abs/2506.00732
作者: Caio Corro,Mathieu Lacroix,Joseph Le Roux
机构: INSA Rennes, IRISA, Inria, CNRS, Université de Rennes, France; Université Sorbonne Paris Nord, CNRS, LIPN, France
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: ACL 2025

点击查看摘要

Abstract:We propose a novel discriminative model for sequence labeling called Bregman conditional random fields (BCRF). Contrary to standard linear-chain conditional random fields, BCRF allows fast parallelizable inference algorithms based on iterative Bregman projections. We show how such models can be learned using Fenchel-Young losses, including extension for learning from partial labels. Experimentally, our approach delivers comparable results to CRF while being faster, and achieves better results in highly constrained settings compared to mean field, another parallelizable alternative.
zh

[NLP-183] Structured Gradient Guidance for Few-Shot Adaptation in Large Language Models

【速读】: 该论文旨在解决在少量样本条件下,大型语言模型(Large Language Models)的任务适应性和训练稳定性问题。其解决方案的关键在于引入一种基于梯度的微调方法,该方法通过构建基础损失函数并添加两个与梯度相关的正则化项:一是强制梯度方向一致性以引导参数更新至任务相关方向并防止偏移,二是控制梯度幅度以避免异常更新。此外,还引入了梯度对齐机制,以提升跨任务和跨领域场景下的泛化能力。

链接: https://arxiv.org/abs/2506.00726
作者: Hongye Zheng,Yichen Wang,Ray Pan,Guiran Liu,Binrong Zhu,Hanlu Zhang
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper presents a gradient-informed fine-tuning method for large language models under few-shot conditions. The goal is to enhance task adaptability and training stability when data is limited. The method builds on a base loss function and introduces two gradient-related regularization terms. The first enforces gradient direction consistency to guide parameter updates along task-relevant directions and prevent drift. The second controls gradient magnitude to avoid abnormal updates. Together, these components support a more efficient and stable optimization path. To further improve cross-task generalization, the method incorporates a gradient alignment mechanism. This mechanism measures the consistency between optimization directions of the source and target tasks. It enhances fine-tuning performance in multi-task and cross-domain scenarios. Across various natural language understanding tasks, the method outperforms existing fine-tuning strategies in average accuracy, gradient stability, and directional alignment. Empirical evaluations under different sample sizes and domain-specific tasks confirm the method’s robustness and broad applicability in low-resource environments. In particular, the method shows clear advantages in controlling parameter update paths. The results demonstrate that a gradient-based fine-tuning framework can effectively leverage the representational power of large language models. It ensures training stability while reducing dependence on large volumes of labeled data.
zh

[NLP-184] Chain-of-Thought Training for Open E2E Spoken Dialogue Systems INTERSPEECH2025

【速读】: 该论文旨在解决传统级联流水线在端到端(E2E)语音对话系统中难以保持完全可微性以及生成响应缺乏语义连贯性的问题。其解决方案的关键在于提出一种基于思维链(CoT)的策略,通过使对话数据训练与多模态语言模型(LM)在语音识别(ASR)、文本到语音合成(TTS)和文本LM任务上的预训练保持紧密对齐,从而提升模型性能。该方法在仅使用300小时公开的人机对话数据的情况下,实现了超过1.5 ROUGE-1的性能提升。

链接: https://arxiv.org/abs/2506.00722
作者: Siddhant Arora,Jinchuan Tian,Hayato Futami,Jee-weon Jung,Jiatong Shi,Yosuke Kashiwagi,Emiru Tsunoo,Shinji Watanabe
机构: 未知
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted at INTERSPEECH 2025

点击查看摘要

Abstract:Unlike traditional cascaded pipelines, end-to-end (E2E) spoken dialogue systems preserve full differentiability and capture non-phonemic information, making them well-suited for modeling spoken interactions. However, existing E2E approaches often require large-scale training data and generates responses lacking semantic coherence. We propose a simple yet effective strategy leveraging a chain-of-thought (CoT) formulation, ensuring that training on conversational data remains closely aligned with the multimodal language model (LM)'s pre-training on speech recognition~(ASR), text-to-speech synthesis (TTS), and text LM tasks. Our method achieves over 1.5 ROUGE-1 improvement over the baseline, successfully training spoken dialogue systems on publicly available human-human conversation datasets, while being compute-efficient enough to train on just 300 hours of public human-human conversation data, such as the Switchboard. We will publicly release our models and training code.
zh

[NLP-185] From Argumentative Text to Argument Knowledge Graph: A New Framework for Structured Argumentation

【速读】: 该论文试图解决如何将论证性文本转化为可理解且可推理的论证知识图谱(Argument Knowledge Graph, AKG)的问题。其解决方案的关键在于通过基本的论证成分(Argumentative Components, ACs)和论证关系(Argumentative Relations, ARs)的标注,构建包含元数据属性的知识库(Knowledge Base, KB)图,并利用前提和推理规则通过假言推理(Modus Ponens)形成论证,进而构建AKG。此外,通过识别标记来发现缺失的推理规则,从而能够检测之前无法识别的削弱攻击(Undercut Attacks),并为未来的推理任务提供支持。

链接: https://arxiv.org/abs/2506.00713
作者: Debarati Bhattacharjee,Ashish Anand
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 16 pages, 7 figures

点击查看摘要

Abstract:This paper presents a framework to convert argumentative texts into argument knowledge graphs (AKG). Starting with basic annotations of argumentative components (ACs) and argumentative relations (ARs), we enrich the information by constructing a knowledge base (KB) graph with metadata attributes for nodes. Next, we use premises and inference rules from the KB to form arguments by applying modus ponens. From these arguments, we create an AKG. The nodes and edges of the AKG have attributes that capture important argumentative features. We also find missing inference rules by identifying markers. This makes it possible to identify undercut attacks that were previously undetectable in existing datasets. The AKG gives a graphical view of the argumentative structure that is easier to understand than theoretical formats. It also prepares the ground for future reasoning tasks, including checking the coherence of arguments and identifying opportunities for revision. For this, it is important to find indirect relations, many of which are implicit. Our proposed AKG format, with annotated inference rules and modus ponens, will help reasoning models learn the implicit indirect relations that require inference over arguments and the relations between them.
zh

[NLP-186] DrKGC: Dynamic Subgraph Retrieval-Augmented LLM s for Knowledge Graph Completion across General and Biomedical Domains

【速读】: 该论文旨在解决知识图谱补全(Knowledge Graph Completion, KGC)中由于传统方法将图上下文编码为文本形式而未能充分挖掘生成式大语言模型(Generative LLMs)在图结构感知与推理方面的潜力的问题。其解决方案的关键在于提出DrKGC框架,该框架通过灵活轻量的模型训练策略学习知识图谱中的结构嵌入和逻辑规则,并结合一种新颖的自底向上的图检索方法,为每个查询提取子图,随后利用图卷积网络(GCN)适配器增强结构嵌入,最终将其整合到提示中以实现有效的LLM微调。

链接: https://arxiv.org/abs/2506.00708
作者: Yongkang Xiao,Sinian Zhang,Yi Dai,Huixue Zhou,Jue Hou,Jie Ding,Rui Zhang
机构: University of Minnesota, Minneapolis, MN, USA (明尼苏达大学); University of Michigan, Ann Arbor, MI, USA (密歇根大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Knowledge graph completion (KGC) aims to predict missing triples in knowledge graphs (KGs) by leveraging existing triples and textual information. Recently, generative large language models (LLMs) have been increasingly employed for graph tasks. However, current approaches typically encode graph context in textual form, which fails to fully exploit the potential of LLMs for perceiving and reasoning about graph structures. To address this limitation, we propose DrKGC (Dynamic Subgraph Retrieval-Augmented LLMs for Knowledge Graph Completion). DrKGC employs a flexible lightweight model training strategy to learn structural embeddings and logical rules within the KG. It then leverages a novel bottom-up graph retrieval method to extract a subgraph for each query guided by the learned rules. Finally, a graph convolutional network (GCN) adapter uses the retrieved subgraph to enhance the structural embeddings, which are then integrated into the prompt for effective LLM fine-tuning. Experimental results on two general domain benchmark datasets and two biomedical datasets demonstrate the superior performance of DrKGC. Furthermore, a realistic case study in the biomedical domain highlights its interpretability and practical utility.
zh

[NLP-187] Measuring Faithfulness and Abstention: An Automated Pipeline for Evaluating LLM -Generated 3-ply Case-Based Legal Arguments

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在法律任务中生成三段式法律论证时的可靠性问题,特别是其在忠实性(无幻觉)、因素利用和适当回避方面的表现。解决方案的关键在于构建一个自动化评估流程,通过外部LLM从生成的论证中提取因素,并与输入案例三元组中的真实因素进行对比,从而客观评估LLM在不同难度测试中的表现,包括避免幻觉、充分利用相关因素以及在无事实依据时正确回避的能力。

链接: https://arxiv.org/abs/2506.00694
作者: Li Zhang,Morgan Gray,Jaromir Savelka,Kevin D. Ashley
机构: University of Pittsburgh (匹兹堡大学); Carnegie Mellon University (卡内基梅隆大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 11 pages, 7th Workshop on Automated Semantic Analysis of Information in Legal Text, 16 June 2025, Chicago, IL

点击查看摘要

Abstract:Large Language Models (LLMs) demonstrate potential in complex legal tasks like argument generation, yet their reliability remains a concern. Building upon pilot work assessing LLM generation of 3-ply legal arguments using human evaluation, this paper introduces an automated pipeline to evaluate LLM performance on this task, specifically focusing on faithfulness (absence of hallucination), factor utilization, and appropriate abstention. We define hallucination as the generation of factors not present in the input case materials and abstention as the model’s ability to refrain from generating arguments when instructed and no factual basis exists. Our automated method employs an external LLM to extract factors from generated arguments and compares them against the ground-truth factors provided in the input case triples (current case and two precedent cases). We evaluated eight distinct LLMs on three tests of increasing difficulty: 1) generating a standard 3-ply argument, 2) generating an argument with swapped precedent roles, and 3) recognizing the impossibility of argument generation due to lack of shared factors and abstaining. Our findings indicate that while current LLMs achieve high accuracy (over 90%) in avoiding hallucination on viable argument generation tests (Tests 1 2), they often fail to utilize the full set of relevant factors present in the cases. Critically, on the abstention test (Test 3), most models failed to follow instructions to stop, instead generating spurious arguments despite the lack of common factors. This automated pipeline provides a scalable method for assessing these crucial LLM behaviors, highlighting the need for improvements in factor utilization and robust abstention capabilities before reliable deployment in legal settings. Project page: this https URL.
zh

[NLP-188] Existing Large Language Model Unlearning Evaluations Are Inconclusive

【速读】: 该论文试图解决机器遗忘(machine unlearning)评估方法中存在的局限性,这些问题可能导致对模型遗忘效果的误判。研究指出当前评估方法存在三个关键问题:评估过程中可能引入新的信息从而掩盖真实的遗忘性能、不同任务间的评估结果差异显著导致评估结果难以泛化,以及依赖虚假相关性使得结果不可靠。解决方案的关键在于提出两个原则:最小信息注入和下游任务感知,通过严格的实验验证这些原则的有效性,以提高评估的准确性和可靠性。

链接: https://arxiv.org/abs/2506.00688
作者: Zhili Feng,Yixuan Even Xu,Alexander Robey,Robert Kirk,Xander Davies,Yarin Gal,Avi Schwarzschild,J. Zico Kolter
机构: Carnegie Mellon University (卡内基梅隆大学); UK AI Security Institute (英国人工智能安全研究所); OATML, University of Oxford (OATML,牛津大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Machine unlearning aims to remove sensitive or undesired data from large language models. However, recent studies suggest that unlearning is often shallow, claiming that removed knowledge can easily be recovered. In this work, we critically examine standard unlearning evaluation practices and uncover key limitations that shake our trust in those findings. First, we show that some evaluations introduce substantial new information into the model, potentially masking true unlearning performance by re-teaching the model during testing. Second, we demonstrate that evaluation outcomes vary significantly across tasks, undermining the generalizability of current evaluation routines. Finally, we find that many evaluations rely on spurious correlations, making their results difficult to trust and interpret. Taken together, these issues suggest that current evaluation protocols may both overstate and understate unlearning success. To address this, we propose two principles for future unlearning evaluations: minimal information injection and downstream task awareness. We validate these principles through a series of targeted experiments, showing how violations of each can lead to misleading conclusions.
zh

[NLP-189] DeepRAG : Integrating Hierarchical Reasoning and Process Supervision for Biomedical Multi-Hop QA

【速读】: 该论文旨在解决生物医学问答任务中复杂问题的准确回答问题,特别是针对MedHopQA数据集的挑战。其解决方案的关键在于提出DeepRAG框架,该框架结合了DeepSeek的分层问题分解能力与RAG Gym统一的检索增强生成优化,并通过过程级监督进行训练,同时利用UMLS本体论提供的概念级奖励信号来提升生物医学准确性。

链接: https://arxiv.org/abs/2506.00671
作者: Yuelyu Ji,Hang Zhang,Shiven Verma,Hui Ji,Chun Li,Yushui Han,Yanshan Wang
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We propose DeepRAG, a novel framework that integrates DeepSeek hierarchical question decomposition capabilities with RAG Gym unified retrieval-augmented generation optimization using process level supervision. Targeting the challenging MedHopQA biomedical question answering task, DeepRAG systematically decomposes complex queries into precise sub-queries and employs concept level reward signals informed by the UMLS ontology to enhance biomedical accuracy. Preliminary evaluations on the MedHopQA dataset indicate that DeepRAG significantly outperforms baseline models, including standalone DeepSeek and RAG Gym, achieving notable improvements in both Exact Match and concept level accuracy.
zh

[NLP-190] SafeTy Reasoning Elicitation Alignment for Multi-Turn Dialogues

【速读】: 该论文试图解决恶意攻击者通过多轮对话利用大语言模型(Large Language Models, LLMs)实现有害目标的安全风险问题。解决方案的关键在于提出一种新的防御机制——SafeTy Reasoning Elicitation Alignment for Multi-Turn Dialogues (STREAM),其核心是构建一个由人工标注的安全部分多轮对话数据集,并以此对一个即插即用的安全推理调节器进行微调,使其能够识别多轮对话中的恶意意图并警告目标LLM潜在风险,从而在保持LLM功能能力的同时有效降低攻击成功率。

链接: https://arxiv.org/abs/2506.00668
作者: Martin Kuo,Jianyi Zhang,Aolin Ding,Louis DiValentin,Amin Hass,Benjamin F Morris,Isaac Jacobson,Randolph Linderman,James Kiessling,Nicolas Ramos,Bhavna Gopal,Maziyar Baran Pouyan,Changwei Liu,Hai Li,Yiran Chen
机构: Duke University (杜克大学); Accenture (埃森哲)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Malicious attackers can exploit large language models (LLMs) by engaging them in multi-turn dialogues to achieve harmful objectives, posing significant safety risks to society. To address this challenge, we propose a novel defense mechanism: SafeTy Reasoning Elicitation Alignment for Multi-Turn Dialogues (STREAM). STREAM defends LLMs against multi-turn attacks while preserving their functional capabilities. Our approach involves constructing a human-annotated dataset, the Safety Reasoning Multi-turn Dialogues dataset, which is used to fine-tune a plug-and-play safety reasoning moderator. This model is designed to identify malicious intent hidden within multi-turn conversations and alert the target LLM of potential risks. We evaluate STREAM across multiple LLMs against prevalent multi-turn attack strategies. Experimental results demonstrate that our method significantly outperforms existing defense techniques, reducing the Attack Success Rate (ASR) by 51.2%, all while maintaining comparable LLM capability.
zh

[NLP-191] Sarc7: Evaluating Sarcasm Detection and Generation with Seven Types and Emotion-Informed Techniques

【速读】: 该论文旨在解决 sarcasm(讽刺)的分类与生成问题,特别是在大型语言模型中准确理解和生成具有复杂语义的讽刺表达。其关键解决方案是提出一种基于情感的提示技术(emotion-based prompting),通过识别讽刺的核心要素——反差性(incongruity)、冲击力(shock value)和语境依赖性(context dependency),从而提升模型在讽刺分类和生成任务中的性能。实验结果表明,该方法在F1分数上优于其他设置,并且在人类评估中表现出更高的生成成功率。

链接: https://arxiv.org/abs/2506.00658
作者: Lang Xiong,Raina Gao,Alyssa Jeong,Yicheng Fu,Sean O’Brien,Vasu Sharma,Kevin Zhu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Sarcasm is a form of humor where expressions convey meanings opposite to their literal interpretations. Classifying and generating sarcasm using large language models is vital for interpreting human communication. Sarcasm poses challenges for computational models, due to its nuanced nature. We introduce Sarc7, a benchmark that classifies 7 types of sarcasm: self-deprecating, brooding, deadpan, polite, obnoxious, raging, and manic by annotating entries of the MUStARD dataset. Classification was evaluated using zero-shot, few-shot, chain-of-thought (CoT), and a novel emotion-based prompting technique. We propose an emotion-based generation method developed by identifying key components of sarcasm-incongruity, shock value, and context dependency. Our classification experiments show that Gemini 2.5, using emotion-based prompting, outperforms other setups with an F1 score of 0.3664. Human evaluators preferred our emotion-based prompting, with 38.46% more successful generations than zero-shot prompting.
zh

[NLP-192] Linear Representation Transferability Hypothesis: Leverag ing Small Models to Steer Large Models

【速读】: 该论文试图解决不同规模神经网络在相同数据上训练后,其学习到的表示是否具有可迁移性的问题,即如何理解模型尺度变化下的表示对齐问题。解决方案的关键在于提出线性表示可迁移性(Linear Representation Transferability, LRT)假设,认为不同模型的表示空间之间存在仿射变换关系,并通过学习不同规模模型隐藏状态之间的仿射映射来验证该假设,结果表明此类映射能够保留与特定模型行为相关的语义方向,从而证明小模型学习到的表示可以用于引导大模型的行为。

链接: https://arxiv.org/abs/2506.00653
作者: Femi Bello,Anubrata Das,Fanzhi Zeng,Fangcong Yin,Leqi Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:It has been hypothesized that neural networks with similar architectures trained on similar data learn shared representations relevant to the learning task. We build on this idea by extending the conceptual framework where representations learned across models trained on the same data can be expressed as linear combinations of a \emphuniversal set of basis features. These basis features underlie the learning task itself and remain consistent across models, regardless of scale. From this framework, we propose the \textbfLinear Representation Transferability (LRT) Hypothesis – that there exists an affine transformation between the representation spaces of different models. To test this hypothesis, we learn affine mappings between the hidden states of models of different sizes and evaluate whether steering vectors – directions in hidden state space associated with specific model behaviors – retain their semantic effect when transferred from small to large language models using the learned mappings. We find strong empirical evidence that such affine mappings can preserve steering behaviors. These findings suggest that representations learned by small models can be used to guide the behavior of large models, and that the LRT hypothesis may be a promising direction on understanding representation alignment across model scales.
zh

[NLP-193] GuideX: Guided Synthetic Data Generation for Zero-Shot Information Extraction ACL

【速读】: 该论文旨在解决传统信息抽取(Information Extraction, IE)系统在跨领域泛化能力不足的问题,特别是在未见过的领域中,由于标签定义差异导致性能显著下降。其解决方案的关键在于提出GUIDEX方法,该方法能够自动定义领域特定的模式、推断标注指南并生成合成标注实例,从而提升模型在未见领域的泛化能力。通过在Llama 3.1上进行微调,GUIDEX在七个零样本命名实体识别基准上达到了新的最先进水平。

链接: https://arxiv.org/abs/2506.00649
作者: Neil De La Fuente,Oscar Sainz,Iker García-Ferrero,Eneko Agirre
机构: HiTZ Basque Center for Language Technology - Ixa NLP Group (HiTZ巴斯克语言技术中心-IXA自然语言处理组); University of the Basque Country (UPV/EHU) (巴斯克自治区大学(UPV/EHU)); Technical University of Munich (TUM) (慕尼黑工业大学(TUM))
类目: Computation and Language (cs.CL)
备注: ACL Findings 2025

点击查看摘要

Abstract:Information Extraction (IE) systems are traditionally domain-specific, requiring costly adaptation that involves expert schema design, data annotation, and model training. While Large Language Models have shown promise in zero-shot IE, performance degrades significantly in unseen domains where label definitions differ. This paper introduces GUIDEX, a novel method that automatically defines domain-specific schemas, infers guidelines, and generates synthetically labeled instances, allowing for better out-of-domain generalization. Fine-tuning Llama 3.1 with GUIDEX sets a new state-of-the-art across seven zeroshot Named Entity Recognition benchmarks. Models trained with GUIDEX gain up to 7 F1 points over previous methods without humanlabeled data, and nearly 2 F1 points higher when combined with it. Models trained on GUIDEX demonstrate enhanced comprehension of complex, domain-specific annotation schemas. Code, models, and synthetic datasets are available at this http URL
zh

[NLP-194] Clinical Annotations for Automatic Stuttering Severity Assessment INTERSPEECH2025

【速读】: 该论文试图解决的是言语流畅性障碍(stuttering)评估与治疗中缺乏高质量标注数据的问题,其解决方案的关键在于基于临床标准构建新的口吃标注方案,并通过专业临床医生进行高精度标注,以确保标注结果反映真实临床专业知识。此外,该方案还引入了多模态特征,结合音频和视频信息用于口吃时刻、次级行为及紧张程度的检测与分类,并提供了基于专家共识的高可靠性测试集,以支持个体标注者和机器学习模型的评估。

链接: https://arxiv.org/abs/2506.00644
作者: Ana Rita Valente,Rufael Marew,Hawau Olamide Toyin,Hamdan Al-Ali,Anelise Bohnen,Inma Becerra,Elsa Marta Soares,Goncalo Leal,Hanan Aldarmaki
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at INTERSPEECH 2025

点击查看摘要

Abstract:Stuttering is a complex disorder that requires specialized expertise for effective assessment and treatment. This paper presents an effort to enhance the FluencyBank dataset with a new stuttering annotation scheme based on established clinical standards. To achieve high-quality annotations, we hired expert clinicians to label the data, ensuring that the resulting annotations mirror real-world clinical expertise. The annotations are multi-modal, incorporating audiovisual features for the detection and classification of stuttering moments, secondary behaviors, and tension scores. In addition to individual annotations, we additionally provide a test set with highly reliable annotations based on expert consensus for assessing individual annotators and machine learning models. Our experiments and analysis illustrate the complexity of this task that necessitates extensive clinical expertise for valid training and evaluation of stuttering assessment models.
zh

[NLP-195] SATA-BENCH: Select All That Apply Benchmark for Multiple Choice Questions

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在处理Select All That Apply (SATA)问题时存在的能力不足问题,即无法可靠地识别所有正确答案。研究表明,即使最先进的模型在SATA任务上的准确匹配率也仅达到41.8%,主要受限于两个核心挑战:选择偏差(selection bias)和数量偏差(count bias)。解决方案的关键在于提出Choice Funnel,这是一种结合了token去偏和自适应阈值的解码策略,旨在引导模型实现完整且准确的答案选择,从而在提升准确率的同时降低推理成本。

链接: https://arxiv.org/abs/2506.00643
作者: Weijie Xu,Shixian Cui,Xi Fang,Chi Xue,Stephanie Eckman,Chandan Reddy
机构: Amazon(亚马逊)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 40 pages, 13 figures

点击查看摘要

Abstract:Large language models (LLMs) are increasingly evaluated on single-answer multiple-choice tasks, yet many real-world problems require identifying all correct answers from a set of options. This capability remains underexplored. We introduce SATA-BENCH, the first dedicated benchmark for evaluating LLMs on Select All That Apply (SATA) questions across diverse domains, including reading comprehension, law, and biomedicine. Our evaluation of 27 open-source and proprietary models reveals a significant gap: even the strongest model achieves only 41.8% exact match, exposing LLMs’ inability to reliably identify all correct answers. We find that this weakness stems from two core challenges: selection bias - models favor certain choices regardless of content, and count bias - models fail to predict the correct number of answers. To address these issues, we propose Choice Funnel, a decoding strategy that combines token debiasing with adaptive thresholding to guide models toward complete and accurate selections. Choice Funnel achieves up to 29% higher exact match than competitive baselines while reducing inference cost by over 64%. Our findings expose fundamental limitations in current LLMs and introduce a new framework for diagnosing and improving multi-answer reasoning. We release SATA-BENCH and Choice Funnel to promote LLM development for robust decision-making in realistic, multi-answer applications.
zh

[NLP-196] Improving the Calibration of Confidence Scores in Text Generation Using the Output Distributions Characteristics ACL2025

【速读】: 该论文试图解决文本生成模型中置信度评分校准不足的问题(well-calibrated model confidence scores),这一问题可能导致模型返回低质量或潜在危险的预测。解决方案的关键在于提出一种与任务无关的置信度度量方法,该方法仅依赖于模型输出的概率分布,而无需进一步微调或启发式规则。通过这种方法,作者成功提升了BART和Flan-T5在摘要、翻译和问答数据集上的校准效果。

链接: https://arxiv.org/abs/2506.00637
作者: Lorenzo Jaime Yu Flores,Ori Ernst,Jackie Chi Kit Cheung
机构: Mila - Quebec AI Institute (Mila - 魁北克人工智能研究所); McGill University (麦吉尔大学); Canada CIFAR AI Chair (加拿大CIFAR人工智能主席)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: ACL 2025 Main Conference

点击查看摘要

Abstract:Well-calibrated model confidence scores can improve the usefulness of text generation models. For example, users can be prompted to review predictions with low confidence scores, to prevent models from returning bad or potentially dangerous predictions. However, confidence metrics are not always well calibrated in text generation. One reason is that in generation, there can be many valid answers, which previous methods do not always account for. Hence, a confident model could distribute its output probability among multiple sequences because they are all valid. We propose task-agnostic confidence metrics suited to generation, which rely solely on the probabilities associated with the model outputs without the need for further fine-tuning or heuristics. Using these, we are able to improve the calibration of BART and Flan-T5 on summarization, translation, and QA datasets.
zh

[NLP-197] ViToSA: Audio-Based Toxic Spans Detection on Vietnamese Speech Utterances INTERSPEECH2025

【速读】: 该论文旨在解决在线平台上毒性语音内容检测的问题,特别是针对低资源语言如越南语的音频内容检测研究较为匮乏的现状。解决方案的关键在于构建了首个针对越南语语音中毒性片段检测的数据集ViToSA,并提出了一种结合自动语音识别(ASR)和文本毒性片段检测(TSD)的处理流程,以实现对语音内容中毒性部分的细粒度识别。

链接: https://arxiv.org/abs/2506.00636
作者: Huy Ba Do,Vy Le-Phuong Huynh,Luan Thanh Nguyen
机构: Faculty of Computer Science (计算机科学学院); Faculty of Information Science and Engineering (信息科学与工程学院)
类目: Computation and Language (cs.CL)
备注: Accepted for presentation at INTERSPEECH 2025

点击查看摘要

Abstract:Toxic speech on online platforms is a growing concern, impacting user experience and online safety. While text-based toxicity detection is well-studied, audio-based approaches remain underexplored, especially for low-resource languages like Vietnamese. This paper introduces ViToSA (Vietnamese Toxic Spans Audio), the first dataset for toxic spans detection in Vietnamese speech, comprising 11,000 audio samples (25 hours) with accurate human-annotated transcripts. We propose a pipeline that combines ASR and toxic spans detection for fine-grained identification of toxic content. Our experiments show that fine-tuning ASR models on ViToSA significantly reduces WER when transcribing toxic speech, while the text-based toxic spans detection (TSD) models outperform existing baselines. These findings establish a novel benchmark for Vietnamese audio-based toxic spans detection, paving the way for future research in speech content moderation.
zh

[NLP-198] Social Construction of Urban Space: Understanding Neighborhood Boundaries Using Rental Listings

【速读】: 该论文试图解决城市空间如何通过语言被社会建构的问题,特别是通过租赁广告分析中介对社区的描述,揭示机构边界与社区声明之间的不匹配。其解决方案的关键在于利用自然语言处理技术,结合人工和大语言模型标注,对非结构化 Craigslist 租赁信息进行分类,并通过地理空间分析和主题建模识别出与空间位置相关的模式,从而揭示城市空间定义在传统方法中被忽视的争议性特征。

链接: https://arxiv.org/abs/2506.00634
作者: Adam Visokay,Ruth Bagley,Ian Kennedy,Chris Hess,Kyle Crowder,Rob Voigt,Denis Peskoff
机构: 未知
类目: Computation and Language (cs.CL)
备注: 8 pages, 3 figures, 4 tables

点击查看摘要

Abstract:Rental listings offer a unique window into how urban space is socially constructed through language. We analyze Chicago Craigslist rental advertisements from 2018 to 2024 to examine how listing agents characterize neighborhoods, identifying mismatches between institutional boundaries and neighborhood claims. Through manual and large language model annotation, we classify unstructured listings from Craigslist according to their neighborhood. Geospatial analysis reveals three distinct patterns: properties with conflicting neighborhood designations due to competing spatial definitions, border properties with valid claims to adjacent neighborhoods, and ``reputation laundering" where listings claim association with distant, desirable neighborhoods. Through topic modeling, we identify patterns that correlate with spatial positioning: listings further from neighborhood centers emphasize different amenities than centrally-located units. Our findings demonstrate that natural language processing techniques can reveal how definitions of urban spaces are contested in ways that traditional methods overlook.
zh

[NLP-199] LID Models are Actually Accent Classifiers: Implications and Solutions for LID on Accented Speech INTERSPEECH2025

【速读】: 该论文旨在解决语言识别(LID)模型在识别带有口音的语音时性能显著下降的问题,特别是针对第二语言(L2)口音语音被误分类为母语或相关语言的现象。其解决方案的关键在于揭示当前最先进的模型对短时语音片段的排列不变性,表明模型依赖于与口音相关的短音系特征进行分类而非语言本身,并通过输入分块(input chunking)提升模型对口音的鲁棒性;此外,还提出了一种无需依赖单语自动语音识别(ASR)系统的序列级信息整合方法,有效减少口音与语言的混淆,从而显著提升对带口音语音的识别性能。

链接: https://arxiv.org/abs/2506.00628
作者: Niyati Bafna,Matthew Wiesner
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted at Interspeech 2025

点击查看摘要

Abstract:Prior research indicates that LID model performance significantly declines on accented speech; however, the specific causes, extent, and characterization of these errors remain under-explored. (i) We identify a common failure mode on accented speech whereby LID systems often misclassify L2 accented speech as the speaker’s native language or a related language. (ii) We present evidence suggesting that state-of-the-art models are invariant to permutations of short spans of speech, implying they classify on the basis of short phonotactic features indicative of accent rather than language. Our analysis reveals a simple method to enhance model robustness to accents through input chunking. (iii) We present an approach that integrates sequence-level information into our model without relying on monolingual ASR systems; this reduces accent-language confusion and significantly enhances performance on accented speech while maintaining comparable results on standard LID.
zh

[NLP-200] Improving Dialogue State Tracking through Combinatorial Search for In-Context Examples

【速读】: 该论文旨在解决对话状态追踪(DST)中基于上下文学习的检索器在构建训练数据时存在的三个关键问题:未考虑示例之间的协同效应、未充分考虑查询的语言特征以及评分未直接优化DST性能。解决方案的关键在于提出CombiSearch方法,该方法通过评估示例对DST性能的组合影响来打分,从而选择更有效的上下文示例。

链接: https://arxiv.org/abs/2506.00622
作者: Haesung Pyun,Yoonah Park,Yohan Jo
机构: Seoul National University (首尔大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:In dialogue state tracking (DST), in-context learning comprises a retriever that selects labeled dialogues as in-context examples and a DST model that uses these examples to infer the dialogue state of the query dialogue. Existing methods for constructing training data for retrievers suffer from three key limitations: (1) the synergistic effect of examples is not considered, (2) the linguistic characteristics of the query are not sufficiently factored in, and (3) scoring is not directly optimized for DST performance. Consequently, the retriever can fail to retrieve examples that would substantially improve DST performance. To address these issues, we present CombiSearch, a method that scores effective in-context examples based on their combinatorial impact on DST performance. Our evaluation on MultiWOZ shows that retrievers trained with CombiSearch surpass state-of-the-art models, achieving a 20x gain in data efficiency and generalizing well to the SGD dataset. Moreover, CombiSearch attains a 12% absolute improvement in the upper bound DST performance over traditional approaches when no retrieval errors are assumed. This significantly increases the headroom for practical DST performance while demonstrating that existing methods rely on suboptimal data for retriever training.
zh

[NLP-201] Enhancing Clinical Multiple-Choice Questions Benchmarks with Knowledge Graph Guided Distractor Generation

【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在临床任务如诊断和治疗中的决策能力评估缺乏严谨基准的问题。其解决方案的关键在于提出一种基于知识引导的数据增强框架,通过生成具有误导性的干扰项(distractors)来提升临床多项选择题(MCQ)数据集的难度,从而更准确地评估LLMs的可靠性。该方法利用医学知识图谱进行多步骤、语义导向的路径探索,识别出医学相关但事实错误的关联路径,进而指导生成更具欺骗性的干扰项,有效降低了先进LLMs的准确性。

链接: https://arxiv.org/abs/2506.00612
作者: Running Yang,Wenlong Deng,Minghui Chen,Yuyin Zhou,Xiaoxiao Li
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Clinical tasks such as diagnosis and treatment require strong decision-making abilities, highlighting the importance of rigorous evaluation benchmarks to assess the reliability of large language models (LLMs). In this work, we introduce a knowledge-guided data augmentation framework that enhances the difficulty of clinical multiple-choice question (MCQ) datasets by generating distractors (i.e., incorrect choices that are similar to the correct one and may confuse existing LLMs). Using our KG-based pipeline, the generated choices are both clinically plausible and deliberately misleading. Our approach involves multi-step, semantically informed walks on a medical knowledge graph to identify distractor paths-associations that are medically relevant but factually incorrect-which then guide the LLM in crafting more deceptive distractors. We apply the designed knowledge graph guided distractor generation (KGGDG) pipline, to six widely used medical QA benchmarks and show that it consistently reduces the accuracy of state-of-the-art LLMs. These findings establish KGGDG as a powerful tool for enabling more robust and diagnostic evaluations of medical LLMs.
zh

[NLP-202] PAKTON: A Multi-Agent Framework for Question Answering in Long Legal Agreements

【速读】: 该论文旨在解决合同审查过程中存在的复杂性、耗时性以及对专业法律知识的依赖问题,同时应对法律解释中的模糊性和主观性,以及合同的保密性带来的模型应用限制。其解决方案的关键在于提出PAKTON:一个完全开源、端到端的多智能体框架,具备即插即用能力,通过协作式智能体工作流和一种新颖的检索增强生成(Retrieval-Augmented Generation, RAG)组件,实现更高效、可访问且隐私保护的自动化法律文件审查。

链接: https://arxiv.org/abs/2506.00608
作者: Petros Raptopoulos,Giorgos Filandrianos,Maria Lymperaiou,Giorgos Stamou
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Contract review is a complex and time-intensive task that typically demands specialized legal expertise, rendering it largely inaccessible to non-experts. Moreover, legal interpretation is rarely straightforward-ambiguity is pervasive, and judgments often hinge on subjective assessments. Compounding these challenges, contracts are usually confidential, restricting their use with proprietary models and necessitating reliance on open-source alternatives. To address these challenges, we introduce PAKTON: a fully open-source, end-to-end, multi-agent framework with plug-and-play capabilities. PAKTON is designed to handle the complexities of contract analysis through collaborative agent workflows and a novel retrieval-augmented generation (RAG) component, enabling automated legal document review that is more accessible, adaptable, and privacy-preserving. Experiments demonstrate that PAKTON outperforms both general-purpose and pretrained models in predictive accuracy, retrieval performance, explainability, completeness, and grounded justifications as evaluated through a human study and validated with automated metrics.
zh

[NLP-203] Entriever: Energy-based Retriever for Knowledge-Grounded Dialog Systems ACL2025

【速读】: 该论文试图解决知识基础对话系统中检索器模型假设知识片段条件独立性的问题,而实际上在给定上下文时可能存在多个相关且相关的知识片段。解决方案的关键在于提出Entriever,一种基于能量的检索器,它将候选检索结果作为一个整体进行建模,而非分别建模每个知识片段,其相关性得分由能量函数定义。

链接: https://arxiv.org/abs/2506.00585
作者: Yucheng Cai,Ke Li,Yi Huang,Junlan Feng,Zhijian Ou
机构: Tsinghua University (清华大学); China Mobile Research Institute (中国移动研究院)
类目: Computation and Language (cs.CL)
备注: Accepted by ACL2025 Findings

点击查看摘要

Abstract:A retriever, which retrieves relevant knowledge pieces from a knowledge base given a context, is an important component in many natural language processing (NLP) tasks. Retrievers have been introduced in knowledge-grounded dialog systems to improve knowledge acquisition. In knowledge-grounded dialog systems, when conditioning on a given context, there may be multiple relevant and correlated knowledge pieces. However, knowledge pieces are usually assumed to be conditionally independent in current retriever models. To address this issue, we propose Entriever, an energy-based retriever. Entriever directly models the candidate retrieval results as a whole instead of modeling the knowledge pieces separately, with the relevance score defined by an energy function. We explore various architectures of energy functions and different training methods for Entriever, and show that Entriever substantially outperforms the strong cross-encoder baseline in knowledge retrieval tasks. Furthermore, we show that in semi-supervised training of knowledge-grounded dialog systems, Entriever enables effective scoring of retrieved knowledge pieces and significantly improves end-to-end performance of dialog systems.
zh

[NLP-204] he Hidden Language of Harm: Examining the Role of Emojis in Harmful Online Communication and Content Moderation

【速读】: 该论文试图解决社交媒体平台上由表情符号(emoji)引发的攻击性内容问题,特别是表情符号在特定语境下可能被滥用以传达有害含义的现象。解决方案的关键在于提出一种基于大语言模型(LLM)的多步骤内容审核流程,该流程能够选择性地替换具有潜在危害的表情符号,同时保持推文的语义意图。

链接: https://arxiv.org/abs/2506.00583
作者: Yuhang Zhou,Yimin Xiao,Wei Ai,Ge Gao
机构: University of Maryland, College Park (马里兰大学学院公园分校)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注: 18 pages, 3 figures

点击查看摘要

Abstract:Social media platforms have become central to modern communication, yet they also harbor offensive content that challenges platform safety and inclusivity. While prior research has primarily focused on textual indicators of offense, the role of emojis, ubiquitous visual elements in online discourse, remains underexplored. Emojis, despite being rarely offensive in isolation, can acquire harmful meanings through symbolic associations, sarcasm, and contextual misuse. In this work, we systematically examine emoji contributions to offensive Twitter messages, analyzing their distribution across offense categories and how users exploit emoji ambiguity. To address this, we propose an LLM-powered, multi-step moderation pipeline that selectively replaces harmful emojis while preserving the tweet’s semantic intent. Human evaluations confirm our approach effectively reduces perceived offensiveness without sacrificing meaning. Our analysis also reveals heterogeneous effects across offense types, offering nuanced insights for online communication and emoji moderation.
zh

[NLP-205] Reasoning Like an Economist: Post-Training on Economic Problems Induces Strategic Generalization in LLM s

【速读】: 该论文试图解决在多智能体系统(Multi-Agent Systems, MAS)中直接训练大型语言模型(Large Language Models, LLMs)所面临的挑战,包括复杂的奖励建模、动态的智能体交互以及对泛化能力的高要求。其解决方案的关键在于采用后训练技术,特别是监督微调(Supervised Fine-Tuning, SFT)和可验证奖励强化学习(Reinforcement Learning with Verifiable Rewards, RLVR),通过经济推理任务进行验证,以提升模型的结构化推理能力和经济理性。

链接: https://arxiv.org/abs/2506.00577
作者: Yufa Zhou,Shaobo Wang,Xingyu Dong,Xiangqi Jin,Yifang Chen,Yue Min,Kexin Yang,Xingzhang Ren,Dayiheng Liu,Linfeng Zhang
机构: Duke University (杜克大学); EPIC Lab, Shanghai Jiao Tong University (上海交通大学电子工程与计算机科学实验室); Qwen Team, Alibaba Group (阿里集团通义实验室); University of Pennsylvania (宾夕法尼亚大学); The University of Chicago (芝加哥大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Directly training Large Language Models (LLMs) for Multi-Agent Systems (MAS) remains challenging due to intricate reward modeling, dynamic agent interactions, and demanding generalization requirements. This paper explores whether post-training techniques, specifically Supervised Fine-Tuning (SFT) and Reinforcement Learning with Verifiable Rewards (RLVR), can effectively \textitgeneralize to multi-agent scenarios. We use economic reasoning as a testbed, leveraging its strong foundations in mathematics and game theory, its demand for structured analytical reasoning, and its relevance to real-world applications such as market design, resource allocation, and policy analysis. We introduce \textbfRecon ( \textbfR easoning like an \textbfECON omist), a 7B-parameter open-source LLM post-trained on a hand-curated dataset of 2,100 high-quality economic reasoning problems. Comprehensive evaluation on economic reasoning benchmarks and multi-agent games reveals clear improvements in structured reasoning and economic rationality. These results underscore the promise of domain-aligned post-training for enhancing reasoning and agent alignment, shedding light on the roles of SFT and RL in shaping model behavior. Code is available at this https URL .
zh

[NLP-206] MMedAgent -RL: Optimizing Multi-Agent Collaboration for Multimodal Medical Reasoning

【速读】: 该论文旨在解决现有单智能体医学大视觉-语言模型(Med-LVLMs)在跨多样化医学专科时泛化能力不足的问题,从而限制了其诊断性能。其解决方案的关键在于提出一种基于强化学习(Reinforcement Learning, RL)的多智能体框架MMedAgent-RL,通过动态优化的协作机制实现不同医学角色之间的灵活互动,其中包含分诊医生和主治医师两个GP智能体,分别负责患者分诊与综合决策,并引入课程学习(Curriculum Learning, CL)引导的强化学习策略以提升主治医师对专科输出不一致性的处理能力。

链接: https://arxiv.org/abs/2506.00555
作者: Peng Xia,Jinglu Wang,Yibo Peng,Kaide Zeng,Xian Wu,Xiangru Tang,Hongtu Zhu,Yun Li,Shujie Liu,Yan Lu,Huaxiu Yao
机构: UNC-Chapel Hill (北卡罗来纳大学教堂山分校); Microsoft Research (微软研究院); CMU (卡内基梅隆大学); Yale University (耶鲁大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Medical Large Vision-Language Models (Med-LVLMs) have shown strong potential in multimodal diagnostic tasks. However, existing single-agent models struggle to generalize across diverse medical specialties, limiting their performance. Recent efforts introduce multi-agent collaboration frameworks inspired by clinical workflows, where general practitioners (GPs) and specialists interact in a fixed sequence. Despite improvements, these static pipelines lack flexibility and adaptability in reasoning. To address this, we propose MMedAgent-RL, a reinforcement learning (RL)-based multi-agent framework that enables dynamic, optimized collaboration among medical agents. Specifically, we train two GP agents based on Qwen2.5-VL via RL: the triage doctor learns to assign patients to appropriate specialties, while the attending physician integrates the judgments from multi-specialists and its own knowledge to make final decisions. To address the inconsistency in specialist outputs, we introduce a curriculum learning (CL)-guided RL strategy that progressively teaches the attending physician to balance between imitating specialists and correcting their mistakes. Experiments on five medical VQA benchmarks demonstrate that MMedAgent-RL not only outperforms both open-source and proprietary Med-LVLMs, but also exhibits human-like reasoning patterns. Notably, it achieves an average performance gain of 18.4% over supervised fine-tuning baselines.
zh

[NLP-207] AnnaAgent : Dynamic Evolution Agent System with Multi-Session Memory for Realistic Seeker Simulation

【速读】: 该论文试图解决在AI驱动的心理健康领域中,由于涉及真实求助者(seeker)的成本和伦理问题,导致难以构建高度逼真的求助者模拟器的问题。其解决方案的关键在于提出AnnaAgent,这是一个具备三级记忆机制的情感与认知动态代理系统,通过情感调节器和基于真实咨询对话训练的投诉诱发器,实现对模拟器配置的动态控制,并有效整合跨会话的短期和长期记忆,从而提升心理咨询服务中求助者模拟的真实性。

链接: https://arxiv.org/abs/2506.00551
作者: Ming Wang,Peidong Wang,Lin Wu,Xiaocui Yang,Daling Wang,Shi Feng,Yuxin Chen,Bixuan Wang,Yifei Zhang
机构: Northeastern University (东北大学); Central University of Finance and Economics (中央财经大学); Northeast Normal University (东北师范大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Constrained by the cost and ethical concerns of involving real seekers in AI-driven mental health, researchers develop LLM-based conversational agents (CAs) with tailored configurations, such as profiles, symptoms, and scenarios, to simulate seekers. While these efforts advance AI in mental health, achieving more realistic seeker simulation remains hindered by two key challenges: dynamic evolution and multi-session memory. Seekers’ mental states often fluctuate during counseling, which typically spans multiple sessions. To address this, we propose AnnaAgent, an emotional and cognitive dynamic agent system equipped with tertiary memory. AnnaAgent incorporates an emotion modulator and a complaint elicitor trained on real counseling dialogues, enabling dynamic control of the simulator’s configurations. Additionally, its tertiary memory mechanism effectively integrates short-term and long-term memory across sessions. Evaluation results, both automated and manual, demonstrate that AnnaAgent achieves more realistic seeker simulation in psychological counseling compared to existing baselines. The ethically reviewed and screened code can be found on this https URL.
zh

[NLP-208] owards Multi-dimensional Evaluation of LLM Summarization across Domains and Languages

【速读】: 该论文试图解决现有文本摘要评估框架在领域覆盖、语言多样性以及人工标注质量方面的不足(current benchmarks still lack domain-specific assessment criteria, remain predominantly English-centric, and face challenges with human annotation due to the complexity of reasoning)。其解决方案的关键在于引入MSumBench,该框架提供了多维度、多领域的中英文摘要评估,并为每个领域设计了专门的评估标准,同时采用多智能体辩论系统提升标注质量。

链接: https://arxiv.org/abs/2506.00549
作者: Hyangsuk Min,Yuho Lee,Minjeong Ban,Jiaqi Deng,Nicole Hee-Yeon Kim,Taewon Yun,Hang Su,Jason Cai,Hwanjun Song
机构: Korea Advanced Institute of Science and Technology (韩国科学技术院); AWS AI Labs (AWS AI 实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 34 pages, 6 figures

点击查看摘要

Abstract:Evaluation frameworks for text summarization have evolved in terms of both domain coverage and metrics. However, existing benchmarks still lack domain-specific assessment criteria, remain predominantly English-centric, and face challenges with human annotation due to the complexity of reasoning. To address these, we introduce MSumBench, which provides a multi-dimensional, multi-domain evaluation of summarization in English and Chinese. It also incorporates specialized assessment criteria for each domain and leverages a multi-agent debate system to enhance annotation quality. By evaluating eight modern summarization models, we discover distinct performance patterns across domains and languages. We further examine large language models as summary evaluators, analyzing the correlation between their evaluation and summarization capabilities, and uncovering systematic bias in their assessment of self-generated summaries. Our benchmark dataset is publicly available at this https URL.
zh

[NLP-209] Con Instruction: Universal Jailbreaking of Multimodal Large Language Models via Non-Textual Modalities

【速读】: 该论文旨在解决多模态语言模型(Multimodal Language Models, MLLMs)在面对非文本形式的对抗性指令时的安全性问题,特别是如何通过生成对抗性图像或音频来绕过模型的安全机制。其解决方案的关键在于提出了一种名为Con Instruction的新方法,该方法无需依赖训练数据或对文本指令进行预处理,而是通过优化对抗性示例使其在嵌入空间中与目标指令高度对齐,从而有效触发模型的恶意行为。

链接: https://arxiv.org/abs/2506.00548
作者: Jiahui Geng,Thy Thy Tran,Preslav Nakov,Iryna Gurevych
机构: 未知
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Existing attacks against multimodal language models (MLLMs) primarily communicate instructions through text accompanied by adversarial images. In contrast, we exploit the capabilities of MLLMs to interpret non-textual instructions, specifically, adversarial images or audio generated by our novel method, Con Instruction. We optimize these adversarial examples to align closely with target instructions in the embedding space, revealing the detrimental implications of MLLMs’ sophisticated understanding. Unlike prior work, our method does not require training data or preprocessing of textual instructions. While these non-textual adversarial examples can effectively bypass MLLM safety mechanisms, their combination with various text inputs substantially amplifies attack success. We further introduce a new Attack Response Categorization (ARC) framework, which evaluates both the quality of the model’s response and its relevance to the malicious instructions. Experimental results demonstrate that Con Instruction effectively bypasses safety mechanisms in multiple vision- and audio-language models, including LLaVA-v1.5, InternVL, Qwen-VL, and Qwen-Audio, evaluated on two standard benchmarks: AdvBench and SafeBench. Specifically, our method achieves the highest attack success rates, reaching 81.3% and 86.6% on LLaVA-v1.5 (13B). On the defense side, we explore various countermeasures against our attacks and uncover a substantial performance gap among existing techniques. Our implementation is made publicly available.
zh

[NLP-210] ARIA: Training Language Agents with Intention-Driven Reward Aggregation

【速读】: 该论文试图解决在开放语言动作环境中,由于动作空间为令牌的联合分布而导致的奖励稀疏性和高奖励方差问题,这阻碍了有效强化学习(Reinforcement Learning, RL)的进行。解决方案的关键在于提出ARIA方法,该方法通过将自然语言动作从高维联合令牌分布空间投影到低维意图空间,实现语义相似动作的聚类与共享奖励,从而减少奖励方差并提升策略优化效果。

链接: https://arxiv.org/abs/2506.00539
作者: Ruihan Yang,Yikai Zhang,Aili Chen,Xintao Wang,Siyu Yuan,Jiangjie Chen,Deqing Yang,Yanghua Xiao
机构: Fudan University (复旦大学); Bytedance Seed (字节跳动种子)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have enabled agents to perform complex reasoning and decision-making through free-form language interactions. However, in open-ended language action environments (e.g., negotiation or question-asking games), the action space can be formulated as a joint distribution over tokens, resulting in an exponentially large action space. Sampling actions in such a space can lead to extreme reward sparsity, which brings large reward variance, hindering effective reinforcement learning (RL). To address this, we propose ARIA, a method that Aggregates Rewards in Intention space to enable efficient and effective language Agents training. ARIA aims to project natural language actions from the high-dimensional joint token distribution space into a low-dimensional intention space, where semantically similar actions are clustered and assigned shared rewards. This intention-aware reward aggregation reduces reward variance by densifying reward signals, fostering better policy optimization. Extensive experiments demonstrate that ARIA not only significantly reduces policy gradient variance, but also delivers substantial performance gains of an average of 9.95% across four downstream tasks, consistently outperforming offline and online RL baselines.
zh

[NLP-211] Decoupling Reasoning and Knowledge Injection for In-Context Knowledge Editing

【速读】: 该论文旨在解决知识编辑(Knowledge Editing)中因外部更新与模型内部参数化知识冲突而导致的推理一致性与准确性下降问题。现有在上下文编辑(In-Context Editing, ICE)方法未能明确区分新注入知识与模型原有推理过程,导致在多跳任务中性能退化。论文提出的解决方案关键在于DecKER框架,其通过生成掩码推理路径,并结合混合检索与模型验证机制,实现推理过程与知识编辑的解耦,从而有效缓解知识冲突并保持推理一致性。

链接: https://arxiv.org/abs/2506.00536
作者: Changyue Wang,Weihang Su,Qingyao Ai,Yujia Zhou,Yiqun Liu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Knowledge editing aims to efficiently update Large Language Models (LLMs) by modifying specific knowledge without retraining the entire model. Among knowledge editing approaches, in-context editing (ICE) offers a lightweight solution by injecting new knowledge directly into the input context, leaving model parameters unchanged. However, existing ICE approaches do not explicitly separate the newly injected knowledge from the model’s original reasoning process. This entanglement often results in conflicts between external updates and internal parametric knowledge, undermining the consistency and accuracy of the reasoning this http URL this work, we conduct preliminary experiments to examine how parametric knowledge influences reasoning path planning. We find that the model’s reasoning is tightly coupled with its internal knowledge, and that naively injecting new information without adapting the reasoning path often leads to performance degradation, particularly in multi-hop tasks. To this end, we propose DecKER, a novel ICE framework that decouples reasoning from knowledge editing by generating a masked reasoning path and then resolving knowledge edits via hybrid retrieval and model-based validation. Experiments on multi-hop QA benchmarks show that DecKER significantly outperforms existing ICE methods by mitigating knowledge conflicts and preserving reasoning consistency. Our code is available at: this https URL .
zh

[NLP-212] CityLens: Benchmarking Large Language-Vision Models for Urban Socioeconomic Sensing

【速读】: 该论文旨在解决通过视觉数据理解城市社会经济状况的问题,这是实现可持续城市发展和政策规划的重要但具有挑战性的任务。其解决方案的关键在于引入了 \textbfCityLens,这是一个全面的基准,用于评估大型语言-视觉模型(LLVMs)从卫星和街景图像中预测社会经济指标的能力。该基准构建了一个多模态数据集,覆盖全球17个城市,涵盖经济、教育、犯罪、交通、健康和环境等6个关键领域,并定义了11项预测任务,结合三种评估范式进行模型性能评估。

链接: https://arxiv.org/abs/2506.00530
作者: Tianhui Liu,Jie Feng,Hetian Pang,Xin Zhang,Tianjian Ouyang,Zhiyuan Zhang,Yong Li
机构: Beijing Jiaotong University (北京交通大学); Tsinghua University (清华大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Understanding urban socioeconomic conditions through visual data is a challenging yet essential task for sustainable urban development and policy planning. In this work, we introduce \textbfCityLens , a comprehensive benchmark designed to evaluate the capabilities of large language-vision models (LLVMs) in predicting socioeconomic indicators from satellite and street view imagery. We construct a multi-modal dataset covering a total of 17 globally distributed cities, spanning 6 key domains: economy, education, crime, transport, health, and environment, reflecting the multifaceted nature of urban life. Based on this dataset, we define 11 prediction tasks and utilize three evaluation paradigms: Direct Metric Prediction, Normalized Metric Estimation, and Feature-Based Regression. We benchmark 17 state-of-the-art LLVMs across these tasks. Our results reveal that while LLVMs demonstrate promising perceptual and reasoning capabilities, they still exhibit limitations in predicting urban socioeconomic indicators. CityLens provides a unified framework for diagnosing these limitations and guiding future efforts in using LLVMs to understand and predict urban socioeconomic patterns. Our codes and datasets are open-sourced via this https URL.
zh

[NLP-213] Retrieval-Augmented Generation Systems for Intellectual Property via Synthetic Multi-Angle Fine-tuning

【速读】: 该论文旨在解决知识产权(Intellectual Property, IP)领域中检索增强生成(Retrieval-Augmented Generation, RAG)系统在处理多样化用户查询时的挑战,包括口语化表达、拼写错误和模糊术语等问题,这些问题导致检索不准确和响应质量不佳。其解决方案的关键在于提出一种名为多角度问题生成与检索微调方法(Multi-Angle Question Generation and Retrieval Fine-Tuning Method, MQG-RFM)的框架,该框架利用大语言模型(Large Language Models, LLMs)生成多样化的用户查询,并通过语义对齐但语言形式多样的问题对检索模型进行微调,从而提升检索鲁棒性。该方法采用轻量级的数据到微调范式,结合提示工程的问题生成与硬负样本挖掘,无需昂贵的基础设施改动即可显著提升检索和生成性能。

链接: https://arxiv.org/abs/2506.00527
作者: Runtao Ren,Jian Ma,Jianxi Luo
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) systems in the Intellectual Property (IP) field often struggle with diverse user queries, including colloquial expressions, spelling errors, and ambiguous terminology, leading to inaccurate retrieval and suboptimal responses. To address this challenge, we propose Multi-Angle Question Generation and Retrieval Fine-Tuning Method (MQG-RFM), a novel framework that leverages large language models (LLMs) to simulate varied user inquiries and fine-tunes retrieval models to align semantically equivalent but linguistically diverse questions. Unlike complex architectural modifications, MQG-RFM adopts a lightweight Data-to-Tune paradigm, combining prompt-engineered query generation with hard negative mining to enhance retrieval robustness without costly infrastructure changes. Experimental results on a Taiwan patent QA dataset show 185.62% improvement in retrieval accuracy on the Patent Consultation dataset and 262.26% improvement on the Novel Patent Technology Report dataset, with 14.22% and 53.58% improvements in generation quality over the baselines, respectively. By bridging the gap between user intent and system comprehension through semantic-aware retrieval optimization, MQG-RFM offers a practical, scalable approach for rapid, cost-effective deployment among small and medium-sized agencies seeking reliable patent intelligence solutions. Additionally, our proposed method has already been adopted by ScholarMate, the largest professional research social networking platform in China, to support real-world development and deployment. A demo version of the instantiated is available at this https URL.
zh

[NLP-214] CausalAbstain: Enhancing Multilingual LLM s with Causal Reasoning for Trustworthy Abstention ACL

【速读】: 该论文旨在解决多语言场景下大型语言模型(Large Language Models, LLMs)因语言间知识差异而产生的幻觉问题,特别是通过鼓励模型在知识空白时进行自我回避(abstention)来减少错误输出。其解决方案的关键在于提出一种基于因果关系的回避方法——\textitCausalAbstain,该方法帮助LLMs判断是否应利用多个生成的反馈响应,并识别最有用的反馈,从而提升回避决策的准确性和可解释性。

链接: https://arxiv.org/abs/2506.00519
作者: Yuxi Sun,Aoqi Zuo,Wei Gao,Jing Ma
机构: Hong Kong Baptist University (香港浸会大学); The University of Melbourne (墨尔本大学); Singapore Management University (新加坡管理大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to Association for Computational Linguistics Findings (ACL) 2025

点击查看摘要

Abstract:Large Language Models (LLMs) often exhibit knowledge disparities across languages. Encouraging LLMs to \textitabstain when faced with knowledge gaps is a promising strategy to reduce hallucinations in multilingual settings. Current abstention strategies for multilingual scenarios primarily rely on generating feedback in various languages using LLMs and performing self-reflection. However, these methods can be adversely impacted by inaccuracies and biases in the generated feedback. To address this, from a causal perspective, we introduce \textitCausalAbstain, a method that helps LLMs determine whether to utilize multiple generated feedback responses and how to identify the most useful ones. Extensive experiments demonstrate that \textitCausalAbstain effectively selects helpful feedback and enhances abstention decisions with interpretability in both native language (\textscCasual-native) and multilingual (\textscCausal-multi) settings, outperforming strong baselines on two benchmark datasets covering encyclopedic and commonsense knowledge QA tasks. Our code and data are open-sourced at this https URL.
zh

[NLP-215] Evaluating the Evaluation of Diversity in Commonsense Generation ACL2025

【速读】: 该论文试图解决在常识生成任务中,如何准确评估生成内容多样性的问题。现有基于形式和内容层面重叠的评价指标在评估常识生成模型的多样性时存在不足,尤其形式类指标容易高估句子集的多样性。论文的关键解决方案是通过使用大型语言模型(LLM)构建一个标注了句子多样性的新数据集,并基于此数据集对现有多样性评估指标进行元评估,结果表明内容类指标在与LLM评分的高相关性方面优于形式类指标。

链接: https://arxiv.org/abs/2506.00514
作者: Tianhui Zhang,Bei Peng,Danushka Bollegala
机构: University of Liverpool(利物浦大学)
类目: Computation and Language (cs.CL)
备注: ACL 2025 Main

点击查看摘要

Abstract:In commonsense generation, given a set of input concepts, a model must generate a response that is not only commonsense bearing, but also capturing multiple diverse viewpoints. Numerous evaluation metrics based on form- and content-level overlap have been proposed in prior work for evaluating the diversity of a commonsense generation model. However, it remains unclear as to which metrics are best suited for evaluating the diversity in commonsense generation. To address this gap, we conduct a systematic meta-evaluation of diversity metrics for commonsense generation. We find that form-based diversity metrics tend to consistently overestimate the diversity in sentence sets, where even randomly generated sentences are assigned overly high diversity scores. We then use an Large Language Model (LLM) to create a novel dataset annotated for the diversity of sentences generated for a commonsense generation task, and use it to conduct a meta-evaluation of the existing diversity evaluation metrics. Our experimental results show that content-based diversity evaluation metrics consistently outperform the form-based counterparts, showing high correlations with the LLM-based ratings. We recommend that future work on commonsense generation should use content-based metrics for evaluating the diversity of their outputs.
zh

[NLP-216] Goal-Aware Identification and Rectification of Misinformation in Multi-Agent Systems

【速读】: 该论文旨在解决基于大型语言模型的多智能体系统(MASs)在面对虚假信息注入时的脆弱性问题。其关键解决方案是提出ARGUS,一个无需训练的两阶段防御框架,通过目标感知推理实现信息流中虚假信息的精准修正。

链接: https://arxiv.org/abs/2506.00509
作者: Zherui Li,Yan Mi,Zhenhong Zhou,Houcheng Jiang,Guibin Zhang,Kun Wang,Junfeng Fang
机构: Beijing University of Posts and Telecommunications (北京邮电大学); Nanyang Technological University (南洋理工大学); University of Science and Technology of China (中国科学技术大学); National University of Singapore (新加坡国立大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Model-based Multi-Agent Systems (MASs) have demonstrated strong advantages in addressing complex real-world tasks. However, due to the introduction of additional attack surfaces, MASs are particularly vulnerable to misinformation injection. To facilitate a deeper understanding of misinformation propagation dynamics within these systems, we introduce MisinfoTask, a novel dataset featuring complex, realistic tasks designed to evaluate MAS robustness against such threats. Building upon this, we propose ARGUS, a two-stage, training-free defense framework leveraging goal-aware reasoning for precise misinformation rectification within information flows. Our experiments demonstrate that in challenging misinformation scenarios, ARGUS exhibits significant efficacy across various injection attacks, achieving an average reduction in misinformation toxicity of approximately 28.17% and improving task success rates under attack by approximately 10.33%. Our code and dataset is available at: this https URL.
zh

[NLP-217] Exploring In-context Example Generation for Machine Translation ACL2025

【速读】: 该论文试图解决在机器翻译中,由于缺乏人工标注的示例对而导致的低资源语言场景下上下文示例选择效果受限的问题。其解决方案的关键在于提出一种无需依赖外部资源的示例生成方法——翻译示范增强(Demonstration Augmentation for Translation, DAT),该方法基于相关性和多样性两个先验标准生成示例对,从而有效提升低资源语言的翻译质量。

链接: https://arxiv.org/abs/2506.00507
作者: Dohyun Lee,Seungil Chad Lee,Chanwoo Yang,Yujin Baek,Jaegul Choo
机构: KAIST AI(KAIST人工智能); Jeonbuk National University(全北国立大学)
类目: Computation and Language (cs.CL)
备注: Accepted to ACL 2025 Findings

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated strong performance across various tasks, leveraging their exceptional in-context learning ability with only a few examples. Accordingly, the selection of optimal in-context examples has been actively studied in the field of machine translation. However, these studies presuppose the presence of a demonstration pool with human-annotated pairs, making them less applicable to low-resource languages where such an assumption is challenging to meet. To overcome this limitation, this paper explores the research direction of in-context example generation for machine translation. Specifically, we propose Demonstration Augmentation for Translation (DAT), a simple yet effective approach that generates example pairs without relying on any external resources. This method builds upon two prior criteria, relevance and diversity, which have been highlighted in previous work as key factors for in-context example selection. Through experiments and analysis on low-resource languages where human-annotated pairs are scarce, we show that DAT achieves superior translation quality compared to the baselines. Furthermore, we investigate the potential of progressively accumulating generated pairs during test time to build and reuse a demonstration pool. Our implementation is publicly available at this https URL.
zh

[NLP-218] FLoE: Fisher-Based Layer Selection for Efficient Sparse Adaptation of Low-Rank Experts

【速读】: 该论文旨在解决现有参数高效微调(Parameter-Efficient Fine-Tuning, PEFT)方法在适配预训练大型语言模型(Large Language Models, LLMs)时存在的冗余参数分配和适应效率不足的问题。现有方法通常在所有层中统一部署低秩适配器(LoRA adapters),忽略了各层对任务的贡献差异及任务特定的秩需求。论文提出的解决方案FLoE的关键在于引入两项创新:一是基于Fisher信息的重要性评分机制,用于动态识别任务关键的Transformer层并实现稀疏适配器部署;二是基于贝叶斯优化的秩分配器,可自动确定特定数据集上的最优LoRA秩,无需穷举网格搜索。

链接: https://arxiv.org/abs/2506.00495
作者: Xinyi Wang,Lirong Gao,Haobo Wang,Yiming Zhang,Junbo Zhao
机构: Zhejiang University (浙江大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Machine Learning (stat.ML)
备注: 17 pages, 9 figures

点击查看摘要

Abstract:Parameter-Efficient Fine-Tuning (PEFT) methods have emerged as a widely adopted strategy for adapting pre-trained Large Language Models (LLMs) to downstream tasks, significantly reducing memory and computational costs. However, most existing PEFT techniques uniformly deploy LoRA adapters across all layers, disregarding the intrinsic heterogeneity of layer contributions and task-specific rank requirements. This uniform paradigm leads to redundant parameter allocation and suboptimal adaptation efficiency. To address these limitations, we propose FLoE, a novel PEFT framework that introduces two key innovations: (i) a Fisher information-guided importance scoring mechanism to dynamically identify task-critical transformer layers for MoE-based low-rank adaptation, enabling sparse adapter deployment; and (ii) a Bayesian optimization-driven rank allocator that automatically determines optimal LoRA ranks on specific datasets without exhaustive grid search. Extensive experiments across diverse LLMs and benchmarks reveal that FLoE achieves impressive efficiency-accuracy trade-offs, making FLoE particularly advantageous in resource-constrained environments that necessitate rapid adaptation.
zh

[NLP-219] Synergizing LLM s with Global Label Propagation for Multimodal Fake News Detection ACL2025

【速读】: 该论文试图解决生成式 AI (Generative AI) 生成的伪标签在多模态虚假新闻检测中表现不佳,难以有效集成的问题。解决方案的关键在于提出基于 LLM 的全局标签传播网络(GLPN-LLM),通过标签传播技术整合 LLM 的能力,利用全局标签传播机制增强预测准确性,并设计基于掩码的机制以防止训练过程中标签泄露。

链接: https://arxiv.org/abs/2506.00488
作者: Shuguo Hu,Jun Hu,Huaiwen Zhang
机构: Inner Mongolia University (内蒙古大学); National University of Singapore (新加坡国立大学)
类目: Computation and Language (cs.CL)
备注: Accepted by ACL 2025 Main Conference

点击查看摘要

Abstract:Large Language Models (LLMs) can assist multimodal fake news detection by predicting pseudo labels. However, LLM-generated pseudo labels alone demonstrate poor performance compared to traditional detection methods, making their effective integration non-trivial. In this paper, we propose Global Label Propagation Network with LLM-based Pseudo Labeling (GLPN-LLM) for multimodal fake news detection, which integrates LLM capabilities via label propagation techniques. The global label propagation can utilize LLM-generated pseudo labels, enhancing prediction accuracy by propagating label information among all samples. For label propagation, a mask-based mechanism is designed to prevent label leakage during training by ensuring that training nodes do not propagate their own labels back to themselves. Experimental results on benchmark datasets show that by synergizing LLMs with label propagation, our model achieves superior performance over state-of-the-art baselines.
zh

[NLP-220] Auto-Patching: Enhancing Multi-Hop Reasoning in Language Models

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在处理多跳问题(multi-hop questions)时的困难,这类问题需要模型在多个推理步骤之间建立信息联系。论文提出的解决方案是Auto-Patch,其关键在于在推理过程中动态地对隐藏状态进行修补,通过一个学习得到的分类器选择性地修改内部表示,从而增强模型的多跳推理能力。

链接: https://arxiv.org/abs/2506.00483
作者: Aviv Jan,Dean Tahory,Omer Talmi,Omar Abo Mokh
机构: Tel Aviv University (特拉维夫大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 8 pages, 5 figures

点击查看摘要

Abstract:Multi-hop questions still stump large language models (LLMs), which struggle to link information across multiple reasoning steps. We introduce Auto-Patch, a novel method that dynamically patches hidden states during inference to enhance multi-hop reasoning in LLMs. Building on the PatchScopes framework, Auto-Patch selectively modifies internal representations using a learned classifier. Evaluated on the MuSiQue dataset, Auto-Patch improves the solve rate from 18.45% (baseline) to 23.63~ \pm ~0.7% (3 runs), narrowing the gap to Chain-of-Thought prompting (27.44%). Our results highlight the potential of dynamic hidden state interventions for advancing complex reasoning in LLMs.
zh

[NLP-221] BenchHub: A Unified Benchmark Suite for Holistic and Customizable LLM Evaluation

【速读】: 该论文试图解决现有基准数据集分散、难以管理以及难以针对特定需求或领域进行评估的问题(domain-specific evaluation challenge)。其解决方案的关键在于构建BenchHub,一个动态基准仓库,它能够聚合并自动分类来自不同领域的基准数据集,集成38个基准中的303,000个问题,并支持持续更新和可扩展的数据管理,从而实现针对不同领域或使用场景的灵活定制化评估。

链接: https://arxiv.org/abs/2506.00482
作者: Eunsu Kim,Haneul Yoo,Guijin Son,Hitesh Patel,Amit Agarwal,Alice Oh
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As large language models (LLMs) continue to advance, the need for up-to-date and well-organized benchmarks becomes increasingly critical. However, many existing datasets are scattered, difficult to manage, and make it challenging to perform evaluations tailored to specific needs or domains, despite the growing importance of domain-specific models in areas such as math or code. In this paper, we introduce BenchHub, a dynamic benchmark repository that empowers researchers and developers to evaluate LLMs more effectively. BenchHub aggregates and automatically classifies benchmark datasets from diverse domains, integrating 303K questions across 38 benchmarks. It is designed to support continuous updates and scalable data management, enabling flexible and customizable evaluation tailored to various domains or use cases. Through extensive experiments with various LLM families, we demonstrate that model performance varies significantly across domain-specific subsets, emphasizing the importance of domain-aware benchmarking. We believe BenchHub can encourage better dataset reuse, more transparent model comparisons, and easier identification of underrepresented areas in existing benchmarks, offering a critical infrastructure for advancing LLM evaluation research.
zh

[NLP-222] PVP: An Image Dataset for Personalized Visual Persuasion with Persuasion Strategies Viewer Characteristics and Persuasiveness Ratings ACL2025

【速读】: 该论文试图解决个性化视觉说服中缺乏综合性数据集的问题,即现有研究缺少将图像的说服力与评估者个人特征(如人口统计信息、人格特质和价值观)相联系的数据。解决方案的关键在于发布Personalized Visual Persuasion (PVP)数据集,该数据集包含28,454张说服性图像、596条信息和9种说服策略,并附有2,521名人类标注者的说服力评分及其心理特征数据,从而为个性化视觉说服技术的发展提供了基础支持。

链接: https://arxiv.org/abs/2506.00481
作者: Junseo Kim,Jongwook Han,Dongmin Choi,Jongwook Yoon,Eun-Ju Lee,Yohan Jo
机构: Seoul National University (首尔大学); Sungkyunkwan University (成均馆大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: ACL 2025 Main. Code and dataset are released at: this https URL

点击查看摘要

Abstract:Visual persuasion, which uses visual elements to influence cognition and behaviors, is crucial in fields such as advertising and political communication. With recent advancements in artificial intelligence, there is growing potential to develop persuasive systems that automatically generate persuasive images tailored to individuals. However, a significant bottleneck in this area is the lack of comprehensive datasets that connect the persuasiveness of images with the personal information about those who evaluated the images. To address this gap and facilitate technological advancements in personalized visual persuasion, we release the Personalized Visual Persuasion (PVP) dataset, comprising 28,454 persuasive images across 596 messages and 9 persuasion strategies. Importantly, the PVP dataset provides persuasiveness scores of images evaluated by 2,521 human annotators, along with their demographic and psychological characteristics (personality traits and values). We demonstrate the utility of our dataset by developing a persuasive image generator and an automated evaluator, and establish benchmark baselines. Our experiments reveal that incorporating psychological characteristics enhances the generation and evaluation of persuasive images, providing valuable insights for personalized visual persuasion.
zh

[NLP-223] EffiVLM-BENCH: A Comprehensive Benchmark for Evaluating Training-Free Acceleration in Large Vision-Language Models ACL2025

【速读】: 该论文试图解决大型视觉-语言模型(Large Vision-Language Models, LVLMs)在实际部署中因计算需求高而面临的效率问题。其解决方案的关键在于系统性评估主流的加速技术,特别是分为令牌压缩和参数压缩两类,并引入EffiVLM-Bench这一统一框架,以全面评估模型的绝对性能、泛化能力和忠实度,同时探索帕累托最优的权衡策略。

链接: https://arxiv.org/abs/2506.00479
作者: Zekun Wang,Minghua Ma,Zexin Wang,Rongchuan Mu,Liping Shan,Ming Liu,Bing Qin
机构: Harbin Institute of Technology (哈尔滨工业大学); Pengcheng Laboratory (鹏城实验室); Du Xiaoman Science Technology Co., Ltd (杜小满科技有限公司)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: ACL 2025

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) have achieved remarkable success, yet their significant computational demands hinder practical deployment. While efforts to improve LVLM efficiency are growing, existing methods lack comprehensive evaluation across diverse backbones, benchmarks, and metrics. In this work, we systematically evaluate mainstream acceleration techniques for LVLMs, categorized into token and parameter compression. We introduce EffiVLM-Bench, a unified framework for assessing not only absolute performance but also generalization and loyalty, while exploring Pareto-optimal trade-offs. Our extensive experiments and in-depth analyses offer insights into optimal strategies for accelerating LVLMs. We open-source code and recipes for EffiVLM-Bench to foster future research.
zh

[NLP-224] Massively Multilingual Adaptation of Large Language Models Using Bilingual Translation Data

【速读】: 该论文旨在解决大规模多语言持续预训练中的关键设计问题,即是否包含平行数据(parallel data)对模型多语言适应能力的影响。其解决方案的关键在于构建大规模的双语翻译语料库MaLA,并基于此开发出EMMA-500 Llama 3系列模型,通过在大量多样化数据混合(最多达671B token)上进行持续预训练,评估包含或不包含双语翻译数据对模型性能的影响。实验结果表明,双语数据有助于提升语言迁移能力和性能,尤其是在低资源语言上表现更为显著。

链接: https://arxiv.org/abs/2506.00469
作者: Shaoxiong Ji,Zihao Li,Jaakko Paavola,Indraneil Paul,Hengyu Luo,Jörg Tiedemann
机构: 未知
类目: Computation and Language (cs.CL)
备注: EMMA-500 Gen 2; refer to Gen 1 in arXiv:2409.17892

点击查看摘要

Abstract:This paper investigates a critical design decision in the practice of massively multilingual continual pre-training – the inclusion of parallel data. Specifically, we study the impact of bilingual translation data for massively multilingual language adaptation of the Llama3 family of models to 500 languages. To this end, we construct the MaLA bilingual translation corpus, containing data from more than 2,500 language pairs. Subsequently, we develop the EMMA-500 Llama 3 suite of four massively multilingual models – continually pre-trained from the Llama 3 family of base models extensively on diverse data mixes up to 671B tokens – and explore the effect of continual pre-training with or without bilingual translation data. Comprehensive evaluation across 7 tasks and 12 benchmarks demonstrates that bilingual data tends to enhance language transfer and performance, particularly for low-resource languages. We open-source the MaLA corpus, EMMA-500 Llama 3 suite artefacts, code, and model generations.
zh

[NLP-225] XMAD-Bench: Cross-Domain Multilingual Audio Deepfake Benchmark

【速读】: 该论文试图解决音频深度伪造(audio deepfake)检测模型在跨领域场景下性能显著下降的问题,即当前检测方法在域内(in-domain)测试中表现出接近99%的高准确率,但在跨领域(cross-domain)场景下的表现却可能接近随机猜测。解决方案的关键在于构建一个大规模、多语言、跨领域的音频深度伪造基准数据集XMAD-Bench,该数据集包含668.8小时的真实与伪造语音,并确保训练集和测试集在说话人、生成方法和真实音频来源方面具有显著差异,从而模拟实际应用中的“野外”环境,推动鲁棒性更强的音频深度伪造检测技术的发展。

链接: https://arxiv.org/abs/2506.00462
作者: Ioan-Paul Ciobanu,Andrei-Iulian Hiji,Nicolae-Catalin Ristea,Paul Irofti,Cristian Rusu,Radu Tudor Ionescu
机构: University of Bucharest(布加勒斯特大学)
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Recent advances in audio generation led to an increasing number of deepfakes, making the general public more vulnerable to financial scams, identity theft, and misinformation. Audio deepfake detectors promise to alleviate this issue, with many recent studies reporting accuracy rates close to 99%. However, these methods are typically tested in an in-domain setup, where the deepfake samples from the training and test sets are produced by the same generative models. To this end, we introduce XMAD-Bench, a large-scale cross-domain multilingual audio deepfake benchmark comprising 668.8 hours of real and deepfake speech. In our novel dataset, the speakers, the generative methods, and the real audio sources are distinct across training and test splits. This leads to a challenging cross-domain evaluation setup, where audio deepfake detectors can be tested ``in the wild’'. Our in-domain and cross-domain experiments indicate a clear disparity between the in-domain performance of deepfake detectors, which is usually as high as 100%, and the cross-domain performance of the same models, which is sometimes similar to random chance. Our benchmark highlights the need for the development of robust audio deepfake detectors, which maintain their generalization capacity across different languages, speakers, generative methods, and data sources. Our benchmark is publicly released at this https URL.
zh

[NLP-226] Fact-Controlled Diagnosis of Hallucinations in Medical Text Summarization

【速读】: 该论文试图解决在患者-临床医生对话摘要过程中大型语言模型(Large Language Models, LLMs)产生的幻觉(hallucinations)问题,这些问题可能对患者护理和临床决策构成重大风险。现有通用领域的幻觉检测方法在临床领域中的适用性尚不明确,且幻觉的稀有性和随机性增加了研究难度。论文的关键解决方案是构建两个数据集——一个为事实控制的Leave-N-out数据集,通过系统性地从源对话中移除事实以诱导摘要中的幻觉内容;另一个为自然幻觉数据集,来源于基于LLM的医学摘要过程中的自发幻觉。研究发现,通用领域检测器难以有效识别临床幻觉,且在事实控制幻觉上的表现不能可靠预测其在自然幻觉上的效果。随后,论文提出基于事实的检测方法,能够计数幻觉并提供现有方法无法实现的可解释性,其中基于事实控制幻觉训练的LLM检测器在现实临床幻觉检测中表现出良好的泛化能力。

链接: https://arxiv.org/abs/2506.00448
作者: Suhas BN,Han-Chin Shing,Lei Xu,Mitch Strong,Jon Burnsky,Jessica Ofor,Jordan R. Mason,Susan Chen,Sundararajan Srinivasan,Chaitanya Shivade,Jack Moriarty,Joseph Paul Cohen
机构: 未知
类目: Computation and Language (cs.CL)
备注: this https URL

点击查看摘要

Abstract:Hallucinations in large language models (LLMs) during summarization of patient-clinician dialogues pose significant risks to patient care and clinical decision-making. However, the phenomenon remains understudied in the clinical domain, with uncertainty surrounding the applicability of general-domain hallucination detectors. The rarity and randomness of hallucinations further complicate their investigation. In this paper, we conduct an evaluation of hallucination detection methods in the medical domain, and construct two datasets for the purpose: A fact-controlled Leave-N-out dataset – generated by systematically removing facts from source dialogues to induce hallucinated content in summaries; and a natural hallucination dataset – arising organically during LLM-based medical summarization. We show that general-domain detectors struggle to detect clinical hallucinations, and that performance on fact-controlled hallucinations does not reliably predict effectiveness on natural hallucinations. We then develop fact-based approaches that count hallucinations, offering explainability not available with existing methods. Notably, our LLM-based detectors, which we developed using fact-controlled hallucinations, generalize well to detecting real-world clinical hallucinations. This research contributes a suite of specialized metrics supported by expert-annotated datasets to advance faithful clinical summarization systems.
zh

[NLP-227] G2S: A General-to-Specific Learning Framework for Temporal Knowledge Graph Forecasting with Large Language Models ACL2025

【速读】: 该论文旨在解决在时间知识图谱(Temporal Knowledge Graph, TKG)预测任务中,大型语言模型(Large Language Models, LLMs)因同时学习通用模式与场景信息而导致的相互干扰问题,从而影响模型的泛化能力。解决方案的关键在于提出一种从通用到具体的学习框架(General-to-Specific learning framework, G2S),通过分阶段学习策略分离通用模式与场景信息的学习过程:在通用学习阶段,通过掩码场景信息并转换为匿名时间结构来捕捉跨不同TKG的通用模式;在具体学习阶段,通过上下文学习或微调方式注入场景信息,以提升模型对特定场景的适应能力。

链接: https://arxiv.org/abs/2506.00445
作者: Long Bai,Zixuan Li,Xiaolong Jin,Jiafeng Guo,Xueqi Cheng,Tat-Seng Chua
机构: Chinese Academy of Sciences (中国科学院); Institute of Computing Technology (计算技术研究所); State Key Laboratory of AI Safety (人工智能安全国家重点实验室); School of Computer Science, University of Chinese Academy of Sciences (中国科学院大学计算机科学学院); National University of Singapore (新加坡国立大学)
类目: Computation and Language (cs.CL)
备注: Findings of ACL 2025

点击查看摘要

Abstract:Forecasting over Temporal Knowledge Graphs (TKGs) which predicts future facts based on historical ones has received much attention. Recent studies have introduced Large Language Models (LLMs) for this task to enhance the models’ generalization abilities. However, these models perform forecasting via simultaneously learning two kinds of entangled knowledge in the TKG: (1) general patterns, i.e., invariant temporal structures shared across different scenarios; and (2) scenario information, i.e., factual knowledge engaged in specific scenario, such as entities and relations. As a result, the learning processes of these two kinds of knowledge may interfere with each other, which potentially impact the generalization abilities of the models. To enhance the generalization ability of LLMs on this task, in this paper, we propose a General-to-Specific learning framework (G2S) that disentangles the learning processes of the above two kinds of knowledge. In the general learning stage, we mask the scenario information in different TKGs and convert it into anonymous temporal structures. After training on these structures, the model is able to capture the general patterns across different TKGs. In the specific learning stage, we inject the scenario information into the structures via either in-context learning or fine-tuning modes. Experimental results show that G2S effectively improves the generalization abilities of LLMs.
zh

[NLP-228] Inter-Passage Verification for Multi-evidence Multi-answer QA ACL2025

【速读】: 该论文旨在解决多答案问答(multi-answer QA)问题,即针对可能具有多个有效答案的问题,现有基于检索增强生成的问答系统在检索和综合大量证据段落方面存在困难。解决方案的关键在于提出一种新的多答案问答框架——检索增强的独立阅读与跨段验证(RI²VER),该框架首先检索大量段落并独立处理以生成初始高召回但噪声较多的答案集,随后通过跨段验证流程对每个候选答案进行验证,包括验证问题生成、额外证据收集以及跨段合成验证,从而有效提升答案的准确性和多证据合成能力。

链接: https://arxiv.org/abs/2506.00425
作者: Bingsen Chen,Shengjie Wang,Xi Ye,Chen Zhao
机构: 未知
类目: Computation and Language (cs.CL)
备注: 19 pages, 6 figures, to appear in ACL 2025 Findings

点击查看摘要

Abstract:Multi-answer question answering (QA), where questions can have many valid answers, presents a significant challenge for existing retrieval-augmented generation-based QA systems, as these systems struggle to retrieve and then synthesize a large number of evidence passages. To tackle these challenges, we propose a new multi-answer QA framework – Retrieval-augmented Independent Reading with Inter-passage Verification (RI ^2 VER). Our framework retrieves a large set of passages and processes each passage individually to generate an initial high-recall but noisy answer set. Then we propose a new inter-passage verification pipeline that validates every candidate answer through (1) Verification Question Generation, (2) Gathering Additional Evidence, and (3) Verification with inter-passage synthesis. Evaluations on the QAMPARI and RoMQA datasets demonstrate that our framework significantly outperforms existing baselines across various model sizes, achieving an average F1 score improvement of 11.17%. Further analysis validates that our inter-passage verification pipeline enables our framework to be particularly beneficial for questions requiring multi-evidence synthesis.
zh

[NLP-229] DYNAC: Dynamic Vocabulary based Non-Autoregressive Contextualization for Speech Recognition INTERSPEECH2025

【速读】: 该论文试图解决在自动语音识别中,针对罕见和未见过短语的上下文偏差(Contextual Biasing, CB)问题,同时提升模型的推理速度。传统方法通过动态词汇(Dynamic Vocabulary)在自回归(AR)模型中表示上下文短语,虽然提高了CB的准确性,但存在推理速度慢的问题。而将动态词汇应用于非自回归(NAR)模型(如连接时序分类,CTC)时,由于条件独立性假设无法捕捉静态与动态标记之间的依赖关系。该论文提出的DYNAC(Dynamic Vocabulary-based NAR Contextualization)方法,通过在中间层集成动态词汇,并让编码器基于动态词汇进行自条件化,有效捕捉了静态与动态标记间的依赖关系,同时显著降低了实时因子(RTF)。实验结果表明,DYNAC在LibriSpeech 960 test-clean数据集上将RTF降低了81%,仅导致词错误率(WER)0.1个百分点的下降。

链接: https://arxiv.org/abs/2506.00422
作者: Yui Sudo,Yosuke Fukumoto,Muhammad Shakeel,Yifan Peng,Chyi-Jiunn Lin,Shinji Watanabe
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted to Interspeech 2025

点击查看摘要

Abstract:Contextual biasing (CB) improves automatic speech recognition for rare and unseen phrases. Recent studies have introduced dynamic vocabulary, which represents context phrases as expandable tokens in autoregressive (AR) models. This method improves CB accuracy but with slow inference speed. While dynamic vocabulary can be applied to non-autoregressive (NAR) models, such as connectionist temporal classification (CTC), the conditional independence assumption fails to capture dependencies between static and dynamic tokens. This paper proposes DYNAC (Dynamic Vocabulary-based NAR Contextualization), a self-conditioned CTC method that integrates dynamic vocabulary into intermediate layers. Conditioning the encoder on dynamic vocabulary, DYNAC effectively captures dependencies between static and dynamic tokens while reducing the real-time factor (RTF). Experimental results show that DYNAC reduces RTF by 81% with a 0.1-point degradation in word error rate on the LibriSpeech 960 test-clean set.
zh

[NLP-230] Enabling Chatbots with Eyes and Ears: An Immersive Multimodal Conversation System for Dynamic Interactions ACL2025

【速读】: 该论文试图解决当前聊天机器人在多模态交互中存在的一系列问题,包括过度侧重视觉模态而忽视听觉模态、静态交互方式限制了动态对话的丰富性,以及任务特定约束阻碍了多轮次、多方参与对话中的多模态无缝融合。其解决方案的关键在于引入一个名为Multimodal Multi-Session Multi-Party Conversation (M^3C)的新多模态对话数据集,并提出一种具有多模态记忆检索能力的新型多模态对话模型,从而实现更自然、沉浸式的多模态人机交互。

链接: https://arxiv.org/abs/2506.00421
作者: Jihyoung Jang,Minwook Bae,Minji Kim,Dilek Hakkani-Tur,Hyounghun Kim
机构: POSTECH(浦项科技大学); UNIST(蔚山国立科学技术院); University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: ACL 2025 (32 pages); Project website: this https URL

点击查看摘要

Abstract:As chatbots continue to evolve toward human-like, real-world, interactions, multimodality remains an active area of research and exploration. So far, efforts to integrate multimodality into chatbots have primarily focused on image-centric tasks, such as visual dialogue and image-based instructions, placing emphasis on the “eyes” of human perception while neglecting the “ears”, namely auditory aspects. Moreover, these studies often center around static interactions that focus on discussing the modality rather than naturally incorporating it into the conversation, which limits the richness of simultaneous, dynamic engagement. Furthermore, while multimodality has been explored in multi-party and multi-session conversations, task-specific constraints have hindered its seamless integration into dynamic, natural conversations. To address these challenges, this study aims to equip chatbots with “eyes and ears” capable of more immersive interactions with humans. As part of this effort, we introduce a new multimodal conversation dataset, Multimodal Multi-Session Multi-Party Conversation ( M^3C ), and propose a novel multimodal conversation model featuring multimodal memory retrieval. Our model, trained on the M^3C , demonstrates the ability to seamlessly engage in long-term conversations with multiple speakers in complex, real-world-like settings, effectively processing visual and auditory inputs to understand and respond appropriately. Human evaluations highlight the model’s strong performance in maintaining coherent and dynamic interactions, demonstrating its potential for advanced multimodal conversational agents.
zh

[NLP-231] Dual Debiasing for Noisy In-Context Learning for Text Generation ACL

【速读】: 该论文试图解决在上下文学习(ICL)中,由于标注数据存在噪声而导致的样本清洁度评估不准确问题。现有方法通过局部困惑度排序来检测噪声标注,但在噪声比例较高时该假设失效。论文的关键解决方案是引入一种双去偏框架,利用合成邻居显式校正困惑度估计,从而生成鲁棒的样本清洁度评分(Sample Cleanliness Score),该评分能够独立于整体语料库的噪声水平准确反映样本的清洁程度。

链接: https://arxiv.org/abs/2506.00418
作者: Siqi Liang,Sumyeong Ahn,Paramveer S. Dhillon,Jiayu Zhou
机构: University of Michigan (密歇根大学); KENTECH (KENTECH)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by 2025 ACL Findings

点击查看摘要

Abstract:In context learning (ICL) relies heavily on high quality demonstrations drawn from large annotated corpora. Existing approaches detect noisy annotations by ranking local perplexities, presuming that noisy samples yield higher perplexities than their clean counterparts. However, this assumption breaks down when the noise ratio is high and many demonstrations are flawed. We reexamine the perplexity based paradigm for text generation under noisy annotations, highlighting two sources of bias in perplexity: the annotation itself and the domain specific knowledge inherent in large language models (LLMs). To overcome these biases, we introduce a dual debiasing framework that uses synthesized neighbors to explicitly correct perplexity estimates, yielding a robust Sample Cleanliness Score. This metric uncovers absolute sample cleanliness regardless of the overall corpus noise level. Extensive experiments demonstrate our method’s superior noise detection capabilities and show that its final ICL performance is comparable to that of a fully clean demonstration corpus. Moreover, our approach remains robust even when noise ratios are extremely high.
zh

[NLP-232] Accelerating Diffusion LLM s via Adaptive Parallel Decoding

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在自回归解码(autoregressive decoding)过程中生成速度缓慢的问题,该过程逐个预测token,导致效率受限。尽管扩散大语言模型(diffusion large language models, dLLMs)理论上支持并行生成token,但实际应用中难以在不显著降低质量的情况下达到自回归模型的速度。论文提出的解决方案是自适应并行解码(Adaptive Parallel Decoding, APD),其关键在于通过定义dLLM边缘概率与一个小辅助自回归模型下序列联合概率之间的乘积混合,动态调整并行采样的token数量,从而实现吞吐量与质量之间的灵活权衡。

链接: https://arxiv.org/abs/2506.00413
作者: Daniel Israel,Guy Van den Broeck,Aditya Grover
机构: University of California, Los Angeles (加利福尼亚大学洛杉矶分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Performance (cs.PF)
备注: 10 pages, 5 figures

点击查看摘要

Abstract:The generation speed of LLMs are bottlenecked by autoregressive decoding, where tokens are predicted sequentially one by one. Alternatively, diffusion large language models (dLLMs) theoretically allow for parallel token generation, but in practice struggle to achieve the speed of autoregressive models without significantly sacrificing quality. We therefore introduce adaptive parallel decoding (APD), a novel method that dynamically adjusts the number of tokens sampled in parallel. We achieve this by defining a multiplicative mixture between the dLLM marginal probabilities and the joint probability of sequences under a small auxiliary autoregressive model. This inverts the standard setup of speculative decoding, where the goal is to sample from a large autoregressive verifier by drafting from a smaller model. We further optimize APD by enabling KV caching and limiting the size of the masked input. Altogether, our method puts forward three tunable parameters to flexibly tradeoff throughput and quality. We show that APD provides markedly higher throughput with minimal quality degradations on downstream benchmarks.
zh

[NLP-233] Causal Structure Discovery for Error Diagnostics of Childrens ASR INTERSPEECH2025

【速读】: 该论文试图解决儿童自动语音识别(Automatic Speech Recognition, ASR)性能低于成人的问题,该问题由生理、认知和外部因素的相互依赖关系共同导致。现有分析方法通常孤立地研究这些因素,忽略了它们之间的复杂关联。论文的关键在于引入因果结构发现(causal structure discovery)以揭示生理、认知及外部因素与ASR错误之间的相互依赖关系,并通过因果量化(causal quantification)评估各因素对儿童ASR的影响。此外,研究还扩展至微调模型,以确定哪些因素可通过微调缓解,哪些仍保持不变。

链接: https://arxiv.org/abs/2506.00402
作者: Vishwanath Pratap Singh,Md. Sahidullah,Tomi Kinnunen
机构: School of Computing (计算学院); Institute for Advancing Intelligence (推进智能研究所)
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Interspeech 2025

点击查看摘要

Abstract:Children’s automatic speech recognition (ASR) often underperforms compared to that of adults due to a confluence of interdependent factors: physiological (e.g., smaller vocal tracts), cognitive (e.g., underdeveloped pronunciation), and extrinsic (e.g., vocabulary limitations, background noise). Existing analysis methods examine the impact of these factors in isolation, neglecting interdependencies-such as age affecting ASR accuracy both directly and indirectly via pronunciation skills. In this paper, we introduce a causal structure discovery to unravel these interdependent relationships among physiology, cognition, extrinsic factors, and ASR errors. Then, we employ causal quantification to measure each factor’s impact on children’s ASR. We extend the analysis to fine-tuned models to identify which factors are mitigated by fine-tuning and which remain largely unaffected. Experiments on Whisper and Wav2Vec2.0 demonstrate the generalizability of our findings across different ASR systems.
zh

[NLP-234] Scaling Textual Gradients via Sampling-Based Momentum

【速读】: 该论文试图解决在大规模语言模型(Large Language Models, LLMs)中,通过优化文本提示(textual prompts)以提升下游自然语言处理(NLP)任务性能的问题。其关键解决方案是提出一种名为Textual Stochastic Gradient Descent with Momentum (TSGD-M) 的方法,该方法通过基于历史批次分布重新加权提示采样,实现可扩展的上下文学习,从而在多个NLP任务中显著优于未采用重加权采样的Textual Gradient Descent (TGD) 基线,并降低大多数任务的方差。

链接: https://arxiv.org/abs/2506.00400
作者: Zixin Ding,Junyuan Hong,Jiachen T. Wang,Zinan Lin,Zhangyang Wang,Yuxin Chen
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As prompts play an increasingly critical role in large language models (LLMs), optimizing textual prompts has become a crucial challenge. The Textual Gradient Descent (TGD) framework has emerged as a promising data-driven approach that iteratively refines textual prompts using LLM - suggested updates (or textual gradients) over minibatches of training samples. In this paper, we empirically demonstrate that scaling the number of training examples initially improves but later degrades TGD’s performance across multiple downstream NLP tasks. However, while data scaling improves results for most tasks, it also significantly increases the computational cost when leveraging LLMs. To address this, we draw inspiration from numerical gradient descent and propose Textual Stochastic Gradient Descent with Momentum (TSGD-M) - a method that facilitates scalable in-context learning by reweighting prompt sampling based on past batch distributions. Across nine NLP tasks spanning three domains - including BIG-Bench Hard (BBH), natural language understanding tasks, and reasoning tasks - TSGD-M significantly outperforms TGD baselines that do not incorporate reweighted sampling, while also reducing variance in most tasks.
zh

[NLP-235] Speculative Reward Model Boosts Decision Making Ability of LLM s Cost-Effectively ACL2025

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在决策过程中效率与效果之间失衡的问题,即现有方法过于注重性能而忽视了计算成本的优化。其解决方案的关键在于提出一种名为推测奖励模型(Speculative Reward Model, SRM)的即插即用框架,该框架通过引入外部奖励分配器预测最优动作,减少对模型内部自评估的依赖,并结合推测验证机制剪枝次优选择,从而在保持效果的同时显著降低计算成本。

链接: https://arxiv.org/abs/2506.00396
作者: Jiawei Gu,Shangsong Liang
机构: 未知
类目: Computation and Language (cs.CL)
备注: ACL2025 Oral (Industry Track)

点击查看摘要

Abstract:Effective decision-making in Large Language Models (LLMs) is essential for handling intricate tasks. However, existing approaches prioritize performance but often overlook the balance between effectiveness and computational cost. To address this, we first introduce the 3E Criteria to systematically assess the cost-effectiveness of search strategies, revealing that existing methods often trade significant efficiency for marginal performance gains. To improve LLM decision-making while maintaining efficiency, we propose the Speculative Reward Model (SRM), a plug-and-play framework that seamlessly integrates with existing search strategies. Specifically, SRM employs an external reward assigner to predict optimal actions, reducing reliance on LLMs’ internal self-evaluation. And a speculative verification mechanism is used to prune suboptimal choices and guide the search toward more promising steps. We evaluate SRM on several complex decision-making tasks including mathematical reasoning, planning and numerical reasoning in specialized domains. Experimental results show that SRM reduces costs to 1/10 of the original search framework on average while maintaining effectiveness.
zh

[NLP-236] SHARE: An SLM-based Hierarchical Action CorREction Assistant for Text-to-SQL ACL2025

【速读】: 该论文旨在解决文本到SQL(text-to-SQL)任务中自校正方法的两个关键问题:一是传统自校正方法依赖于大语言模型(LLM)的递归调用,导致计算开销呈指数级增长;二是LLM在处理声明式SQL查询时难以实现有效的错误检测与修正,因为其无法展示底层推理过程。解决方案的关键在于提出SHARE,一个基于小语言模型(SLM)的分层动作校正助手,通过顺序流水线协调三个专业SLM,将声明式SQL查询转换为揭示底层推理的逐步动作轨迹,并进行两阶段细粒度优化,同时引入一种新的分层自进化策略以实现数据高效的训练。

链接: https://arxiv.org/abs/2506.00391
作者: Ge Qu,Jinyang Li,Bowen Qin,Xiaolong Li,Nan Huo,Chenhao Ma,Reynold Cheng
机构: The University of Hong Kong(香港大学); BAAI(百度研究院); The Chinese University of Hong Kong, Shenzhen(香港中文大学,深圳)
类目: Computation and Language (cs.CL)
备注: Accepted to ACL 2025 Main

点击查看摘要

Abstract:Current self-correction approaches in text-to-SQL face two critical limitations: 1) Conventional self-correction methods rely on recursive self-calls of LLMs, resulting in multiplicative computational overhead, and 2) LLMs struggle to implement effective error detection and correction for declarative SQL queries, as they fail to demonstrate the underlying reasoning path. In this work, we propose SHARE, an SLM-based Hierarchical Action corREction assistant that enables LLMs to perform more precise error localization and efficient correction. SHARE orchestrates three specialized Small Language Models (SLMs) in a sequential pipeline, where it first transforms declarative SQL queries into stepwise action trajectories that reveal underlying reasoning, followed by a two-phase granular refinement. We further propose a novel hierarchical self-evolution strategy for data-efficient training. Experimental results demonstrate that SHARE effectively enhances self-correction capabilities while proving robust across various LLMs. Furthermore, our comprehensive analysis shows that SHARE maintains strong performance even in low-resource training settings, which is particularly valuable for text-to-SQL applications with data privacy constraints.
zh

[NLP-237] Adaptive-VP: A Framework for LLM -Based Virtual Patients that Adapts to Trainees Dialogue to Facilitate Nurse Communication Training ACL2025

【速读】: 该论文旨在解决传统标准化患者(Standardized Patient, SP)模拟在成本和灵活性方面的局限性,以及现有虚拟患者(Virtual Patient, VP)系统在适应学员沟通技能差异方面的不足。其关键解决方案是提出一种基于大型语言模型(Large Language Model, LLM)的自适应VP对话生成框架——Adaptive-VP,该框架能够根据学员输入动态调整VP行为,实现更具真实感和适应性的交互体验。

链接: https://arxiv.org/abs/2506.00386
作者: Keyeun Lee,Seolhee Lee,Esther Hehsun Kim,Yena Ko,Jinsu Eun,Dahee Kim,Hyewon Cho,Haiyi Zhu,Robert E. Kraut,Eunyoung Suh,Eun-mee Kim,Hajin Lim
机构: Seoul National University (首尔国立大学); Carnegie Mellon University (卡内基梅隆大学)
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: ACL 2025 Findings, 34 pages, 9 figures

点击查看摘要

Abstract:Effective communication training is essential to preparing nurses for high-quality patient care. While standardized patient (SP) simulations provide valuable experiential learning, they are often costly and inflexible. Virtual patient (VP) systems offer a scalable alternative, but most fail to adapt to the varying communication skills of trainees. In particular, when trainees respond ineffectively, VPs should escalate in hostility or become uncooperative–yet this level of adaptive interaction remains largely unsupported. To address this gap, we introduce Adaptive-VP, a VP dialogue generation framework that leverages large language models (LLMs) to dynamically adapt VP behavior based on trainee input. The framework features a pipeline for constructing clinically grounded yet flexible VP scenarios and a modular system for assessing trainee communication and adjusting VP responses in real time, while ensuring learner safety. We validated Adaptive-VP by simulating challenging patient conversations. Automated evaluation using a corpus from practicing nurses showed that our communication skill evaluation mechanism reflected real-world proficiency levels. Expert nurses further confirmed that Adaptive-VP produced more natural and realistic interactions than existing approaches, demonstrating its potential as a scalable and effective tool for nursing communication training.
zh

[NLP-238] Spectral Insights into Data-Oblivious Critical Layers in Large Language Models ACL2025

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)中特征表示在不同层之间的演化机制问题,旨在提升模型的可解释性和鲁棒性。其解决方案的关键在于提出一种数据无关(data-oblivious)的方法,通过分析表示动态性(representation dynamics)来识别预微调LLMs中的内在关键层,具体采用中心核对齐(Centered Kernel Alignment, CKA)进行分析。研究发现,这些关键层在微调过程中表现出显著的表示空间变化,并且这种变化由顶层主成分的变化驱动,这些主成分编码了从前提到结论的语义转换。

链接: https://arxiv.org/abs/2506.00382
作者: Xuyuan Liu,Lei Hsiung,Yaoqing Yang,Yujun Yan
机构: Dartmouth College (达特茅斯学院)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Accepted by Findings of ACL2025

点击查看摘要

Abstract:Understanding how feature representations evolve across layers in large language models (LLMs) is key to improving their interpretability and robustness. While recent studies have identified critical layers linked to specific functions or behaviors, these efforts typically rely on data-dependent analyses of fine-tuned models, limiting their use to post-hoc settings. In contrast, we introduce a data-oblivious approach to identify intrinsic critical layers in pre-fine-tuned LLMs by analyzing representation dynamics via Centered Kernel Alignment(CKA). We show that layers with significant shifts in representation space are also those most affected during fine-tuning–a pattern that holds consistently across tasks for a given model. Our spectral analysis further reveals that these shifts are driven by changes in the top principal components, which encode semantic transitions from rationales to conclusions. We further apply these findings to two practical scenarios: efficient domain adaptation, where fine-tuning critical layers leads to greater loss reduction compared to non-critical layers; and backdoor defense, where freezing them reduces attack success rates by up to 40%.
zh

[NLP-239] Neuro2Semantic: A Transfer Learning Framework for Semantic Reconstruction of Continuous Language from Human Intracranial EEG INTERSPEECH2025

【速读】: 该论文试图解决从神经信号中解码连续语言的问题(Decoding continuous language from neural signals),这是神经科学与人工智能交叉领域的一个重大挑战。其解决方案的关键在于提出了一种名为Neuro2Semantic的新框架,该框架包含两个阶段:首先使用基于LSTM的适配器将神经信号与预训练的文本嵌入对齐;其次通过校正模块直接从对齐后的嵌入生成连续且自然的文本。这一灵活方法克服了以往解码方法的局限性,实现了无约束的文本生成。

链接: https://arxiv.org/abs/2506.00381
作者: Siavash Shams,Richard Antonello,Gavin Mischler,Stephan Bickel,Ashesh Mehta,Nima Mesgarani
机构: 未知
类目: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS); Signal Processing (eess.SP)
备注: Accepted at Interspeech 2025 Code at this https URL

点击查看摘要

Abstract:Decoding continuous language from neural signals remains a significant challenge in the intersection of neuroscience and artificial intelligence. We introduce Neuro2Semantic, a novel framework that reconstructs the semantic content of perceived speech from intracranial EEG (iEEG) recordings. Our approach consists of two phases: first, an LSTM-based adapter aligns neural signals with pre-trained text embeddings; second, a corrector module generates continuous, natural text directly from these aligned embeddings. This flexible method overcomes the limitations of previous decoding approaches and enables unconstrained text generation. Neuro2Semantic achieves strong performance with as little as 30 minutes of neural data, outperforming a recent state-of-the-art method in low-data settings. These results highlight the potential for practical applications in brain-computer interfaces and neural decoding technologies.
zh

[NLP-240] Adapting General-Purpose Embedding Models to Private Datasets Using Keyword-based Retrieval

【速读】: 该论文旨在解决通用文本嵌入模型在应用于私有数据集(如公司特定的专有数据)时效果下降的问题,这些问题通常包含专业术语和行话。解决方案的关键在于引入BMEmbed方法,通过利用基于关键词的检索技术(BM25)从关键词检索结果的排名中构建监督信号,从而促进模型适应私有数据集。该方法通过增强嵌入表示的一致性和对齐性,提升了检索性能。

链接: https://arxiv.org/abs/2506.00363
作者: Yubai Wei,Jiale Han,Yi Yang
机构: 未知
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: Link: this https URL

点击查看摘要

Abstract:Text embedding models play a cornerstone role in AI applications, such as retrieval-augmented generation (RAG). While general-purpose text embedding models demonstrate strong performance on generic retrieval benchmarks, their effectiveness diminishes when applied to private datasets (e.g., company-specific proprietary data), which often contain specialized terminology and lingo. In this work, we introduce BMEmbed, a novel method for adapting general-purpose text embedding models to private datasets. By leveraging the well-established keyword-based retrieval technique (BM25), we construct supervisory signals from the ranking of keyword-based retrieval results to facilitate model adaptation. We evaluate BMEmbed across a range of domains, datasets, and models, showing consistent improvements in retrieval performance. Moreover, we provide empirical insights into how BM25-based signals contribute to improving embeddings by fostering alignment and uniformity, highlighting the value of this approach in adapting models to domain-specific data. We release the source code available at this https URL for the research community.
zh

[NLP-241] Efficient Latent Semantic Clustering for Scaling Test-Time Computation of LLM s

【速读】: 该论文旨在解决大规模语言模型(Large Language Models, LLMs)在测试阶段计算扩展(test-time computation)中因冗余推理路径导致的计算效率低下和语义理解不足的问题。其解决方案的关键在于提出一种轻量且具有上下文敏感性的潜在语义聚类(Latent Semantic Clustering, LSC)方法,该方法利用生成模型内部的隐藏状态进行聚类,无需依赖外部模型,从而显著提升计算效率并保持或超越现有方法的性能。

链接: https://arxiv.org/abs/2506.00344
作者: Sungjae Lee,Hoyoung Kim,Jeongyeon Hwang,Eunhyeok Park,Jungseul Ok
机构: POSTECH(浦项科技大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Scaling test-time computation–generating and analyzing multiple or sequential outputs for a single input–has become a promising strategy for improving the reliability and quality of large language models (LLMs), as evidenced by advances in uncertainty quantification and multi-step reasoning. A key shared component is semantic clustering, which groups outputs that differ in form but convey the same meaning. Semantic clustering enables estimation of the distribution over the semantics of outputs and helps avoid redundant exploration of reasoning paths. However, existing approaches typically rely on external models, which introduce substantial computational overhead and often fail to capture context-aware semantics. We propose Latent Semantic Clustering (LSC), a lightweight and context-sensitive method that leverages the generator LLM’s internal hidden states for clustering, eliminating the need for external models. Our extensive experiment across various LLMs and datasets shows that LSC significantly improves the computational efficiency of test-time scaling while maintaining or exceeding the performance of existing methods.
zh

[NLP-242] OWSM v4: Improving Open Whisper-Style Speech Models via Data Scaling and Cleaning INTERSPEECH2025

【速读】: 该论文旨在解决开放语音基础模型(Open Whisper-style Speech Models, OWSM)训练数据不足的问题,尤其是针对多语言场景下的性能提升。其关键解决方案是整合大规模网络爬取的YODAS数据集,并通过可扩展的数据清洗流程处理其复杂性,如错误的语言标签和音频-文本对齐问题,最终构建了一个包含75种语言、166,000小时语音的高质量数据集。基于此数据集训练的OWSM v4模型在多语言基准测试中显著优于之前版本,并在多个场景中达到或超过前沿工业模型如Whisper和MMS的性能。

链接: https://arxiv.org/abs/2506.00338
作者: Yifan Peng,Shakeel Muhammad,Yui Sudo,William Chen,Jinchuan Tian,Chyi-Jiunn Lin,Shinji Watanabe
机构: 未知
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted at INTERSPEECH 2025

点击查看摘要

Abstract:The Open Whisper-style Speech Models (OWSM) project has developed a series of fully open speech foundation models using academic-scale resources, but their training data remains insufficient. This work enhances OWSM by integrating YODAS, a large-scale web-crawled dataset with a Creative Commons license. However, incorporating YODAS is nontrivial due to its wild nature, which introduces challenges such as incorrect language labels and audio-text misalignments. To address this, we develop a scalable data-cleaning pipeline using public toolkits, yielding a dataset with 166,000 hours of speech across 75 languages. Our new series of OWSM v4 models, trained on this curated dataset alongside existing OWSM data, significantly outperform previous versions on multilingual benchmarks. Our models even match or surpass frontier industrial models like Whisper and MMS in multiple scenarios. We will publicly release the cleaned YODAS data, pre-trained models, and all associated scripts via the ESPnet toolkit.
zh

[NLP-243] Beyond Context to Cognitive Appraisal: Emotion Reasoning as a Theory of Mind Benchmark for Large Language Models

【速读】: 该论文试图解决文本情感识别任务中,如何利用隐含的语境线索进行更高层次的情绪推理问题,而不仅仅是依赖显性的情感表达。解决方案的关键在于基于心智理论(Theory-of-Mind, ToM)框架,利用大规模语言模型(Large Language Models, LLMs)分析上下文信息以推断他人情绪状态,并通过认知评价理论构建专门的ToM评估数据集,以同时评估从语境到情绪的正向推理和从情绪到推断语境的反向推理能力。

链接: https://arxiv.org/abs/2506.00334
作者: Gerard Christopher Yeo,Kokil Jaidka
机构: 未知
类目: Computation and Language (cs.CL)
备注: 9 pages, 3 figures

点击查看摘要

Abstract:Datasets used for emotion recognition tasks typically contain overt cues that can be used in predicting the emotions expressed in a text. However, one challenge is that texts sometimes contain covert contextual cues that are rich in affective semantics, which warrant higher-order reasoning abilities to infer emotional states, not simply the emotions conveyed. This study advances beyond surface-level perceptual features to investigate how large language models (LLMs) reason about others’ emotional states using contextual information, within a Theory-of-Mind (ToM) framework. Grounded in Cognitive Appraisal Theory, we curate a specialized ToM evaluation dataset1 to assess both forward reasoning - from context to emotion- and backward reasoning - from emotion to inferred context. We showed that LLMs can reason to a certain extent, although they are poor at associating situational outcomes and appraisals with specific emotions. Our work highlights the need for psychological theories in the training and evaluation of LLMs in the context of emotion reasoning.
zh

[NLP-244] Disentangling Codemixing in Chats: The NUS ABC Codemixed Corpus

【速读】: 该论文试图解决缺乏公开可用的、经过作者标注且适用于建模人类对话和关系的语码混用(code-mixing)语料库的问题。解决方案的关键在于构建一个首个经过标注且通用的语料库,通过持续收集、验证和整合语码混用消息,形成结构化的JSON格式数据集,并附有详细的元数据和语言统计信息,从而为计算语言学、社会语言学及自然语言处理(NLP)应用提供基础数据支持。

链接: https://arxiv.org/abs/2506.00332
作者: Svetlana Churina,Akshat Gupta,Insyirah Mujtahid,Kokil Jaidka
机构: 未知
类目: Computation and Language (cs.CL); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Abstract:Code-mixing involves the seamless integration of linguistic elements from multiple languages within a single discourse, reflecting natural multilingual communication patterns. Despite its prominence in informal interactions such as social media, chat messages and instant-messaging exchanges, there has been a lack of publicly available corpora that are author-labeled and suitable for modeling human conversations and relationships. This study introduces the first labeled and general-purpose corpus for understanding code-mixing in context while maintaining rigorous privacy and ethical standards. Our live project will continuously gather, verify, and integrate code-mixed messages into a structured dataset released in JSON format, accompanied by detailed metadata and linguistic statistics. To date, it includes over 355,641 messages spanning various code-mixing patterns, with a primary focus on English, Mandarin, and other languages. We expect the Codemix Corpus to serve as a foundational dataset for research in computational linguistics, sociolinguistics, and NLP applications.
zh

[NLP-245] reeRare: Syntax Tree-Guided Retrieval and Reasoning for Knowledge-Intensive Question Answering

【速读】: 该论文旨在解决复杂且知识密集型问题在问答任务中的处理难题,这类问题通常需要模型识别问题的多维特性并跨多个信息源进行推理。现有方法在迭代和自适应检索框架中受限于推理错误的累积和检索结果的不匹配。论文提出的解决方案关键在于TreeRare框架,该框架利用语法树(syntax tree)引导信息检索与推理过程,通过自底向上的遍历方式,在每个节点生成基于子组件的查询并检索相关段落以解决局部不确定性,最终整合树结构中的证据形成最终答案。

链接: https://arxiv.org/abs/2506.00331
作者: Boyi Zhang,Zhuo Liu,Hangfeng He
机构: University of Rochester (罗切斯特大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In real practice, questions are typically complex and knowledge-intensive, requiring Large Language Models (LLMs) to recognize the multifaceted nature of the question and reason across multiple information sources. Iterative and adaptive retrieval, where LLMs decide when and what to retrieve based on their reasoning, has been shown to be a promising approach to resolve complex, knowledge-intensive questions. However, the performance of such retrieval frameworks is limited by the accumulation of reasoning errors and misaligned retrieval results. To overcome these limitations, we propose TreeRare (Syntax Tree-Guided Retrieval and Reasoning), a framework that utilizes syntax trees to guide information retrieval and reasoning for question answering. Following the principle of compositionality, TreeRare traverses the syntax tree in a bottom-up fashion, and in each node, it generates subcomponent-based queries and retrieves relevant passages to resolve localized uncertainty. A subcomponent question answering module then synthesizes these passages into concise, context-aware evidence. Finally, TreeRare aggregates the evidence across the tree to form a final answer. Experiments across five question answering datasets involving ambiguous or multi-hop reasoning demonstrate that TreeRare achieves substantial improvements over existing state-of-the-art methods.
zh

[NLP-246] Dyna-Think: Synergizing Reasoning Acting and World Model Simulation in AI Agents

【速读】: 该论文试图解决长周期AI代理任务中有效行为与缺失行为不明确的问题,以及如何提升AI代理在复杂任务中的性能。其解决方案的关键在于提出Dyna-Think框架,该框架通过将规划、内部世界模型与推理和行动相结合,以增强AI代理的性能。Dyna-Think的核心技术包括Dyna-Think Imitation Learning (DIT) 和 Dyna-Think Dyna Training (DDT),其中DIT通过重建R1的思考过程来优化世界模型模拟,而DDT则通过两阶段训练流程分别提升代理的世界建模能力和行动策略。

链接: https://arxiv.org/abs/2506.00320
作者: Xiao Yu,Baolin Peng,Ruize Xu,Michel Galley,Hao Cheng,Suman Nath,Jianfeng Gao,Zhou Yu
机构: Columbia University (哥伦比亚大学); Microsoft Research (微软研究院); Arklex.ai (Arklex.ai)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent progress in reasoning with large language models (LLMs), such as DeepSeek-R1, demonstrates impressive capabilities in domains like mathematics and coding, by exhibiting complex cognitive behaviors such as verification, goal decomposition, and self-reflection. However, it is unclear what behavior is effective and what behavior is missing for long-horizon AI agents tasks. In this work, we propose Dyna-Think, a thinking framework that integrates planning with an internal world model with reasoning and acting to enhance AI agent performance. To enable Dyna-Think, we propose Dyna-Think Imitation Learning (DIT) and Dyna-Think Dyna Training (DDT). To initialize a policy with Dyna-Think, DIT reconstructs the thinking process of R1 to focus on performing world model simulation relevant to the proposed (and planned) action, and trains the policy using this reconstructed data. To enhance Dyna-Think, DDT uses a two-stage training process to first improve the agent’s world modeling ability via objectives such as state prediction or critique generation, and then improve the agent’s action via policy training. We evaluate our methods on OSWorld, and demonstrate that Dyna-Think improves the agent’s in-domain and out-of-domain performance, achieving similar best-of-n performance compared to R1 while generating 2x less tokens on average. Our extensive empirical studies reveal that 1) using critique generation for world model training is effective to improve policy performance; and 2) AI agents with better performance correlate with better world modeling abilities. We believe our results suggest a promising research direction to integrate world model simulation into AI agents to enhance their reasoning, planning, and acting capabilities.
zh

[NLP-247] SkillVerse : Assessing and Enhancing LLM s with Tree Evaluation ACL2025

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在复杂、多方面任务中评估不足的问题,通过引入一种细粒度的技能特定评估方法来提升模型能力的理解。其解决方案的关键在于提出SkillVerse,这是一个基于无监督树状结构诊断框架,利用LLM作为评判者对模型输出进行批判性分析,并将其组织为称为树状图(dendrogram)的分层结构,从而灵活地揭示模型在不同粒度层级上的行为特征。

链接: https://arxiv.org/abs/2506.00319
作者: Yufei Tian,Jiao Sun,Nanyun Peng,Zizhao Zhang
机构: University of California, Los Angeles (加利福尼亚大学洛杉矶分校); Google DeepMind (谷歌深度思维); Google Cloud AI (谷歌云人工智能)
类目: Computation and Language (cs.CL)
备注: Accepted to ACL 2025

点击查看摘要

Abstract:As language models evolve to tackle complex, multifaceted tasks, their evaluation must adapt to capture this intricacy. A granular, skill-specific understanding of model capabilities can empower researchers to make informed model development plans. In this paper, we introduce SkillVerse, an unsupervised tree-structured diagnosis framework for understanding model proficiency in specific abilities. With LLM as a judge, SkillVerse first critiques the model responses, and then organizes them into a hierarchical structure termed dendrogram. Given proficiency at arbitrary levels of granularity, SkillVerse is flexible to produce insights of behaviors of modern large models. We also demonstrate its efficacy in two downstream tasks: 1) improving model in-context learning by 25% using a tree-search algorithm to select more informative few-shot demonstrations, and 2) accurately predicting new model weaknesses with a 55% success rate, 22% higher than without SkillVerse.
zh

[NLP-248] An evaluation of LLM s for generating movie reviews: GPT -4o Gemini-2.0 and DeepSeek -V3

【速读】: 该论文旨在解决如何利用大型语言模型(Large Language Models, LLMs)生成高质量的电影评论问题,特别是评估不同LLMs在生成电影评论时的表现及其与IMDb用户评论的差异。解决方案的关键在于构建一个基于三个LLMs(GPT-4o、DeepSeek-V3和Gemini-2.0)的框架,并通过对比分析生成的评论与真实用户评论在词汇、情感极性、相似性和主题一致性等方面的差异,以评估其生成质量。研究还通过用户调查验证了生成评论的可辨识性,揭示了各模型在情感表达和风格一致性上的优劣。

链接: https://arxiv.org/abs/2506.00312
作者: Brendan Sands,Yining Wang,Chenhao Xu,Yuxuan Zhou,Lai Wei,Rohitash Chandra
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have been prominent in various tasks, including text generation and summarisation. The applicability of LLMs to the generation of product reviews is gaining momentum, paving the way for the generation of movie reviews. In this study, we propose a framework that generates movie reviews using three LLMs (GPT-4o, DeepSeek-V3, and Gemini-2.0), and evaluate their performance by comparing the generated outputs with IMDb user reviews. We use movie subtitles and screenplays as input to the LLMs and investigate how they affect the quality of reviews generated. We review the LLM-based movie reviews in terms of vocabulary, sentiment polarity, similarity, and thematic consistency in comparison to IMDB user reviews. The results demonstrate that LLMs are capable of generating syntactically fluent and structurally complete movie reviews. Nevertheless, there is still a noticeable gap in emotional richness and stylistic coherence between LLM-generated and IMDb reviews, suggesting that further refinement is needed to improve the overall quality of movie review generation. We provided a survey-based analysis where participants were told to distinguish between LLM and IMDb user reviews. The results show that LLM-generated reviews are difficult to distinguish from IMDB user reviews. We found that DeepSeek-V3 produced the most balanced reviews, closely matching IMDb reviews. GPT-4o overemphasised positive emotions, while Gemini-2.0 captured negative emotions better but showed excessive emotional intensity.
zh

[NLP-249] MythTriage: Scalable Detection of Opioid Use Disorder Myths on a Video-Sharing Platform

【速读】: 该论文旨在解决在线健康信息中虚假信息(misinformation)的广泛存在问题,特别是针对阿片类药物使用障碍(OUD)这一高风险但研究不足的主题。其关键解决方案是提出一种高效的分类流程——MythTriage,该流程结合了轻量级模型处理常规案例,并将复杂案例转交给高性能但成本较高的大语言模型(LLM),从而在保持较高准确率(最高达0.86宏F1分数)的同时,显著降低标注时间和成本。

链接: https://arxiv.org/abs/2506.00308
作者: Hayoung Jung,Shravika Mittal,Ananya Aatreya,Navreet Kaur,Munmun De Choudhury,Tanushree Mitra
机构: University of Washington (华盛顿大学); Georgia Institute of Technology (佐治亚理工学院)
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: 34 pages, 14 figures, 21 tables. In submission

点击查看摘要

Abstract:Understanding the prevalence of misinformation in health topics online can inform public health policies and interventions. However, measuring such misinformation at scale remains a challenge, particularly for high-stakes but understudied topics like opioid-use disorder (OUD)–a leading cause of death in the U.S. We present the first large-scale study of OUD-related myths on YouTube, a widely-used platform for health information. With clinical experts, we validate 8 pervasive myths and release an expert-labeled video dataset. To scale labeling, we introduce MythTriage, an efficient triage pipeline that uses a lightweight model for routine cases and defers harder ones to a high-performing, but costlier, large language model (LLM). MythTriage achieves up to 0.86 macro F1-score while estimated to reduce annotation time and financial cost by over 76% compared to experts and full LLM labeling. We analyze 2.9K search results and 343K recommendations, uncovering how myths persist on YouTube and offering actionable insights for public health and platform moderation.
zh

[NLP-250] Lossless Token Sequence Compression via Meta-Tokens

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLM)中提示压缩的问题,现有方法多采用有损压缩技术,旨在保留与下游任务相关的语义信息的同时显著缩短序列长度。本文提出了一种与任务无关的无损压缩技术,其灵感来源于LZ77算法,能够在不丢失语义信息的前提下,平均将输入标记序列长度减少27%和18%。由于使用了基于Transformer的LLM,这相当于分别减少了47%和33%的编码计算量。该方法的关键在于实现了可逆的标记序列转换,且在需要严格保持语义/语法的任务中表现出优于现有有损压缩方法的性能。

链接: https://arxiv.org/abs/2506.00307
作者: John Harvill,Ziwei Fan,Hao Wang,Yizhou Sun,Hao Ding,Luke Huan,Anoop Deoras
机构: AWS AI Labs (AWS人工智能实验室); Amazon (亚马逊); Rutgers University (罗格斯大学); University of California Los Angeles (加利福尼亚大学洛杉矶分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 16 pages, 8 figures

点击查看摘要

Abstract:Existing work on prompt compression for Large Language Models (LLM) focuses on lossy methods that try to maximize the retention of semantic information that is relevant to downstream tasks while significantly reducing the sequence length. In this paper, we introduce a task-agnostic lossless compression technique similar to LZ77 that makes it possible to reduce the input token sequence length on average by 27% and 18% for the two evaluation tasks explored here. Given that we use transformer-based LLMs, this equates to 47% and 33% less encoding computation, respectively, due to the quadratic nature of attention. The token sequence transformation is trivial to reverse and highlights that no semantic information is lost in the process. We evaluate our proposed approach on two tasks that require strict preservation of semantics/syntax and demonstrate that existing lossy compression methods perform poorly in this setting. We find that our lossless compression technique produces only a small gap in performance compared to using the uncompressed input and posit that larger models and an expanded computing budget would likely erase the gap entirely.
zh

[NLP-251] Can LLM s Understand Unvoiced Speech? Exploring EMG-to-Text Conversion with LLM s ACL2025

【速读】: 该论文试图解决无声肌电图(unvoiced electromyography, EMG)到文本的转换问题,特别是在缺乏配对有声和无声EMG信号及语音数据的情况下,如何实现有效的通信。解决方案的关键在于提出一种新颖的EMG适配模块,该模块能够将EMG特征映射到大型语言模型(large language models, LLM)的输入空间,从而使得LLM能够理解无声EMG信号,实现了在封闭词汇任务中平均词错误率(WER)为0.49的性能。

链接: https://arxiv.org/abs/2506.00304
作者: Payal Mohapatra,Akash Pandey,Xiaoyuan Zhang,Qi Zhu
机构: Northwestern University (西北大学)
类目: Computation and Language (cs.CL)
备注: Accepted to ACL 2025 main conference

点击查看摘要

Abstract:Unvoiced electromyography (EMG) is an effective communication tool for individuals unable to produce vocal speech. However, most prior methods rely on paired voiced and unvoiced EMG signals, along with speech data, for EMG-to-text conversion, which is not practical for such individuals. Given the rise of large language models (LLMs) in speech recognition, we explore their potential to understand unvoiced speech. To this end, we address the challenge of learning from unvoiced EMG alone and propose a novel EMG adaptor module that maps EMG features into an LLM’s input space, achieving an average word error rate (WER) of 0.49 on a closed-vocabulary unvoiced EMG-to-text task. Even with a conservative data availability of just six minutes, our approach improves performance over specialized models by nearly 20%. While LLMs have been shown to be extendable to new language modalities – such as audio – understanding articulatory biosignals like unvoiced EMG remains more challenging. This work takes a crucial first step toward enabling LLMs to comprehend unvoiced speech using surface EMG.
zh

[NLP-252] DLM-One: Diffusion Language Models for One-Step Sequence Generation

【速读】: 该论文试图解决传统序列生成方法中需要迭代优化以提高生成质量所带来的计算效率低的问题,特别是针对基于连续扩散语言模型(DLM)的单步序列生成任务。解决方案的关键在于提出DLM-One框架,通过将学生模型输出的得分在连续token嵌入空间中与预训练教师DLM的得分函数对齐,从而实现无需迭代精调的高效生成。这一方法显著提升了采样效率,实验表明在保持竞争性性能的同时,推理时间可提升约500倍。

链接: https://arxiv.org/abs/2506.00290
作者: Tianqi Chen,Shujian Zhang,Mingyuan Zhou
机构: The University of Texas at Austin (德克萨斯大学奥斯汀分校)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:This paper introduces DLM-One, a score-distillation-based framework for one-step sequence generation with continuous diffusion language models (DLMs). DLM-One eliminates the need for iterative refinement by aligning the scores of a student model’s outputs in the continuous token embedding space with the score function of a pretrained teacher DLM. We investigate whether DLM-One can achieve substantial gains in sampling efficiency for language modeling. Through comprehensive experiments on DiffuSeq – a representative continuous DLM – we show that DLM-One achieves up to ~500x speedup in inference time while maintaining competitive performance on benchmark text generation tasks used to evaluate the teacher models. We further analyze the method’s empirical behavior across multiple datasets, providing initial insights into its generality and practical applicability. Our findings position one-step diffusion as a promising direction for efficient, high-quality language generation and broader adoption of continuous diffusion models operating in embedding space for natural language processing.
zh

[NLP-253] Emergent Abilities of Large Language Models under Continued Pretraining for Language Adaptation ACL2025

【速读】: 该论文试图解决在持续预训练(Continued Pretraining, CPT)过程中,将现有大语言模型(Large Language Models, LLMs)适配到新语言时,英语数据的作用及其对模型下游能力的影响问题。研究发现,虽然英语数据的加入不影响验证困惑度(perplexity),但对目标语言下游能力的出现至关重要。解决方案的关键在于引入一种与语言无关的上下文学习(In-Context Learning, ICL)基准,以揭示在未包含英语数据时CPT过程中出现的灾难性遗忘现象,并通过课程学习和权重的指数移动平均(Exponential Moving Average, EMA)作为有效替代方法,减少对英语数据的依赖。

链接: https://arxiv.org/abs/2506.00288
作者: Ahmed Elhady,Eneko Agirre,Mikel Artetxe
机构: HiTZ Center, University of the Basque Country (UPV/EHU); Reka AI
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: To appear in ACL 2025 Main

点击查看摘要

Abstract:Continued pretraining (CPT) is a popular approach to adapt existing large language models (LLMs) to new languages. When doing so, it is common practice to include a portion of English data in the mixture, but its role has not been carefully studied to date. In this work, we show that including English does not impact validation perplexity, yet it is critical for the emergence of downstream capabilities in the target language. We introduce a language-agnostic benchmark for in-context learning (ICL), which reveals catastrophic forgetting early on CPT when English is not included. This in turn damages the ability of the model to generalize to downstream prompts in the target language as measured by perplexity, even if it does not manifest in terms of accuracy until later in training, and can be tied to a big shift in the model parameters. Based on these insights, we introduce curriculum learning and exponential moving average (EMA) of weights as effective alternatives to mitigate the need for English. All in all, our work sheds light into the dynamics by which emergent abilities arise when doing CPT for language adaptation, and can serve as a foundation to design more effective methods in the future.
zh

[NLP-254] Hierarchical Level-Wise News Article Clustering via Multilingual Matryoshka Embeddings ACL2025

【速读】: 该论文旨在解决现有基于上下文的大语言模型嵌入在主题建模和聚类任务中存在扩展性差、依赖不透明的相似性度量以及在多语言环境下表现不佳的问题。其解决方案的关键在于提出一种新颖的可扩展、可解释、分层且支持多语言的聚类方法,通过训练多语言Matryoshka嵌入(Matryoshka embeddings)来在不同粒度层次上确定故事的相似性,并利用该嵌入的分层特性开发高效的分层聚类算法,从而有效识别新闻文章和社交媒体数据中的独特新闻故事、叙述和主题。

链接: https://arxiv.org/abs/2506.00277
作者: Hans W. A. Hanley,Zakir Durumeric
机构: Stanford University (斯坦福大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
备注: Accepted to The 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025)

点击查看摘要

Abstract:Contextual large language model embeddings are increasingly utilized for topic modeling and clustering. However, current methods often scale poorly, rely on opaque similarity metrics, and struggle in multilingual settings. In this work, we present a novel, scalable, interpretable, hierarchical, and multilingual approach to clustering news articles and social media data. To do this, we first train multilingual Matryoshka embeddings that can determine story similarity at varying levels of granularity based on which subset of the dimensions of the embeddings is examined. This embedding model achieves state-of-the-art performance on the SemEval 2022 Task 8 test dataset (Pearson \rho = 0.816). Once trained, we develop an efficient hierarchical clustering algorithm that leverages the hierarchical nature of Matryoshka embeddings to identify unique news stories, narratives, and themes. We conclude by illustrating how our approach can identify and cluster stories, narratives, and overarching themes within real-world news datasets.
zh

[NLP-255] RoboMoRe: LLM -based Robot Co-design via Joint Optimization of Morphology and Reward

【速读】: 该论文试图解决机器人共设计(robot co-design)中由于使用固定奖励函数而导致的次优设计收敛问题,该问题限制了对适用于不同形态的多样化运动模式的探索。解决方案的关键在于提出一种基于大语言模型(LLM)的框架RoboMoRe,该框架通过整合形态与奖励塑造,在机器人共设计循环中实现联合优化。RoboMoRe采用双阶段优化策略:在粗粒度优化阶段,利用LLM驱动的多样性反射机制生成多样且高质量的形态-奖励配对;在细粒度优化阶段,通过交替的LLM引导奖励和形态梯度更新迭代优化候选方案,从而有效提升机器人形态及其适配运动行为的性能。

链接: https://arxiv.org/abs/2506.00276
作者: Jiawei Fang,Yuxuan Sun,Chengtian Ma,Qiuyu Lu,Lining Yao
机构: University of California, Berkeley (加州大学伯克利分校)
类目: Robotics (cs.RO); Computation and Language (cs.CL)
备注: 30 pages, 13 figures

点击查看摘要

Abstract:Robot co-design, jointly optimizing morphology and control policy, remains a longstanding challenge in the robotics community, where many promising robots have been developed. However, a key limitation lies in its tendency to converge to sub-optimal designs due to the use of fixed reward functions, which fail to explore the diverse motion modes suitable for different morphologies. Here we propose RoboMoRe, a large language model (LLM)-driven framework that integrates morphology and reward shaping for co-optimization within the robot co-design loop. RoboMoRe performs a dual-stage optimization: in the coarse optimization stage, an LLM-based diversity reflection mechanism generates both diverse and high-quality morphology-reward pairs and efficiently explores their distribution. In the fine optimization stage, top candidates are iteratively refined through alternating LLM-guided reward and morphology gradient updates. RoboMoRe can optimize both efficient robot morphologies and their suited motion behaviors through reward shaping. Results demonstrate that without any task-specific prompting or predefined reward/morphology templates, RoboMoRe significantly outperforms human-engineered designs and competing methods across eight different tasks.
zh

[NLP-256] CASPER: A Large Scale Spontaneous Speech Dataset

【速读】: 该论文试图解决高质量自然口语数据稀缺的问题(spontaneous speech data scarcity),因为现有大多数数据集包含的是脚本化对话。解决方案的关键在于提出了一种新颖的流程,用于诱发和记录自然对话,并发布了包含200+小时自发语音的Stage 1数据集。该方法促进了流畅、自然的对话,鼓励多样化的话题和互动交流,与传统方法不同,它能够实现真实的交互,为未来数据收集提供可复现的框架。

链接: https://arxiv.org/abs/2506.00267
作者: Cihan Xiao,Ruixing Liang,Xiangyu Zhang,Mehmet Emre Tiryaki,Veronica Bae,Lavanya Shankar,Rong Yang,Ethan Poon,Emmanuel Dupoux,Sanjeev Khudanpur,Leibny Paola Garcia Perera
机构: Johns Hopkins University (约翰霍普金斯大学); Meta (元); Edison Academy Magnet School (爱迪生学院磁校)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The success of large language models has driven interest in developing similar speech processing capabilities. However, a key challenge is the scarcity of high-quality spontaneous speech data, as most existing datasets contain scripted dialogues. To address this, we present a novel pipeline for eliciting and recording natural dialogues and release our Stage 1 dataset with 200+ hours of spontaneous speech. Our approach fosters fluid, natural conversations while encouraging a diverse range of topics and interactive exchanges. Unlike traditional methods, it facilitates genuine interactions, providing a reproducible framework for future data collection. This paper introduces our dataset and methodology, laying the groundwork for addressing the shortage of spontaneous speech data. We plan to expand this dataset in future stages, offering a growing resource for the research community.
zh

[NLP-257] MultiHoax: A Dataset of Multi-hop False-Premise Questions

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在面对包含错误前提的复杂多步骤推理任务时,难以准确检测虚假前提并进行可靠推理的问题。解决方案的关键在于引入MultiHoax基准,该基准通过涵盖七个不同国家和十个多样化知识类别的数据集,利用维基百科作为主要知识来源,评估LLMs在跨区域、跨知识领域以及多跳推理中的虚假前提检测能力,从而揭示当前先进LLMs在这一方面存在的不足,并推动其在多跳推理和虚假前提检测方面的改进。

链接: https://arxiv.org/abs/2506.00264
作者: Mohammadamin Shafiei,Hamidreza Saffari,Nafise Sadat Moosavi
机构: University of Milan (米兰大学); Politecnico di Milano (米兰理工学院); University of Sheffield (谢菲尔德大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As Large Language Models are increasingly deployed in high-stakes domains, their ability to detect false assumptions and reason critically is crucial for ensuring reliable outputs. False-premise questions (FPQs) serve as an important evaluation method by exposing cases where flawed assumptions lead to incorrect responses. While existing benchmarks focus on single-hop FPQs, real-world reasoning often requires multi-hop inference, where models must verify consistency across multiple reasoning steps rather than relying on surface-level cues. To address this gap, we introduce MultiHoax, a benchmark for evaluating LLMs’ ability to handle false premises in complex, multi-step reasoning tasks. Our dataset spans seven countries and ten diverse knowledge categories, using Wikipedia as the primary knowledge source to enable factual reasoning across regions. Experiments reveal that state-of-the-art LLMs struggle to detect false premises across different countries, knowledge categories, and multi-hop reasoning types, highlighting the need for improved false premise detection and more robust multi-hop reasoning capabilities in LLMs.
zh

[NLP-258] GPR: Empowering Generation with Graph-Pretrained Retriever EMNLP’25

【速读】: 该论文旨在解决图检索增强生成(Graph Retrieval-Augmented Generation, GRAG)中对图专用检索器的高要求问题,现有检索器通常依赖于在纯文本上预训练的语言模型,由于领域不匹配和结构忽视导致效果受限。解决方案的关键在于提出GPR,一种直接在知识图谱上预训练的基于图的检索器,通过大语言模型(Large Language Model, LLM)引导的图增强对齐自然语言问题与相关子图,并采用结构感知的目标学习细粒度的检索策略。

链接: https://arxiv.org/abs/2506.00261
作者: Xiaochen Wang,Zongyu Wu,Yuan Zhong,Xiang Zhang,Suhang Wang,Fenglong Ma
机构: The Pennsylvania State University (宾夕法尼亚州立大学)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: Short paper submitted to EMNLP’25

点击查看摘要

Abstract:Graph retrieval-augmented generation (GRAG) places high demands on graph-specific retrievers. However, existing retrievers often rely on language models pretrained on plain text, limiting their effectiveness due to domain misalignment and structure ignorance. To address these challenges, we propose GPR, a graph-based retriever pretrained directly on knowledge graphs. GPR aligns natural language questions with relevant subgraphs through LLM-guided graph augmentation and employs a structure-aware objective to learn fine-grained retrieval strategies. Experiments on two datasets, three LLM backbones, and five baselines show that GPR consistently improves both retrieval quality and downstream generation, demonstrating its effectiveness as a robust retrieval solution for GRAG.
zh

[NLP-259] he Impact of Disability Disclosure on Fairness and Bias in LLM -Driven Candidate Selection

【速读】: 该论文试图解决在使用生成式 AI (Generative AI) 进行招聘时,因候选人自愿披露的残疾信息而导致的潜在偏见问题。研究发现,当候选人的其他背景信息完全相同时,LLMs 更倾向于选择未披露残疾信息的候选人,而非明确声明自己无残疾的候选人,这表明模型可能存在隐性偏见。解决方案的关键在于识别并减少 LLM 在处理敏感个人信息时产生的不公平倾向,特别是在涉及残疾等敏感属性的情况下。

链接: https://arxiv.org/abs/2506.00256
作者: Mahammed Kamruzzaman,Gene Louis Kim
机构: University of South Florida (南佛罗里达大学)
类目: Computation and Language (cs.CL)
备注: Accepted at The 38th International FLAIRS Conference (FLAIRS 2025)(main)

点击查看摘要

Abstract:As large language models (LLMs) become increasingly integrated into hiring processes, concerns about fairness have gained prominence. When applying for jobs, companies often request/require demographic information, including gender, race, and disability or veteran status. This data is collected to support diversity and inclusion initiatives, but when provided to LLMs, especially disability-related information, it raises concerns about potential biases in candidate selection outcomes. Many studies have highlighted how disability can impact CV screening, yet little research has explored the specific effect of voluntarily disclosed information on LLM-driven candidate selection. This study seeks to bridge that gap. When candidates shared identical gender, race, qualifications, experience, and backgrounds, and sought jobs with minimal employment rate gaps between individuals with and without disabilities (e.g., Cashier, Software Developer), LLMs consistently favored candidates who disclosed that they had no disability. Even in cases where candidates chose not to disclose their disability status, the LLMs were less likely to select them compared to those who explicitly stated they did not have a disability.
zh

[NLP-260] Aligned but Blind: Alignment Increases Implicit Bias by Reducing Awareness of Race ACL2025

【速读】: 该论文试图解决价值对齐的语言模型(Value-Aligned Language Models, LMs)在显式偏见评估中表现公正,但在隐式词联想任务中仍表现出刻板印象的问题,这引发了对其公平使用的担忧。解决方案的关键在于发现对齐过程反而会增强模型输出中的隐式偏见,并提出一种新的偏见缓解策略,该策略通过激励模型在早期层中表征种族概念来提高对种族概念的敏感性,从而有效缓解隐式偏见。

链接: https://arxiv.org/abs/2506.00253
作者: Lihao Sun,Chengzhi Mao,Valentin Hofmann,Xuechunzi Bai
机构: University of Chicago (芝加哥大学); Rutgers University (罗格斯大学); Allen Institute for AI (艾伦人工智能研究所); University of Washington (华盛顿大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: Accpeted to ACL 2025 Main Conferencce

点击查看摘要

Abstract:Although value-aligned language models (LMs) appear unbiased in explicit bias evaluations, they often exhibit stereotypes in implicit word association tasks, raising concerns about their fair usage. We investigate the mechanisms behind this discrepancy and find that alignment surprisingly amplifies implicit bias in model outputs. Specifically, we show that aligned LMs, unlike their unaligned counterparts, overlook racial concepts in early internal representations when the context is ambiguous. Not representing race likely fails to activate safety guardrails, leading to unintended biases. Inspired by this insight, we propose a new bias mitigation strategy that works by incentivizing the representation of racial concepts in the early model layers. In contrast to conventional mitigation methods of machine unlearning, our interventions find that steering the model to be more aware of racial concepts effectively mitigates implicit bias. Similar to race blindness in humans, ignoring racial nuances can inadvertently perpetuate subtle biases in LMs.
zh

[NLP-261] PersianMedQA: Language-Centric Evaluation of LLM s in the Persian Medical Domain

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在高风险领域如医学中的可靠性问题,特别是在低资源语言环境下的表现不足。其解决方案的关键在于构建了一个大规模、专家验证的波斯语医学选择题数据集——PersianMedQA,用于评估LLMs在波斯语和英语中的表现,并通过对比不同类型的模型(通用模型、波斯语微调模型和医学专用模型)在零样本和思维链(Chain-of-Thought, CoT)设置下的性能,揭示模型规模与领域适应性之间的关系。

链接: https://arxiv.org/abs/2506.00250
作者: Mohammad Javad Ranjbar Kalahroodi,Amirhossein Sheikholselami,Sepehr Karimi,Sepideh Ranjbar Kalahroodi,Heshaam Faili,Azadeh Shakery
机构: 未知
类目: Computation and Language (cs.CL); Information Theory (cs.IT)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have achieved remarkable performance on a wide range of NLP benchmarks, often surpassing human-level accuracy. However, their reliability in high-stakes domains such as medicine, particularly in low-resource languages, remains underexplored. In this work, we introduce PersianMedQA, a large-scale, expert-validated dataset of multiple-choice Persian medical questions, designed to evaluate LLMs across both Persian and English. We benchmark over 40 state-of-the-art models, including general-purpose, Persian fine-tuned, and medical LLMs, in zero-shot and chain-of-thought (CoT) settings. Our results show that closed-source general models (e.g., GPT-4.1) consistently outperform all other categories, achieving 83.3% accuracy in Persian and 80.7% in English, while Persian fine-tuned models such as Dorna underperform significantly (e.g., 35.9% in Persian), often struggling with both instruction-following and domain reasoning. We also analyze the impact of translation, showing that while English performance is generally higher, Persian responses are sometimes more accurate due to cultural and clinical contextual cues. Finally, we demonstrate that model size alone is insufficient for robust performance without strong domain or language adaptation. PersianMedQA provides a foundation for evaluating multilingual and culturally grounded medical reasoning in LLMs. The PersianMedQA dataset can be accessed at: this https URL](this https URL
zh

[NLP-262] MIR: Methodology Inspiration Retrieval for Scientific Research Problems ACL2025

【速读】: 该论文试图解决在科学发现过程中如何有效检索能够为当前研究问题提供方法灵感的先前工作的问题,即Methodology Inspiration Retrieval (MIR)。解决方案的关键在于构建Methodology Adjacency Graph (MAG),通过引用关系捕捉方法学的传承脉络,并将其作为“直观先验”嵌入到密集检索器中,从而识别超越表面语义相似性的方法学启发模式,显著提升了检索效果。

链接: https://arxiv.org/abs/2506.00249
作者: Aniketh Garikaparthi,Manasi Patwardhan,Aditya Sanjiv Kanade,Aman Hassan,Lovekesh Vig,Arman Cohan
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: ACL 2025

点击查看摘要

Abstract:There has been a surge of interest in harnessing the reasoning capabilities of Large Language Models (LLMs) to accelerate scientific discovery. While existing approaches rely on grounding the discovery process within the relevant literature, effectiveness varies significantly with the quality and nature of the retrieved literature. We address the challenge of retrieving prior work whose concepts can inspire solutions for a given research problem, a task we define as Methodology Inspiration Retrieval (MIR). We construct a novel dataset tailored for training and evaluating retrievers on MIR, and establish baselines. To address MIR, we build the Methodology Adjacency Graph (MAG); capturing methodological lineage through citation relationships. We leverage MAG to embed an “intuitive prior” into dense retrievers for identifying patterns of methodological inspiration beyond superficial semantic similarity. This achieves significant gains of +5.4 in Recall@3 and +7.8 in Mean Average Precision (mAP) over strong baselines. Further, we adapt LLM-based re-ranking strategies to MIR, yielding additional improvements of +4.5 in Recall@3 and +4.8 in mAP. Through extensive ablation studies and qualitative analyses, we exhibit the promise of MIR in enhancing automated scientific discovery and outline avenues for advancing inspiration-driven retrieval.
zh

[NLP-263] Beyond Semantic Entropy: Boosting LLM Uncertainty Quantification with Pairwise Semantic Similarity

【速读】: 该论文试图解决大语言模型(Large Language Models, LLMs)中幻觉问题的检测难题,特别是传统语义熵(Semantic Entropy, SE)在处理现代LLMs生成的长句回复时效果下降的问题。其关键解决方案是提出一种受最近邻熵估计启发的简单黑盒不确定性量化方法,该方法通过考虑簇内相似性(intra-cluster similarity)和簇间相似性(inter-cluster similarity)来改进不确定性评估,并且可以轻松扩展到白盒场景中。

链接: https://arxiv.org/abs/2506.00245
作者: Dang Nguyen,Ali Payani,Baharan Mirzasoleiman
机构: UCLA CS (UCLA 计算机科学系); Cisco Systems Inc. (思科系统公司)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 11 pages, 4 figures, 6 tables, link: this https URL

点击查看摘要

Abstract:Hallucination in large language models (LLMs) can be detected by assessing the uncertainty of model outputs, typically measured using entropy. Semantic entropy (SE) enhances traditional entropy estimation by quantifying uncertainty at the semantic cluster level. However, as modern LLMs generate longer one-sentence responses, SE becomes less effective because it overlooks two crucial factors: intra-cluster similarity (the spread within a cluster) and inter-cluster similarity (the distance between clusters). To address these limitations, we propose a simple black-box uncertainty quantification method inspired by nearest neighbor estimates of entropy. Our approach can also be easily extended to white-box settings by incorporating token probabilities. Additionally, we provide theoretical results showing that our method generalizes semantic entropy. Extensive empirical results demonstrate its effectiveness compared to semantic entropy across two recent LLMs (Phi3 and Llama3) and three common text generation tasks: question answering, text summarization, and machine translation. Our code is available at this https URL.
zh

[NLP-264] Whispers of Many Shores: Cultural Alignment through Collaborative Cultural Expertise

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在跨文化应用场景中缺乏细腻文化理解与适应性的问题,传统方法通常需要高昂成本的全量微调。其解决方案的关键在于提出一种新颖的软提示微调框架,通过向量化的提示微调动态地将查询路由至由优化软提示嵌入生成的、具有文化专长的“专家”LLM配置,而无需修改基础模型参数,从而实现高效且模块化的文化对齐。

链接: https://arxiv.org/abs/2506.00242
作者: Shuai Feng,Wei-Chuang Chan,Srishti Chouhan,Junior Francisco Garcia Ayala,Srujananjali Medicherla,Kyle Clark,Mingwei Shi
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 14 main pages;8 page appendix

点击查看摘要

Abstract:The integration of large language models (LLMs) into global applications necessitates effective cultural alignment for meaningful and culturally-sensitive interactions. Current LLMs often lack the nuanced understanding required for diverse cultural contexts, and adapting them typically involves costly full fine-tuning. To address this, we introduce a novel soft prompt fine-tuning framework that enables efficient and modular cultural alignment. Our method utilizes vectorized prompt tuning to dynamically route queries to a committee of culturally specialized ‘expert’ LLM configurations, created by optimizing soft prompt embeddings without altering the base model’s parameters. Extensive experiments demonstrate that our framework significantly enhances cultural sensitivity and adaptability, improving alignment scores from 0.208 to 0.820, offering a robust solution for culturally-aware LLM deployment. This research paves the way for subsequent investigations into enhanced cultural coverage and dynamic expert adaptation, crucial for realizing autonomous AI with deeply nuanced understanding in a globally interconnected world.
zh

[NLP-265] ZeShot-VQA: Zero-Shot Visual Question Answering Framework with Answer Mapping for Natural Disaster Damage Assessment

【速读】: 该论文试图解决传统视觉问答(VQA)模型在应对自然灾害后损毁评估时存在的局限性,即模型无法回答开放性问题,仅能从预定义答案列表中选择最佳答案,且当需要处理新问题或新答案时,需进行微调或重新训练,耗时耗力。解决方案的关键在于提出一种基于视觉-语言模型(VLM)的零样本视觉问答(ZeShot-VQA)方法,该方法利用零样本学习能力,无需微调即可应用于新数据集,并能够生成训练过程中未见过的答案,从而提升模型的灵活性和适用性。

链接: https://arxiv.org/abs/2506.00238
作者: Ehsan Karimi,Maryam Rahnemoonfar
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: Accepted by the 2025 IEEE International Geoscience and Remote Sensing Symposium (IGARSS 2025)

点击查看摘要

Abstract:Natural disasters usually affect vast areas and devastate infrastructures. Performing a timely and efficient response is crucial to minimize the impact on affected communities, and data-driven approaches are the best choice. Visual question answering (VQA) models help management teams to achieve in-depth understanding of damages. However, recently published models do not possess the ability to answer open-ended questions and only select the best answer among a predefined list of answers. If we want to ask questions with new additional possible answers that do not exist in the predefined list, the model needs to be fin-tuned/retrained on a new collected and annotated dataset, which is a time-consuming procedure. In recent years, large-scale Vision-Language Models (VLMs) have earned significant attention. These models are trained on extensive datasets and demonstrate strong performance on both unimodal and multimodal vision/language downstream tasks, often without the need for fine-tuning. In this paper, we propose a VLM-based zero-shot VQA (ZeShot-VQA) method, and investigate the performance of on post-disaster FloodNet dataset. Since the proposed method takes advantage of zero-shot learning, it can be applied on new datasets without fine-tuning. In addition, ZeShot-VQA is able to process and generate answers that has been not seen during the training procedure, which demonstrates its flexibility.
zh

[NLP-266] Localized LoRA: A Structured Low-Rank Approximation for Efficient Fine-Tuning

【速读】: 该论文试图解决传统参数高效微调(Parameter-efficient fine-tuning, PEFT)方法,如LoRA,在处理预训练权重更新时依赖全局低秩结构而导致的空间模式信息丢失问题。解决方案的关键在于提出一种名为Localized LoRA的框架,该框架将权重更新建模为作用于权重矩阵结构化块上的低秩矩阵的组合,从而在不增加可训练参数总数的前提下实现密集且局部化的参数更新。

链接: https://arxiv.org/abs/2506.00236
作者: Babak Barazandeh
机构: University of Southern California (南加州大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Parameter-efficient fine-tuning (PEFT) methods, such as LoRA, offer compact and effective alternatives to full model fine-tuning by introducing low-rank updates to pretrained weights. However, most existing approaches rely on global low-rank structures, which can overlook spatial patterns spread across the parameter space. In this work, we propose Localized LoRA, a generalized framework that models weight updates as a composition of low-rank matrices applied to structured blocks of the weight matrix. This formulation enables dense, localized updates throughout the parameter space-without increasing the total number of trainable parameters. We provide a formal comparison between global, diagonal-local, and fully localized low-rank approximations, and show that our method consistently achieves lower approximation error under matched parameter budgets. Experiments on both synthetic and practical settings demonstrate that Localized LoRA offers a more expressive and adaptable alternative to existing methods, enabling efficient fine-tuning with improved performance.
zh

[NLP-267] MedOrch: Medical Diagnosis with Tool-Augmented Reasoning Agents for Flexible Extensibility

【速读】: 该论文旨在解决医疗决策中人工智能系统适应性差与缺乏专业知识支撑的问题,现有系统要么是任务特定模型适应性有限,要么是通用语言模型缺乏与专业领域知识和工具的结合。其解决方案的关键在于提出MedOrch框架,该框架通过模块化、基于代理的架构灵活集成多个专业工具和推理代理,实现对多模态医疗数据的推理驱动处理,并确保推理过程的透明性和可追溯性,从而提升临床决策支持的准确性和可信度。

链接: https://arxiv.org/abs/2506.00235
作者: Yexiao He,Ang Li,Boyi Liu,Zhewei Yao,Yuxiong He
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Healthcare decision-making represents one of the most challenging domains for Artificial Intelligence (AI), requiring the integration of diverse knowledge sources, complex reasoning, and various external analytical tools. Current AI systems often rely on either task-specific models, which offer limited adaptability, or general language models without grounding with specialized external knowledge and tools. We introduce MedOrch, a novel framework that orchestrates multiple specialized tools and reasoning agents to provide comprehensive medical decision support. MedOrch employs a modular, agent-based architecture that facilitates the flexible integration of domain-specific tools without altering the core system. Furthermore, it ensures transparent and traceable reasoning processes, enabling clinicians to meticulously verify each intermediate step underlying the system’s recommendations. We evaluate MedOrch across three distinct medical applications: Alzheimer’s disease diagnosis, chest X-ray interpretation, and medical visual question answering, using authentic clinical datasets. The results demonstrate MedOrch’s competitive performance across these diverse medical tasks. Notably, in Alzheimer’s disease diagnosis, MedOrch achieves an accuracy of 93.26%, surpassing the state-of-the-art baseline by over four percentage points. For predicting Alzheimer’s disease progression, it attains a 50.35% accuracy, marking a significant improvement. In chest X-ray analysis, MedOrch exhibits superior performance with a Macro AUC of 61.2% and a Macro F1-score of 25.5%. Moreover, in complex multimodal visual question answering (Image+Table), MedOrch achieves an accuracy of 54.47%. These findings underscore MedOrch’s potential to advance healthcare AI by enabling reasoning-driven tool utilization for multimodal medical data processing and supporting intricate cognitive tasks in clinical decision-making.
zh

[NLP-268] ComposeRAG : A Modular and Composable RAG for Corpus-Grounded Multi-Hop Question Answering

【速读】: 该论文旨在解决现有检索增强生成(Retrieval-Augmented Generation, RAG)系统中因模块耦合导致的可解释性差、系统评估困难及针对性优化受限的问题,尤其是在复杂多跳问答任务中的表现不足。其解决方案的关键在于提出ComposeRAG,一种模块化抽象框架,将RAG流程分解为原子且可组合的模块,如问题分解、查询重写、检索决策和答案验证等,每个模块作为结构化输入/输出的参数化转换,支持独立实现、升级与分析,并引入自反思机制以提升多步推理的鲁棒性。

链接: https://arxiv.org/abs/2506.00232
作者: Ruofan Wu,Youngwon Lee,Fan Shu,Danmei Xu,Seung-won Hwang,Zhewei Yao,Yuxiong He,Feng Yan
机构: University of Houston (休斯顿大学); Seoul National University (首尔国立大学); Snowflake AI Research (雪flake AI 研究院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) systems are increasingly diverse, yet many suffer from monolithic designs that tightly couple core functions like query reformulation, retrieval, reasoning, and verification. This limits their interpretability, systematic evaluation, and targeted improvement, especially for complex multi-hop question answering. We introduce ComposeRAG, a novel modular abstraction that decomposes RAG pipelines into atomic, composable modules. Each module, such as Question Decomposition, Query Rewriting, Retrieval Decision, and Answer Verification, acts as a parameterized transformation on structured inputs/outputs, allowing independent implementation, upgrade, and analysis. To enhance robustness against errors in multi-step reasoning, ComposeRAG incorporates a self-reflection mechanism that iteratively revisits and refines earlier steps upon verification failure. Evaluated on four challenging multi-hop QA benchmarks, ComposeRAG consistently outperforms strong baselines in both accuracy and grounding fidelity. Specifically, it achieves up to a 15% accuracy improvement over fine-tuning-based methods and up to a 5% gain over reasoning-specialized pipelines under identical retrieval conditions. Crucially, ComposeRAG significantly enhances grounding: its verification-first design reduces ungrounded answers by over 10% in low-quality retrieval settings, and by approximately 3% even with strong corpora. Comprehensive ablation studies validate the modular architecture, demonstrating distinct and additive contributions from each component. These findings underscore ComposeRAG’s capacity to deliver flexible, transparent, scalable, and high-performing multi-hop reasoning with improved grounding and interpretability.
zh

[NLP-269] REIC: RAG -Enhanced Intent Classification at Scale

【速读】: 该论文旨在解决客户服务中心中意图分类(intent classification)在面对产品线扩展时所面临的可扩展性问题,这些问题包括意图数量的增加以及不同垂直领域间分类体系的差异。论文提出的解决方案的关键在于REIC(Retrieval-augmented generation Enhanced Intent Classification),该方法利用检索增强生成(RAG)技术动态整合相关知识,从而实现精确的意图分类,而无需频繁重新训练模型。

链接: https://arxiv.org/abs/2506.00210
作者: Ziji Zhang,Michael Yang,Zhiyu Chen,Yingying Zhuang,Shu-Ting Pi,Qun Liu,Rajashekar Maragoud,Vy Nguyen,Anurag Beniwal
机构: Amazon.com, Inc.(亚马逊公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Accurate intent classification is critical for efficient routing in customer service, ensuring customers are connected with the most suitable agents while reducing handling times and operational costs. However, as companies expand their product lines, intent classification faces scalability challenges due to the increasing number of intents and variations in taxonomy across different verticals. In this paper, we introduce REIC, a Retrieval-augmented generation Enhanced Intent Classification approach, which addresses these challenges effectively. REIC leverages retrieval-augmented generation (RAG) to dynamically incorporate relevant knowledge, enabling precise classification without the need for frequent retraining. Through extensive experiments on real-world datasets, we demonstrate that REIC outperforms traditional fine-tuning, zero-shot, and few-shot methods in large-scale customer service settings. Our results highlight its effectiveness in both in-domain and out-of-domain scenarios, demonstrating its potential for real-world deployment in adaptive and large-scale intent classification systems.
zh

[NLP-270] Structure-Aware Fill-in-the-Middle Pretraining for Code

【速读】: 该论文试图解决现有代码大语言模型(Code LLM)在预训练过程中将代码视为普通文本并随机遮蔽字符片段导致的上下文不连贯问题,从而影响模型对代码结构和常见代码编辑模式(如代码块、表达式或函数)的理解与生成。解决方案的关键在于提出AST-FIM,一种利用抽象语法树(Abstract Syntax Tree, AST)来大规模遮蔽完整语法结构的预训练策略,以生成更符合通用代码结构和实际代码编辑习惯的训练样本。

链接: https://arxiv.org/abs/2506.00204
作者: Linyuan Gong,Alvin Cheung,Mostafa Elhoushi,Sida Wang
机构: UC Berkeley (加州大学伯克利分校); FAIR at Meta (Meta人工智能研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: 14 pages

点击查看摘要

Abstract:Fill-in-the-Middle (FIM) is a common pretraining method for code LLMs, where models complete code segments given surrounding context. However, existing LLMs treat code as plain text and mask random character spans. We propose and evaluate AST-FIM, a pretraining strategy that leverages Abstract Syntax Trees (ASTs) to mask complete syntactic structures at scale, ensuring coherent training examples better aligned with universal code structures and common code editing patterns such as blocks, expressions, or functions. To evaluate real-world fill-in-the-middle (FIM) programming tasks, we introduce Real-FIM-Eval, a benchmark derived from 30,000+ GitHub commits across 12 languages. On infilling tasks, experiments on 1B and 8B parameter models show that AST-FIM is particularly beneficial for real-world code editing as it outperforms standard random-character FIM by up to 5 pts on standard FIM benchmarks. Our code is publicly available at this https URL.
zh

[NLP-271] Structuring Radiology Reports: Challenging LLM s with Lightweight Models

【速读】: 该论文试图解决放射科报告在临床决策中的标准化格式缺失问题,这限制了人类可读性和机器学习应用。其解决方案的关键在于探索轻量级编码器-解码器模型(如T5和BERT2BERT),这些模型参数量为300M,相较于大型语言模型(LLMs)具有更低的计算需求、更高的透明度和更好的数据隐私保护,从而在资源受限的医疗环境中实现可持续且隐私友好的临床文本结构化处理。

链接: https://arxiv.org/abs/2506.00200
作者: Johannes Moll,Louisa Fay,Asfandyar Azhar,Sophie Ostmeier,Tim Lueth,Sergios Gatidis,Curtis Langlotz,Jean-Benoit Delbrouck
机构: Stanford University (斯坦福大学); Technical University of Munich (慕尼黑工业大学); Carnegie Mellon University (卡内基梅隆大学); HOPPR (未知)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Radiology reports are critical for clinical decision-making but often lack a standardized format, limiting both human interpretability and machine learning (ML) applications. While large language models (LLMs) have shown strong capabilities in reformatting clinical text, their high computational requirements, lack of transparency, and data privacy concerns hinder practical deployment. To address these challenges, we explore lightweight encoder-decoder models (300M parameters)-specifically T5 and BERT2BERT-for structuring radiology reports from the MIMIC-CXR and CheXpert Plus datasets. We benchmark these models against eight open-source LLMs (1B-70B), adapted using prefix prompting, in-context learning (ICL), and low-rank adaptation (LoRA) finetuning. Our best-performing lightweight model outperforms all LLMs adapted using prompt-based techniques on a human-annotated test set. While some LoRA-finetuned LLMs achieve modest gains over the lightweight model on the Findings section (BLEU 6.4%, ROUGE-L 4.8%, BERTScore 3.6%, F1-RadGraph 1.1%, GREEN 3.6%, and F1-SRR-BERT 4.3%), these improvements come at the cost of substantially greater computational resources. For example, LLaMA-3-70B incurred more than 400 times the inference time, cost, and carbon emissions compared to the lightweight model. These results underscore the potential of lightweight, task-specific models as sustainable and privacy-preserving solutions for structuring clinical text in resource-constrained healthcare settings.
zh

[NLP-272] Let Them Down Easy! Contextual Effects of LLM Guardrails on User Perceptions and Preferences

【速读】: 该论文试图解决当前大语言模型(Large Language Models, LLMs)在处理潜在有害输入时过于保守,导致安全性和用户体验之间的权衡问题。其解决方案的关键在于采用部分合规(partial compliance)策略,即在不提供具体操作细节的情况下提供一般性信息,从而显著降低用户的负面感知,相较于直接拒绝回答,可减少超过50%的负面反馈。研究还指出,现有模型较少自然采用该策略,且奖励模型对部分合规的评分较低,因此有效防护机制应侧重于设计合理的拒绝回应,而非依赖意图检测。

链接: https://arxiv.org/abs/2506.00195
作者: Mingqian Zheng,Wenjia Hu,Patrick Zhao,Motahhare Eslami,Jena D. Hwang,Faeze Brahman,Carolyn Rose,Maarten Sap
机构: Carnegie Mellon University (卡内基梅隆大学); Pareto.ai (Pareto.ai); Simon Fraser University (西蒙弗雷泽大学); Allen Institute for AI (艾伦人工智能研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Current LLMs are trained to refuse potentially harmful input queries regardless of whether users actually had harmful intents, causing a tradeoff between safety and user experience. Through a study of 480 participants evaluating 3,840 query-response pairs, we examine how different refusal strategies affect user perceptions across varying motivations. Our findings reveal that response strategy largely shapes user experience, while actual user motivation has negligible impact. Partial compliance – providing general information without actionable details – emerges as the optimal strategy, reducing negative user perceptions by over 50% to flat-out refusals. Complementing this, we analyze response patterns of 9 state-of-the-art LLMs and evaluate how 6 reward models score different refusal strategies, demonstrating that models rarely deploy partial compliance naturally and reward models currently undervalue it. This work demonstrates that effective guardrails require focusing on crafting thoughtful refusals rather than detecting intent, offering a path toward AI safety mechanisms that ensure both safety and sustained user engagement.
zh

[NLP-273] Control-R: Towards controllable test-time scaling

【速读】: 该论文旨在解决大型推理模型(Large Reasoning Models, LRM)在长链式思维(Long Chain-of-Thought, CoT)推理过程中存在的“思考不足”和“过度思考”问题。其解决方案的关键在于引入一种名为推理控制场(Reasoning Control Fields, RCF)的新颖测试时方法,该方法从树搜索的角度注入结构化的控制信号,以引导推理过程。RCF使模型能够在解决复杂任务时根据给定的控制条件调整推理努力程度,从而实现对长CoT推理过程的有效控制。

链接: https://arxiv.org/abs/2506.00189
作者: Di Zhang,Weida Wang,Junxian Li,Xunzhi Wang,Jiatong Li,Jianbo Wu,Jingdi Lei,Haonan He,Peng Ye,Shufei Zhang,Wanli Ouyang,Yuqiang Li,Dongzhan Zhou
机构: Fudan University (复旦大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); Tongji University (同济大学); Shanghai Jiaotong University (上海交通大学); Nankai University (南开大学); Hong Kong Polytechnic University (香港理工大学); University of California, Merced (加州大学默塞德分校); University of Science and Technology of China (中国科学技术大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper target in addressing the challenges of underthinking and overthinking in long chain-of-thought (CoT) reasoning for Large Reasoning Models (LRMs) by introducing Reasoning Control Fields (RCF)–a novel test-time approach that injects structured control signals to guide reasoning from a tree search perspective. RCF enables models to adjust reasoning effort according to given control conditions when solving complex tasks. Additionally, we present the Control-R-4K dataset, which consists of challenging problems annotated with detailed reasoning processes and corresponding control fields. To further enhance reasoning control, we propose a Conditional Distillation Finetuning (CDF) method, which trains model–particularly Control-R-32B–to effectively adjust reasoning effort during test time. Experimental results on benchmarks such as AIME2024 and MATH500 demonstrate that our approach achieves state-of-the-art performance at the 32B scale while enabling a controllable Long CoT reasoning process (L-CoT). Overall, this work introduces an effective paradigm for controllable test-time scaling reasoning.
zh

[NLP-274] Disentangled Safety Adapters Enable Efficient Guardrails and Flexible Inference-Time Alignment

【速读】: 该论文旨在解决现有AI安全机制(如安全阈值模型和对齐训练)在推理效率与开发灵活性之间的权衡问题。其解决方案的关键在于提出解耦安全适配器(Disentangled Safety Adapters, DSA),通过将安全相关的计算从任务优化的基础模型中解耦,实现安全功能的多样化与灵活性,同时对推理成本影响最小。DSA利用轻量级适配器,借助基础模型的内部表示,实现了高效且可动态调整的安全对齐与防护机制。

链接: https://arxiv.org/abs/2506.00166
作者: Kundan Krishna,Joseph Y Cheng,Charles Maalouf,Leon A Gatys
机构: Apple(苹果)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 16 pages, 2 figures, including references and appendix

点击查看摘要

Abstract:Existing paradigms for ensuring AI safety, such as guardrail models and alignment training, often compromise either inference efficiency or development flexibility. We introduce Disentangled Safety Adapters (DSA), a novel framework addressing these challenges by decoupling safety-specific computations from a task-optimized base model. DSA utilizes lightweight adapters that leverage the base model’s internal representations, enabling diverse and flexible safety functionalities with minimal impact on inference cost. Empirically, DSA-based safety guardrails substantially outperform comparably sized standalone models, notably improving hallucination detection (0.88 vs. 0.61 AUC on Summedits) and also excelling at classifying hate speech (0.98 vs. 0.92 on ToxiGen) and unsafe model inputs and responses (0.93 vs. 0.90 on AEGIS2.0 BeaverTails). Furthermore, DSA-based safety alignment allows dynamic, inference-time adjustment of alignment strength and a fine-grained trade-off between instruction following performance and model safety. Importantly, combining the DSA safety guardrail with DSA safety alignment facilitates context-dependent alignment strength, boosting safety on StrongReject by 93% while maintaining 98% performance on MTBench – a total reduction in alignment tax of 8 percentage points compared to standard safety alignment fine-tuning. Overall, DSA presents a promising path towards more modular, efficient, and adaptable AI safety and alignment.
zh

[NLP-275] Werewolf: A Straightforward Game Framework with TTS for Improved User Engagement

【速读】: 该论文试图解决在基于生成式 AI (Generative AI) 的社交推理游戏(如狼人杀)中,如何提升人类玩家的参与度与交互体验的问题。现有研究通过微调、高级提示工程或额外的经验池来实现更具吸引力的文本格式游戏体验,而本文提出了一种新颖且简洁的基于大型语言模型(Large Language Models, LLMs)的狼人杀游戏系统,其关键在于优化文本转语音(Text-to-Speech, TTS)模型以增强与不同LLM模型的兼容性,并提升用户参与度,同时认为随着LLM推理能力的增强,额外组件将不再是必要。

链接: https://arxiv.org/abs/2506.00160
作者: Qihui Fan,Enfu Nan,Wenbo Li,Lei Lu,Pu Zhao,Yanzhi Wang
机构: Northeastern University (东北大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The growing popularity of social deduction game systems for both business applications and AI research has greatly benefited from the rapid advancements in Large Language Models (LLMs), which now demonstrate stronger reasoning and persuasion capabilities. Especially with the raise of DeepSeek R1 and V3 models, LLMs should enable a more engaging experience for human players in LLM-agent-based social deduction games like Werewolf. Previous works either fine-tuning, advanced prompting engineering, or additional experience pool to achieve engaging text-format Werewolf game experience. We propose a novel yet straightforward LLM-based Werewolf game system with tuned Text-to-Speech(TTS) models designed for enhanced compatibility with various LLM models, and improved user engagement. We argue with ever enhancing LLM reasoning, extra components will be unnecessary in the case of Werewolf.
zh

[NLP-276] Vedavani: A Benchmark Corpus for ASR on Vedic Sanskrit Poetry

【速读】: 该论文试图解决梵语(Sanskrit)自动语音识别(ASR)系统开发中的挑战,特别是在其诗歌形式中捕捉复杂的韵律和节奏特征的问题。解决方案的关键在于构建了一个名为Vedavani的首个专注于吠陀诗歌的全面ASR研究框架,其中包括一个包含54小时标注音频样本的梵语ASR数据集,该数据集来源于《梨俱吠陀》和《阿闼婆吠陀》,能够准确反映语言的韵律和节奏特征,并在多种先进的多语言语音模型上进行了基准测试,最终发现IndicWhisper在这些模型中表现最佳。

链接: https://arxiv.org/abs/2506.00145
作者: Sujeet Kumar,Pretam Ray,Abhinay Beerukuri,Shrey Kamoji,Manoj Balaji Jagadeeshan,Pawan Goyal
机构: Indian Institute of Technology, Kharagpur (印度理工学院,卡哈格尔普尔分校); B P Mandal College of Engineering, Madhepura (B·P·曼达尔工程学院,马德赫普拉)
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Sanskrit, an ancient language with a rich linguistic heritage, presents unique challenges for automatic speech recognition (ASR) due to its phonemic complexity and the phonetic transformations that occur at word junctures, similar to the connected speech found in natural conversations. Due to these complexities, there has been limited exploration of ASR in Sanskrit, particularly in the context of its poetic verses, which are characterized by intricate prosodic and rhythmic patterns. This gap in research raises the question: How can we develop an effective ASR system for Sanskrit, particularly one that captures the nuanced features of its poetic form? In this study, we introduce Vedavani, the first comprehensive ASR study focused on Sanskrit Vedic poetry. We present a 54-hour Sanskrit ASR dataset, consisting of 30,779 labelled audio samples from the Rig Veda and Atharva Veda. This dataset captures the precise prosodic and rhythmic features that define the language. We also benchmark the dataset on various state-of-the-art multilingual speech models. ^1 Experimentation revealed that IndicWhisper performed the best among the SOTA models.
zh

[NLP-277] LaMP-QA: A Benchmark for Personalized Long-form Question Answering

【速读】: 该论文试图解决用户导向的问答系统中个性化回答生成研究相对不足的问题,主要原因是缺乏用于训练和评估个性化问答系统的资源。解决方案的关键是引入LaMP-QA——一个用于评估个性化长文本回答生成的基准数据集,该数据集覆盖了艺术与娱乐、生活方式与个人发展、社会与文化三大类别,共计45个子类别,并通过全面的人工和自动评估,验证了该基准的有效性及个性化上下文对模型性能的提升作用,实验结果显示引入个性化上下文可使性能提升高达39%。

链接: https://arxiv.org/abs/2506.00137
作者: Alireza Salemi,Hamed Zamani
机构: University of Massachusetts Amherst (马萨诸塞大学阿默斯特分校)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Personalization is essential for question answering systems that are user-centric. Despite its importance, personalization in answer generation has been relatively underexplored. This is mainly due to lack of resources for training and evaluating personalized question answering systems. We address this gap by introducing LaMP-QA – a benchmark designed for evaluating personalized long-form answer generation. The benchmark covers questions from three major categories: (1) Arts Entertainment, (2) Lifestyle Personal Development, and (3) Society Culture, encompassing over 45 subcategories in total. To assess the quality and potential impact of the LaMP-QA benchmark for personalized question answering, we conduct comprehensive human and automatic evaluations, to compare multiple evaluation strategies for evaluating generated personalized responses and measure their alignment with human preferences. Furthermore, we benchmark a number of non-personalized and personalized approaches based on open-source and proprietary large language models (LLMs). Our results show that incorporating the personalized context provided leads to performance improvements of up to 39%. The benchmark is publicly released to support future research in this area.
zh

[NLP-278] Spurious Correlations and Beyond: Understanding and Mitigating Shortcut Learning in SDOH Extraction with Large Language Models

【速读】: 该论文试图解决从临床文本中提取社会决定因素(Social Determinants of Health, SDOH)时,大型语言模型(Large Language Models, LLMs)可能依赖表面线索导致错误预测的问题。解决方案的关键在于通过评估如提示工程和思维链推理等缓解策略,以减少虚假阳性结果,并提升LLMs在医疗领域中的可靠性。

链接: https://arxiv.org/abs/2506.00134
作者: Fardin Ahsan Sakib,Ziwei Zhu,Karen Trister Grace,Meliha Yetisgen,Ozlem Uzuner
机构: George Mason University (乔治梅森大学); University of Washington (华盛顿大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Social determinants of health (SDOH) extraction from clinical text is critical for downstream healthcare analytics. Although large language models (LLMs) have shown promise, they may rely on superficial cues leading to spurious predictions. Using the MIMIC portion of the SHAC (Social History Annotation Corpus) dataset and focusing on drug status extraction as a case study, we demonstrate that mentions of alcohol or smoking can falsely induce models to predict current/past drug use where none is present, while also uncovering concerning gender disparities in model performance. We further evaluate mitigation strategies - such as prompt engineering and chain-of-thought reasoning - to reduce these false positives, providing insights into enhancing LLM reliability in health domains.
zh

[NLP-279] Writing-Zero: Bridge the Gap Between Non-verifiable Problems and Verifiable Rewards

【速读】: 该论文试图解决在非可验证任务(如创意写作和开放性对话)中,由于质量评估主观且缺乏明确参考,现有基于标量奖励模型的方法存在泛化能力有限和容易出现奖励黑客(reward hacking)的问题。其解决方案的关键在于提出一种统一的基于可验证奖励的强化学习(RLVR)训练范式,核心包括基于写作原则的成对生成奖励模型(GenRM)和一种新的自举相对策略优化(BRPO)算法。GenRM通过自我原则批判将主观评估转化为可验证奖励,而BRPO则通过在强化学习训练过程中利用组内滚动中的自举响应作为临时参考,实现动态、无参考的成对比较。

链接: https://arxiv.org/abs/2506.00103
作者: Xun Lu
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reinforcement learning with verifiable rewards (RLVR) has enabled large language models (LLMs) to achieve remarkable breakthroughs in reasoning tasks with objective ground-truth answers, such as mathematics and code generation. However, a significant gap remains for non-verifiable tasks, like creative writing and open-ended dialogue, where quality assessment is inherently subjective and lacks definitive references. Existing approaches for these domains often rely on scalar reward models trained with human preferences, which suffer from limited generalization and are prone to reward hacking, such as over-explanation and length bias. In this work, we propose a unified RLVR-based training paradigm that bridges the gap between non-verifiable tasks and verifiable rewards. We introduce a writing-principle-based pairwise Generative Reward Model (GenRM) and a novel Bootstrapped Relative Policy Optimization (BRPO) algorithm. The pairwise writing GenRM leverages self-principled critique to transform subjective assessments into reliable, verifiable rewards, while BRPO enables dynamic, reference-free pairwise comparison by leveraging a bootstrapped response as temporary reference from within group rollouts during RL training. Our approach empowers LLMs to develop robust writing capabilities without supervised fine-tuning, as demonstrated by Writing-Zero, which shows consistent improvement and strong resistance to reward hacking compared to scalar reward baselines. Furthermore, our method achieves competitive results on both in-house and open-source writing benchmarks. Our findings suggest the potential to unify rule-based, reference-based, and reference-free reward modeling under the RLVR framework, thus paving the way for a comprehensive and scalable RL training paradigm applicable across all language tasks.
zh

[NLP-280] Childrens Voice Privacy: First Steps And Emerging Challenges INTERSPEECH2025

【速读】: 该论文试图解决儿童在语音技术中代表性不足以及隐私保护不足的问题,特别是针对儿童语音的匿名化技术研究较少。其解决方案的关键在于评估适用于成人语音的匿名化技术在儿童语音上的适用性,并建立相关基准。研究通过三个儿童数据集、六种匿名化方法以及客观和主观的效用指标进行分析,揭示了现有成人语音匿名化系统在保护儿童语音隐私方面的有效性,但同时也指出了其在语音质量上的显著退化问题。

链接: https://arxiv.org/abs/2506.00100
作者: Ajinkya Kulkarni,Francisco Teixeira,Enno Hermann,Thomas Rolland,Isabel Trancoso,Mathew Magimai Doss
机构: Idiap Research Institute (Idiap 研究所); INESC-ID, Lisbon, Portugal (INESC-ID,里斯本,葡萄牙); Instituto Superior Técnico (高等技术研究所)
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注: Accepted at Interspeech 2025, Netherlands

点击查看摘要

Abstract:Children are one of the most under-represented groups in speech technologies, as well as one of the most vulnerable in terms of privacy. Despite this, anonymization techniques targeting this population have received little attention. In this study, we seek to bridge this gap, and establish a baseline for the use of voice anonymization techniques designed for adult speech when applied to children’s voices. Such an evaluation is essential, as children’s speech presents a distinct set of challenges when compared to that of adults. This study comprises three children’s datasets, six anonymization methods, and objective and subjective utility metrics for evaluation. Our results show that existing systems for adults are still able to protect children’s voice privacy, but suffer from much higher utility degradation. In addition, our subjective study displays the challenges of automatic evaluation methods for speech quality in children’s speech, highlighting the need for further research.
zh

[NLP-281] ClinBench-HPB: A Clinical Benchmark for Evaluating LLM s in Hepato-Pancreato-Biliary Diseases

【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在肝胆胰脾(Hepato-pancreato-biliary, HPB)疾病领域诊断能力不足的问题,特别是现有评估基准缺乏HPB疾病覆盖和临床实际案例。其解决方案的关键在于系统性地构建了一个涵盖3,535道封闭式多选题和337个开放式真实诊断案例的HPB疾病评估基准(ClinBench-HBP),该基准覆盖了国际疾病分类第十版(ICD-10)中定义的所有33个主要类别和465个子类别,从而为评估LLMs在复杂临床场景下的诊断能力提供了更贴近实际的测试环境。

链接: https://arxiv.org/abs/2506.00095
作者: Yuchong Li,Xiaojun Zeng,Chihua Fang,Jian Yang,Lei Zhang
机构: The Hong Kong Polytechnic University (香港理工大学); Shenzhen Institutes of Advanced Technology, CAS (中国科学院深圳先进技术研究院); Zhujiang Hospital, Southern Medical University (南方医科大学珠江医院); The Key Laboratory of Biomedical Imaging Science and System, CAS (中国科学院生物医学成像科学与系统重点实验室)
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Hepato-pancreato-biliary (HPB) disorders represent a global public health challenge due to their high morbidity and mortality. Although large language models (LLMs) have shown promising performance in general medical question-answering tasks, the current evaluation benchmarks are mostly derived from standardized examinations or manually designed questions, lacking HPB coverage and clinical cases. To address these issues, we systematically eatablish an HPB disease evaluation benchmark comprising 3,535 closed-ended multiple-choice questions and 337 open-ended real diagnosis cases, which encompasses all the 33 main categories and 465 subcategories of HPB diseases defined in the International Statistical Classification of Diseases, 10th Revision (ICD-10). The multiple-choice questions are curated from public datasets and synthesized data, and the clinical cases are collected from prestigious medical journals, case-sharing platforms, and collaborating hospitals. By evalauting commercial and open-source general and medical LLMs on our established benchmark, namely ClinBench-HBP, we find that while commercial LLMs perform competently on medical exam questions, they exhibit substantial performance degradation on HPB diagnosis tasks, especially on complex, inpatient clinical cases. Those medical LLMs also show limited generalizability to HPB diseases. Our results reveal the critical limitations of current LLMs in the domain of HPB diseases, underscoring the imperative need for future medical LLMs to handle real, complex clinical diagnostics rather than simple medical exam questions. The benchmark will be released at the homepage.
zh

[NLP-282] HD-NDEs: Neural Differential Equations for Hallucination Detection in LLM s

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在生成过程中出现的幻觉(hallucination)问题,即模型生成不准确或非事实性陈述的现象。现有基于分类的方法虽然在减轻幻觉方面效率较高,但在处理输出早期或中期出现的非事实信息时表现不佳,导致可靠性下降。论文提出的解决方案是Hallucination Detection-Neural Differential Equations (HD-NDEs),其关键在于利用神经微分方程(Neural DEs)建模LLMs隐空间中的动态系统,并将隐空间中的序列映射到分类空间以进行真实性评估,从而系统地捕捉LLMs的完整动态特性,提升幻觉检测的准确性。

链接: https://arxiv.org/abs/2506.00088
作者: Qing Li,Jiahui Geng,Zongxiong Chen,Derui Zhu,Yuxia Wang,Congbo Ma,Chenyang Lyu,Fakhri Karray
机构: Mohamed bin Zayed University of Artificial Intelligence (MBZUAI); Fraunhofer Institute for Open Communication Systems (FOKUS); Technical University of Munich; New York University Abu Dhabi; Alibaba International Digital Commerce
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In recent years, large language models (LLMs) have made remarkable advancements, yet hallucination, where models produce inaccurate or non-factual statements, remains a significant challenge for real-world deployment. Although current classification-based methods, such as SAPLMA, are highly efficient in mitigating hallucinations, they struggle when non-factual information arises in the early or mid-sequence of outputs, reducing their reliability. To address these issues, we propose Hallucination Detection-Neural Differential Equations (HD-NDEs), a novel method that systematically assesses the truthfulness of statements by capturing the full dynamics of LLMs within their latent space. Our approaches apply neural differential equations (Neural DEs) to model the dynamic system in the latent space of LLMs. Then, the sequence in the latent space is mapped to the classification space for truth assessment. The extensive experiments across five datasets and six widely used LLMs demonstrate the effectiveness of HD-NDEs, especially, achieving over 14% improvement in AUC-ROC on the True-False dataset compared to state-of-the-art techniques.
zh

[NLP-283] SwitchLingua: The First Large-Scale Multilingual and Multi-Ethnic Code-Switching Dataset

【速读】: 该论文旨在解决自动语音识别(ASR)系统在处理代码切换(Code-Switching, CS)场景时的性能不足问题,尤其是现有单语数据集无法满足多语言、多文化应用场景的需求。其解决方案的关键在于提出了一种名为LinguaMaster的多智能体协作框架,用于高效且可扩展地合成多语言数据,并基于此构建了SwitchLingua数据集,该数据集包含大规模的多语言、多民族的代码切换文本和音频样本,同时引入了语义感知错误率(Semantic-Aware Error Rate, SAER)作为新的评估指标,以更准确地衡量系统在代码切换场景下的表现。

链接: https://arxiv.org/abs/2506.00087
作者: Peng Xie,Xingyuan Liu,Tsz Wai Chan,Yequan Bie,Yangqiu Song,Yang Wang,Hao Chen,Kani Chen
机构: The Hong Kong Unviersity of Science and Technology (香港科技大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Code-switching (CS) is the alternating use of two or more languages within a conversation or utterance, often influenced by social context and speaker identity. This linguistic phenomenon poses challenges for Automatic Speech Recognition (ASR) systems, which are typically designed for a single language and struggle to handle multilingual inputs. The growing global demand for multilingual applications, including Code-Switching ASR (CSASR), Text-to-Speech (CSTTS), and Cross-Lingual Information Retrieval (CLIR), highlights the inadequacy of existing monolingual datasets. Although some code-switching datasets exist, most are limited to bilingual mixing within homogeneous ethnic groups, leaving a critical need for a large-scale, diverse benchmark akin to ImageNet in computer vision. To bridge this gap, we introduce \textbfLinguaMaster, a multi-agent collaboration framework specifically designed for efficient and scalable multilingual data synthesis. Leveraging this framework, we curate \textbfSwitchLingua, the first large-scale multilingual and multi-ethnic code-switching dataset, including: (1) 420K CS textual samples across 12 languages, and (2) over 80 hours of audio recordings from 174 speakers representing 18 countries/regions and 63 racial/ethnic backgrounds, based on the textual data. This dataset captures rich linguistic and cultural diversity, offering a foundational resource for advancing multilingual and multicultural research. Furthermore, to address the issue that existing ASR evaluation metrics lack sensitivity to code-switching scenarios, we propose the \textbfSemantic-Aware Error Rate (SAER), a novel evaluation metric that incorporates semantic information, providing a more accurate and context-aware assessment of system performance. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2506.00087 [cs.CL] (or arXiv:2506.00087v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2506.00087 Focus to learn more arXiv-issued DOI via DataCite
zh

[NLP-284] COSMIC: Generalized Refusal Direction Identification in LLM Activations ACL2025

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)中拒绝行为的识别问题,尤其是如何在不依赖预定义拒绝模板或模型输出的情况下自动检测和操控这些行为。解决方案的关键在于提出一种名为COSMIC(Cosine Similarity Metrics for Inversion of Concepts)的自动化框架,该框架通过余弦相似度来选择可行的引导方向并确定目标层,从而实现对模型行为的有效控制。

链接: https://arxiv.org/abs/2506.00085
作者: Vincent Siu,Nicholas Crispino,Zihao Yu,Sam Pan,Zhun Wang,Yang Liu,Dawn Song,Chenguang Wang
机构: Washington University in St. Louis (圣路易斯华盛顿大学); University of California, Berkeley (加州大学伯克利分校); University of California, Santa Cruz (加州大学圣塔克鲁兹分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 9 pages, Accepted to ACL 2025 Findings

点击查看摘要

Abstract:Large Language Models (LLMs) encode behaviors such as refusal within their activation space, yet identifying these behaviors remains a significant challenge. Existing methods often rely on predefined refusal templates detectable in output tokens or require manual analysis. We introduce \textbfCOSMIC (Cosine Similarity Metrics for Inversion of Concepts), an automated framework for direction selection that identifies viable steering directions and target layers using cosine similarity - entirely independent of model outputs. COSMIC achieves steering performance comparable to prior methods without requiring assumptions about a model’s refusal behavior, such as the presence of specific refusal tokens. It reliably identifies refusal directions in adversarial settings and weakly aligned models, and is capable of steering such models toward safer behavior with minimal increase in false refusals, demonstrating robustness across a wide range of alignment conditions.
zh

[NLP-285] Bottom-Up Perspectives on AI Governance: Insights from User Reviews of AI Products

【速读】: 该论文试图解决当前AI治理框架在实际应用中与用户实践之间存在的脱节问题,即现有高阶治理框架虽提供有价值的规范导向,但未能充分反映组织和操作环境中与AI系统互动者的实际关切。解决方案的关键在于采用自下而上的方法,通过分析用户对AI产品的评论,利用BERTopic模型提取与AI治理相关的潜在主题,从而揭示治理相关议题在技术与非技术领域的广泛分布,并强调用户视角对完善AI治理和数字政策的重要性。

链接: https://arxiv.org/abs/2506.00080
作者: Stefan Pasch
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:With the growing importance of AI governance, numerous high-level frameworks and principles have been articulated by policymakers, institutions, and expert communities to guide the development and application of AI. While such frameworks offer valuable normative orientation, they may not fully capture the practical concerns of those who interact with AI systems in organizational and operational contexts. To address this gap, this study adopts a bottom-up approach to explore how governance-relevant themes are expressed in user discourse. Drawing on over 100,000 user reviews of AI products from this http URL, we apply BERTopic to extract latent themes and identify those most semantically related to AI governance. The analysis reveals a diverse set of governance-relevant topics spanning both technical and non-technical domains. These include concerns across organizational processes-such as planning, coordination, and communication-as well as stages of the AI value chain, including deployment infrastructure, data handling, and analytics. The findings show considerable overlap with institutional AI governance and ethics frameworks on issues like privacy and transparency, but also surface overlooked areas such as project management, strategy development, and customer interaction. This highlights the need for more empirically grounded, user-centered approaches to AI governance-approaches that complement normative models by capturing how governance unfolds in applied settings. By foregrounding how governance is enacted in practice, this study contributes to more inclusive and operationally grounded approaches to AI governance and digital policy.
zh

[NLP-286] Gaussian mixture models as a proxy for interacting language models

【速读】: 该论文试图解决在社会科学研究中,由于大规模实验不可行,利用大型语言模型(Large Language Models, LLMs)模拟人类行为时所面临的计算复杂性和高成本问题。其解决方案的关键在于引入交互式高斯混合模型(Interacting Gaussian Mixture Models, GMMs),作为LLMs的替代框架,通过简化模型结构来捕捉LLMs在交互过程中的动态特性,并比较两者在行为生成上的异同。

链接: https://arxiv.org/abs/2506.00077
作者: Edward Wang,Tianyu Wang,Avanti Athreya,Vince Lyzinski,Carey E. Priebe
机构: Johns Hopkins University (约翰霍普金斯大学); University of Maryland, College Park (马里兰大学学院公园分校)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are a powerful tool with the ability to match human capabilities and behavior in many settings. Retrieval-augmented generation (RAG) further allows LLMs to generate diverse output depending on the contents of their RAG database. This motivates their use in the social sciences to study human behavior between individuals when large-scale experiments are infeasible. However, LLMs depend on complex, computationally expensive algorithms. In this paper, we introduce interacting Gaussian mixture models (GMMs) as an alternative to similar frameworks using LLMs. We compare a simplified model of GMMs to select experimental simulations of LLMs whose updating and response depend on feedback from other LLMs. We find that interacting GMMs capture important features of the dynamics in interacting LLMs, and we investigate key similarities and differences between interacting LLMs and GMMs. We conclude by discussing the benefits of Gaussian mixture models, potential modifications, and future research directions.
zh

[NLP-287] Optimizing Storytelling Improving Audience Retention and Reducing Waste in the Entertainment Industry

【速读】: 该论文试图解决电视网络在节目制作决策中面临的高财务风险问题,特别是由于依赖有限的历史数据来预测单集收视率所带来的挑战。解决方案的关键在于引入一种机器学习框架,该框架结合了来自25000多个电视节目剧集的自然语言处理(NLP)特征与传统收视率数据,通过提取对话中的情感基调、认知复杂性和叙事结构,提升预测准确性。此外,该框架还利用基于对话向量之间欧几里得距离的相似性评分方法,以内容为基础比较节目,并在不同类型节目中测试,揭示了类型特定的性能表现,为编剧、高管和营销人员提供了可解释的指标以获得观众行为的数据驱动洞察。

链接: https://arxiv.org/abs/2506.00076
作者: Andrew Cornfeld,Ashley Miller,Mercedes Mora-Figueroa,Kurt Samuels,Anthony Palomba
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Television networks face high financial risk when making programming decisions, often relying on limited historical data to forecast episodic viewership. This study introduces a machine learning framework that integrates natural language processing (NLP) features from over 25000 television episodes with traditional viewership data to enhance predictive accuracy. By extracting emotional tone, cognitive complexity, and narrative structure from episode dialogue, we evaluate forecasting performance using SARIMAX, rolling XGBoost, and feature selection models. While prior viewership remains a strong baseline predictor, NLP features contribute meaningful improvements for some series. We also introduce a similarity scoring method based on Euclidean distance between aggregate dialogue vectors to compare shows by content. Tested across diverse genres, including Better Call Saul and Abbott Elementary, our framework reveals genre-specific performance and offers interpretable metrics for writers, executives, and marketers seeking data-driven insight into audience behavior.
zh

[NLP-288] he Automated but Risky Game: Modeling Agent -to-Agent Negotiations and Transactions in Consumer Markets

【速读】: 该论文试图解决在消费者市场中,通过授权AI代理完全自动化谈判和交易所带来的效率与风险问题,具体包括不同大语言模型(Large Language Model, LLM)代理在为用户争取有利交易方面的能力差异,以及完全自动化交易可能引发的潜在风险。解决方案的关键在于构建一个实验框架,用于评估多种LLM代理在现实世界中的谈判和交易表现,从而揭示AI中介交易的内在不平衡性及行为异常可能导致的财务损失问题。

链接: https://arxiv.org/abs/2506.00073
作者: Shenzhe Zhu,Jiao Sun,Yi Nian,Tobin South,Alex Pentland,Jiaxin Pei
机构: University of Toronto(多伦多大学); Google DeepMind(谷歌深度思维); University of Southern California(南加州大学); Massachusetts Institute of Technology(麻省理工学院); Stanford University(斯坦福大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:AI agents are increasingly used in consumer-facing applications to assist with tasks such as product search, negotiation, and transaction execution. In this paper, we explore a future scenario where both consumers and merchants authorize AI agents to fully automate negotiations and transactions. We aim to answer two key questions: (1) Do different LLM agents vary in their ability to secure favorable deals for users? (2) What risks arise from fully automating deal-making with AI agents in consumer markets? To address these questions, we develop an experimental framework that evaluates the performance of various LLM agents in real-world negotiation and transaction settings. Our findings reveal that AI-mediated deal-making is an inherently imbalanced game – different agents achieve significantly different outcomes for their users. Moreover, behavioral anomalies in LLMs can result in financial losses for both consumers and merchants, such as overspending or accepting unreasonable deals. These results underscore that while automation can improve efficiency, it also introduces substantial risks. Users should exercise caution when delegating business decisions to AI agents.
zh

[NLP-289] Evaluating Prompt Engineering Techniques for Accuracy and Confidence Elicitation in Medical LLM s AAMAS AAMAS2025 ALT

【速读】: 该论文试图解决在医疗场景中,提示工程(prompt engineering)技术如何影响大型语言模型(Large Language Models, LLMs)的准确性与置信度获取的问题。其解决方案的关键在于通过调整温度设置、提示风格(如链式思维、少样本、情感化、专家模仿)以及置信度量表,优化模型输出的准确性与置信度之间的对齐程度,并采用AUC-ROC、Brier Score和期望校准误差(Expected Calibration Error, ECE)等指标进行评估,以实现高风险医疗任务中对模型性能与不确定性的双重控制。

链接: https://arxiv.org/abs/2506.00072
作者: Nariman Naderi,Zahra Atf,Peter R Lewis,Aref Mahjoub far,Seyed Amir Ahmad Safavi-Naini,Ali Soroush
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: This paper was accepted for presentation at the 7th International Workshop on EXplainable, Trustworthy, and Responsible AI and Multi-Agent Systems (EXTRAAMAS 2025). Workshop website: this https URL

点击查看摘要

Abstract:This paper investigates how prompt engineering techniques impact both accuracy and confidence elicitation in Large Language Models (LLMs) applied to medical contexts. Using a stratified dataset of Persian board exam questions across multiple specialties, we evaluated five LLMs - GPT-4o, o3-mini, Llama-3.3-70b, Llama-3.1-8b, and DeepSeek-v3 - across 156 configurations. These configurations varied in temperature settings (0.3, 0.7, 1.0), prompt styles (Chain-of-Thought, Few-Shot, Emotional, Expert Mimicry), and confidence scales (1-10, 1-100). We used AUC-ROC, Brier Score, and Expected Calibration Error (ECE) to evaluate alignment between confidence and actual performance. Chain-of-Thought prompts improved accuracy but also led to overconfidence, highlighting the need for calibration. Emotional prompting further inflated confidence, risking poor decisions. Smaller models like Llama-3.1-8b underperformed across all metrics, while proprietary models showed higher accuracy but still lacked calibrated confidence. These results suggest prompt engineering must address both accuracy and uncertainty to be effective in high-stakes medical tasks.
zh

[NLP-290] Evaluating the Sensitivity of LLM s to Prior Context

【速读】: 该论文试图解决大规模语言模型(Large Language Models, LLMs)在多轮对话等持续交互场景中,长期上下文对其性能影响的问题。现有主流基准测试主要针对单轮问答任务,未能有效捕捉多轮交互的影响。为弥补这一不足,研究者提出了一套新的基准测试,系统地变化先验上下文的量和性质,并评估了包括GPT、Claude和Gemini在内的多种传统LLMs对上下文变化的敏感性。研究发现,多轮交互可能导致多项选择题性能显著下降,某些模型甚至下降达73%,而GPT-4o也出现最高32%的准确率下降。关键解决方案在于通过合理安排任务描述在上下文中的位置,可显著缓解性能下降,提升准确率最多达3.5倍。这表明需要更稳健的策略来设计、评估和减轻LLMs的上下文相关敏感性。

链接: https://arxiv.org/abs/2506.00069
作者: Robert Hankache,Kingsley Nketia Acheampong,Liang Song,Marek Brynda,Raad Khraishi,Greig A. Cowan
机构: NatWest AI Research (NatWest人工智能研究); University College London (伦敦大学学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As large language models (LLMs) are increasingly deployed in multi-turn dialogue and other sustained interactive scenarios, it is essential to understand how extended context affects their performance. Popular benchmarks, focusing primarily on single-turn question answering (QA) tasks, fail to capture the effects of multi-turn exchanges. To address this gap, we introduce a novel set of benchmarks that systematically vary the volume and nature of prior context. We evaluate multiple conventional LLMs, including GPT, Claude, and Gemini, across these benchmarks to measure their sensitivity to contextual variations. Our findings reveal that LLM performance on multiple-choice questions can degrade dramatically in multi-turn interactions, with performance drops as large as 73% for certain models. Even highly capable models such as GPT-4o exhibit up to a 32% decrease in accuracy. Notably, the relative performance of larger versus smaller models is not always predictable. Moreover, the strategic placement of the task description within the context can substantially mitigate performance drops, improving the accuracy by as much as a factor of 3.5. These findings underscore the need for robust strategies to design, evaluate, and mitigate context-related sensitivity in LLMs.
zh

[NLP-291] Probing Politico-Economic Bias in Multilingual Large Language Models : A Cultural Analysis of Low-Resource Pakistani Languages

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在非西方和低资源多语言语境下的政治经济偏见问题,特别是针对巴基斯坦的五种低资源语言(乌尔都语、旁遮普语、信德语、俾路支语和普什图语)进行系统性分析。其解决方案的关键在于提出一种整合了改进版政治光谱测试(Political Compass Test, PCT)与多层次框架分析的新框架,结合经济(左-右)和社会(自由-专制)轴线的定量评估以及内容、风格和重点的定性分析,同时将提示与11个与巴基斯坦社会相关的核心社会政治主题对齐,从而揭示模型在不同语言和文化背景下的意识形态表达差异及潜在偏见。

链接: https://arxiv.org/abs/2506.00068
作者: Afrozah Nadeem,Mark Dras,Usman Naseem
机构: Macquarie University (麦考瑞大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Preprint

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly shaping public discourse, yet their politico-economic biases remain underexamined in non-Western and low-resource multilingual contexts. This paper presents a systematic analysis of political bias in 13 state-of-the-art LLMs across five low-resource languages spoken in Pakistan: Urdu, Punjabi, Sindhi, Balochi, and Pashto. We propose a novel framework that integrates an adapted Political Compass Test (PCT) with a multi-level framing analysis. Our method combines quantitative assessment of political orientation across economic (left-right) and social (libertarian-authoritarian) axes with qualitative analysis of framing through content, style, and emphasis. We further contextualize this analysis by aligning prompts with 11 key socio-political themes relevant to Pakistani society. Our results reveal that LLMs predominantly align with liberal-left values, echoing Western training data influences, but exhibit notable shifts toward authoritarian framing in regional languages, suggesting strong cultural modulation effects. We also identify consistent model-specific bias signatures and language-conditioned variations in ideological expression. These findings show the urgent need for culturally grounded, multilingual bias auditing frameworks.
zh

[NLP-292] You Prefer This One I Prefer Yours: Using Reference Words is Harder Than Vocabulary Words for Humans and Multimodal Language Models

【速读】: 该论文试图解决多模态语言模型(Multimodal Language Models, MLMs)在使用指代词方面能力不足的问题,特别是其在处理具有更高认知需求的词汇类别(如物主代词和指示代词)时的表现。研究的关键在于通过对比人类与MLMs在三种词类上的使用情况,揭示当前NLP系统在需要语用和社交认知的语法形式生成上仍存在显著挑战,其核心问题源于视角转换和空间推理能力的局限。

链接: https://arxiv.org/abs/2506.00065
作者: Dota Tianai Dong,Yifan Luo,Po-Ya Angela Wang,Asli Ozyurek,Paula Rubio-Fernandez
机构: Max Planck Institute for Psycholinguistics (马克斯·普朗克语言学研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 8 pages

点击查看摘要

Abstract:Multimodal language models (MLMs) increasingly communicate in human-like ways, yet their ability to use reference words remains largely overlooked despite their ubiquity in everyday communication. Our study addresses this gap by comparing human and MLM use of three word classes with increasing cognitive demands: vocabulary words, possessive pronouns (mine' vs yours’), and demonstrative pronouns (this one' vs that one’). Evaluating seven state-of-the-art MLMs against human participants, we observe a clear difficulty hierarchy: while MLMs approach human-level performance on the vocabulary task, they show substantial deficits with possessives and demonstratives. Our analysis reveals these difficulties stem from limitations in perspective-taking and spatial reasoning. Although prompt engineering improved model performance on possessive use, demonstrative use remained well below human-level competence. These findings provide theoretical and empirical evidence that producing grammatical forms requiring pragmatics and social cognition remains a clear challenge in current NLP systems.
zh

[NLP-293] Mis-prompt: Benchmarking Large Language Models for Proactive Error Handling

【速读】: 该论文试图解决在缺乏显式错误处理指令的情况下,如何实现主动错误处理(proactive error handling)的问题。解决方案的关键在于引入一个名为Mis-prompt的新基准,该基准包含四个评估任务、一个错误类别分类体系以及一个新的评估数据集,以促进对主动错误处理能力的研究,并通过错误处理实例的监督微调(SFT)提升大语言模型(LLMs)的主动错误处理能力。

链接: https://arxiv.org/abs/2506.00064
作者: Jiayi Zeng,Yizhe Feng,Mengliang He,Wenhui Lei,Wei Zhang,Zeming Liu,Xiaoming Shi,Aimin Zhou
机构: East China Normal University (华东师范大学); Beihang University (北京航空航天大学); Shanghai Jiaotong University (上海交通大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated significant advancements in error handling. Current error-handling works are performed in a passive manner, with explicit error-handling instructions. However, in real-world scenarios, explicit error-handling instructions are usually unavailable. In this paper, our work identifies this challenge as how to conduct proactive error handling without explicit error handling instructions. To promote further research, this work introduces a new benchmark, termed Mis-prompt, consisting of four evaluation tasks, an error category taxonomy, and a new evaluation dataset. Furthermore, this work analyzes current LLMs’ performance on the benchmark, and the experimental results reveal that current LLMs show poor performance on proactive error handling, and SFT on error handling instances improves LLMs’ proactive error handling capabilities. The dataset will be publicly available.
zh

[NLP-294] SafeCOMM: What about Safety Alignment in Fine-Tuned Telecom Large Language Models ?

【速读】: 该论文试图解决在电信领域微调大型语言模型(Large Language Models, LLMs)过程中可能出现的模型安全性退化问题,即微调可能导致模型对有害或不道德用户查询的响应能力下降。解决方案的关键在于通过引入三种安全再对齐防御机制(SafeInstruct、SafeLoRA 和 SafeMERGE),在恢复模型安全性的同时保持其下游任务性能,从而构建安全的电信通信(SafeCOMM)模型。

链接: https://arxiv.org/abs/2506.00062
作者: Aladin Djuhera,Swanand Ravindra Kadhe,Farhan Ahmed,Syed Zawad,Holger Boche,Walid Saad
机构: Technical University of Munich (慕尼黑工业大学); IBM Research (IBM研究院); Virginia Tech (弗吉尼亚理工大学)
类目: Computers and Society (cs.CY); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Fine-tuning large language models (LLMs) for telecom tasks and datasets is a common practice to adapt general-purpose models to the telecom domain. However, little attention has been paid to how this process may compromise model safety. Recent research has shown that even benign fine-tuning can degrade the safety alignment of LLMs, causing them to respond to harmful or unethical user queries. In this paper, we investigate this issue for telecom-tuned LLMs using three representative datasets featured by the GenAINet initiative. We show that safety degradation persists even for structured and seemingly harmless datasets such as 3GPP standards and tabular records, indicating that telecom-specific data is not immune to safety erosion during fine-tuning. We further extend our analysis to publicly available Telecom LLMs trained via continual pre-training, revealing that safety alignment is often severely lacking, primarily due to the omission of safety-focused instruction tuning. To address these issues in both fine-tuned and pre-trained models, we conduct extensive experiments and evaluate three safety realignment defenses (SafeInstruct, SafeLoRA, and SafeMERGE) using established red-teaming benchmarks. The results show that, across all settings, the proposed defenses can effectively restore safety after harmful degradation without compromising downstream task performance, leading to Safe teleCOMMunication (SafeCOMM) models. In a nutshell, our work serves as a diagnostic study and practical guide for safety realignment in telecom-tuned LLMs, and emphasizes the importance of safety-aware instruction and fine-tuning for real-world deployments of Telecom LLMs.
zh

[NLP-295] Unraveling SITT: Social Influence Technique Taxonomy and Detection with LLM s

【速读】: 该论文试图解决如何检测文本内容中微妙的社会影响策略的问题,其解决方案的关键在于构建了一个包含58种实证基础技术的综合框架——社会影响技术分类法(Social Influence Technique Taxonomy, SITT),并基于此构建了用于评估大语言模型(LLM)识别能力的数据集。该数据集由746条对话组成,由11名专家在波兰语中进行标注并翻译为英语,通过分层多标签分类设置对五种LLM进行了基准测试,结果表明当前模型在识别上下文敏感的技术方面仍存在局限性,凸显了领域特定微调的重要性。

链接: https://arxiv.org/abs/2506.00061
作者: Wiktoria Mieleszczenko-Kowszewicz,Beata Bajcar,Aleksander Szczęsny,Maciej Markiewicz,Jolanta Babiak,Berenika Dyczek,Przemysław Kazienko
机构: Warsaw University of Technology (华沙理工大学); Wrocław University of Science and Technology (弗罗茨瓦夫科技大学); University of Wrocław (弗罗茨瓦夫大学); Lincoln University College, Petaling Jaya, Malaysian (林肯大学学院,吉隆坡,马来西亚)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In this work we present the Social Influence Technique Taxonomy (SITT), a comprehensive framework of 58 empirically grounded techniques organized into nine categories, designed to detect subtle forms of social influence in textual content. We also investigate the LLMs ability to identify various forms of social influence. Building on interdisciplinary foundations, we construct the SITT dataset – a 746-dialogue corpus annotated by 11 experts in Polish and translated into English – to evaluate the ability of LLMs to identify these techniques. Using a hierarchical multi-label classification setup, we benchmark five LLMs, including GPT-4o, Claude 3.5, Llama-3.1, Mixtral, and PLLuM. Our results show that while some models, notably Claude 3.5, achieved moderate success (F1 score = 0.45 for categories), overall performance of models remains limited, particularly for context-sensitive techniques. The findings demonstrate key limitations in current LLMs’ sensitivity to nuanced linguistic cues and underscore the importance of domain-specific fine-tuning. This work contributes a novel resource and evaluation example for understanding how LLMs detect, classify, and potentially replicate strategies of social influence in natural dialogues.
zh

[NLP-296] Comparative analysis of privacy-preserving open-source LLM s regarding extraction of diagnostic information from clinical CMR imaging reports

【速读】: 该论文旨在解决如何在保护患者隐私的前提下,利用本地部署的开源大型语言模型(LLMs)从自由文本的心血管磁共振(CMR)报告中提取诊断信息的问题。解决方案的关键在于采用隐私保护、本地部署的开源LLMs,通过其强大的自然语言处理能力实现对临床CMR报告的自动诊断分类,从而提供一种准确、快速且资源高效的分析方法。

链接: https://arxiv.org/abs/2506.00060
作者: Sina Amirrajab,Volker Vehof,Michael Bietenbeck,Ali Yilmaz
机构: Maastricht University (马斯特里赫特大学); University Hospital Münster (明斯特大学医院)
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: under review for Scientific Reports

点击查看摘要

Abstract:Purpose: We investigated the utilization of privacy-preserving, locally-deployed, open-source Large Language Models (LLMs) to extract diagnostic information from free-text cardiovascular magnetic resonance (CMR) reports. Materials and Methods: We evaluated nine open-source LLMs on their ability to identify diagnoses and classify patients into various cardiac diagnostic categories based on descriptive findings in 109 clinical CMR reports. Performance was quantified using standard classification metrics including accuracy, precision, recall, and F1 score. We also employed confusion matrices to examine patterns of misclassification across models. Results: Most open-source LLMs demonstrated exceptional performance in classifying reports into different diagnostic categories. Google’s Gemma2 model achieved the highest average F1 score of 0.98, followed by Qwen2.5:32B and DeepseekR1-32B with F1 scores of 0.96 and 0.95, respectively. All other evaluated models attained average scores above 0.93, with Mistral and DeepseekR1-7B being the only exceptions. The top four LLMs outperformed our board-certified cardiologist (F1 score of 0.94) across all evaluation metrics in analyzing CMR reports. Conclusion: Our findings demonstrate the feasibility of implementing open-source, privacy-preserving LLMs in clinical settings for automated analysis of imaging reports, enabling accurate, fast and resource-efficient diagnostic categorization.
zh

[NLP-297] Retrieval-Augmented Generation: A Comprehensive Survey of Architectures Enhancements and Robustness Frontiers

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在事实一致性、领域灵活性等方面的局限性,通过引入检索增强生成(Retrieval-Augmented Generation, RAG)范式来提升模型的性能。其解决方案的关键在于利用外部证据在推理过程中进行条件生成,从而增强模型的准确性和适应性。论文系统分析了RAG系统的多个优化方向,包括检索优化、上下文过滤、解码控制和效率提升,并探讨了检索精度与生成灵活性、效率与忠实性之间的权衡问题。

链接: https://arxiv.org/abs/2506.00054
作者: Chaitanya Sharma
机构: Independent Researcher(独立研究员)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) has emerged as a powerful paradigm to enhance large language models (LLMs) by conditioning generation on external evidence retrieved at inference time. While RAG addresses critical limitations of parametric knowledge storage-such as factual inconsistency and domain inflexibility-it introduces new challenges in retrieval quality, grounding fidelity, pipeline efficiency, and robustness against noisy or adversarial inputs. This survey provides a comprehensive synthesis of recent advances in RAG systems, offering a taxonomy that categorizes architectures into retriever-centric, generator-centric, hybrid, and robustness-oriented designs. We systematically analyze enhancements across retrieval optimization, context filtering, decoding control, and efficiency improvements, supported by comparative performance analyses on short-form and multi-hop question answering tasks. Furthermore, we review state-of-the-art evaluation frameworks and benchmarks, highlighting trends in retrieval-aware evaluation, robustness testing, and federated retrieval settings. Our analysis reveals recurring trade-offs between retrieval precision and generation flexibility, efficiency and faithfulness, and modularity and coordination. We conclude by identifying open challenges and future research directions, including adaptive retrieval architectures, real-time retrieval integration, structured reasoning over multi-hop evidence, and privacy-preserving retrieval mechanisms. This survey aims to consolidate current knowledge in RAG research and serve as a foundation for the next generation of retrieval-augmented language modeling systems.
zh

[NLP-298] Enhancing Tool Learning in Large Language Models with Hierarchical Error Checklists

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在调用外部工具和API时因参数填写错误而导致的效果下降问题。其解决方案的关键在于提出一种分层的工具错误检查清单(Hierarchical Tool Error Checklist, HiTEC)框架,该框架通过全局错误检查清单识别跨工具的通用问题,并结合局部错误检查清单定位特定工具和上下文中的故障,从而系统性地诊断和缓解工具调用错误。

链接: https://arxiv.org/abs/2506.00042
作者: Yue Cui,Liuyi Yao,Shuchang Tao,Weijie Shi,Yaliang Li,Bolin Ding,Xiaofang Zhou
机构: Alibaba Group (阿里巴巴集团); The Hong Kong University of Science and Technology (香港科技大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have significantly advanced natural language processing, particularly through the integration of external tools and APIs. However, their effectiveness is frequently hampered by parameter mis-filling during tool calling. In this paper, we propose the Hierarchical Tool Error Checklist (HiTEC) framework to systematically diagnose and mitigate tool-calling errors without relying on extensive real-world interactions. HiTEC introduces a two-tiered approach: a global error checklist that identifies common, cross-tool issues, and a local error checklist that targets tool-specific and contextual failures. Building on this structure, we propose two deployments: HiTEC-In Context Learning (HiTEC-ICL) and HiTEC-Kahneman-Tversky Optimization (HiTEC-KTO). HiTEC-ICL embeds the global checklist in the initial prompts and leverages a two-round conversational interaction to dynamically refine parameter handling, while HiTEC-KTO generates high-quality negative examples to drive fine-tuning via preference-based optimization. Extensive experiments across five public datasets demonstrate that our framework significantly improves parameter-filling accuracy and tool-calling success rates compared to baseline methods.
zh

[NLP-299] From Mathematical Reasoning to Code: Generalization of Process Reward Models in Test-Time Scaling

【速读】: 该论文旨在解决大型语言模型在复杂推理任务中因中间错误导致性能下降的问题,其解决方案的关键在于利用过程奖励模型(Process Reward Model, PRM)通过结构化反馈机制提升模型的推理能力。研究重点分析了PRM的训练方法、可扩展性及泛化能力,并揭示了预训练与奖励模型训练计算量(FLOPs)之间的相互作用,强调了模型规模与计算成本之间的平衡对PRM效率和准确性的重要性。此外,研究还探讨了不同测试时缩放策略的有效性,如蒙特卡洛树搜索在资源充足时的表现优于最佳N采样方法,在资源受限情况下则更具实用性。

链接: https://arxiv.org/abs/2506.00027
作者: Zhengyu Chen,Yudong Wang,Teng Xiao,Ruochen Zhou,Xuesheng Yang,Wei Wang,Zhifang Sui,Jingang Wang
机构: Meituan Inc. (美团公司); National Key Laboratory for Multimedia Information Processing, Peking University (北京大学多媒体信息处理国家重点实验室); Pennsylvania State University (宾夕法尼亚州立大学); Peking University (北京大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advancements in improving the reasoning capabilities of Large Language Models have underscored the efficacy of Process Reward Models (PRMs) in addressing intermediate errors through structured feedback mechanisms. This study analyzes PRMs from multiple perspectives, including training methodologies, scalability, and generalization capabilities. We investigate the interplay between pre-training and reward model training FLOPs to assess their influence on PRM efficiency and accuracy in complex reasoning tasks. Our analysis reveals a pattern of diminishing returns in performance with increasing PRM scale, highlighting the importance of balancing model size and computational cost. Furthermore, the diversity of training datasets significantly impacts PRM performance, emphasizing the importance of diverse data to enhance both accuracy and efficiency. We further examine test-time scaling strategies, identifying Monte Carlo Tree Search as the most effective method when computational resources are abundant, while Best-of-N Sampling serves as a practical alternative under resource-limited conditions. Notably, our findings indicate that PRMs trained on mathematical datasets exhibit performance comparable to those tailored for code generation, suggesting robust cross-domain generalization. Employing a gradient-based metric, we observe that PRMs exhibit a preference for selecting responses with similar underlying patterns, further informing their optimization.
zh

[NLP-300] Scaling Physical Reasoning with the PHYSICS Dataset

【速读】: 该论文试图解决物理领域在大型语言模型(Large Language Models, LLMs)研究中受到关注不足的问题,尤其是物理学科因其高度推理性和对现实世界理解的重要性,却未获得相应的学术和工业关注。解决方案的关键在于构建一个高质量的物理问题数据集PHYSICS,该数据集包含16,568道涵盖多个物理领域和难度级别的问题,并通过精心设计的质量控制流程从超过100本教材中筛选内容。此外,为提升模型的物理推理能力,数据集被划分为训练集和测试集,并为训练数据提供了由强大推理模型生成的推理路径;同时,引入了针对物理问题定制的Rule+Model评估框架,以平衡效率与准确性。

链接: https://arxiv.org/abs/2506.00022
作者: Shenghe Zheng,Qianjia Cheng,Junchi Yao,Mengsong Wu,haonan he,Ning Ding,Yu Cheng,Shuyue Hu,Lei Bai,Dongzhan Zhou,Ganqu Cui,Peng Ye
机构: Shanghai AI Laboratory; Harbin Institute of Technology; Tsinghua University; The Chinese University of Hong Kong; Beijing University of Aeronautics and Astronautics; University of Electronic Science and Technology of China; Suzhou University; University of Science and Technology of China
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Physics Education (physics.ed-ph)
备注: Work on physical datasets

点击查看摘要

Abstract:Large Language Models (LLMs) have achieved remarkable progress on advanced reasoning tasks such as mathematics and coding competitions. Meanwhile, physics, despite being both reasoning-intensive and essential to real-world understanding, received limited academic and industrial attention. This paper introduces PHYSICS, a dataset containing 16,568 high-quality physics problems spanning subjects and difficulty levels, to facilitate this issue. Specifically, PHYSICS is curated with exercises from over 100 textbooks through a carefully designed pipeline for quality control. It covers five major physics domains: Mechanics, Electromagnetism, Thermodynamics, Optics, and Modern Physics. It also spans a wide range of difficulty levels, from high school to graduate-level physics courses. To utilize the data for improving and evaluating the model’s physical reasoning capabilities, we split the dataset into training and test sets, and provide reasoning paths generated by powerful reasoning models for the training data to facilitate model training. In addition, for the evaluation part, we find that existing evaluation frameworks exhibit biases in aspects such as units, simplification, and precision in physics domain. To balance efficiency and accuracy, we introduce a Rule+Model evaluation framework tailored to physics problems. Our evaluations on current state-of-the-art open-source and proprietary models highlight the limitations of current models in handling physics-related tasks. We hope that our dataset and evaluation methodology will jointly advance the development of LLMs in the field of physics.
zh

[NLP-301] Amadeus-Verbo Technical Report: The powerful Qwen 2.5 family models trained in Portuguese

【速读】: 该论文试图解决如何利用可用的数据和资源,通过微调基础模型来促进巴西葡萄牙语大型语言模型(Large Language Models, LLMs)的开源开发问题。解决方案的关键在于构建一个包含基础微调、合并和指令微调模型的系列,覆盖多种参数规模(0.5B, 1.5B, 3B, 7B, 14B, 32B, 和 72B),从而展示微调过程的简便性与有效性。

链接: https://arxiv.org/abs/2506.00019
作者: William Alberto Cruz-Castañeda,Marcellus Amadeus
机构: Amadeus AI (Amadeus AI)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This report introduces the experience of developing Amadeus Verbo, a family of large language models for Brazilian Portuguese. To handle diverse use cases, Amadeus Verbo includes base-tuned, merged, and instruction-tuned models in sizes of 0.5B, 1.5B, 3B, 7B, 14B, 32B, and 72B parameters. Thus, the main objective is to show how easy it is to fine-tune foundation models to democratize the open-source development of Brazilian Portuguese LLMs when data and resources are available. Amadeus-Verbo family models are all available at HuggingFace at this https URL.
zh

[NLP-302] Probing Audio-Generation Capabilities of Text-Based Language Models NAACL

【速读】: 该论文试图解决的问题是:文本形式的音频表示如何与大型语言模型(Large Language Model, LLM)对音频世界的理解相关联,以及LLMs在仅接受文本数据训练的情况下,能否被提示生成音频。论文提出的解决方案的关键在于采用一种三层级方法,逐步增加音频生成的复杂度,并通过代码作为文本与音频之间的中介,促使LLMs生成可执行代码以产生目标音频输出。

链接: https://arxiv.org/abs/2506.00003
作者: Arjun Prasaath Anbazhagan,Parteek Kumar,Ujjwal Kaur,Aslihan Akalin,Kevin Zhu,Sean O’Brien
机构: Algoverse AI Research (Algoverse AI 研究)
类目: ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: Accepted at Conference of the North American Chapter of the Association for Computational Linguistics 2025, Student Research Workshop (NAACL SRW)

点击查看摘要

Abstract:How does textual representation of audio relate to the Large Language Model’s (LLMs) learning about the audio world? This research investigates the extent to which LLMs can be prompted to generate audio, despite their primary training in textual data. We employ a three-tier approach, progressively increasing the complexity of audio generation: 1) Musical Notes, 2) Environmental Sounds, and 3) Human Speech. To bridge the gap between text and audio, we leverage code as an intermediary, prompting LLMs to generate code that, when executed, produces the desired audio output. To evaluate the quality and accuracy of the generated audio, we employ FAD and CLAP scores. Our findings reveal that while LLMs can generate basic audio features, their performance deteriorates as the complexity of the audio increases. This suggests that while LLMs possess a latent understanding of the auditory world, their ability to translate this understanding into tangible audio output remains rudimentary. Further research into techniques that can enhance the quality and diversity of LLM-generated audio can lead to an improvement in the performance of text-based LLMs in generating audio.
zh

[NLP-303] Enhancing Finite State Machine Design Automation with Large Language Models and Prompt Engineering Techniques

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在有限状态机(Finite State Machine, FSM)设计中的性能问题,特别是其稳定性、局限性以及提升成功率的潜在方法。解决方案的关键在于采用系统化的格式化提示方法和创新的提示优化技术——以任务为导向的提示(To-do-Oriented Prompting, TOP)补丁,以提高LLMs在不同FSM设计场景中的成功率,并探索其在HDL设计自动化以外领域的应用潜力。

链接: https://arxiv.org/abs/2506.00001
作者: Qun-Kai Lin,Cheng Hsu,Tian-Sheuan Chang
机构: 未知
类目: Hardware Architecture (cs.AR); Computation and Language (cs.CL)
备注: published in 2024 IEEE Asia Pacific Conference on Circuits and Systems (APCCAS 2024)

点击查看摘要

Abstract:Large Language Models (LLMs) have attracted considerable attention in recent years due to their remarkable compatibility with Hardware Description Language (HDL) design. In this paper, we examine the performance of three major LLMs, Claude 3 Opus, ChatGPT-4, and ChatGPT-4o, in designing finite state machines (FSMs). By utilizing the instructional content provided by HDLBits, we evaluate the stability, limitations, and potential approaches for improving the success rates of these models. Furthermore, we explore the impact of using the prompt-refining method, To-do-Oriented Prompting (TOP) Patch, on the success rate of these LLM models in various FSM design scenarios. The results show that the systematic format prompt method and the novel prompt refinement method have the potential to be applied to other domains beyond HDL design automation, considering its possible integration with other prompt engineering techniques in the future.
zh

[NLP-304] GLEN: Generative Retrieval via Lexical Index Learning EMNLP2023

【速读】: 该论文旨在解决生成式检索(Generative Retrieval)中的两个关键问题:一是预训练语言模型与文档标识符之间的知识差异,二是训练与推理之间的差距导致的排序学习困难。其解决方案的关键在于提出一种名为GLEN(Generative retrieval via LExical iNdex learning)的新方法,该方法通过两阶段的词法索引学习策略,在训练阶段有效利用动态词法标识符以学习查询与文档间的语义关联;在推理阶段采用无冲突的推理机制,利用标识符权重对文档进行排序,从而避免额外计算开销。

链接: https://arxiv.org/abs/2311.03057
作者: Sunkyung Lee,Minjin Choi,Jongwuk Lee
机构: 未知
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP 2023) main conference. 12 pages, 2 figures, 8 tables

点击查看摘要

Abstract:Generative retrieval shed light on a new paradigm of document retrieval, aiming to directly generate the identifier of a relevant document for a query. While it takes advantage of bypassing the construction of auxiliary index structures, existing studies face two significant challenges: (i) the discrepancy between the knowledge of pre-trained language models and identifiers and (ii) the gap between training and inference that poses difficulty in learning to rank. To overcome these challenges, we propose a novel generative retrieval method, namely Generative retrieval via LExical iNdex learning (GLEN). For training, GLEN effectively exploits a dynamic lexical identifier using a two-phase index learning strategy, enabling it to learn meaningful lexical identifiers and relevance signals between queries and documents. For inference, GLEN utilizes collision-free inference, using identifier weights to rank documents without additional overhead. Experimental results prove that GLEN achieves state-of-the-art or competitive performance against existing generative retrieval methods on various benchmark datasets, e.g., NQ320k, MS MARCO, and BEIR. The code is available at this https URL.
zh

[NLP-305] LinearVC: Linear transformations of self-supervised features through the lens of voice conversion INTERSPEECH2025

【速读】: 该论文试图解决语音转换(Voice Conversion, VC)问题,旨在通过简单的方法揭示自监督表示的结构。其解决方案的关键在于利用自监督特征的线性变换实现高质量的语音转换,特别是通过旋转特征空间中的表示即可完成有效的语音转换,表明内容信息嵌入在一个低维子空间中,可通过线性变换生成目标语音。

链接: https://arxiv.org/abs/2506.01510
作者: Herman Kamper,Benjamin van Niekerk,Julian Zaïdi,Marc-André Carbonneau
机构: Ubisoft La Forge (育碧拉法耶特工作室)
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL)
备注: Accepted to Interspeech 2025

点击查看摘要

Abstract:We introduce LinearVC, a simple voice conversion method that sheds light on the structure of self-supervised representations. First, we show that simple linear transformations of self-supervised features effectively convert voices. Next, we probe the geometry of the feature space by constraining the set of allowed transformations. We find that just rotating the features is sufficient for high-quality voice conversion. This suggests that content information is embedded in a low-dimensional subspace which can be linearly transformed to produce a target voice. To validate this hypothesis, we finally propose a method that explicitly factorizes content and speaker information using singular value decomposition; the resulting linear projection with a rank of just 100 gives competitive conversion results. Our work has implications for both practical voice conversion and a broader understanding of self-supervised speech representations. Samples and code: this https URL.
zh

[NLP-306] Confidence intervals for forced alignment boundaries using model ensembles

【速读】: 该论文试图解决强制对齐(forced alignment)中仅提供边界单次估计而缺乏置信度信息的问题。解决方案的关键在于利用神经网络集成技术(neural network ensemble technique),通过训练多个段落分类器神经网络,并重复对齐过程,最终以集成结果的中位数作为边界位置,同时基于顺序统计量构建97.85%的置信区间,从而量化对齐结果的不确定性。

链接: https://arxiv.org/abs/2506.01256
作者: Matthew C. Kelley
机构: 未知
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)
备注: submitted for publication; 7 pages, 1 figure

点击查看摘要

Abstract:Forced alignment is a common tool to align audio with orthographic and phonetic transcriptions. Most forced alignment tools provide only a single estimate of a boundary. The present project introduces a method of deriving confidence intervals for these boundaries using a neural network ensemble technique. Ten different segment classifier neural networks were previously trained, and the alignment process is repeated with each model. The alignment ensemble is then used to place the boundary at the median of the boundaries in the ensemble, and 97.85% confidence intervals are constructed using order statistics. On the Buckeye and TIMIT corpora, the ensemble boundaries show a slight improvement over using just a single model. The confidence intervals are incorporated into Praat TextGrids using a point tier, and they are also output as a table for researchers to analyze separately as diagnostics or to incorporate uncertainty into their analyses.
zh

[NLP-307] Pushing the Limits of Beam Search Decoding for Transducer-based ASR models INTERSPEECH2025

【速读】: 该论文旨在解决基于Transducer模型的端到端自动语音识别(ASR)系统中,因使用束搜索(beam search)而导致的推理速度显著下降的问题。其关键解决方案是提出一种通用加速方法,通过引入批量操作、基于树状的假设结构、新颖的空白得分机制以增强浅层融合,以及利用CUDA图执行实现高效的GPU推理,从而大幅缩小束搜索与贪婪解码之间的速度差距,并提升识别准确率和低资源场景下的浅层融合效果。

链接: https://arxiv.org/abs/2506.00185
作者: Lilit Grigoryan,Vladimir Bataev,Andrei Andrusenko,Hainan Xu,Vitaly Lavrukhin,Boris Ginsburg
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)
备注: Accepted to Interspeech 2025

点击查看摘要

Abstract:Transducer models have emerged as a promising choice for end-to-end ASR systems, offering a balanced trade-off between recognition accuracy, streaming capabilities, and inference speed in greedy decoding. However, beam search significantly slows down Transducers due to repeated evaluations of key network components, limiting practical applications. This paper introduces a universal method to accelerate beam search for Transducers, enabling the implementation of two optimized algorithms: ALSD++ and AES++. The proposed method utilizes batch operations, a tree-based hypothesis structure, novel blank scoring for enhanced shallow fusion, and CUDA graph execution for efficient GPU inference. This narrows the speed gap between beam and greedy modes to only 10-20% for the whole system, achieves 14-30% relative improvement in WER compared to greedy decoding, and improves shallow fusion for low-resource up to 11% compared to existing implementations. All the algorithms are open sourced.
zh

计算机视觉

[CV-0] DualMap: Online Open-Vocabulary Semantic Mapping for Natural Language Navigation in Dynamic Changing Scenes

【速读】:该论文试图解决机器人在动态变化环境中通过自然语言查询进行理解和导航的问题,特别是传统方法中因需要昂贵的3D物体合并而导致的效率低下问题。解决方案的关键在于提出了一种混合分割前端和基于物体级别的状态检查机制,从而消除了对3D物体合并的依赖,实现了高效的在线场景映射。此外,双图表示结构结合了全局抽象地图用于高层次候选选择与局部具体地图用于精确目标到达,有效管理并更新环境中的动态变化。

链接: https://arxiv.org/abs/2506.01950
作者: Jiajun Jiang,Yiming Zhu,Zirui Wu,Jie Song
机构: The Hong Kong University of Science and Technology (Guangzhou); The Hong Kong University of Science and Technology
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 6 figures. Code: this https URL Project page: this https URL

点击查看摘要

Abstract:We introduce DualMap, an online open-vocabulary mapping system that enables robots to understand and navigate dynamically changing environments through natural language queries. Designed for efficient semantic mapping and adaptability to changing environments, DualMap meets the essential requirements for real-world robot navigation applications. Our proposed hybrid segmentation frontend and object-level status check eliminate the costly 3D object merging required by prior methods, enabling efficient online scene mapping. The dual-map representation combines a global abstract map for high-level candidate selection with a local concrete map for precise goal-reaching, effectively managing and updating dynamic changes in the environment. Through extensive experiments in both simulation and real-world scenarios, we demonstrate state-of-the-art performance in 3D open-vocabulary segmentation, efficient scene mapping, and online language-guided navigation.
zh

[CV-1] IMAGHarmony: Controllable Image Editing with Consistent Object Quantity and Layout

【速读】:该论文试图解决多对象场景下图像编辑中对物体类别、数量及空间布局的精确控制问题,这一问题在现有图像编辑方法中被显著忽视。其解决方案的关键在于提出IMAGHarmony框架,该框架通过引入和谐感知注意力(harmony-aware attention, HA)来整合多模态语义,并显式建模物体数量和布局,从而提升编辑的准确性与结构一致性。此外,还提出了偏好引导的噪声选择(preference-guided noise selection, PNS)策略,以基于视觉-语言匹配选择语义对齐的初始噪声样本,增强多对象编辑的生成稳定性和布局一致性。

链接: https://arxiv.org/abs/2506.01949
作者: Fei Shen,Xiaoyu Du,Yutong Gao,Jian Yu,Yushe Cao,Xing Lei,Jinhui Tang
机构: Nanjing University of Science and Technology (南京理工大学); Tsinghua University (清华大学); University of California (加州大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent diffusion models have advanced image editing by enhancing visual quality and control, supporting broad applications across creative and personalized domains. However, current image editing largely overlooks multi-object scenarios, where precise control over object categories, counts, and spatial layouts remains a significant challenge. To address this, we introduce a new task, quantity-and-layout consistent image editing (QL-Edit), which aims to enable fine-grained control of object quantity and spatial structure in complex scenes. We further propose IMAGHarmony, a structure-aware framework that incorporates harmony-aware attention (HA) to integrate multimodal semantics, explicitly modeling object counts and layouts to enhance editing accuracy and structural consistency. In addition, we observe that diffusion models are susceptible to initial noise and exhibit strong preferences for specific noise patterns. Motivated by this, we present a preference-guided noise selection (PNS) strategy that chooses semantically aligned initial noise samples based on vision-language matching, thereby improving generation stability and layout consistency in multi-object editing. To support evaluation, we construct HarmonyBench, a comprehensive benchmark covering diverse quantity and layout control scenarios. Extensive experiments demonstrate that IMAGHarmony consistently outperforms state-of-the-art methods in structural alignment and semantic accuracy. The code and model are available at this https URL.
zh

[CV-2] RAW Image Reconstruction from RGB on Smartphones. NTIRE 2025 Challenge Report CVPR2025

【速读】:该论文试图解决从sRGB图像中逆向重建RAW图像的问题,即Reverse ISP(图像信号处理)任务。其关键在于利用传感器信息和已有的sRGB图像生成逼真的RAW数据,从而克服RAW图像数据集稀缺且收集成本高的问题。通过这一过程,论文旨在恢复智能手机拍摄的RAW传感器图像,而无需依赖元数据。

链接: https://arxiv.org/abs/2506.01947
作者: Marcos V. Conde,Radu Timofte,Radu Berdan,Beril Besbinar,Daisuke Iso,Pengzhou Ji,Xiong Dun,Zeying Fan,Chen Wu,Zhansheng Wang,Pengbo Zhang,Jiazi Huang,Qinglin Liu,Wei Yu,Shengping Zhang,Xiangyang Ji,Kyungsik Kim,Minkyung Kim,Hwalmin Lee,Hekun Ma,Huan Zheng,Yanyan Wei,Zhao Zhang,Jing Fang,Meilin Gao,Xiang Yu,Shangbin Xie,Mengyuan Sun,Huanjing Yue,Jingyu Yang Huize Cheng,Shaomeng Zhang,Zhaoyang Zhang,Haoxiang Liang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025 - New Trends in Image Restoration and Enhancement (NTIRE)

点击查看摘要

Abstract:Numerous low-level vision tasks operate in the RAW domain due to its linear properties, bit depth, and sensor designs. Despite this, RAW image datasets are scarce and more expensive to collect than the already large and public sRGB datasets. For this reason, many approaches try to generate realistic RAW images using sensor information and sRGB images. This paper covers the second challenge on RAW Reconstruction from sRGB (Reverse ISP). We aim to recover RAW sensor images from smartphones given the corresponding sRGB images without metadata and, by doing this, ``reverse" the ISP transformation. Over 150 participants joined this NTIRE 2025 challenge and submitted efficient models. The proposed methods and benchmark establish the state-of-the-art for generating realistic RAW data.
zh

[CV-3] MLLM s Need 3D-Aware Representation Supervision for Scene Understanding

【速读】:该论文试图解决多模态大语言模型(MLLM)在3D表示能力上的不足,尤其是在缺乏显式3D数据的情况下,导致其在3D推理任务中的表现受限。解决方案的关键在于提出3DRS框架,通过引入预训练3D基础模型的监督信号,增强MLLM的3D表示学习能力,从而提升场景理解性能。

链接: https://arxiv.org/abs/2506.01946
作者: Xiaohu Huang,Jingjing Wu,Qunyi Xie,Kai Han
机构: The University of Hong Kong (香港大学); Baidu Inc. (百度公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in scene understanding have leveraged multimodal large language models (MLLMs) for 3D reasoning by capitalizing on their strong 2D pretraining. However, the lack of explicit 3D data during MLLM pretraining limits 3D representation capability. In this paper, we investigate the 3D-awareness of MLLMs by evaluating multi-view correspondence and reveal a strong positive correlation between the quality of 3D-aware representation and downstream task performance. Motivated by this, we propose 3DRS, a framework that enhances MLLM 3D representation learning by introducing supervision from pretrained 3D foundation models. Our approach aligns MLLM visual features with rich 3D knowledge distilled from 3D models, effectively improving scene understanding. Extensive experiments across multiple benchmarks and MLLMs – including visual grounding, captioning, and question answering – demonstrate consistent performance gains. Project page: this https URL
zh

[CV-4] Learning Video Generation for Robotic Manipulation with Collaborative Trajectory Control

【速读】:该论文试图解决现有基于轨迹的方法在复杂机器人操作任务中难以捕捉多物体交互的问题,这一问题源于重叠区域中的多特征纠缠导致的视觉保真度下降。其解决方案的关键在于提出RoboMaster框架,通过协同轨迹建模来表征物体间动力学,将交互过程分解为预交互、交互和后交互三个子阶段,并分别利用主导物体的特征进行建模,从而缓解了先前方法在交互过程中多物体特征融合的缺陷。

链接: https://arxiv.org/abs/2506.01943
作者: Xiao Fu,Xintao Wang,Xian Liu,Jianhong Bai,Runsen Xu,Pengfei Wan,Di Zhang,Dahua Lin
机构: The Chinese University of Hong Kong (香港中文大学); Kuaishou Technology (快手科技); Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL Code: this https URL

点击查看摘要

Abstract:Recent advances in video diffusion models have demonstrated strong potential for generating robotic decision-making data, with trajectory conditions further enabling fine-grained control. However, existing trajectory-based methods primarily focus on individual object motion and struggle to capture multi-object interaction crucial in complex robotic manipulation. This limitation arises from multi-feature entanglement in overlapping regions, which leads to degraded visual fidelity. To address this, we present RoboMaster, a novel framework that models inter-object dynamics through a collaborative trajectory formulation. Unlike prior methods that decompose objects, our core is to decompose the interaction process into three sub-stages: pre-interaction, interaction, and post-interaction. Each stage is modeled using the feature of the dominant object, specifically the robotic arm in the pre- and post-interaction phases and the manipulated object during interaction, thereby mitigating the drawback of multi-object feature fusion present during interaction in prior work. To further ensure subject semantic consistency throughout the video, we incorporate appearance- and shape-aware latent representations for objects. Extensive experiments on the challenging Bridge V2 dataset, as well as in-the-wild evaluation, demonstrate that our method outperforms existing approaches, establishing new state-of-the-art performance in trajectory-controlled video generation for robotic manipulation.
zh

[CV-5] OD3: Optimization-free Dataset Distillation for Object Detection

【速读】:该论文旨在解决在大规模数据集上训练大型神经网络所需的高计算资源问题,特别是在密集预测任务如目标检测中的挑战。现有研究主要集中在图像分类任务上的数据蒸馏(Dataset Distillation, DD),而目标检测这一更复杂的设置尚未得到充分探索。论文提出的解决方案是OD3,一个无需优化的数据蒸馏框架,其关键在于分两阶段进行:首先通过迭代方式将对象实例放置在合成图像中合适的位置,其次利用预训练的观察者模型对候选对象进行筛选,去除低置信度的对象。该方法在MS COCO和PASCAL VOC数据集上实现了较高的压缩比,并显著提升了目标检测的性能。

链接: https://arxiv.org/abs/2506.01942
作者: Salwa K. Al Khatib(1),Ahmed ElHagry(1),Shitong Shao(2 and 1),Zhiqiang Shen(1) ((1) Mohamed Bin Zayed University of Artificial Intelligence (MBZUAI), (2) Hong Kong University of Science and Technology (Guangzhou))
机构: Mohamed Bin Zayed University of Artificial Intelligence (MBZUAI); Hong Kong University of Science and Technology (Guangzhou)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Equal Contribution of the first three authors

点击查看摘要

Abstract:Training large neural networks on large-scale datasets requires substantial computational resources, particularly for dense prediction tasks such as object detection. Although dataset distillation (DD) has been proposed to alleviate these demands by synthesizing compact datasets from larger ones, most existing work focuses solely on image classification, leaving the more complex detection setting largely unexplored. In this paper, we introduce OD3, a novel optimization-free data distillation framework specifically designed for object detection. Our approach involves two stages: first, a candidate selection process in which object instances are iteratively placed in synthesized images based on their suitable locations, and second, a candidate screening process using a pre-trained observer model to remove low-confidence objects. We perform our data synthesis framework on MS COCO and PASCAL VOC, two popular detection datasets, with compression ratios ranging from 0.25% to 5%. Compared to the prior solely existing dataset distillation method on detection and conventional core set selection methods, OD3 delivers superior accuracy, establishes new state-of-the-art results, surpassing prior best method by more than 14% on COCO mAP50 at a compression ratio of 1.0%. Code and condensed datasets are available at: this https URL.
zh

[CV-6] Fast and Robust Rotation Averag ing with Anisotropic Coordinate Descent

【速读】:该论文试图解决各向同性旋转平均方法在最优性、鲁棒性和效率之间的平衡问题。其关键解决方案是分析一种最初用于优化标准弦距的块坐标下降方法,并推导出一个更简单的公式及其各向异性扩展,从而获得一个快速的通用求解器。该求解器被集成到扩展的各向异性大规模鲁棒旋转平均流程中,实现了在公开结构从运动数据集上的最先进性能。

链接: https://arxiv.org/abs/2506.01940
作者: Yaroslava Lochman,Carl Olsson,Christopher Zach
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Anisotropic rotation averaging has recently been explored as a natural extension of respective isotropic methods. In the anisotropic formulation, uncertainties of the estimated relative rotations – obtained via standard two-view optimization – are propagated to the optimization of absolute rotations. The resulting semidefinite relaxations are able to recover global minima but scale poorly with the problem size. Local methods are fast and also admit robust estimation but are sensitive to initialization. They usually employ minimum spanning trees and therefore suffer from drift accumulation and can get trapped in poor local minima. In this paper, we attempt to bridge the gap between optimality, robustness and efficiency of anisotropic rotation averaging. We analyze a family of block coordinate descent methods initially proposed to optimize the standard chordal distances, and derive a much simpler formulation and an anisotropic extension obtaining a fast general solver. We integrate this solver into the extended anisotropic large-scale robust rotation averaging pipeline. The resulting algorithm achieves state-of-the-art performance on public structure-from-motion datasets. Project page: this https URL
zh

[CV-7] Low-Rank Head Avatar Personalization with Registers

【速读】:该论文旨在解决通用模型在生成头部虚拟形象时难以捕捉个体特定细节的问题,因为这类模型通常学习的是通用领域先验,无法有效合成独特的身份特征。解决方案的关键在于提出一种名为Register Module的专用架构,该模块通过增强低秩适应(LoRA)的效果,在仅需少量参数的情况下,能够适应未见过的身份,并在预训练模型的中间特征上存储和重用信息,从而提升个性化效果。

链接: https://arxiv.org/abs/2506.01935
作者: Sai Tanmay Reddy Chakkera,Aggelina Chatziagapi,Md Moniruzzaman,Chen-Ping Yu,Yi-Hsuan Tsai,Dimitris Samaras
机构: Stony Brook University (纽约州立大学石溪分校); Atmanity Inc. (Atmanity公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 23 pages, 16 figures. Project page: this https URL

点击查看摘要

Abstract:We introduce a novel method for low-rank personalization of a generic model for head avatar generation. Prior work proposes generic models that achieve high-quality face animation by leveraging large-scale datasets of multiple identities. However, such generic models usually fail to synthesize unique identity-specific details, since they learn a general domain prior. To adapt to specific subjects, we find that it is still challenging to capture high-frequency facial details via popular solutions like low-rank adaptation (LoRA). This motivates us to propose a specific architecture, a Register Module, that enhances the performance of LoRA, while requiring only a small number of parameters to adapt to an unseen identity. Our module is applied to intermediate features of a pre-trained model, storing and re-purposing information in a learnable 3D feature space. To demonstrate the efficacy of our personalization method, we collect a dataset of talking videos of individuals with distinctive facial details, such as wrinkles and tattoos. Our approach faithfully captures unseen faces, outperforming existing methods quantitatively and qualitatively. We will release the code, models, and dataset to the public.
zh

[CV-8] E3D-Bench: A Benchmark for End-to-End 3D Geometric Foundation Models

【速读】:该论文旨在解决从非结构化或流式图像中实时、准确估计核心3D属性(如相机参数、点云、深度图和3D点轨迹)的问题,这是空间智能(spatial intelligence)在机器人、航拍成像和扩展现实等应用中的基础。其解决方案的关键是提出一种端到端的3D几何基础模型(GFMs),该模型能够在单次前向传播中直接预测密集的3D表示,从而避免依赖缓慢或不可用的预计算相机参数。

链接: https://arxiv.org/abs/2506.01933
作者: Wenyan Cong,Yiqing Liang,Yancheng Zhang,Ziyi Yang,Yan Wang,Boris Ivanovic,Marco Pavone,Chen Chen,Zhangyang Wang,Zhiwen Fan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Spatial intelligence, encompassing 3D reconstruction, perception, and reasoning, is fundamental to applications such as robotics, aerial imaging, and extended reality. A key enabler is the real-time, accurate estimation of core 3D attributes (camera parameters, point clouds, depth maps, and 3D point tracks) from unstructured or streaming imagery. Inspired by the success of large foundation models in language and 2D vision, a new class of end-to-end 3D geometric foundation models (GFMs) has emerged, directly predicting dense 3D representations in a single feed-forward pass, eliminating the need for slow or unavailable precomputed camera parameters. Since late 2023, the field has exploded with diverse variants, but systematic evaluation is lacking. In this work, we present the first comprehensive benchmark for 3D GFMs, covering five core tasks: sparse-view depth estimation, video depth estimation, 3D reconstruction, multi-view pose estimation, novel view synthesis, and spanning both standard and challenging out-of-distribution datasets. Our standardized toolkit automates dataset handling, evaluation protocols, and metric computation to ensure fair, reproducible comparisons. We evaluate 16 state-of-the-art GFMs, revealing their strengths and limitations across tasks and domains, and derive key insights to guide future model scaling and optimization. All code, evaluation scripts, and processed data will be publicly released to accelerate research in 3D spatial intelligence.
zh

[CV-9] Image Generation from Contextually-Contradictory Prompts

【速读】:该论文试图解决文本到图像扩散模型在处理包含语境矛盾(contextual contradiction)的提示时,生成结果语义不准确的问题。其关键解决方案是提出一种阶段感知的提示分解框架,通过一系列代理提示(proxy prompt)引导去噪过程,每个代理提示旨在匹配去噪特定阶段预期出现的语义内容,同时确保语境一致性。该方法利用大语言模型(Large Language Model, LLM)分析目标提示,识别矛盾并生成保留原意的同时解决语境冲突的替代表达,从而实现对语义细节的精细控制和在存在语境矛盾情况下的准确图像生成。

链接: https://arxiv.org/abs/2506.01929
作者: Saar Huberman,Or Patashnik,Omer Dahary,Ron Mokady,Daniel Cohen-Or
机构: Tel Aviv University (特拉维夫大学); BRIA AI (BRIA AI)
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Project page: this https URL

点击查看摘要

Abstract:Text-to-image diffusion models excel at generating high-quality, diverse images from natural language prompts. However, they often fail to produce semantically accurate results when the prompt contains concept combinations that contradict their learned priors. We define this failure mode as contextual contradiction, where one concept implicitly negates another due to entangled associations learned during training. To address this, we propose a stage-aware prompt decomposition framework that guides the denoising process using a sequence of proxy prompts. Each proxy prompt is constructed to match the semantic content expected to emerge at a specific stage of denoising, while ensuring contextual coherence. To construct these proxy prompts, we leverage a large language model (LLM) to analyze the target prompt, identify contradictions, and generate alternative expressions that preserve the original intent while resolving contextual conflicts. By aligning prompt information with the denoising progression, our method enables fine-grained semantic control and accurate image generation in the presence of contextual contradictions. Experiments across a variety of challenging prompts show substantial improvements in alignment to the textual prompt.
zh

[CV-10] axaDiffusion: Progressively Trained Diffusion Model for Fine-Grained Species Generation

【速读】:该论文试图解决在生成细粒度动物图像时,如何提高形态学和身份准确性的难题。传统方法将每个物种视为独立类别,未能充分利用物种间存在的视觉相似性及分类学上的层级关系。其解决方案的关键在于引入分类学知识,通过分层训练条件扩散模型,在不同分类级别(如纲、目、科、属、种)上逐步细化特征,从而实现从粗粒度到细粒度的特征学习与区分,提升生成图像的准确性,尤其在每物种训练样本有限的情况下仍能保持良好性能。

链接: https://arxiv.org/abs/2506.01923
作者: Amin Karimi Monsefi,Mridul Khurana,Rajiv Ramnath,Anuj Karpatne,Wei-Lun Chao,Cheng Zhang
机构: The Ohio State University (俄亥俄州立大学); Virginia Tech (弗吉尼亚理工学院); Texas A&M University (德克萨斯A&M大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We propose TaxaDiffusion, a taxonomy-informed training framework for diffusion models to generate fine-grained animal images with high morphological and identity accuracy. Unlike standard approaches that treat each species as an independent category, TaxaDiffusion incorporates domain knowledge that many species exhibit strong visual similarities, with distinctions often residing in subtle variations of shape, pattern, and color. To exploit these relationships, TaxaDiffusion progressively trains conditioned diffusion models across different taxonomic levels – starting from broad classifications such as Class and Order, refining through Family and Genus, and ultimately distinguishing at the Species level. This hierarchical learning strategy first captures coarse-grained morphological traits shared by species with common ancestors, facilitating knowledge transfer before refining fine-grained differences for species-level distinction. As a result, TaxaDiffusion enables accurate generation even with limited training samples per species. Extensive experiments on three fine-grained animal datasets demonstrate that outperforms existing approaches, achieving superior fidelity in fine-grained animal image generation. Project page: this https URL
zh

[CV-11] MedEBench: Revisiting Text-instructed Image Editing

【速读】:该论文旨在解决文本引导的医学图像编辑在临床应用中的评估标准缺失与适应性不足问题,其核心挑战在于如何构建一个具有临床相关性的评价框架以支持可靠且有意义的医学图像编辑系统开发。解决方案的关键在于提出MedEBench基准,该基准包含1,182个来源于临床的图像-提示三元组,覆盖13个解剖区域的70项任务,并引入了包括编辑准确性、上下文保留性和视觉质量在内的综合评估框架,同时通过ROI(Region of Interest)掩码和注意力图的IoU分析实现对模型失败模式的系统性解析。

链接: https://arxiv.org/abs/2506.01921
作者: Minghao Liu,Zhitao He,Zhiyuan Fan,Qingyun Wang,Yi R. Fung
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Text-guided image editing has seen rapid progress in natural image domains, but its adaptation to medical imaging remains limited and lacks standardized evaluation. Clinically, such editing holds promise for simulating surgical outcomes, creating personalized teaching materials, and enhancing patient communication. To bridge this gap, we introduce \textbfMedEBench, a comprehensive benchmark for evaluating text-guided medical image editing. It consists of 1,182 clinically sourced image-prompt triplets spanning 70 tasks across 13 anatomical regions. MedEBench offers three key contributions: (1) a clinically relevant evaluation framework covering Editing Accuracy, Contextual Preservation, and Visual Quality, supported by detailed descriptions of expected change and ROI (Region of Interest) masks; (2) a systematic comparison of seven state-of-the-art models, revealing common failure patterns; and (3) a failure analysis protocol based on attention grounding, using IoU between attention maps and ROIs to identify mislocalization. MedEBench provides a solid foundation for developing and evaluating reliable, clinically meaningful medical image editing systems.
zh

[CV-12] Elucidating the representation of images within an unconditional diffusion model denoiser

【速读】:该论文试图解决生成式扩散模型中得分网络(score network)内部机制不明确的问题,特别是如何理解其对图像的表示与计算过程。解决方案的关键在于分析用于去噪的UNet结构,发现其中间块能够将图像分解为稀疏的活跃通道子集,并利用这些通道的空间平均向量作为干净图像的非线性表示。基于此表示,作者提出了一种新的随机图像重构算法,验证了该表示能够恢复由目标图像表示定义的图像样本,并揭示了潜在空间中的欧几里得距离与条件密度及语义相似性之间的对应关系。

链接: https://arxiv.org/abs/2506.01912
作者: Zahra Kadkhodaie,Stéphane Mallat,Eero Simoncelli
机构: New York University (纽约大学); Flatiron Institute, Simons Foundation (扁平化研究所,西蒙斯基金会)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Generative diffusion models learn probability densities over diverse image datasets by estimating the score with a neural network trained to remove noise. Despite their remarkable success in generating high-quality images, the internal mechanisms of the underlying score networks are not well understood. Here, we examine a UNet trained for denoising on the ImageNet dataset, to better understand its internal representation and computation of the score. We show that the middle block of the UNet decomposes individual images into sparse subsets of active channels, and that the vector of spatial averages of these channels can provide a nonlinear representation of the underlying clean images. We develop a novel algorithm for stochastic reconstruction of images from this representation and demonstrate that it recovers a sample from a set of images defined by a target image representation. We then study the properties of the representation and demonstrate that Euclidean distances in the latent space correspond to distances between conditional densities induced by representations as well as semantic similarities in the image space. Applying a clustering algorithm in the representation space yields groups of images that share both fine details (e.g., specialized features, textured regions, small objects), as well as global structure, but are only partially aligned with object identities. Thus, we show for the first time that a network trained solely on denoising contains a rich and accessible sparse representation of images.
zh

[CV-13] Reinforcement Learning Tuning for VideoLLM s: Reward Design and Data Efficiency

【速读】:该论文旨在解决在复杂语义和长时序依赖背景下,对现实世界视频进行理解的挑战。其解决方案的关键在于利用强化学习调优(Reinforcement Learning Tuning, RLT)作为后训练策略,以增强多模态大语言模型(Multimodal Large Language Models, MLLMs)的视频特定推理能力。具体而言,作者基于Group Relative Policy Optimization (GRPO)框架提出了一种双奖励机制,通过离散和连续奖励信号同时监督语义和时序推理,并引入一种基于重复推理的方差感知数据选择策略,以提升偏好优化的效果。

链接: https://arxiv.org/abs/2506.01908
作者: Hongyu Li,Songhao Han,Yue Liao,Junfeng Luo,Jialin Gao,Shuicheng Yan,Si Liu
机构: BUAA(北京航空航天大学); NUS(新加坡国立大学); Meituan(美团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Understanding real-world videos with complex semantics and long temporal dependencies remains a fundamental challenge in computer vision. Recent progress in multimodal large language models (MLLMs) has demonstrated strong capabilities in vision-language tasks, while reinforcement learning tuning (RLT) has further improved their reasoning abilities. In this work, we explore RLT as a post-training strategy to enhance the video-specific reasoning capabilities of MLLMs. Built upon the Group Relative Policy Optimization (GRPO) framework, we propose a dual-reward formulation that supervises both semantic and temporal reasoning through discrete and continuous reward signals. To facilitate effective preference-based optimization, we introduce a variance-aware data selection strategy based on repeated inference to identify samples that provide informative learning signals. We evaluate our approach across eight representative video understanding tasks, including VideoQA, Temporal Video Grounding, and Grounded VideoQA. Our method consistently outperforms supervised fine-tuning and existing RLT baselines, achieving superior performance with significantly less training data. These results underscore the importance of reward design and data selection in advancing reasoning-centric video understanding with MLLMs. Notably, The initial code release (two months ago) has now been expanded with updates, including optimized reward mechanisms and additional datasets. The latest version is available at this https URL .
zh

[CV-14] ShapeLLM -Omni: A Native Multimodal LLM for 3D Generation and Understanding

【速读】:该论文试图解决当前主流多模态大语言模型(如ChatGPT-4o)在3D内容理解与生成方面的不足,其核心问题在于现有模型的多模态能力主要局限于图像和文本,而对3D内容的处理能力较为薄弱。解决方案的关键在于提出一种名为ShapeLLM-Omni的原生3D大语言模型,通过训练一个3D向量量化变分自编码器(3D vector-quantized variational autoencoder, VQVAE)将3D对象映射到离散潜在空间,实现高效的形状表示与重建,并构建了一个大规模连续训练数据集3D-Alpaca,涵盖生成、理解与编辑任务,从而为未来研究提供丰富的资源。

链接: https://arxiv.org/abs/2506.01853
作者: Junliang Ye,Zhengyi Wang,Ruowen Zhao,Shenghao Xie,Jun Zhu
机构: Tsinghua University (清华大学); Peking University (北京大学); ShengShu (盛数)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Recently, the powerful text-to-image capabilities of ChatGPT-4o have led to growing appreciation for native multimodal large language models. However, its multimodal capabilities remain confined to images and text. Yet beyond images, the ability to understand and generate 3D content is equally crucial. To address this gap, we propose ShapeLLM-Omni-a native 3D large language model capable of understanding and generating 3D assets and text in any sequence. First, we train a 3D vector-quantized variational autoencoder (VQVAE), which maps 3D objects into a discrete latent space to achieve efficient and accurate shape representation and reconstruction. Building upon the 3D-aware discrete tokens, we innovatively construct a large-scale continuous training dataset named 3D-Alpaca, encompassing generation, comprehension, and editing, thus providing rich resources for future research and training. Finally, by performing instruction-based training of the Qwen-2.5-vl-7B-Instruct model on the 3D-Alpaca dataset. Our work provides an effective attempt at extending multimodal models with basic 3D capabilities, which contributes to future research in 3D-native AI. Project page: this https URL
zh

[CV-15] MoDA: Modulation Adapter for Fine-Grained Visual Grounding in Instructional MLLM s

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在复杂场景中对细粒度视觉概念进行精准定位的问题。现有方法在将视觉特征与语言指令对齐时存在局限,导致模型难以准确理解并生成与视觉内容高度相关的响应。论文提出的解决方案关键在于引入MoDA(Modulation Adapter)模块,该模块通过指令引导的调制机制,对预对齐的视觉特征进行细化。MoDA采用基于Transformer的交叉注意力机制生成调制掩码,从而根据语言指令强调语义相关的嵌入维度,提升模型的视觉定位能力和上下文相关性响应效果。

链接: https://arxiv.org/abs/2506.01850
作者: Wayner Barrios,Andrés Villa,Juan León Alcázar,SouYoung Jin,Bernard Ghanem
机构: Dartmouth(达特茅斯学院); KAUST(卡耐基梅隆大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Recently, Multimodal Large Language Models (MLLMs) have demonstrated impressive performance on instruction-following tasks by integrating pretrained visual encoders with large language models (LLMs). However, existing approaches often struggle to ground fine-grained visual concepts in complex scenes. In this paper, we propose MoDA (Modulation Adapter), a lightweight yet effective module designed to refine pre-aligned visual features through instruction-guided modulation. Our approach follows the standard LLaVA training protocol, consisting of a two-stage process: (1) aligning image features to the LLMs input space via a frozen vision encoder and adapter layers, and (2) refining those features using the MoDA adapter during the instructional tuning stage. MoDA employs a Transformer-based cross-attention mechanism to generate a modulation mask over the aligned visual tokens, thereby emphasizing semantically relevant embedding dimensions based on the language instruction. The modulated features are then passed to the LLM for autoregressive language generation. Our experimental evaluation shows that MoDA improves visual grounding and generates more contextually appropriate responses, demonstrating its effectiveness as a general-purpose enhancement for image-based MLLMs.
zh

[CV-16] GSCodec Studio: A Modular Framework for Gaussian Splat Compression

【速读】:该论文试图解决高存储需求限制3D Gaussian Splatting(GS)在共享、传输和存储中的实际应用问题。其解决方案的关键在于提出GSCodec Studio,一个统一且模块化的框架,集成多种3D/4D GS重建方法和压缩技术,支持灵活组合与全面比较,从而促进静态和动态GS的紧凑表示与压缩方案的发展,实现具有竞争力的率失真性能。

链接: https://arxiv.org/abs/2506.01822
作者: Sicheng Li,Chengzhen Wu,Hao Li,Xiang Gao,Yiyi Liao,Lu Yu
机构: Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: Repository of the project: this https URL

点击查看摘要

Abstract:3D Gaussian Splatting and its extension to 4D dynamic scenes enable photorealistic, real-time rendering from real-world captures, positioning Gaussian Splats (GS) as a promising format for next-generation immersive media. However, their high storage requirements pose significant challenges for practical use in sharing, transmission, and storage. Despite various studies exploring GS compression from different perspectives, these efforts remain scattered across separate repositories, complicating benchmarking and the integration of best practices. To address this gap, we present GSCodec Studio, a unified and modular framework for GS reconstruction, compression, and rendering. The framework incorporates a diverse set of 3D/4D GS reconstruction methods and GS compression techniques as modular components, facilitating flexible combinations and comprehensive comparisons. By integrating best practices from community research and our own explorations, GSCodec Studio supports the development of compact representation and compression solutions for static and dynamic Gaussian Splats, namely our Static and Dynamic GSCodec, achieving competitive rate-distortion performance in static and dynamic GS compression. The code for our framework is publicly available at this https URL , to advance the research on Gaussian Splats compression.
zh

[CV-17] Ridgeformer: Mutli-Stage Contrastive Training For Fine-grained Cross-Domain Fingerprint Recognition

【速读】:该论文旨在解决非接触式指纹识别技术中面临的挑战,包括图像失焦、纹路与谷地对比度降低、手指位置变化以及透视失真等问题,这些问题显著影响了非接触式指纹匹配的准确性和可靠性。解决方案的关键在于提出一种基于多阶段Transformer的非接触式指纹匹配方法,该方法首先捕获全局空间特征,随后在指纹样本间进行局部特征对齐的精细化调整,通过分层特征提取与匹配流程实现跨样本的细粒度对齐,同时保持全局特征表示的鲁棒性。

链接: https://arxiv.org/abs/2506.01806
作者: Shubham Pandey,Bhavin Jawade,Srirangaraj Setlur
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to IEEE International Conference on Image Processing 2025

点击查看摘要

Abstract:The increasing demand for hygienic and portable biometric systems has underscored the critical need for advancements in contactless fingerprint recognition. Despite its potential, this technology faces notable challenges, including out-of-focus image acquisition, reduced contrast between fingerprint ridges and valleys, variations in finger positioning, and perspective distortion. These factors significantly hinder the accuracy and reliability of contactless fingerprint matching. To address these issues, we propose a novel multi-stage transformer-based contactless fingerprint matching approach that first captures global spatial features and subsequently refines localized feature alignment across fingerprint samples. By employing a hierarchical feature extraction and matching pipeline, our method ensures fine-grained, cross-sample alignment while maintaining the robustness of global feature representation. We perform extensive evaluations on publicly available datasets such as HKPolyU and RidgeBase under different evaluation protocols, such as contactless-to-contact matching and contactless-to-contactless matching and demonstrate that our proposed approach outperforms existing methods, including COTS solutions.
zh

[CV-18] UMA: Ultra-detailed Human Avatars via Multi-level Surface Alignment

【速读】:该论文旨在解决从多视角视频中学习具有生动动态和逼真外观的可动画化服装人体虚拟模型的问题,特别是现有方法在高分辨率下无法保留最高细节水平的缺陷。其关键解决方案是提出一种潜在形变模型,并利用基础2D视频点跟踪器的指导来监督可动画角色的3D形变,从而提升对阴影和表面变化的鲁棒性,并减少局部极小值问题。此外,通过引入级联训练策略,将点轨迹锚定在渲染的虚拟角色上,生成一致的3D点轨迹,最终在顶点和纹理像素级别监督虚拟角色,以缓解时间漂移和2D点跟踪器缺乏3D感知的问题。

链接: https://arxiv.org/abs/2506.01802
作者: Heming Zhu,Guoxing Sun,Christian Theobalt,Marc Habermann
机构: Max Planck Institute for Informatics(马克斯·普朗克信息研究所); Saarland Informatics Campus(萨尔兰计算机科学园区); Saarbrücken Research Center for Visual Computing, Interaction and AI(萨尔布吕肯视觉计算、交互与人工智能研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: For video results, see this https URL

点击查看摘要

Abstract:Learning an animatable and clothed human avatar model with vivid dynamics and photorealistic appearance from multi-view videos is an important foundational research problem in computer graphics and vision. Fueled by recent advances in implicit representations, the quality of the animatable avatars has achieved an unprecedented level by attaching the implicit representation to drivable human template meshes. However, they usually fail to preserve the highest level of detail, particularly apparent when the virtual camera is zoomed in and when rendering at 4K resolution and higher. We argue that this limitation stems from inaccurate surface tracking, specifically, depth misalignment and surface drift between character geometry and the ground truth surface, which forces the detailed appearance model to compensate for geometric errors. To address this, we propose a latent deformation model and supervising the 3D deformation of the animatable character using guidance from foundational 2D video point trackers, which offer improved robustness to shading and surface variations, and are less prone to local minima than differentiable rendering. To mitigate the drift over time and lack of 3D awareness of 2D point trackers, we introduce a cascaded training strategy that generates consistent 3D point tracks by anchoring point tracks to the rendered avatar, which ultimately supervises our avatar at the vertex and texel level. To validate the effectiveness of our approach, we introduce a novel dataset comprising five multi-view video sequences, each over 10 minutes in duration, captured using 40 calibrated 6K-resolution cameras, featuring subjects dressed in clothing with challenging texture patterns and wrinkle deformations. Our approach demonstrates significantly improved performance in rendering quality and geometric accuracy over the prior state of the art.
zh

[CV-19] OmniV2V: Versatile Video Generation and Editing via Dynamic Content Manipulation

【速读】:该论文旨在解决现有视频生成模型在多场景下缺乏多样化生成与编辑能力的问题,即大多数模型仅适用于单一任务,无法通过动态内容操作实现跨场景的视频生成与编辑。其解决方案的关键在于提出OmniV2V模型,该模型通过统一的动态内容操作注入模块,有效整合了多种视频生成与编辑任务的需求,并结合基于LLaVA的视觉-文本指令模块,提升了模型对视觉内容与指令之间对应关系的理解能力。此外,构建的多任务数据处理系统能够高效进行数据增强,从而支持多类型、多场景的数据集构建与评估。

链接: https://arxiv.org/abs/2506.01801
作者: Sen Liang,Zhentao Yu,Zhengguang Zhou,Teng Hu,Hongmei Wang,Yi Chen,Qin Lin,Yuan Zhou,Xin Li,Qinglin Lu,Zhibo Chen
机构: Tencent Hunyuan(腾讯混元)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The emergence of Diffusion Transformers (DiT) has brought significant advancements to video generation, especially in text-to-video and image-to-video tasks. Although video generation is widely applied in various fields, most existing models are limited to single scenarios and cannot perform diverse video generation and editing through dynamic content manipulation. We propose OmniV2V, a video model capable of generating and editing videos across different scenarios based on various operations, including: object movement, object addition, mask-guided video edit, try-on, inpainting, outpainting, human animation, and controllable character video synthesis. We explore a unified dynamic content manipulation injection module, which effectively integrates the requirements of the above tasks. In addition, we design a visual-text instruction module based on LLaVA, enabling the model to effectively understand the correspondence between visual content and instructions. Furthermore, we build a comprehensive multi-task data processing system. Since there is data overlap among various tasks, this system can efficiently provide data augmentation. Using this system, we construct a multi-type, multi-scenario OmniV2V dataset and its corresponding OmniV2V-Test benchmark. Extensive experiments show that OmniV2V works as well as, and sometimes better than, the best existing open-source and commercial models for many video generation and editing tasks.
zh

[CV-20] WorldExplorer: Towards Generating Fully Navigable 3D Scenes

【速读】:该论文试图解决从文本生成高质量、可导航的3D世界的问题,现有方法在场景内部探索时受限,导致在偏离中心或全景视角时产生拉伸和噪声伪影。其解决方案的关键在于提出WorldExplorer,该方法基于自回归视频轨迹生成,通过多视角一致性图像初始化360度全景场景,并利用视频扩散模型在迭代场景生成流程中扩展场景。关键创新包括场景记忆机制,用于根据最相关的历史视图条件化视频生成,以及碰撞检测机制以避免进入物体内部,最终通过3D高斯点云优化将所有生成视图融合为统一的3D表示。

链接: https://arxiv.org/abs/2506.01799
作者: Manuel-Andreas Schneider,Lukas Höllein,Matthias Nießner
机构: Technical University of Munich(慕尼黑工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: project page: see this https URL , video: see this https URL

点击查看摘要

Abstract:Generating 3D worlds from text is a highly anticipated goal in computer vision. Existing works are limited by the degree of exploration they allow inside of a scene, i.e., produce streched-out and noisy artifacts when moving beyond central or panoramic perspectives. To this end, we propose WorldExplorer, a novel method based on autoregressive video trajectory generation, which builds fully navigable 3D scenes with consistent visual quality across a wide range of viewpoints. We initialize our scenes by creating multi-view consistent images corresponding to a 360 degree panorama. Then, we expand it by leveraging video diffusion models in an iterative scene generation pipeline. Concretely, we generate multiple videos along short, pre-defined trajectories, that explore the scene in depth, including motion around objects. Our novel scene memory conditions each video on the most relevant prior views, while a collision-detection mechanism prevents degenerate results, like moving into objects. Finally, we fuse all generated views into a unified 3D representation via 3D Gaussian Splatting optimization. Compared to prior approaches, WorldExplorer produces high-quality scenes that remain stable under large camera motion, enabling for the first time realistic and unrestricted exploration. We believe this marks a significant step toward generating immersive and truly explorable virtual 3D environments.
zh

[CV-21] R2SM: Referring and Reasoning for Selective Masks

【速读】:该论文试图解决的是在文本引导的图像分割任务中,如何根据用户意图区分生成可见(modal)或完整(amodal)分割掩码的问题。其解决方案的关键在于提出一种新的任务——Referring and Reasoning for Selective Masks (R2SM),并通过构建R2SM数据集,包含模态和非模态文本查询及其对应的真值掩码,使模型能够根据自然语言提示判断是否需要生成对象的可见部分或完整形状(包括被遮挡区域),从而实现更符合用户意图的分割结果。

链接: https://arxiv.org/abs/2506.01795
作者: Yu-Lin Shih,Wei-En Tai,Cheng Sun,Yu-Chiang Frank Wang,Hwann-Tzong Chen
机构: National Tsing Hua University (国立清华大学); NVIDIA (NVIDIA); National Taiwan University (国立台湾大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce a new task, Referring and Reasoning for Selective Masks (R2SM), which extends text-guided segmentation by incorporating mask-type selection driven by user intent. This task challenges vision-language models to determine whether to generate a modal (visible) or amodal (complete) segmentation mask based solely on natural language prompts. To support the R2SM task, we present the R2SM dataset, constructed by augmenting annotations of COCOA-cls, D2SA, and MUVA. The R2SM dataset consists of both modal and amodal text queries, each paired with the corresponding ground-truth mask, enabling model finetuning and evaluation for the ability to segment images as per user intent. Specifically, the task requires the model to interpret whether a given prompt refers to only the visible part of an object or to its complete shape, including occluded regions, and then produce the appropriate segmentation. For example, if a prompt explicitly requests the whole shape of a partially hidden object, the model is expected to output an amodal mask that completes the occluded parts. In contrast, prompts without explicit mention of hidden regions should generate standard modal masks. The R2SM benchmark provides a challenging and insightful testbed for advancing research in multimodal reasoning and intent-aware segmentation.
zh

[CV-22] FaceCoT: A Benchmark Dataset for Face Anti-Spoofing with Chain-of-Thought Reasoning

【速读】:该论文旨在解决人脸反欺骗(Face Anti-Spoofing, FAS)在面对多种呈现攻击时,因依赖单一视觉模态而导致的泛化能力有限的问题。其解决方案的关键在于引入面向FAS的大型多模态视觉问答(Visual Question Answering, VQA)数据集FaceCoT,并结合思维链增强的渐进式学习策略(CoT-Enhanced Progressive Learning, CEPL),通过融合视觉与语言的联合推理来提升模型的鲁棒性与可解释性。

链接: https://arxiv.org/abs/2506.01783
作者: Honglu Zhang,Zhiqin Fang,Ningning Zhao,Saihui Hou,Long Ma,Renwang Pei,Zhaofeng He
机构: Didi Chuxing(滴滴出行); Beijing University of Posts and Telecommunications(北京邮电大学); Beijing Normal University(北京师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Face Anti-Spoofing (FAS) typically depends on a single visual modality when defending against presentation attacks such as print attacks, screen replays, and 3D masks, resulting in limited generalization across devices, environments, and attack types. Meanwhile, Multimodal Large Language Models (MLLMs) have recently achieved breakthroughs in image-text understanding and semantic reasoning, suggesting that integrating visual and linguistic co-inference into FAS can substantially improve both robustness and interpretability. However, the lack of a high-quality vision-language multimodal dataset has been a critical bottleneck. To address this, we introduce FaceCoT (Face Chain-of-Thought), the first large-scale Visual Question Answering (VQA) dataset tailored for FAS. FaceCoT covers 14 spoofing attack types and enriches model learning with high-quality CoT VQA annotations. Meanwhile, we develop a caption model refined via reinforcement learning to expand the dataset and enhance annotation quality. Furthermore, we introduce a CoT-Enhanced Progressive Learning (CEPL) strategy to better leverage the CoT data and boost model performance on FAS tasks. Extensive experiments demonstrate that models trained with FaceCoT and CEPL outperform state-of-the-art methods on multiple benchmark datasets.
zh

[CV-23] unMORE: Unsupervised Multi-Object Segmentation via Center-Boundary Reasoning ICML2025

【速读】:该论文试图解决在单张图像上进行无监督多目标分割的挑战性问题(unsupervised multi-object segmentation)。现有方法通常依赖于图像重建目标来学习对象性或利用预训练图像特征对相似像素进行分组,但在分割简单的合成物体或发现有限数量的真实物体时表现较好。本文提出了一种名为unMORE的新颖两阶段流程,其关键在于在第一阶段显式学习三个精心定义的对象中心表示(object-centric representations),随后在第二阶段利用这些学习到的对象先验进行多目标推理,该推理模块完全无需网络结构且不依赖人工标注。

链接: https://arxiv.org/abs/2506.01778
作者: Yafei Yang,Zihui Zhang,Bo Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注: ICML 2025. Code and data are available at: this https URL

点击查看摘要

Abstract:We study the challenging problem of unsupervised multi-object segmentation on single images. Existing methods, which rely on image reconstruction objectives to learn objectness or leverage pretrained image features to group similar pixels, often succeed only in segmenting simple synthetic objects or discovering a limited number of real-world objects. In this paper, we introduce unMORE, a novel two-stage pipeline designed to identify many complex objects in real-world images. The key to our approach involves explicitly learning three levels of carefully defined object-centric representations in the first stage. Subsequently, our multi-object reasoning module utilizes these learned object priors to discover multiple objects in the second stage. Notably, this reasoning module is entirely network-free and does not require human labels. Extensive experiments demonstrate that unMORE significantly outperforms all existing unsupervised methods across 6 real-world benchmark datasets, including the challenging COCO dataset, achieving state-of-the-art object segmentation results. Remarkably, our method excels in crowded images where all baselines collapse.
zh

[CV-24] Many-for-Many: Unify the Training of Multiple Video and Image Generation and Manipulation Tasks

【速读】:该论文试图解决现有视觉生成与操作模型在任务泛化能力上的不足以及训练高质量文本到视频(T2V)基础模型所需大量标注数据成本过高的问题。其解决方案的关键在于提出一个统一框架——“many-for-many”,通过利用多种不同视觉生成与操作任务的训练数据,训练一个能够执行多个任务的单一模型。该框架设计了一个轻量级适配器以统一不同任务的条件,并采用联合图像-视频学习策略从零开始逐步训练模型,从而实现性能提升的统一视觉生成与操作模型。此外,引入深度图作为条件以增强模型对三维空间的感知能力。

链接: https://arxiv.org/abs/2506.01758
作者: Tao Yang,Ruibin Li,Yangming Shi,Yuqi Zhang,Qide Dong,Haoran Cheng,Weiguo Feng,Shilei Wen,Bingyue Peng,Lei Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion models have shown impressive performance in many visual generation and manipulation tasks. Many existing methods focus on training a model for a specific task, especially, text-to-video (T2V) generation, while many other works focus on finetuning the pretrained T2V model for image-to-video (I2V), video-to-video (V2V), image and video manipulation tasks, etc. However, training a strong T2V foundation model requires a large amount of high-quality annotations, which is very costly. In addition, many existing models can perform only one or several tasks. In this work, we introduce a unified framework, namely many-for-many, which leverages the available training data from many different visual generation and manipulation tasks to train a single model for those different tasks. Specifically, we design a lightweight adapter to unify the different conditions in different tasks, then employ a joint image-video learning strategy to progressively train the model from scratch. Our joint learning leads to a unified visual generation and manipulation model with improved video generation performance. In addition, we introduce depth maps as a condition to help our model better perceive the 3D space in visual generation. Two versions of our model are trained with different model sizes (8B and 2B), each of which can perform more than 10 different tasks. In particular, our 8B model demonstrates highly competitive performance in video generation tasks compared to open-source and even commercial engines. Our models and source codes are available at this https URL.
zh

[CV-25] Efficient Egocentric Action Recognition with Multimodal Data

【速读】:该论文旨在解决在可穿戴扩展现实(XR)设备上实现高效、实时的自我中心动作识别(Egocentric Action Recognition, EAR)的问题,其核心挑战在于设备在便携性、电池寿命和计算资源之间的固有权衡。论文提出的解决方案关键在于系统分析不同输入模态(RGB视频与3D手部姿态)的采样频率对动作识别性能和CPU使用的影响,并通过探索多种配置,揭示了在降低RGB帧采样率的同时结合高频3D手部姿态输入,能够在保持高识别准确率的前提下显著降低CPU负载,从而实现更高效的实时EAR系统。

链接: https://arxiv.org/abs/2506.01757
作者: Marco Calzavara,Ard Kastrati,Matteo Macchini,Dushan Vasilevski,Roger Wattenhofer
机构: ETH Zurich(苏黎世联邦理工学院); Magic Leap(魔术手)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted as an extended abstract at the Second Joint Egocentric Vision (EgoVis) Workshop, 2025

点击查看摘要

Abstract:The increasing availability of wearable XR devices opens new perspectives for Egocentric Action Recognition (EAR) systems, which can provide deeper human understanding and situation awareness. However, deploying real-time algorithms on these devices can be challenging due to the inherent trade-offs between portability, battery life, and computational resources. In this work, we systematically analyze the impact of sampling frequency across different input modalities - RGB video and 3D hand pose - on egocentric action recognition performance and CPU usage. By exploring a range of configurations, we provide a comprehensive characterization of the trade-offs between accuracy and computational efficiency. Our findings reveal that reducing the sampling rate of RGB frames, when complemented with higher-frequency 3D hand pose input, can preserve high accuracy while significantly lowering CPU demands. Notably, we observe up to a 3x reduction in CPU usage with minimal to no loss in recognition performance. This highlights the potential of multimodal input strategies as a viable approach to achieving efficient, real-time EAR on XR devices.
zh

[CV-26] STORM: Benchmarking Visual Rating of MLLM s with a Comprehensive Ordinal Regression Dataset NIPS2025

【速读】:该论文旨在解决多模态大语言模型(MLLMs)在视觉评分能力上的不足,特别是在有序回归(OR)任务中的表现不佳问题,同时针对相关数据集和基准的缺乏进行补充。其解决方案的关键在于构建了一个名为STORM的数据集和基准,涵盖五个常见的视觉评分领域,包含655K张图像对及其对应的精心策划的视觉问答(VQA)数据,并提出了一种从粗到细的处理流程,动态考虑标签候选并提供可解释的思维过程,从而为MLLMs提供一种通用且可信的有序推理范式。

链接: https://arxiv.org/abs/2506.01738
作者: Jinhong Wang,Shuo Tong,Jian liu,Dongqi Tang,Jintai Chen,Haochao Ying,Hongxia Xu,Danny Chen,Jian Wu
机构: Zhejiang University (浙江大学); Ant Group (蚂蚁集团); HKUST (Guangzhou) (香港科技大学(广州)); University of Notre Dame (圣母大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: underreview of NIPS2025 DB track

点击查看摘要

Abstract:Visual rating is an essential capability of artificial intelligence (AI) for multi-dimensional quantification of visual content, primarily applied in ordinal regression (OR) tasks such as image quality assessment, facial age estimation, and medical image grading. However, current multi-modal large language models (MLLMs) under-perform in such visual rating ability while also suffering the lack of relevant datasets and benchmarks. In this work, we collect and present STORM, a data collection and benchmark for Stimulating Trustworthy Ordinal Regression Ability of MLLMs for universal visual rating. STORM encompasses 14 ordinal regression datasets across five common visual rating domains, comprising 655K image-level pairs and the corresponding carefully curated VQAs. Importantly, we also propose a coarse-to-fine processing pipeline that dynamically considers label candidates and provides interpretable thoughts, providing MLLMs with a general and trustworthy ordinal thinking paradigm. This benchmark aims to evaluate the all-in-one and zero-shot performance of MLLMs in scenarios requiring understanding of the essential common ordinal relationships of rating labels. Extensive experiments demonstrate the effectiveness of our framework and shed light on better fine-tuning strategies. The STORM dataset, benchmark, and pre-trained models are available on the following webpage to support further research in this area. Datasets and codes are released on the project page: this https URL.
zh

[CV-27] VideoCap-R1: Enhancing MLLM s for Video Captioning via Structured Thinking

【速读】:该论文试图解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在视频字幕生成任务中描述动作能力不足的问题。其解决方案的关键在于提出一种基于GRPO(Generalized Reward-based Policy Optimization)的强化学习后训练方法,通过引入结构化思维过程和两个专门设计的奖励机制:一个不依赖大语言模型(LLM-free)的思考评分器用于评估结构化思维质量,另一个是依赖大语言模型(LLM-assisted)的字幕评分器用于评估输出质量,从而有效建立结构化推理与全面描述生成之间的联系。

链接: https://arxiv.org/abs/2506.01725
作者: Desen Meng,Rui Huang,Zhilin Dai,Xinhao Li,Yifan Xu,Jun Zhang,Zhenpeng Huang,Meng Zhang,Lingshu Zhang,Yi Liu,Limin Wang
机构: Nanjing University (南京大学); Shanghai AI Laboratory (上海人工智能实验室); Honor Device Co., Ltd (荣耀终端有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:While recent advances in reinforcement learning have significantly enhanced reasoning capabilities in large language models (LLMs), these techniques remain underexplored in multi-modal LLMs for video captioning. This paper presents the first systematic investigation of GRPO-based RL post-training for video MLLMs, with the goal of enhancing video MLLMs’ capability of describing actions in videos. Specifically, we develop the VideoCap-R1, which is prompted to first perform structured thinking that analyzes video subjects with their attributes and actions before generating complete captions, supported by two specialized reward mechanisms: a LLM-free think scorer evaluating the structured thinking quality and a LLM-assisted caption scorer assessing the output quality. The RL training framework effectively establishes the connection between structured reasoning and comprehensive description generation, enabling the model to produce captions with more accurate actions. Our experiments demonstrate that VideoCap-R1 achieves substantial improvements over the Qwen2VL-7B baseline using limited samples (1.5k) across multiple video caption benchmarks (DREAM1K: +4.4 event F1, VDC: +4.2 Acc, CAREBENCH: +3.1 action F1, +6.9 object F1) while consistently outperforming the SFT-trained counterparts, confirming GRPO’s superiority in enhancing MLLMs’ captioning capabilities.
zh

[CV-28] Active Learning via Vision-Language Model Adaptation with Open Data

【速读】:该论文旨在解决在有限标注数据情况下,如何高效提升视觉语言模型(Visual Language Models, VLMs)在下游任务中的性能问题。其关键解决方案是引入基于开放资源的主动学习(Active Learning with Open Resources, ALOR),通过检索与任务相关的公开数据来增强任务特定数据,并结合对比微调(Contrastive Tuning, CT)方法,显著提升了模型性能。此外,论文还提出了一种简单有效的Tail First Sampling(TFS)策略,优先选择低频类样本进行标注,以缓解数据分布不均衡带来的偏差问题。

链接: https://arxiv.org/abs/2506.01724
作者: Tong Wang,Jiaqi Wang,Shu Kong
机构: University of Macau (澳门大学); Shanghai AI Lab (上海人工智能实验室); Institute of Collaborative Innovation (协同创新研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Here is the project webpage: this https URL

点击查看摘要

Abstract:Pretrained on web-scale open data, VLMs offer powerful capabilities for solving downstream tasks after being adapted to task-specific labeled data. Yet, data labeling can be expensive and may demand domain expertise. Active Learning (AL) aims to reduce this expense by strategically selecting the most informative data for labeling and model training. Recent AL methods have explored VLMs but have not leveraged publicly available open data, such as VLM’s pretraining data. In this work, we leverage such data by retrieving task-relevant examples to augment the task-specific examples. As expected, incorporating them significantly improves AL. Given that our method exploits open-source VLM and open data, we refer to it as Active Learning with Open Resources (ALOR). Additionally, most VLM-based AL methods use prompt tuning (PT) for model adaptation, likely due to its ability to directly utilize pretrained parameters and the assumption that doing so reduces the risk of overfitting to limited labeled data. We rigorously compare popular adaptation approaches, including linear probing (LP), finetuning (FT), and contrastive tuning (CT). We reveal two key findings: (1) All adaptation approaches benefit from incorporating retrieved data, and (2) CT resoundingly outperforms other approaches across AL methods. Further analysis of retrieved data reveals a naturally imbalanced distribution of task-relevant classes, exposing inherent biases within the VLM. This motivates our novel Tail First Sampling (TFS) strategy for AL, an embarrassingly simple yet effective method that prioritizes sampling data from underrepresented classes to label. Extensive experiments demonstrate that our final method, contrastively finetuning VLM on both retrieved and TFS-selected labeled data, significantly outperforms existing methods.
zh

[CV-29] Data Pruning by Information Maximization ICLR2025

【速读】:该论文旨在解决数据剪枝(data pruning)问题,即从大规模数据集中选择最具信息量的样本子集,以减少冗余并提升模型训练效率。解决方案的关键在于提出一种名为InfoMax的方法,该方法通过最大化所选样本的信息含量并最小化冗余来优化核心集(coreset)的代表性。其核心思想是利用样本的重要性得分衡量单个样本的信息价值,并通过成对样本相似性量化冗余,最终将问题形式化为一个离散二次规划(DQP)任务,以最大化总信息量。为实现可扩展性,InfoMax结合了基于梯度的求解器、相似性矩阵的稀疏化技术和数据集划分策略,从而有效处理大规模数据集。

链接: https://arxiv.org/abs/2506.01701
作者: Haoru Tan,Sitong Wu,Wei Huang,Shizhen Zhao,Xiaojuan Qi
机构: The University of Hong Kong (香港大学); The Chinese University of Hong Kong (香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: ICLR 2025

点击查看摘要

Abstract:In this paper, we present InfoMax, a novel data pruning method, also known as coreset selection, designed to maximize the information content of selected samples while minimizing redundancy. By doing so, InfoMax enhances the overall informativeness of the coreset. The information of individual samples is measured by importance scores, which capture their influence or difficulty in model learning. To quantify redundancy, we use pairwise sample similarities, based on the premise that similar samples contribute similarly to the learning process. We formalize the coreset selection problem as a discrete quadratic programming (DQP) task, with the objective of maximizing the total information content, represented as the sum of individual sample contributions minus the redundancies introduced by similar samples within the coreset. To ensure practical scalability, we introduce an efficient gradient-based solver, complemented by sparsification techniques applied to the similarity matrix and dataset partitioning strategies. This enables InfoMax to seamlessly scale to datasets with millions of samples. Extensive experiments demonstrate the superior performance of InfoMax in various data pruning tasks, including image classification, vision-language pre-training, and instruction tuning for large language models.
zh

[CV-30] SteerPose: Simultaneous Extrinsic Camera Calibration and Matching from Articulation

【速读】:该论文试图解决在多相机系统中,如何利用自由移动的人类或动物自身作为校准目标,同时估计其在不同视角下的对应关系的问题。解决方案的关键在于提出一种名为SteerPose的神经网络,该网络能够将2D姿态旋转到另一视角,并通过可微分匹配在统一框架内同时完成外参标定和对应关系搜索。此外,引入了一种新的几何一致性损失,以确保估计的旋转和对应关系能够产生有效的平移估计。

链接: https://arxiv.org/abs/2506.01691
作者: Sang-Eun Lee,Ko Nishino,Shohei Nobuhara
机构: Kyoto University(京都大学); Kyoto Institute of Technology(京都工艺纤维大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages

点击查看摘要

Abstract:Can freely moving humans or animals themselves serve as calibration targets for multi-camera systems while simultaneously estimating their correspondences across views? We humans can solve this problem by mentally rotating the observed 2D poses and aligning them with those in the target views. Inspired by this cognitive ability, we propose SteerPose, a neural network that performs this rotation of 2D poses into another view. By integrating differentiable matching, SteerPose simultaneously performs extrinsic camera calibration and correspondence search within a single unified framework. We also introduce a novel geometric consistency loss that explicitly ensures that the estimated rotation and correspondences result in a valid translation estimation. Experimental results on diverse in-the-wild datasets of humans and animals validate the effectiveness and robustness of the proposed method. Furthermore, we demonstrate that our method can reconstruct the 3D poses of novel animals in multi-camera setups by leveraging off-the-shelf 2D pose estimators and our class-agnostic model.
zh

[CV-31] MotionSight: Boosting Fine-Grained Motion Understanding in Multimodal LLM s

【速读】:该论文试图解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在细粒度视频运动理解方面的性能不足问题,特别是其缺乏帧间差异分析和对细微视觉线索的忽略。解决方案的关键在于提出MotionSight,这是一种无需训练的零样本方法,通过引入以物体为中心的视觉焦点和运动模糊作为视觉提示,有效提升模型对细粒度运动的理解能力。

链接: https://arxiv.org/abs/2506.01674
作者: Yipeng Du,Tiehan Fan,Kepan Nan,Rui Xie,Penghao Zhou,Xiang Li,Jian Yang,Zhenheng Yang,Ying Tai
机构: Nanjing University (南京大学); ByteDance (字节跳动); Nankai University (南开大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Despite advancements in Multimodal Large Language Models (MLLMs), their proficiency in fine-grained video motion understanding remains critically limited. They often lack inter-frame differencing and tend to average or ignore subtle visual cues. Furthermore, while visual prompting has shown potential in static images, its application to video’s temporal complexities, particularly for fine-grained motion understanding, remains largely unexplored. We investigate whether inherent capability can be unlocked and boost MLLMs’ motion perception and enable distinct visual signatures tailored to decouple object and camera motion cues. In this study, we introduce MotionSight, a novel zero-shot method pioneering object-centric visual spotlight and motion blur as visual prompts to effectively improve fine-grained motion understanding without training. To convert this into valuable data assets, we curated MotionVid-QA, the first large-scale dataset for fine-grained video motion understanding, with hierarchical annotations including SFT and preference data, \Theta(40K) video clips and \Theta(87K) QAs. Experiments show MotionSight achieves state-of-the-art open-source performance and competitiveness with commercial models. In particular, for fine-grained motion understanding we present a novel zero-shot technique and a large-scale, high-quality dataset. All the code and annotations will be publicly available.
zh

[CV-32] EarthMind: Towards Multi-Granular and Multi-Sensor Earth Observation with Large Multimodal Models

【速读】:该论文旨在解决大型多模态模型(Large Multimodal Models, LMMs)在全面理解地球观测(Earth Observation, EO)数据方面的不足,这一问题对于监测环境及人类活动的影响至关重要。论文提出的解决方案是EarthMind,其关键在于两个核心组件:一是空间注意力提示(Spatial Attention Prompting, SAP),通过重新分配大语言模型(LLM)中的注意力以增强像素级理解;二是跨模态融合,通过将异构模态对齐到共享空间,并根据信息密度自适应地重新加权标记,实现有效的融合。

链接: https://arxiv.org/abs/2506.01667
作者: Yan Shu,Bin Ren,Zhitong Xiong,Danda Pani Paudel,Luc Van Gool,Begum Demir,Nicu Sebe,Paolo Rota
机构: University of Trento (特伦托大学); Technische Universität Berlin (柏林工业大学); Technical University of Munich (慕尼黑工业大学); University of Pisa (比萨大学); INSAIT, Sofia University “St. Kliment Ohridski” (索非亚大学“圣克莱门特·奥赫里德斯基”INSAIT)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Large Multimodal Models (LMMs) have demonstrated strong performance in various vision-language tasks. However, they often struggle to comprehensively understand Earth Observation (EO) data, which is critical for monitoring the environment and the effects of human activity on it. In this work, we present EarthMind, a novel vision-language framework for multi-granular and multi-sensor EO data understanding. EarthMind features two core components: (1) Spatial Attention Prompting (SAP), which reallocates attention within the LLM to enhance pixel-level understanding; and (2) Cross-modal Fusion, which aligns heterogeneous modalities into a shared space and adaptively reweighs tokens based on their information density for effective fusion. To facilitate multi-sensor fusion evaluation, we propose EarthMind-Bench, a comprehensive benchmark with over 2,000 human-annotated multi-sensor image-question pairs, covering a wide range of perception and reasoning tasks. Extensive experiments demonstrate the effectiveness of EarthMind. It achieves state-of-the-art performance on EarthMind-Bench, surpassing GPT-4o despite being only 4B in scale. Moreover, EarthMind outperforms existing methods on multiple public EO benchmarks, showcasing its potential to handle both multi-granular and multi-sensor challenges in a unified framework.
zh

[CV-33] Zoom-Refine: Boosting High-Resolution Multimodal Understanding via Localized Zoom and Self-Refinement

【速读】:该论文试图解决多模态大语言模型(Multimodal Large Language Models, MLLM)在处理高分辨率图像时难以准确解析细粒度细节的问题,而这些细节对于复杂的视觉理解至关重要。解决方案的关键在于提出一种无需训练的新型方法——Zoom-Refine,其核心机制包括“局部放大”(Localized Zoom)和“自我精炼”(Self-Refinement)两个协同步骤。在“局部放大”阶段,Zoom-Refine利用MLLM生成初步响应并预测任务相关图像区域的边界框坐标;在“自我精炼”阶段,该方法将高分辨率裁剪图像中的细粒度信息与初始推理结果结合,以重新评估并优化初步响应。该方法充分利用了MLLM内在的空间定位、上下文推理和对比分析能力,无需额外训练或外部专家干预。

链接: https://arxiv.org/abs/2506.01663
作者: Xuan Yu,Dayan Guan,Michael Ying Yang,Yanfeng Gu
机构: Harbin Institute of Technology (哈尔滨工业大学); University of Bath (巴斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code is available at this https URL

点击查看摘要

Abstract:Multimodal Large Language Models (MLLM) often struggle to interpret high-resolution images accurately, where fine-grained details are crucial for complex visual understanding. We introduce Zoom-Refine, a novel training-free method that enhances MLLM capabilities to address this issue. Zoom-Refine operates through a synergistic process of \textitLocalized Zoom and \textitSelf-Refinement. In the \textitLocalized Zoom step, Zoom-Refine leverages the MLLM to provide a preliminary response to an input query and identifies the most task-relevant image region by predicting its bounding box coordinates. During the \textitSelf-Refinement step, Zoom-Refine then integrates fine-grained details from the high-resolution crop (identified by \textitLocalized Zoom) with its initial reasoning to re-evaluate and refine its preliminary response. Our method harnesses the MLLM’s inherent capabilities for spatial localization, contextual reasoning and comparative analysis without requiring additional training or external experts. Comprehensive experiments demonstrate the efficacy of Zoom-Refine on two challenging high-resolution multimodal benchmarks. Code is available at \hrefthis https URL\colormagentathis http URL
zh

[CV-34] Visual Explanation via Similar Feature Activation for Metric Learning

【速读】:该论文试图解决在度量学习模型中缺乏全连接层作为分类器,导致传统类激活图(Class Activation Map, CAM)及其变体(如Grad-CAM和Relevance-CAM)无法直接应用的问题。解决方案的关键在于提出一种新的可视化解释方法——相似特征激活图(Similar Feature Activation Map, SFAM),该方法通过通道级贡献重要性分数(Channel-wise Contribution Importance Score, CIS)来衡量特征重要性,该分数基于两个图像嵌入之间的相似性计算得出,并通过线性组合将重要性权重与卷积神经网络(CNN)的特征图结合,从而生成可解释的可视化结果。

链接: https://arxiv.org/abs/2506.01636
作者: Yi Liao,Ugochukwu Ejike Akpudo,Jue Zhang,Yongsheng Gao,Jun Zhou,Wenyi Zeng,Weichuan Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Visual explanation maps enhance the trustworthiness of decisions made by deep learning models and offer valuable guidance for developing new algorithms in image recognition tasks. Class activation maps (CAM) and their variants (e.g., Grad-CAM and Relevance-CAM) have been extensively employed to explore the interpretability of softmax-based convolutional neural networks, which require a fully connected layer as the classifier for decision-making. However, these methods cannot be directly applied to metric learning models, as such models lack a fully connected layer functioning as a classifier. To address this limitation, we propose a novel visual explanation method termed Similar Feature Activation Map (SFAM). This method introduces the channel-wise contribution importance score (CIS) to measure feature importance, derived from the similarity measurement between two image embeddings. The explanation map is constructed by linearly combining the proposed importance weights with the feature map from a CNN model. Quantitative and qualitative experiments show that SFAM provides highly promising interpretable visual explanations for CNN models using Euclidean distance or cosine similarity as the similarity metric.
zh

[CV-35] EPFL-Smart-Kitchen-30: Densely annotated cooking dataset with 3D kinematics to challenge video and language models

【速读】:该论文旨在解决人类行为理解与建模的问题,特别是在复杂任务环境中的行为分析。其解决方案的关键在于构建一个高质量的多模态数据集——EPFL-Smart-Kitchen-30,该数据集通过非侵入式运动捕捉平台采集了厨房环境中16名受试者烹饪四种不同食谱的29.7小时数据,包含同步的外视角、第一视角、深度信息、惯性测量单元(IMUs)、眼动追踪、身体和手部运动学等多源数据,并对动作序列进行了密集标注。基于此数据集,研究者提出了四个基准测试以推动行为理解与建模的发展。

链接: https://arxiv.org/abs/2506.01608
作者: Andy Bonnetto,Haozhe Qi,Franklin Leong,Matea Tashkovska,Mahdi Rad,Solaiman Shokur,Friedhelm Hummel,Silvestro Micera,Marc Pollefeys,Alexander Mathis
机构: École Polytechnique Fédérale de Lausanne (EPFL)(洛桑联邦理工学院); Microsoft(微软); Scuola Superiore Sant’Anna (Pisa)(圣安娜高等研究院(比萨)); Swiss Federal Institute of Technology Valais (EPFL Valais)(瓦莱州瑞士联邦理工学院(EPFL Valais)); University of Geneva Medical School (日内瓦大学医学院); Eidgenössische Technische Hochschule (ETH)(苏黎世联邦理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Other Quantitative Biology (q-bio.OT)
备注: Code and data at: this https URL

点击查看摘要

Abstract:Understanding behavior requires datasets that capture humans while carrying out complex tasks. The kitchen is an excellent environment for assessing human motor and cognitive function, as many complex actions are naturally exhibited in kitchens from chopping to cleaning. Here, we introduce the EPFL-Smart-Kitchen-30 dataset, collected in a noninvasive motion capture platform inside a kitchen environment. Nine static RGB-D cameras, inertial measurement units (IMUs) and one head-mounted HoloLens~2 headset were used to capture 3D hand, body, and eye movements. The EPFL-Smart-Kitchen-30 dataset is a multi-view action dataset with synchronized exocentric, egocentric, depth, IMUs, eye gaze, body and hand kinematics spanning 29.7 hours of 16 subjects cooking four different recipes. Action sequences were densely annotated with 33.78 action segments per minute. Leveraging this multi-modal dataset, we propose four benchmarks to advance behavior understanding and modeling through 1) a vision-language benchmark, 2) a semantic text-to-motion generation benchmark, 3) a multi-modal action recognition benchmark, 4) a pose-based action segmentation benchmark. We expect the EPFL-Smart-Kitchen-30 dataset to pave the way for better methods as well as insights to understand the nature of ecologically-valid human behavior. Code and data are available at this https URL
zh

[CV-36] WoMAP: World Models For Embodied Open-Vocabulary Object Localization

【速读】:该论文旨在解决机器人在部分可观测环境中进行语言指导的主动目标定位(language-instructed active object localization)问题,现有方法要么在演示数据集之外泛化能力不足(如模仿学习方法),要么无法生成物理上合理的动作(如视觉-语言模型)。其解决方案的关键在于提出WoMAP(World Models for Active Perception),该方法通过基于高斯点云的现实-仿真-现实数据生成流水线实现无需专家示范的可扩展数据生成,从开放词汇目标检测器中蒸馏密集奖励信号,并利用潜在世界模型进行动态和奖励预测,以在推理时对高层动作提案进行物理 grounding。

链接: https://arxiv.org/abs/2506.01600
作者: Tenny Yin,Zhiting Mei,Tao Sun,Lihan Zha,Emily Zhou,Jeremy Bao,Miyu Yamane,Ola Shorinwa,Anirudha Majumdar
机构: Princeton University (普林斯顿大学); McGill University (麦吉尔大学)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Language-instructed active object localization is a critical challenge for robots, requiring efficient exploration of partially observable environments. However, state-of-the-art approaches either struggle to generalize beyond demonstration datasets (e.g., imitation learning methods) or fail to generate physically grounded actions (e.g., VLMs). To address these limitations, we introduce WoMAP (World Models for Active Perception): a recipe for training open-vocabulary object localization policies that: (i) uses a Gaussian Splatting-based real-to-sim-to-real pipeline for scalable data generation without the need for expert demonstrations, (ii) distills dense rewards signals from open-vocabulary object detectors, and (iii) leverages a latent world model for dynamics and rewards prediction to ground high-level action proposals at inference time. Rigorous simulation and hardware experiments demonstrate WoMAP’s superior performance in a broad range of zero-shot object localization tasks, with more than 9x and 2x higher success rates compared to VLM and diffusion policy baselines, respectively. Further, we show that WoMAP achieves strong generalization and sim-to-real transfer on a TidyBot.
zh

[CV-37] Silence is Golden: Leverag ing Adversarial Examples to Nullify Audio Control in LDM-based Talking-Head Generation CVPR2025

【速读】:该论文旨在解决基于潜在扩散模型(Latent Diffusion Models, LDM)的说话头动画技术所带来的隐私泄露问题,特别是针对参考肖像在生成式AI (Generative AI) 驱动的图像到视频动画中的滥用风险。现有防护方法通过向肖像添加扰动来对抗LDM模型,但其效果有限,无法有效防止音频信号对图像的操控,并且扩散基净化技术可消除这些保护性扰动。该论文提出的解决方案关键在于Silencer,其包含两个阶段:第一阶段引入消融损失以忽略说话头生成中的音频控制;第二阶段在LDM中应用反净化损失,优化逆向潜在特征以生成鲁棒扰动,从而实现对肖像隐私的主动保护。

链接: https://arxiv.org/abs/2506.01591
作者: Yuan Gan,Jiaxu Miao,Yunze Wang,Yi Yang
机构: Zhejiang University (浙江大学); Sun Yat-sen University (中山大学); University of Wisconsin–Madison (威斯康星大学麦迪逊分校)
类目: Graphics (cs.GR); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD)
备注: Accepted to CVPR 2025

点击查看摘要

Abstract:Advances in talking-head animation based on Latent Diffusion Models (LDM) enable the creation of highly realistic, synchronized videos. These fabricated videos are indistinguishable from real ones, increasing the risk of potential misuse for scams, political manipulation, and misinformation. Hence, addressing these ethical concerns has become a pressing issue in AI security. Recent proactive defense studies focused on countering LDM-based models by adding perturbations to portraits. However, these methods are ineffective at protecting reference portraits from advanced image-to-video animation. The limitations are twofold: 1) they fail to prevent images from being manipulated by audio signals, and 2) diffusion-based purification techniques can effectively eliminate protective perturbations. To address these challenges, we propose Silencer, a two-stage method designed to proactively protect the privacy of portraits. First, a nullifying loss is proposed to ignore audio control in talking-head generation. Second, we apply anti-purification loss in LDM to optimize the inverted latent feature to generate robust perturbations. Extensive experiments demonstrate the effectiveness of Silencer in proactively protecting portrait privacy. We hope this work will raise awareness among the AI security community regarding critical ethical issues related to talking-head generation techniques. Code: this https URL.
zh

[CV-38] Multi-Modal Dataset Distillation in the Wild

【速读】:该论文旨在解决多模态模型训练中面临的两个关键数据挑战:大规模数据集带来的存储和计算成本过高,以及网络爬取数据中不可避免的噪声(即部分不匹配的模态对)导致模型性能下降。其解决方案的关键在于提出了一种名为MDW(Multi-modal dataset Distillation in the Wild)的框架,该框架通过在蒸馏过程中引入可学习的细粒度对应关系,并自适应优化蒸馏数据以强调具有区分性的对应区域,从而提升蒸馏数据的信息密度和有效性。此外,MDW还通过双轨协同学习机制从真实数据中捕获鲁棒的跨模态对应先验知识,避免数据噪声的影响,实现可验证的噪声容忍度。

链接: https://arxiv.org/abs/2506.01586
作者: Zhuohang Dang,Minnan Luo,Chengyou Jia,Hangwei Qian,Xiaojun Chang,Ivor W. Tsang
机构: Xi’an Jiaotong University (西安交通大学); ASTAR (ASTAR); University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent multi-modal models have shown remarkable versatility in real-world applications. However, their rapid development encounters two critical data challenges. First, the training process requires large-scale datasets, leading to substantial storage and computational costs. Second, these data are typically web-crawled with inevitable noise, i.e., partially mismatched pairs, severely degrading model performance. To these ends, we propose Multi-modal dataset Distillation in the Wild, i.e., MDW, the first framework to distill noisy multi-modal datasets into compact clean ones for effective and efficient model training. Specifically, MDW introduces learnable fine-grained correspondences during distillation and adaptively optimizes distilled data to emphasize correspondence-discriminative regions, thereby enhancing distilled data’s information density and efficacy. Moreover, to capture robust cross-modal correspondence prior knowledge from real data, MDW proposes dual-track collaborative learning to avoid the risky data noise, alleviating information loss with certifiable noise tolerance. Extensive experiments validate MDW’s theoretical and empirical efficacy with remarkable scalability, surpassing prior methods by over 15% across various compression ratios, highlighting its appealing practicality for applications with diverse efficacy and resource needs.
zh

[CV-39] FreqPolicy: Frequency Autoregressive Visuomotor Policy with Continuous Tokens

【速读】:该论文旨在解决机器人操作中学习有效视觉-运动策略(visuomotor policy)的挑战,即在生成精确动作的同时保持计算效率。现有方法由于基本动作表示和网络架构的固有局限性而表现不佳。论文的关键解决方案是提出一种新的范式,通过逐步建模分层频率成分来更有效地捕捉运动的结构特性,其中低频成分反映全局运动模式,高频成分编码局部细节,并引入连续潜在表示以提升动作空间的精度与平滑性。

链接: https://arxiv.org/abs/2506.01583
作者: Yiming Zhong,Yumeng Liu,Chuyang Xiao,Zemin Yang,Youzhuo Wang,Yufei Zhu,Ye Shi,Yujing Sun,Xinge Zhu,Yuexin Ma
机构: ShanghaiTech University (上海科技大学); The University of Hong Kong (香港大学); Nanyang Technological University (南洋理工大学); The Chinese University of Hong Kong (香港中文大学)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Learning effective visuomotor policies for robotic manipulation is challenging, as it requires generating precise actions while maintaining computational efficiency. Existing methods remain unsatisfactory due to inherent limitations in the essential action representation and the basic network architectures. We observe that representing actions in the frequency domain captures the structured nature of motion more effectively: low-frequency components reflect global movement patterns, while high-frequency components encode fine local details. Additionally, robotic manipulation tasks of varying complexity demand different levels of modeling precision across these frequency bands. Motivated by this, we propose a novel paradigm for visuomotor policy learning that progressively models hierarchical frequency components. To further enhance precision, we introduce continuous latent representations that maintain smoothness and continuity in the action space. Extensive experiments across diverse 2D and 3D robotic manipulation benchmarks demonstrate that our approach outperforms existing methods in both accuracy and efficiency, showcasing the potential of a frequency-domain autoregressive framework with continuous tokens for generalized robotic manipulation.
zh

[CV-40] HOSIG: Full-Body Human-Object-Scene Interaction Generation with Hierarchical Scene Perception

【速读】:该论文旨在解决在计算机图形学与动画中生成高保真全身人体交互(full-body human interactions)的问题,特别是在动态物体和静态场景中的交互。现有方法在人体-物体交互中常忽略场景上下文,导致不合理的穿透现象,而在人体-场景交互中则难以协调细粒度操作与长距离导航。解决方案的关键在于提出HOSIG框架,通过分层场景感知实现全身交互的合成,其核心包括:1)基于场景感知的抓取姿态生成器,通过整合局部几何约束确保无碰撞的全身姿态;2)启发式导航算法,通过压缩2D地板图和双组件空间推理自主规划避障路径;3)场景引导的动力扩散模型,通过引入空间锚点和双空间无分类器指导生成具有手指级精度的轨迹控制全身运动。

链接: https://arxiv.org/abs/2506.01579
作者: Wei Yao,Yunlian Sun,Hongwen Zhang,Yebin Liu,Jinhui Tang
机构: Nanjing University of Science and Technology (南京理工大学); Beijing Normal University (北京师范大学); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Generating high-fidelity full-body human interactions with dynamic objects and static scenes remains a critical challenge in computer graphics and animation. Existing methods for human-object interaction often neglect scene context, leading to implausible penetrations, while human-scene interaction approaches struggle to coordinate fine-grained manipulations with long-range navigation. To address these limitations, we propose HOSIG, a novel framework for synthesizing full-body interactions through hierarchical scene perception. Our method decouples the task into three key components: 1) a scene-aware grasp pose generator that ensures collision-free whole-body postures with precise hand-object contact by integrating local geometry constraints, 2) a heuristic navigation algorithm that autonomously plans obstacle-avoiding paths in complex indoor environments via compressed 2D floor maps and dual-component spatial reasoning, and 3) a scene-guided motion diffusion model that generates trajectory-controlled, full-body motions with finger-level accuracy by incorporating spatial anchors and dual-space classifier-free guidance. Extensive experiments on the TRUMANS dataset demonstrate superior performance over state-of-the-art methods. Notably, our framework supports unlimited motion length through autoregressive generation and requires minimal manual intervention. This work bridges the critical gap between scene-aware navigation and dexterous object manipulation, advancing the frontier of embodied interaction synthesis. Codes will be available after publication. Project page: this http URL
zh

[CV-41] SAM2-LOVE: Segment Anything Model 2 in Language-aided Audio-Visual Scenes CVPR2025

【速读】:该论文旨在解决参考音频-视觉分割(Ref-AVS)任务中由于缺乏第三模态或现有三模态方法在时空一致性上的不足所导致的目标偏移问题。解决方案的关键在于提出一种名为SAM2-LOVE的新框架,该框架将文本、音频和视觉表示整合到一个可学习的token中,以提示和对齐SAM2,从而实现语言辅助音频-视觉场景(LAVS)中的Ref-AVS。关键技术包括多模态融合模块以提升SAM2的多模态理解能力,以及用于增强时空一致性的token传播与累积策略,同时避免遗忘历史信息。

链接: https://arxiv.org/abs/2506.01558
作者: Yuji Wang,Haoran Xu,Yong Liu,Jiaze Li,Yansong Tang
机构: Tsinghua Shenzhen International Graduate School, Tsinghua University (清华大学深圳国际研究生院,清华大学); ZJU (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025

点击查看摘要

Abstract:Reference Audio-Visual Segmentation (Ref-AVS) aims to provide a pixel-wise scene understanding in Language-aided Audio-Visual Scenes (LAVS). This task requires the model to continuously segment objects referred to by text and audio from a video. Previous dual-modality methods always fail due to the lack of a third modality and the existing triple-modality method struggles with spatio-temporal consistency, leading to the target shift of different frames. In this work, we introduce a novel framework, termed SAM2-LOVE, which integrates textual, audio, and visual representations into a learnable token to prompt and align SAM2 for achieving Ref-AVS in the LAVS. Technically, our approach includes a multimodal fusion module aimed at improving multimodal understanding of SAM2, as well as token propagation and accumulation strategies designed to enhance spatio-temporal consistency without forgetting historical information. We conducted extensive experiments to demonstrate that SAM2-LOVE outperforms the SOTA by 8.5% in \mathcalJ\F on the Ref-AVS benchmark and showcase the simplicity and effectiveness of the components. Our code will be available here.
zh

[CV-42] LongDWM: Cross-Granularity Distillation for Building a Long-Term Driving World Model

【速读】:该论文旨在解决当前驾驶世界模型在长期未来预测中存在严重误差累积的问题,以及由于训练与推理之间的差距导致生成长视频时一致性不足的挑战。其关键解决方案是通过将世界模型学习分层解耦为大范围运动学习和双向连续运动学习,并引入一种简单的蒸馏方法,利用细粒度视频流作为粗粒度流的自监督信号,以提升无限视频生成的连贯性,从而实现长期且时间一致的视频生成。

链接: https://arxiv.org/abs/2506.01546
作者: Xiaodong Wang,Zhirong Wu,Peixi Peng
机构: Peking University (北京大学); Peng Cheng Laboratory (鹏城实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: project homepage: this https URL

点击查看摘要

Abstract:Driving world models are used to simulate futures by video generation based on the condition of the current state and actions. However, current models often suffer serious error accumulations when predicting the long-term future, which limits the practical application. Recent studies utilize the Diffusion Transformer (DiT) as the backbone of driving world models to improve learning flexibility. However, these models are always trained on short video clips (high fps and short duration), and multiple roll-out generations struggle to produce consistent and reasonable long videos due to the training-inference gap. To this end, we propose several solutions to build a simple yet effective long-term driving world model. First, we hierarchically decouple world model learning into large motion learning and bidirectional continuous motion learning. Then, considering the continuity of driving scenes, we propose a simple distillation method where fine-grained video flows are self-supervised signals for coarse-grained flows. The distillation is designed to improve the coherence of infinite video generation. The coarse-grained and fine-grained modules are coordinated to generate long-term and temporally coherent videos. In the public benchmark NuScenes, compared with the state-of-the-art front-view model, our model improves FVD by 27% and reduces inference time by 85% for the video task of generating 110+ frames. More videos (including 90s duration) are available at this https URL.
zh

[CV-43] G4Seg: Generation for Inexact Segmentation Refinement with Diffusion Models

【速读】:该论文试图解决的是不精确分割(Inexact Segmentation, IS)任务,该任务在传统方法中通常依赖于判别模型框架或基于内部注意力机制的密集视觉表示。本文的解决方案关键在于利用Stable Diffusion(SD)中的内在生成先验,通过分析原始图像与掩码条件生成图像之间的模式差异,建立语义对应对齐并更新前景概率,从而实现从粗到细的分割优化。

链接: https://arxiv.org/abs/2506.01539
作者: Tianjiao Zhang,Fei Zhang,Jiangchao Yao,Ya Zhang,Yanfeng Wang
机构: CMIC, Shanghai Jiao Tong University (CMIC,上海交通大学); School of Artificial Intelligence, Shanghai Jiao Tong University (人工智能学院,上海交通大学); Shanghai Innovation Institute (上海创新研究院); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 16 pages, 12 figures, IEEE International Conference on Multimedia Expo 2025

点击查看摘要

Abstract:This paper considers the problem of utilizing a large-scale text-to-image diffusion model to tackle the challenging Inexact Segmentation (IS) task. Unlike traditional approaches that rely heavily on discriminative-model-based paradigms or dense visual representations derived from internal attention mechanisms, our method focuses on the intrinsic generative priors in Stable Diffusion~(SD). Specifically, we exploit the pattern discrepancies between original images and mask-conditional generated images to facilitate a coarse-to-fine segmentation refinement by establishing a semantic correspondence alignment and updating the foreground probability. Comprehensive quantitative and qualitative experiments validate the effectiveness and superiority of our plug-and-play design, underscoring the potential of leveraging generation discrepancies to model dense representations and encouraging further exploration of generative approaches for solving discriminative tasks.
zh

[CV-44] Beyond black and white: A more nuanced approach to facial recognition with continuous ethnicity labels

【速读】:该论文试图解决人脸识别模型中的数据偏差问题(data bias),特别是传统方法在处理种族标签时将其视为离散值所导致的不平衡问题。论文提出的解决方案的关键在于将种族标签(ethnicity)重新定义为连续变量,而非每个个体的离散类别,从而更准确地反映数据集的真实分布特性。通过实验和理论验证,作者证明了在连续空间中平衡数据集能够提升模型性能,相较于离散空间平衡的数据集更具优势。

链接: https://arxiv.org/abs/2506.01532
作者: Pedro C. Neto,Naser Damer,Jaime S. Cardoso,Ana F. Sequeira
机构: INESC TEC(INESTEC); FEUP(FEUP); Fraunhofer Institute for Computer Graphics Research IGD(弗劳恩霍夫计算机图形学研究所); Technische Universitat Darmstadt(达姆施塔特工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under review

点击查看摘要

Abstract:Bias has been a constant in face recognition models. Over the years, researchers have looked at it from both the model and the data point of view. However, their approach to mitigation of data bias was limited and lacked insight on the real nature of the problem. Here, in this document, we propose to revise our use of ethnicity labels as a continuous variable instead of a discrete value per identity. We validate our formulation both experimentally and theoretically, showcasing that not all identities from one ethnicity contribute equally to the balance of the dataset; thus, having the same number of identities per ethnicity does not represent a balanced dataset. We further show that models trained on datasets balanced in the continuous space consistently outperform models trained on data balanced in the discrete space. We trained more than 65 different models, and created more than 20 subsets of the original datasets.
zh

[CV-45] Speed-up of Vision Transformer Models by Attention-aware Token Filtering

【速读】:该论文旨在解决Vision Transformer (ViT) 模型在图像嵌入提取中计算负担过高的问题。其解决方案的关键在于提出了一种名为Attention-aware Token Filtering (ATF) 的加速方法,该方法通过引入一个令牌过滤模块和一种过滤策略,在不修改或微调Transformer编码器的情况下,动态保留特定对象类型的区域令牌以及静态接收高注意力的区域令牌,从而在保持任务准确性的同时显著提升模型速度。

链接: https://arxiv.org/abs/2506.01519
作者: Takahiro Naruko,Hiroaki Akutsu
机构: Hitachi, Ltd. (日立有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision Transformer (ViT) models have made breakthroughs in image embedding extraction, which provide state-of-the-art performance in tasks such as zero-shot image classification. However, the models suffer from a high computational burden. In this paper, we propose a novel speed-up method for ViT models called Attention-aware Token Filtering (ATF). ATF consists of two main ideas: a novel token filtering module and a filtering strategy. The token filtering module is introduced between a tokenizer and a transformer encoder of the ViT model, without modifying or fine-tuning of the transformer encoder. The module filters out tokens inputted to the encoder so that it keeps tokens in regions of specific object types dynamically and keeps tokens in regions that statically receive high attention in the transformer encoder. This filtering strategy maintains task accuracy while filtering out tokens inputted to the transformer encoder. Evaluation results on retrieval tasks show that ATF provides 2.8\times speed-up to a ViT model, SigLIP, while maintaining the retrieval recall rate.
zh

[CV-46] Enhancing Diffusion-based Unrestricted Adversarial Attacks via Adversary Preferences Alignment

【速读】:该论文试图解决扩散模型中对抗样本生成的对齐问题,即如何在保持视觉一致性的同时提升攻击效果,这涉及两个冲突的偏好:视觉一致性和攻击有效性。解决方案的关键在于提出APA(Adversary Preferences Alignment)框架,该框架通过两阶段优化策略解耦冲突偏好,并利用可微分奖励进行优化。第一阶段通过规则基础的相似性奖励微调LoRA以提升视觉一致性,第二阶段则根据替代分类器的反馈,通过轨迹级和步进奖励更新图像潜在表示或提示嵌入,从而实现更有效的对抗攻击。

链接: https://arxiv.org/abs/2506.01511
作者: Kaixun Jiang,Zhaoyu Chen,Haijing Guo,Jinglun Li,Jiyuan Fu,Pinxue Guo,Hao Tang,Bo Li,Wenqiang Zhang
机构: Fudan University (复旦大学); Shanghai Key Lab of Intelligent Information Processing (上海市智能信息处理重点实验室); Peking University (北京大学); vivo Mobile Communication Co., Ltd (维沃移动通信有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Preference alignment in diffusion models has primarily focused on benign human preferences (e.g., aesthetic). In this paper, we propose a novel perspective: framing unrestricted adversarial example generation as a problem of aligning with adversary preferences. Unlike benign alignment, adversarial alignment involves two inherently conflicting preferences: visual consistency and attack effectiveness, which often lead to unstable optimization and reward hacking (e.g., reducing visual quality to improve attack success). To address this, we propose APA (Adversary Preferences Alignment), a two-stage framework that decouples conflicting preferences and optimizes each with differentiable rewards. In the first stage, APA fine-tunes LoRA to improve visual consistency using rule-based similarity reward. In the second stage, APA updates either the image latent or prompt embedding based on feedback from a substitute classifier, guided by trajectory-level and step-wise rewards. To enhance black-box transferability, we further incorporate a diffusion augmentation strategy. Experiments demonstrate that APA achieves significantly better attack transferability while maintaining high visual consistency, inspiring further research to approach adversarial attacks from an alignment perspective. Code will be available at this https URL.
zh

[CV-47] Efficiency without Compromise: CLIP-aided Text-to-Image GANs with Increased Diversity IJCNN2025

【速读】:该论文旨在解决大规模文本到图像生成对抗网络(GAN)训练成本高且生成多样性不足的问题。现有方法虽然通过引入预训练模型降低了训练成本,但导致了生成结果在给定提示下的多样性显著下降。解决方案的关键在于提出一种名为SCAD的模型,该模型采用两个针对文本到图像任务优化的切片对抗网络(SANs)作为判别器,从而在大幅降低训练成本的同时,显著提升了生成样本的多样性和保真度。

链接: https://arxiv.org/abs/2506.01493
作者: Yuya Kobayashi,Yuhta Takida,Takashi Shibuya,Yuki Mitsufuji
机构: SonyAI(索尼人工智能); Sony Group Corp.(索尼集团株式会社)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted at IJCNN 2025

点击查看摘要

Abstract:Recently, Generative Adversarial Networks (GANs) have been successfully scaled to billion-scale large text-to-image datasets. However, training such models entails a high training cost, limiting some applications and research usage. To reduce the cost, one promising direction is the incorporation of pre-trained models. The existing method of utilizing pre-trained models for a generator significantly reduced the training cost compared with the other large-scale GANs, but we found the model loses the diversity of generation for a given prompt by a large margin. To build an efficient and high-fidelity text-to-image GAN without compromise, we propose to use two specialized discriminators with Slicing Adversarial Networks (SANs) adapted for text-to-image tasks. Our proposed model, called SCAD, shows a notable enhancement in diversity for a given prompt with better sample fidelity. We also propose to use a metric called Per-Prompt Diversity (PPD) to evaluate the diversity of text-to-image models quantitatively. SCAD achieved a zero-shot FID competitive with the latest large-scale GANs at two orders of magnitude less training cost.
zh

[CV-48] FDSG: Forecasting Dynamic Scene Graphs

【速读】:该论文旨在解决动态场景图生成(Dynamic Scene Graph Generation)中实体和关系动态难以有效外推的问题,现有方法要么未显式建模时间动态,要么仅预测关系而假设实体标签和位置静态。其解决方案的关键在于提出Forecasting Dynamic Scene Graphs (FDSG)框架,该框架通过查询分解和神经随机微分方程建模实体与关系的动态,并利用时间聚合模块结合预测与观测信息进行预测优化,从而实现对未观测帧的实体标签、边界框及关系的预测。

链接: https://arxiv.org/abs/2506.01487
作者: Yi Yang,Yuren Cong,Hao Cheng,Bodo Rosenhahn,Michael Ying Yang
机构: TNT, Leibniz University Hannover (TNT,汉诺威莱布尼茨大学); ITC, University of Twente (ITC,特文特大学); Visual Computing Group, University of Bath (视觉计算组,巴斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 21 pages, 9 figures, 15 tables

点击查看摘要

Abstract:Dynamic scene graph generation extends scene graph generation from images to videos by modeling entity relationships and their temporal evolution. However, existing methods either generate scene graphs from observed frames without explicitly modeling temporal dynamics, or predict only relationships while assuming static entity labels and locations. These limitations hinder effective extrapolation of both entity and relationship dynamics, restricting video scene understanding. We propose Forecasting Dynamic Scene Graphs (FDSG), a novel framework that predicts future entity labels, bounding boxes, and relationships, for unobserved frames, while also generating scene graphs for observed frames. Our scene graph forecast module leverages query decomposition and neural stochastic differential equations to model entity and relationship dynamics. A temporal aggregation module further refines predictions by integrating forecasted and observed information via cross-attention. To benchmark FDSG, we introduce Scene Graph Forecasting, a new task for full future scene graph prediction. Experiments on Action Genome show that FDSG outperforms state-of-the-art methods on dynamic scene graph generation, scene graph anticipation, and scene graph forecasting. Codes will be released upon publication.
zh

[CV-49] Unlocking Aha Moments via Reinforcement Learning: Advancing Collaborative Visual Comprehension and Generation

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)中视觉理解与生成能力相互独立的问题,即这两项功能在模型中被视为独立模块,未能有效协同以提升图像生成效果。其解决方案的关键在于提出一种两阶段训练方法:监督微调赋予模型生成真实思维链(Chain of Thought, CoT)的能力,而强化学习通过探索与利用的权衡激活模型的全部潜力,从而实现视觉理解与生成的协同进化,推动图像生成进入迭代内省过程。

链接: https://arxiv.org/abs/2506.01480
作者: Kaihang Pan,Yang Wu,Wendong Bu,Kai Shen,Juncheng Li,Yingting Wang,Yunfei Li,Siliang Tang,Jun Xiao,Fei Wu,Hang Zhao,Yueting Zhuang
机构: Zhejiang University (浙江大学); Ant Group (蚂蚁集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 21 pages, 7 figures

点击查看摘要

Abstract:Recent endeavors in Multimodal Large Language Models (MLLMs) aim to unify visual comprehension and generation. However, these two capabilities remain largely independent, as if they are two separate functions encapsulated within the same model. Consequently, visual comprehension does not enhance visual generation, and the reasoning mechanisms of LLMs have not been fully integrated to revolutionize image generation. In this paper, we propose to enable the collaborative co-evolution of visual comprehension and generation, advancing image generation into an iterative introspective process. We introduce a two-stage training approach: supervised fine-tuning teaches the MLLM with the foundational ability to generate genuine CoT for visual generation, while reinforcement learning activates its full potential via an exploration-exploitation trade-off. Ultimately, we unlock the Aha moment in visual generation, advancing MLLMs from text-to-image tasks to unified image generation. Extensive experiments demonstrate that our model not only excels in text-to-image generation and image editing, but also functions as a superior image semantic evaluator with enhanced visual comprehension capabilities. Project Page: this https URL.
zh

[CV-50] SemiVT-Surge: Semi-Supervised Video Transformer for Surgical Phase Recognition MICCAI2025

【速读】:该论文旨在解决手术视频分析中手术阶段识别的准确性和标注数据稀缺的问题。其解决方案的关键在于提出一种基于视频Transformer的模型,结合了鲁棒的伪标签框架、时间一致性正则化和基于类别原型的对比学习,从而有效利用未标注数据来提升模型性能。通过在RAMIE和Cholec80数据集上的实验验证,该方法在减少标注数据依赖的同时实现了最先进的性能。

链接: https://arxiv.org/abs/2506.01471
作者: Yiping Li,Ronald de Jong,Sahar Nasirihaghighi,Tim Jaspers,Romy van Jaarsveld,Gino Kuiper,Richard van Hillegersberg,Fons van der Sommen,Jelle Ruurda,Marcel Breeuwer,Yasmina Al Khalil
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for MICCAI 2025

点击查看摘要

Abstract:Accurate surgical phase recognition is crucial for computer-assisted interventions and surgical video analysis. Annotating long surgical videos is labor-intensive, driving research toward leveraging unlabeled data for strong performance with minimal annotations. Although self-supervised learning has gained popularity by enabling large-scale pretraining followed by fine-tuning on small labeled subsets, semi-supervised approaches remain largely underexplored in the surgical domain. In this work, we propose a video transformer-based model with a robust pseudo-labeling framework. Our method incorporates temporal consistency regularization for unlabeled data and contrastive learning with class prototypes, which leverages both labeled data and pseudo-labels to refine the feature space. Through extensive experiments on the private RAMIE (Robot-Assisted Minimally Invasive Esophagectomy) dataset and the public Cholec80 dataset, we demonstrate the effectiveness of our approach. By incorporating unlabeled data, we achieve state-of-the-art performance on RAMIE with a 4.9% accuracy increase and obtain comparable results to full supervision while using only 1/4 of the labeled data on Cholec80. Our findings establish a strong benchmark for semi-supervised surgical phase recognition, paving the way for future research in this domain.
zh

[CV-51] Sheep Facial Pain Assessment Under Weighted Graph Neural Networks

【速读】:该论文旨在解决如何准确识别和评估绵羊的疼痛问题,以判断动物健康状况并减少有害情况的发生。其解决方案的关键在于提出一种新型加权图神经网络(WGNN)模型,该模型通过连接检测到的绵羊面部关键点来定义疼痛水平,并构建了一个符合绵羊面部表情量表(SPFES)参数的新绵羊面部关键点数据集。此方法为基于图神经网络(GNN)在绵羊面部关键点数据上检测和测量疼痛水平提供了新的研究基础。

链接: https://arxiv.org/abs/2506.01468
作者: Alam Noor,Luis Almeida,Mohamed Daoudi,Kai Li,Eduardo Tovar
机构: CISTER Research Center, Porto, Portugal; Faculty of Engineering University of Porto, Portugal; Univ. Lille, CNRS, Centrale Lille, UMR 9189 CRIStAL, F-59000, Lille, France; IMT Nord Europe, Institut Mines-Télécom, Univ. Lille, Centre for Digital Systems, F-59000 , Lille, France
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 2025 19th International Conference on Automatic Face and Gesture Recognition (FG)

点击查看摘要

Abstract:Accurately recognizing and assessing pain in sheep is key to discern animal health and mitigating harmful situations. However, such accuracy is limited by the ability to manage automatic monitoring of pain in those animals. Facial expression scoring is a widely used and useful method to evaluate pain in both humans and other living beings. Researchers also analyzed the facial expressions of sheep to assess their health state and concluded that facial landmark detection and pain level prediction are essential. For this purpose, we propose a novel weighted graph neural network (WGNN) model to link sheep’s detected facial landmarks and define pain levels. Furthermore, we propose a new sheep facial landmarks dataset that adheres to the parameters of the Sheep Facial Expression Scale (SPFES). Currently, there is no comprehensive performance benchmark that specifically evaluates the use of graph neural networks (GNNs) on sheep facial landmark data to detect and measure pain levels. The YOLOv8n detector architecture achieves a mean average precision (mAP) of 59.30% with the sheep facial landmarks dataset, among seven other detection models. The WGNN framework has an accuracy of 92.71% for tracking multiple facial parts expressions with the YOLOv8n lightweight on-board device deployment-capable model.
zh

[CV-52] owards Scalable Video Anomaly Retrieval: A Synthetic Video-Text Benchmark

【速读】:该论文旨在解决视频异常检索任务中数据稀缺和隐私限制两大问题。现有数据集由于现实世界异常事件的长尾特性导致数据不足,同时隐私问题阻碍了大规模数据的收集。其解决方案的关键在于引入SVTA(Synthetic Video-Text Anomaly benchmark),这是首个针对跨模态异常检索的大规模数据集,通过生成式AI(Generative AI)模型生成高质量视频与文本配对数据,从而克服数据可用性挑战。

链接: https://arxiv.org/abs/2506.01466
作者: Shuyu Yang,Yilun Wang,Yaxiong Wang,Li Zhu,Zhedong Zheng
机构: Xi’an Jiaotong University (西安交通大学); Hefei University of Technology (合肥工业大学); University of Macau (澳门大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Video anomaly retrieval aims to localize anomalous events in videos using natural language queries to facilitate public safety. However, existing datasets suffer from severe limitations: (1) data scarcity due to the long-tail nature of real-world anomalies, and (2) privacy constraints that impede large-scale collection. To address the aforementioned issues in one go, we introduce SVTA (Synthetic Video-Text Anomaly benchmark), the first large-scale dataset for cross-modal anomaly retrieval, leveraging generative models to overcome data availability challenges. Specifically, we collect and generate video descriptions via the off-the-shelf LLM (Large Language Model) covering 68 anomaly categories, e.g., throwing, stealing, and shooting. These descriptions encompass common long-tail events. We adopt these texts to guide the video generative model to produce diverse and high-quality videos. Finally, our SVTA involves 41,315 videos (1.36M frames) with paired captions, covering 30 normal activities, e.g., standing, walking, and sports, and 68 anomalous events, e.g., falling, fighting, theft, explosions, and natural disasters. We adopt three widely-used video-text retrieval baselines to comprehensively test our SVTA, revealing SVTA’s challenging nature and its effectiveness in evaluating a robust cross-modal retrieval method. SVTA eliminates privacy risks associated with real-world anomaly collection while maintaining realistic scenarios. The dataset demo is available at: [this https URL].
zh

[CV-53] DiffuseSlide: Training-Free High Frame Rate Video Generation Diffusion

【速读】:该论文试图解决高帧率(FPS)视频生成中存在的时间不一致问题,如闪烁和长序列中的质量退化,尤其是在快速运动场景中。其解决方案的关键在于提出了一种无需训练的方法——DiffuseSlide,该方法利用预训练扩散模型,通过引入关键帧并结合噪声重新注入和滑动窗口潜在去噪等创新技术,实现平滑且一致的视频输出,从而在不进行额外微调的情况下提升视频质量和时间连贯性。

链接: https://arxiv.org/abs/2506.01454
作者: Geunmin Hwang,Hyun-kyu Ko,Younghyun Kim,Seungryong Lee,Eunbyung Park
机构: RECON Labs Inc. (RECON 实验室); Yonsei University (延世大学); Sungkyunkwan University (成均馆大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advancements in diffusion models have revolutionized video generation, enabling the creation of high-quality, temporally consistent videos. However, generating high frame-rate (FPS) videos remains a significant challenge due to issues such as flickering and degradation in long sequences, particularly in fast-motion scenarios. Existing methods often suffer from computational inefficiencies and limitations in maintaining video quality over extended frames. In this paper, we present a novel, training-free approach for high FPS video generation using pre-trained diffusion models. Our method, DiffuseSlide, introduces a new pipeline that leverages key frames from low FPS videos and applies innovative techniques, including noise re-injection and sliding window latent denoising, to achieve smooth, consistent video outputs without the need for additional fine-tuning. Through extensive experiments, we demonstrate that our approach significantly improves video quality, offering enhanced temporal coherence and spatial fidelity. The proposed method is not only computationally efficient but also adaptable to various video generation tasks, making it ideal for applications such as virtual reality, video games, and high-quality content creation.
zh

[CV-54] A Novel Context-Adaptive Fusion of Shadow and Highlight Regions for Efficient Sonar Image Classification

【速读】:该论文旨在解决水下声呐图像中阴影区域特征利用不足的问题,传统研究多集中于高光区域的分析,而对阴影区域的分类研究较少。其解决方案的关键在于提出一种上下文自适应的声呐图像分类框架,该框架通过先进图像处理技术提取并整合具有区分性的阴影和高光特征,并引入专门针对阴影的分类器与自适应阴影分割方法,从而实现基于主导区域的有效分类。此外,还提出了一个区域感知的去噪模型,通过特征重要性驱动的优化策略提升图像质量与分类可靠性。

链接: https://arxiv.org/abs/2506.01445
作者: Kamal Basha S,Anukul Kiran B,Athira Nambiar,Suresh Rajendran
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Sonar imaging is fundamental to underwater exploration, with critical applications in defense, navigation, and marine research. Shadow regions, in particular, provide essential cues for object detection and classification, yet existing studies primarily focus on highlight-based analysis, leaving shadow-based classification underexplored. To bridge this gap, we propose a Context-adaptive sonar image classification framework that leverages advanced image processing techniques to extract and integrate discriminative shadow and highlight features. Our framework introduces a novel shadow-specific classifier and adaptive shadow segmentation, enabling effective classification based on the dominant region. This approach ensures optimal feature representation, improving robustness against noise and occlusions. In addition, we introduce a Region-aware denoising model that enhances sonar image quality by preserving critical structural details while suppressing noise. This model incorporates an explainability-driven optimization strategy, ensuring that denoising is guided by feature importance, thereby improving interpretability and classification reliability. Furthermore, we present S3Simulator+, an extended dataset incorporating naval mine scenarios with physics-informed noise specifically tailored for the underwater sonar domain, fostering the development of robust AI models. By combining novel classification strategies with an enhanced dataset, our work addresses key challenges in sonar image analysis, contributing to the advancement of autonomous underwater perception. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2506.01445 [cs.CV] (or arXiv:2506.01445v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2506.01445 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-55] Variance-Based Defense Against Blended Backdoor Attacks ECML KDD2025

【速读】:该论文旨在解决后门攻击(backdoor attack)在训练阶段对AI模型的隐蔽性威胁,即攻击者通过在训练数据中嵌入特定触发器并修改标签,使模型在正常数据上表现良好,但在包含触发器的输入上产生恶意行为的问题。解决方案的关键在于提出一种新颖的防御方法,该方法通过对给定数据集进行训练,检测被污染的类别,提取攻击触发器的关键部分,并最终识别出被污染的样本,从而增强模型的可解释性并有效抵御后门攻击。

链接: https://arxiv.org/abs/2506.01444
作者: Sujeevan Aseervatham,Achraf Kerzazi,Younès Bennani
机构: Orange Research (橙研究所); LaMSN - La Maison des Sciences Numériques (LaMSN - 数字科学之家); Université Sorbonne Paris Nord (索邦巴黎北大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: This paper has been accepted at ECML PKDD 2025

点击查看摘要

Abstract:Backdoor attacks represent a subtle yet effective class of cyberattacks targeting AI models, primarily due to their stealthy nature. The model behaves normally on clean data but exhibits malicious behavior only when the attacker embeds a specific trigger into the input. This attack is performed during the training phase, where the adversary corrupts a small subset of the training data by embedding a pattern and modifying the labels to a chosen target. The objective is to make the model associate the pattern with the target label while maintaining normal performance on unaltered data. Several defense mechanisms have been proposed to sanitize training data-sets. However, these methods often rely on the availability of a clean dataset to compute statistical anomalies, which may not always be feasible in real-world scenarios where datasets can be unavailable or compromised. To address this limitation, we propose a novel defense method that trains a model on the given dataset, detects poisoned classes, and extracts the critical part of the attack trigger before identifying the poisoned instances. This approach enhances explainability by explicitly revealing the harmful part of the trigger. The effectiveness of our method is demonstrated through experimental evaluations on well-known image datasets and comparative analysis against three state-of-the-art algorithms: SCAn, ABL, and AGPD.
zh

[CV-56] MS-RAFT-3D: A Multi-Scale Architecture for Recurrent Image-Based Scene Flow ICIP2025

【速读】:该论文旨在解决图像基础场景流(image-based scene flow)中多尺度概念尚未被有效应用的问题。其解决方案的关键在于基于单尺度递归场景流主干网络,开发出一种多尺度方法,将光学流中成功的分层思想推广至图像基础场景流任务中,通过合理设计特征和上下文编码器、整体粗到细框架以及训练损失函数,实现了在KITTI和Spring数据集上的性能提升。

链接: https://arxiv.org/abs/2506.01443
作者: Jakob Schmid,Azin Jahedi,Noah Berenguel Senn,Andrés Bruhn
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICIP 2025

点击查看摘要

Abstract:Although multi-scale concepts have recently proven useful for recurrent network architectures in the field of optical flow and stereo, they have not been considered for image-based scene flow so far. Hence, based on a single-scale recurrent scene flow backbone, we develop a multi-scale approach that generalizes successful hierarchical ideas from optical flow to image-based scene flow. By considering suitable concepts for the feature and the context encoder, the overall coarse-to-fine framework and the training loss, we succeed to design a scene flow approach that outperforms the current state of the art on KITTI and Spring by 8.7%(3.89 vs. 4.26) and 65.8% (9.13 vs. 26.71), respectively. Our code is available at this https URL.
zh

[CV-57] Semantic Palette-Guided Color Propagation ICME2025

【速读】:该论文试图解决传统颜色传播方法在进行局部颜色编辑时难以实现内容感知的问题,这些问题通常依赖于低层次的视觉线索(如颜色、纹理或明度)来衡量像素相似性,导致无法准确地将颜色调整扩展到具有相似语义的区域。其解决方案的关键在于引入语义调色板(semantic palette),通过最小化设计良好的能量函数来求解编辑后的调色板,并利用该调色板将局部编辑准确传播到具有相似语义的区域,从而实现高效且精确的像素级颜色编辑。

链接: https://arxiv.org/abs/2506.01441
作者: Zi-Yu Zhang,Bing-Feng Seng,Ya-Feng Du,Kang Li,Zhe-Cheng Wang,Zheng-Jun Du
机构: Qinghai University (青海大学); Qinghai Provincial Laboratory for Intelligent Computing and Application (青海省智能计算与应用重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages,5 figures, IEEE ICME 2025

点击查看摘要

Abstract:Color propagation aims to extend local color edits to similar regions across the input image. Conventional approaches often rely on low-level visual cues such as color, texture, or lightness to measure pixel similarity, making it difficult to achieve content-aware color propagation. While some recent approaches attempt to introduce semantic information into color editing, but often lead to unnatural, global color change in color adjustments. To overcome these limitations, we present a semantic palette-guided approach for color propagation. We first extract a semantic palette from an input image. Then, we solve an edited palette by minimizing a well-designed energy function based on user edits. Finally, local edits are accurately propagated to regions that share similar semantics via the solved palette. Our approach enables efficient yet accurate pixel-level color editing and ensures that local color changes are propagated in a content-aware manner. Extensive experiments demonstrated the effectiveness of our method.
zh

[CV-58] DNAEdit: Direct Noise Alignment for Text-Guided Rectified Flow Editing

【速读】:该论文旨在解决训练-free图像编辑方法中由于噪声累积导致的重建精度下降问题。传统基于扩散的方法和近期基于修正流(Rectified Flow, RF)的方法在逆向合成轨迹时,通过逐步添加噪声来模拟噪声潜变量,导致误差累积并影响编辑效果。该论文提出的解决方案关键在于Direct Noise Alignment (DNA),其通过直接在噪声域中精炼目标高斯噪声,而非依赖于当前时间步的噪声潜变量近似下一时间步的潜变量,从而显著减少了误差累积。此外,论文还引入了Mobile Velocity Guidance (MVG)以控制目标提示引导的生成过程,实现背景保留与目标对象可编辑性的平衡。

链接: https://arxiv.org/abs/2506.01430
作者: Chenxi Xie,Minghan Li,Shuai Li,Yuhui Wu,Qiaosi Yi,Lei Zhang
机构: The Hong Kong Polytechnic University (香港理工大学); OPPO Research Institute (OPPO研究院); Harvard AI and Robotics Lab (哈佛人工智能与机器人实验室); Harvard University (哈佛大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project URL: this https URL

点击查看摘要

Abstract:Leveraging the powerful generation capability of large-scale pretrained text-to-image models, training-free methods have demonstrated impressive image editing results. Conventional diffusion-based methods, as well as recent rectified flow (RF)-based methods, typically reverse synthesis trajectories by gradually adding noise to clean images, during which the noisy latent at the current timestep is used to approximate that at the next timesteps, introducing accumulated drift and degrading reconstruction accuracy. Considering the fact that in RF the noisy latent is estimated through direct interpolation between Gaussian noises and clean images at each timestep, we propose Direct Noise Alignment (DNA), which directly refines the desired Gaussian noise in the noise domain, significantly reducing the error accumulation in previous methods. Specifically, DNA estimates the velocity field of the interpolated noised latent at each timestep and adjusts the Gaussian noise by computing the difference between the predicted and expected velocity field. We validate the effectiveness of DNA and reveal its relationship with existing RF-based inversion methods. Additionally, we introduce a Mobile Velocity Guidance (MVG) to control the target prompt-guided generation process, balancing image background preservation and target object editability. DNA and MVG collectively constitute our proposed method, namely DNAEdit. Finally, we introduce DNA-Bench, a long-prompt benchmark, to evaluate the performance of advanced image editing models. Experimental results demonstrate that our DNAEdit achieves superior performance to state-of-the-art text-guided editing methods. Codes and benchmark will be available at \href this https URLthis https URL.
zh

[CV-59] SEMNAV: A Semantic Segmentation-Driven Approach to Visual Semantic Navigation

【速读】:该论文旨在解决视觉语义导航(Visual Semantic Navigation, VSN)中因领域适应问题导致的模型在真实环境中的泛化能力不足问题。现有方法主要依赖虚拟场景中的原始RGB数据进行训练,难以有效迁移到现实环境中。其解决方案的关键在于提出SEMNAV方法,通过引入语义分割作为主要的视觉输入表示,以增强智能体的感知与决策能力,从而提升模型在未见环境中的泛化性能。

链接: https://arxiv.org/abs/2506.01418
作者: Rafael Flor-Rodríguez,Carlos Gutiérrez-Álvarez,Francisco Javier Acevedo-Rodríguez,Sergio Lafuente-Arroyo,Roberto J. López-Sastre
机构: University of Alcalá (阿尔卡拉大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Visual Semantic Navigation (VSN) is a fundamental problem in robotics, where an agent must navigate toward a target object in an unknown environment, mainly using visual information. Most state-of-the-art VSN models are trained in simulation environments, where rendered scenes of the real world are used, at best. These approaches typically rely on raw RGB data from the virtual scenes, which limits their ability to generalize to real-world environments due to domain adaptation issues. To tackle this problem, in this work, we propose SEMNAV, a novel approach that leverages semantic segmentation as the main visual input representation of the environment to enhance the agent’s perception and decision-making capabilities. By explicitly incorporating high-level semantic information, our model learns robust navigation policies that improve generalization across unseen environments, both in simulated and real world settings. We also introduce a newly curated dataset, i.e. the SEMNAV dataset, designed for training semantic segmentation-aware navigation models like SEMNAV. Our approach is evaluated extensively in both simulated environments and with real-world robotic platforms. Experimental results demonstrate that SEMNAV outperforms existing state-of-the-art VSN models, achieving higher success rates in the Habitat 2.0 simulation environment, using the HM3D dataset. Furthermore, our real-world experiments highlight the effectiveness of semantic segmentation in mitigating the sim-to-real gap, making our model a promising solution for practical VSN-based robotic applications. We release SEMNAV dataset, code and trained models at this https URL
zh

[CV-60] ViTA-PAR: Visual and Textual Attribute Alignment with Attribute Prompting for Pedestrian Attribute Recognition ICIP2025

【速读】:该论文旨在解决行人属性识别(Pedestrian Attribute Recognition, PAR)中因属性分布于不同身体区域而导致模型性能下降的问题。传统方法通常受限于固定水平区域的体部分割,难以适应属性出现在变化或非预期位置的情况。解决方案的关键在于提出一种基于视觉与文本属性对齐的多模态提示框架ViTA-PAR,通过引入视觉属性提示捕捉从全局到局部的语义信息,并设计可学习的文本提示模板以增强文本嵌入的上下文表示,最终实现视觉与文本特征的有效对齐与融合。

链接: https://arxiv.org/abs/2506.01411
作者: Minjeong Park,Hongbeen Park,Jinkyu Kim
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to IEEE ICIP 2025

点击查看摘要

Abstract:The Pedestrian Attribute Recognition (PAR) task aims to identify various detailed attributes of an individual, such as clothing, accessories, and gender. To enhance PAR performance, a model must capture features ranging from coarse-grained global attributes (e.g., for identifying gender) to fine-grained local details (e.g., for recognizing accessories) that may appear in diverse regions. Recent research suggests that body part representation can enhance the model’s robustness and accuracy, but these methods are often restricted to attribute classes within fixed horizontal regions, leading to degraded performance when attributes appear in varying or unexpected body locations. In this paper, we propose Visual and Textual Attribute Alignment with Attribute Prompting for Pedestrian Attribute Recognition, dubbed as ViTA-PAR, to enhance attribute recognition through specialized multimodal prompting and vision-language alignment. We introduce visual attribute prompts that capture global-to-local semantics, enabling diverse attribute representations. To enrich textual embeddings, we design a learnable prompt template, termed person and attribute context prompting, to learn person and attributes context. Finally, we align visual and textual attribute features for effective fusion. ViTA-PAR is validated on four PAR benchmarks, achieving competitive performance with efficient inference. We release our code and model at this https URL.
zh

[CV-61] NTIRE 2025 the 2nd Restore Any Image Model (RAIM) in the Wild Challenge

【速读】:该论文旨在解决真实世界中图像恢复的问题,特别是针对复杂且未知的退化情况,同时兼顾感知质量和保真度。其关键解决方案在于设计了一个包含两个赛道的挑战:第一个赛道是低光联合去噪与去马赛克(JDD)任务,第二个赛道是图像细节增强/生成任务,每个赛道下均设有成对数据与非成对数据的子任务,以全面评估算法的定量性能与主观质量,从而推动图像恢复技术的发展。

链接: https://arxiv.org/abs/2506.01394
作者: Jie Liang,Radu Timofte,Qiaosi Yi,Zhengqiang Zhang,Shuaizheng Liu,Lingchen Sun,Rongyuan Wu,Xindong Zhang,Hui Zeng,Lei Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this paper, we present a comprehensive overview of the NTIRE 2025 challenge on the 2nd Restore Any Image Model (RAIM) in the Wild. This challenge established a new benchmark for real-world image restoration, featuring diverse scenarios with and without reference ground truth. Participants were tasked with restoring real-captured images suffering from complex and unknown degradations, where both perceptual quality and fidelity were critically evaluated. The challenge comprised two tracks: (1) the low-light joint denoising and demosaicing (JDD) task, and (2) the image detail enhancement/generation task. Each track included two sub-tasks. The first sub-task involved paired data with available ground truth, enabling quantitative evaluation. The second sub-task dealt with real-world yet unpaired images, emphasizing restoration efficiency and subjective quality assessed through a comprehensive user study. In total, the challenge attracted nearly 300 registrations, with 51 teams submitting more than 600 results. The top-performing methods advanced the state of the art in image restoration and received unanimous recognition from all 20+ expert judges. The datasets used in Track 1 and Track 2 are available at this https URL and this https URL, respectively. The official challenge pages for Track 1 and Track 2 can be found at this https URL and this https URL.
zh

[CV-62] Sparse Imagination for Efficient Visual World Model Planning

【速读】:该论文旨在解决基于世界模型的规划在复杂环境中因预测准确性需求而导致的计算资源消耗过大问题,这一问题尤其限制了机器人领域中的实时应用。解决方案的关键在于提出一种稀疏想象(Sparse Imagination)方法,通过减少前向预测过程中处理的token数量来提升计算效率,其核心是采用基于Transformer架构并结合随机分组注意力策略的稀疏训练视觉世界模型,使模型能够根据可用计算资源自适应调整处理的token数量,从而在保持高控制精度的同时显著加速规划过程。

链接: https://arxiv.org/abs/2506.01392
作者: Junha Chun,Youngjoon Jeong,Taesup Kim
机构: Seoul National University (首尔国立大学)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:World model based planning has significantly improved decision-making in complex environments by enabling agents to simulate future states and make informed choices. However, ensuring the prediction accuracy of world models often demands substantial computational resources, posing a major challenge for real-time applications. This computational burden is particularly restrictive in robotics, where resources are severely constrained. To address this limitation, we propose a Sparse Imagination for Efficient Visual World Model Planning, which enhances computational efficiency by reducing the number of tokens processed during forward prediction. Our method leverages a sparsely trained vision-based world model based on transformers with randomized grouped attention strategy, allowing the model to adaptively adjust the number of tokens processed based on the computational resource. By enabling sparse imagination (rollout), our approach significantly accelerates planning while maintaining high control fidelity. Experimental results demonstrate that sparse imagination preserves task performance while dramatically improving inference efficiency, paving the way for the deployment of world models in real-time decision-making scenarios.
zh

[CV-63] Neural shape reconstruction from multiple views with static pattern projection CVPR2025

【速读】:该论文试图解决主动立体三维形貌测量系统中相机与投影仪之间需要精确标定从而限制系统使用便捷性的问题。解决方案的关键在于提出一种基于神经符号距离场(NeuralSDF)的新型体素差分渲染技术,以实现相机和投影仪在运动状态下的相对位姿自动标定,并通过捕捉多张图像来恢复目标物体的形状。

链接: https://arxiv.org/abs/2506.01389
作者: Ryo Furukawa,Kota Nishihara,Hiroshi Kawasaki
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, CVPR 2025 Workshop on Neural Fields Beyond Conventional Cameras

点击查看摘要

Abstract:Active-stereo-based 3D shape measurement is crucial for various purposes, such as industrial inspection, reverse engineering, and medical systems, due to its strong ability to accurately acquire the shape of textureless objects. Active stereo systems typically consist of a camera and a pattern projector, tightly fixed to each other, and precise calibration between a camera and a projector is required, which in turn decreases the usability of the system. If a camera and a projector can be freely moved during shape scanning process, it will drastically increase the convenience of the usability of the system. To realize it, we propose a technique to recover the shape of the target object by capturing multiple images while both the camera and the projector are in motion, and their relative poses are auto-calibrated by our neural signed-distance-field (NeuralSDF) using novel volumetric differential rendering technique. In the experiment, the proposed method is evaluated by performing 3D reconstruction using both synthetic and real images.
zh

[CV-64] VRD-IU: Lessons from Visually Rich Document Intelligence and Understanding IJCAI2025

【速读】:该论文旨在解决视觉丰富文档理解(Visually Rich Document Understanding, VRDU)中表单类文档的复杂布局、多利益相关方参与及高结构变异性所带来的挑战。解决方案的关键在于通过VRD-IU竞赛引入的两种赛道:Track A侧重于基于实体的关键信息检索,而Track B则专注于从原始文档图像中端到端地定位关键信息。竞赛中采用的方法包括层次分解、基于Transformer的检索、多模态特征融合以及先进的目标检测技术,这些方法为提升文档智能处理能力提供了新的思路和基准。

链接: https://arxiv.org/abs/2506.01388
作者: Yihao Ding,Soyeon Caren Han,Yan Li,Josiah Poon
机构: The University of Melbourne(墨尔本大学); The University of Sydney(悉尼大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted at IJCAI 2025 Demonstrations Track

点击查看摘要

Abstract:Visually Rich Document Understanding (VRDU) has emerged as a critical field in document intelligence, enabling automated extraction of key information from complex documents across domains such as medical, financial, and educational applications. However, form-like documents pose unique challenges due to their complex layouts, multi-stakeholder involvement, and high structural variability. Addressing these issues, the VRD-IU Competition was introduced, focusing on extracting and localizing key information from multi-format forms within the Form-NLU dataset, which includes digital, printed, and handwritten documents. This paper presents insights from the competition, which featured two tracks: Track A, emphasizing entity-based key information retrieval, and Track B, targeting end-to-end key information localization from raw document images. With over 20 participating teams, the competition showcased various state-of-the-art methodologies, including hierarchical decomposition, transformer-based retrieval, multimodal feature fusion, and advanced object detection techniques. The top-performing models set new benchmarks in VRDU, providing valuable insights into document intelligence.
zh

[CV-65] Playing with Transformer at 30 FPS via Next-Frame Diffusion

【速读】:该论文旨在解决自回归视频生成模型在实时视频生成中的计算效率与硬件资源利用不足的问题。其关键解决方案是引入两种创新方法:一是将一致性蒸馏(consistency distillation)扩展至视频领域,以减少采样步骤并提升推理效率;二是提出推测采样(speculative sampling),通过利用相邻帧共享相同动作输入的特性,实现并行计算优化,从而提升生成速度。

链接: https://arxiv.org/abs/2506.01380
作者: Xinle Cheng,Tianyu He,Jiayi Xu,Junliang Guo,Di He,Jiang Bian
机构: Peking University (北京大学); Microsoft Research (微软研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Autoregressive video models offer distinct advantages over bidirectional diffusion models in creating interactive video content and supporting streaming applications with arbitrary duration. In this work, we present Next-Frame Diffusion (NFD), an autoregressive diffusion transformer that incorporates block-wise causal attention, enabling iterative sampling and efficient inference via parallel token generation within each frame. Nonetheless, achieving real-time video generation remains a significant challenge for such models, primarily due to the high computational cost associated with diffusion sampling and the hardware inefficiencies inherent to autoregressive generation. To address this, we introduce two innovations: (1) We extend consistency distillation to the video domain and adapt it specifically for video models, enabling efficient inference with few sampling steps; (2) To fully leverage parallel computation, motivated by the observation that adjacent frames often share the identical action input, we propose speculative sampling. In this approach, the model generates next few frames using current action input, and discard speculatively generated frames if the input action differs. Experiments on a large-scale action-conditioned video generation benchmark demonstrate that NFD beats autoregressive baselines in terms of both visual quality and sampling efficiency. We, for the first time, achieves autoregressive video generation at over 30 Frames Per Second (FPS) on an A100 GPU using a 310M model.
zh

[CV-66] RadarSplat: Radar Gaussian Splatting for High-Fidelity Data Synthesis and 3D Reconstruction of Autonomous Driving Scenes

【速读】:该论文旨在解决雷达数据在高保真三维场景重建中的合成问题,特别是在存在显著雷达噪声(如接收器饱和和多路径反射)的场景下,现有方法表现不佳,且仅能合成预处理后的无噪声雷达图像,无法应对真实雷达数据的合成需求。解决方案的关键在于提出RadarSplat,该方法将高斯点云投射(Gaussian Splatting)与新颖的雷达噪声建模相结合,从而实现更真实的雷达数据合成和增强的三维重建效果。

链接: https://arxiv.org/abs/2506.01379
作者: Pou-Chun Kung,Skanda Harisha,Ram Vasudevan,Aline Eid,Katherine A. Skinner
机构: University of Michigan (密歇根大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:High-Fidelity 3D scene reconstruction plays a crucial role in autonomous driving by enabling novel data generation from existing datasets. This allows simulating safety-critical scenarios and augmenting training datasets without incurring further data collection costs. While recent advances in radiance fields have demonstrated promising results in 3D reconstruction and sensor data synthesis using cameras and LiDAR, their potential for radar remains largely unexplored. Radar is crucial for autonomous driving due to its robustness in adverse weather conditions like rain, fog, and snow, where optical sensors often struggle. Although the state-of-the-art radar-based neural representation shows promise for 3D driving scene reconstruction, it performs poorly in scenarios with significant radar noise, including receiver saturation and multipath reflection. Moreover, it is limited to synthesizing preprocessed, noise-excluded radar images, failing to address realistic radar data synthesis. To address these limitations, this paper proposes RadarSplat, which integrates Gaussian Splatting with novel radar noise modeling to enable realistic radar data synthesis and enhanced 3D reconstruction. Compared to the state-of-the-art, RadarSplat achieves superior radar image synthesis (+3.4 PSNR / 2.6x SSIM) and improved geometric reconstruction (-40% RMSE / 1.5x Accuracy), demonstrating its effectiveness in generating high-fidelity radar data and scene reconstruction. A project page is available at this https URL.
zh

[CV-67] No Train Yet Gain: Towards Generic Multi-Object Tracking in Sports and Beyond

【速读】:该论文旨在解决多目标跟踪(Multi-object tracking, MOT)在体育分析中的挑战,包括快速运动、遮挡和相机移动等问题。传统基于检测的跟踪方法需要大量调参,而基于分割的方法在轨迹处理上存在困难。论文提出的解决方案是McByte,其关键在于将时间传播的分割掩码作为关联线索,从而提升跟踪鲁棒性,且无需针对每个视频进行调优。McByte不依赖训练,仅使用社区中常用的预训练模型和目标检测器,展现了在体育和通用行人跟踪任务中的强大性能。

链接: https://arxiv.org/abs/2506.01373
作者: Tomasz Stanczyk,Seongro Yoon,Francois Bremond
机构: Inria(法国国家信息与自动化研究所); Université Côte d’Azur(蔚蓝海岸大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multi-object tracking (MOT) is essential for sports analytics, enabling performance evaluation and tactical insights. However, tracking in sports is challenging due to fast movements, occlusions, and camera shifts. Traditional tracking-by-detection methods require extensive tuning, while segmentation-based approaches struggle with track processing. We propose McByte, a tracking-by-detection framework that integrates temporally propagated segmentation mask as an association cue to improve robustness without per-video tuning. Unlike many existing methods, McByte does not require training, relying solely on pre-trained models and object detectors commonly used in the community. Evaluated on SportsMOT, DanceTrack, SoccerNet-tracking 2022 and MOT17, McByte demonstrates strong performance across sports and general pedestrian tracking. Our results highlight the benefits of mask propagation for a more adaptable and generalizable MOT approach. Code will be made available at this https URL.
zh

[CV-68] SVQA-R1: Reinforcing Spatial Reasoning in MLLM s via View-Consistent Reward Optimization

【速读】:该论文旨在解决现有视觉-语言模型(Vision-Language Models, VLMs)在空间推理能力上的不足,特别是在需要理解相对位置、距离和物体配置的Spatial Visual Question Answering (Spatial VQA)任务中。其解决方案的关键在于提出SVQA-R1框架,该框架首次将R1风格的训练方法扩展至空间VQA任务,通过引入一种新颖的分组强化学习策略——Spatial-GRPO,该策略通过扰动物体之间的空间关系(如镜像翻转)构建视图一致的奖励机制,从而促使模型发展出一致且具地面实证的空间理解能力。

链接: https://arxiv.org/abs/2506.01371
作者: Peiyao Wang,Haibin Ling
机构: Stony Brook University (纽约州立大学石溪分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 7 figures

点击查看摘要

Abstract:Spatial reasoning remains a critical yet underdeveloped capability in existing vision-language models (VLMs), especially for Spatial Visual Question Answering (Spatial VQA) tasks that require understanding relative positions, distances, and object configurations. Inspired by the R1 paradigm introduced in DeepSeek-R1, which enhances reasoning in language models through rule-based reinforcement learning (RL), we propose SVQA-R1, the first framework to extend R1-style training to spatial VQA. In particular, we introduce Spatial-GRPO, a novel group-wise RL strategy that constructs view-consistent rewards by perturbing spatial relations between objects, e.g., mirror flipping, thereby encouraging the model to develop a consistent and grounded understanding of space. Our model, SVQA-R1, not only achieves dramatically improved accuracy on spatial VQA benchmarks but also exhibits interpretable reasoning paths even without using supervised fine-tuning (SFT) data. Extensive experiments and visualization demonstrate the effectiveness of SVQA-R1 across multiple spatial reasoning benchmarks.
zh

[CV-69] PointT2I: LLM -based text-to-image generation via keypoints

【速读】:该论文试图解决文本到图像(Text-to-image, T2I)生成模型在处理包含复杂概念(尤其是人体姿态)的输入提示时,难以准确生成对应图像的问题。解决方案的关键在于提出PointT2I框架,该框架通过大型语言模型(Large Language Model, LLM)直接生成与人体姿态相关的关键点(Keypoints),并结合文本提示进行图像生成,同时引入基于LLM的反馈系统以提升生成内容与提示之间的语义一致性,从而实现无需微调即可生成准确对齐姿态的图像。

链接: https://arxiv.org/abs/2506.01370
作者: Taekyung Lee,Donggyu Lee,Myungjoo Kang
机构: Seoul National University (首尔国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Text-to-image (T2I) generation model has made significant advancements, resulting in high-quality images aligned with an input prompt. However, despite T2I generation’s ability to generate fine-grained images, it still faces challenges in accurately generating images when the input prompt contains complex concepts, especially human pose. In this paper, we propose PointT2I, a framework that effectively generates images that accurately correspond to the human pose described in the prompt by using a large language model (LLM). PointT2I consists of three components: Keypoint generation, Image generation, and Feedback system. The keypoint generation uses an LLM to directly generate keypoints corresponding to a human pose, solely based on the input prompt, without external references. Subsequently, the image generation produces images based on both the text prompt and the generated keypoints to accurately reflect the target pose. To refine the outputs of the preceding stages, we incorporate an LLM-based feedback system that assesses the semantic consistency between the generated contents and the given prompts. Our framework is the first approach to leveraging LLM for keypoints-guided image generation without any fine-tuning, producing accurate pose-aligned images based solely on textual prompts.
zh

[CV-70] Synthetic Data Augmentation using Pre-trained Diffusion Models for Long-tailed Food Image Classification

【速读】:该论文试图解决长尾分布(long-tailed distribution)下的食品图像分类问题,其中某些食物类别样本数量远多于其他类别,导致模型对尾部类别的识别性能下降。其解决方案的关键在于提出一种两阶段的合成数据增强框架,利用预训练的扩散模型生成具有类内多样性与类间分离性的合成图像,通过正负提示词的联合采样策略提升分类性能。

链接: https://arxiv.org/abs/2506.01368
作者: GaYeon Koh,Hyun-Jic Oh,Jeonghyun Noh,Won-Ki Jeong
机构: Korea University (韩国科学技术院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages

点击查看摘要

Abstract:Deep learning-based food image classification enables precise identification of food categories, further facilitating accurate nutritional analysis. However, real-world food images often show a skewed distribution, with some food types being more prevalent than others. This class imbalance can be problematic, causing models to favor the majority (head) classes with overall performance degradation for the less common (tail) classes. Recently, synthetic data augmentation using diffusion-based generative models has emerged as a promising solution to address this issue. By generating high-quality synthetic images, these models can help uniformize the data distribution, potentially improving classification performance. However, existing approaches face challenges: fine-tuning-based methods need a uniformly distributed dataset, while pre-trained model-based approaches often overlook inter-class separation in synthetic data. In this paper, we propose a two-stage synthetic data augmentation framework, leveraging pre-trained diffusion models for long-tailed food classification. We generate a reference set conditioned by a positive prompt on the generation target and then select a class that shares similar features with the generation target as a negative prompt. Subsequently, we generate a synthetic augmentation set using positive and negative prompt conditions by a combined sampling strategy that promotes intra-class diversity and inter-class separation. We demonstrate the efficacy of the proposed method on two long-tailed food benchmark datasets, achieving superior performance compared to previous works in terms of top-1 accuracy.
zh

[CV-71] CLIP-driven rain perception: Adaptive deraining with pattern-aware network routing and mask-guided cross-attention

【速读】:该论文旨在解决现有去雨模型在处理不同雨型时的适应性不足问题,因为单一网络难以有效应对多种雨滴密度、条纹方向和降雨强度等变化。其解决方案的关键在于提出一种基于CLIP的雨型感知网络(CLIP-driven rain perception network, CLIP-RPN),通过CLIP的跨模态视觉-语言对齐能力实现语义感知的雨型识别,并结合自适应子网络路由机制,根据检测到的雨型动态激活专用处理分支,从而显著提升模型对多样化降雨条件的处理能力。

链接: https://arxiv.org/abs/2506.01366
作者: Cong Guan,Osamu Yoshie
机构: Waseda University (早稻田大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Existing deraining models process all rainy images within a single network. However, different rain patterns have significant variations, which makes it challenging for a single network to handle diverse types of raindrops and streaks. To address this limitation, we propose a novel CLIP-driven rain perception network (CLIP-RPN) that leverages CLIP to automatically perceive rain patterns by computing visual-language matching scores and adaptively routing to sub-networks to handle different rain patterns, such as varying raindrop densities, streak orientations, and rainfall intensity. CLIP-RPN establishes semantic-aware rain pattern recognition through CLIP’s cross-modal visual-language alignment capabilities, enabling automatic identification of precipitation characteristics across different rain scenarios. This rain pattern awareness drives an adaptive subnetwork routing mechanism where specialized processing branches are dynamically activated based on the detected rain type, significantly enhancing the model’s capacity to handle diverse rainfall conditions. Furthermore, within sub-networks of CLIP-RPN, we introduce a mask-guided cross-attention mechanism (MGCA) that predicts precise rain masks at multi-scale to facilitate contextual interactions between rainy regions and clean background areas by cross-attention. We also introduces a dynamic loss scheduling mechanism (DLS) to adaptively adjust the gradients for the optimization process of CLIP-RPN. Compared with the commonly used l_1 or l_2 loss, DLS is more compatible with the inherent dynamics of the network training process, thus achieving enhanced outcomes. Our method achieves state-of-the-art performance across multiple datasets, particularly excelling in complex mixed datasets.
zh

[CV-72] EgoBrain: Synergizing Minds and Eyes For Human Action Understanding

【速读】:该论文旨在解决如何通过多模态融合提升对人类行为和认知的理解,特别是在脑机接口(BCI)领域中实现更精确的行动识别。其关键解决方案是构建了EgoBrain——首个大规模、时间对齐的多模态数据集,同步记录了第一视角视觉与32通道脑电图(EEG)数据,并在此基础上开发了一个多模态学习框架,实现了EEG与视觉信息的融合,从而在跨被试和跨环境挑战中达到了66.70%的动作识别准确率。

链接: https://arxiv.org/abs/2506.01353
作者: Nie Lin,Yansen Wang,Dongqi Han,Weibang Jiang,Jingyuan Li,Ryosuke Furuta,Yoichi Sato,Dongsheng Li
机构: The University of Tokyo (东京大学); Microsoft Research Asia (微软亚洲研究院)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 21 pages, 12 figures

点击查看摘要

Abstract:The integration of brain-computer interfaces (BCIs), in particular electroencephalography (EEG), with artificial intelligence (AI) has shown tremendous promise in decoding human cognition and behavior from neural signals. In particular, the rise of multimodal AI models have brought new possibilities that have never been imagined before. Here, we present EgoBrain --the world’s first large-scale, temporally aligned multimodal dataset that synchronizes egocentric vision and EEG of human brain over extended periods of time, establishing a new paradigm for human-centered behavior analysis. This dataset comprises 61 hours of synchronized 32-channel EEG recordings and first-person video from 40 participants engaged in 29 categories of daily activities. We then developed a muiltimodal learning framework to fuse EEG and vision for action understanding, validated across both cross-subject and cross-environment challenges, achieving an action recognition accuracy of 66.70%. EgoBrain paves the way for a unified framework for brain-computer interface with multiple modalities. All data, tools, and acquisition protocols are openly shared to foster open science in cognitive computing.
zh

[CV-73] arget Driven Adaptive Loss For Infrared Small Target Detection

【速读】:该论文旨在解决红外小目标检测(IRSTD)中局部区域目标检测性能不足以及对小尺度和低局部对比度目标鲁棒性差的问题。解决方案的关键在于提出一种目标驱动自适应(TDA)损失函数,该损失函数引入了基于块的机制和自适应调整策略,以优化尺度和局部对比度,从而引导模型更加关注目标周围的局部区域,特别是小尺度和低对比度的目标。

链接: https://arxiv.org/abs/2506.01349
作者: Yuho Shoji,Takahiro Toizumi,Atsushi Ito
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We propose a target driven adaptive (TDA) loss to enhance the performance of infrared small target detection (IRSTD). Prior works have used loss functions, such as binary cross-entropy loss and IoU loss, to train segmentation models for IRSTD. Minimizing these loss functions guides models to extract pixel-level features or global image context. However, they have two issues: improving detection performance for local regions around the targets and enhancing robustness to small scale and low local contrast. To address these issues, the proposed TDA loss introduces a patch-based mechanism, and an adaptive adjustment strategy to scale and local contrast. The proposed TDA loss leads the model to focus on local regions around the targets and pay particular attention to targets with smaller scales and lower local contrast. We evaluate the proposed method on three datasets for IRSTD. The results demonstrate that the proposed TDA loss achieves better detection performance than existing losses on these datasets.
zh

[CV-74] Rethinking Image Histogram Matching for Image Classification

【速读】:该论文旨在解决在恶劣天气条件下低对比度图像对分类器性能的影响问题。传统方法中,直方图均衡化(Histogram Equalization, HE)作为直方图匹配(Histogram Matching, HM)的一种特殊形式,通过将像素值分布调整为均匀分布来提升图像对比度,但其效果受限于固定的分布形式。本文的关键解决方案是提出一种可微且参数化的直方图匹配方法,该方法通过下游分类器的损失函数优化目标像素值分布,从而在保持图像信息的同时提升分类器在各种恶劣天气条件下的性能。

链接: https://arxiv.org/abs/2506.01346
作者: Rikuto Otsuka,Yuho Shoji,Yuka Ogino,Takahiro Toizumi,Atsushi Ito
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper rethinks image histogram matching (HM) and proposes a differentiable and parametric HM preprocessing for a downstream classifier. Convolutional neural networks have demonstrated remarkable achievements in classification tasks. However, they often exhibit degraded performance on low-contrast images captured under adverse weather conditions. To maintain classifier performance under low-contrast images, histogram equalization (HE) is commonly used. HE is a special case of HM using a uniform distribution as a target pixel value distribution. In this paper, we focus on the shape of the target pixel value distribution. Compared to a uniform distribution, a single, well-designed distribution could have potential to improve the performance of the downstream classifier across various adverse weather conditions. Based on this hypothesis, we propose a differentiable and parametric HM that optimizes the target distribution using the loss function of the downstream classifier. This method addresses pixel value imbalances by transforming input images with arbitrary distributions into a target distribution optimized for the classifier. Our HM is trained on only normal weather images using the classifier. Experimental results show that a classifier trained with our proposed HM outperforms conventional preprocessing methods under adverse weather conditions.
zh

[CV-75] A 2-Stage Model for Vehicle Class and Orientation Detection with Photo-Realistic Image Generation

【速读】:该论文旨在解决通过合成数据训练模型以检测车辆类别和方向的问题,但面临训练数据类别分布不均衡以及模型在真实世界图像中预测效果不佳的挑战。解决方案的关键在于提出一种两阶段检测模型,结合逼真图像生成技术,通过构建包含图像、类别和位置信息的元表,将合成图像转换为真实世界风格并进行融合,最终利用元表中的图像进行车辆类别和方向分类,并结合预提取的位置信息完成检测。

链接: https://arxiv.org/abs/2506.01338
作者: Youngmin Kim,Donghwa Kang,Hyeongboo Baek
机构: Incheon National University(INU) (仁川国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to IEEE BigData Conference 2022

点击查看摘要

Abstract:We aim to detect the class and orientation of a vehicle by training a model with synthetic data. However, the distribution of the classes in the training data is imbalanced, and the model trained on the synthetic image is difficult to predict in real-world images. We propose a two-stage detection model with photo-realistic image generation to tackle this issue. Our model mainly takes four steps to detect the class and orientation of the vehicle. (1) It builds a table containing the image, class, and location information of objects in the image, (2) transforms the synthetic images into real-world images style, and merges them into the meta table. (3) Classify vehicle class and orientation using images from the meta-table. (4) Finally, the vehicle class and orientation are detected by combining the pre-extracted location information and the predicted classes. We achieved 4th place in IEEE BigData Challenge 2022 Vehicle class and Orientation Detection (VOD) with our approach.
zh

[CV-76] Ultra-High-Resolution Image Synthesis: Data Method and Evaluation

【速读】:该论文旨在解决超高分辨率图像合成(Ultra-high-resolution image synthesis)中存在的挑战,主要包括缺乏标准化基准和计算资源限制。其解决方案的关键在于构建了Aesthetic-4K数据集,并提出了Diffusion-4K框架。Aesthetic-4K数据集包含高质量的4K图像及其由GPT-4o生成的描述性标题,为研究提供了可靠的数据支持;而Diffusion-4K框架通过引入Scale Consistent Variational Auto-Encoder (SC-VAE)和Wavelet-based Latent Fine-tuning (WLF),实现了高效的视觉标记压缩与细节捕捉,从而支持直接使用真实感4K数据进行训练。此外,该方法适用于多种潜在扩散模型,并结合新型评估指标如GLCM Score和Compression Ratio,提升了对超高分辨率图像合成质量的全面评估能力。

链接: https://arxiv.org/abs/2506.01331
作者: Jinjin Zhang,Qiuyu Huang,Junjie Liu,Xiefan Guo,Di Huang
机构: Beihang University (北京航空航天大学); Meituan (美团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Ultra-high-resolution image synthesis holds significant potential, yet remains an underexplored challenge due to the absence of standardized benchmarks and computational constraints. In this paper, we establish Aesthetic-4K, a meticulously curated dataset containing dedicated training and evaluation subsets specifically designed for comprehensive research on ultra-high-resolution image synthesis. This dataset consists of high-quality 4K images accompanied by descriptive captions generated by GPT-4o. Furthermore, we propose Diffusion-4K, an innovative framework for the direct generation of ultra-high-resolution images. Our approach incorporates the Scale Consistent Variational Auto-Encoder (SC-VAE) and Wavelet-based Latent Fine-tuning (WLF), which are designed for efficient visual token compression and the capture of intricate details in ultra-high-resolution images, thereby facilitating direct training with photorealistic 4K data. This method is applicable to various latent diffusion models and demonstrates its efficacy in synthesizing highly detailed 4K images. Additionally, we propose novel metrics, namely the GLCM Score and Compression Ratio, to assess the texture richness and fine details in local patches, in conjunction with holistic measures such as FID, Aesthetics, and CLIPScore, enabling a thorough and multifaceted evaluation of ultra-high-resolution image synthesis. Consequently, Diffusion-4K achieves impressive performance in ultra-high-resolution image synthesis, particularly when powered by state-of-the-art large-scale diffusion models (eg, Flux-12B). The source code is publicly available at this https URL.
zh

[CV-77] Ψ-Sampler: Initial Particle Sampling for SMC-Based Inference-Time Reward Alignment in Score Models

【速读】:该论文旨在解决基于得分的生成模型在推理阶段的奖励对齐问题,即如何在生成过程中更好地适应特定奖励函数以提升生成质量。现有方法通常从高斯先验初始化粒子,这导致无法有效捕捉与奖励相关的区域,从而降低采样效率。解决方案的关键在于采用基于pCNL(预条件Crank-Nicolson Langevin)算法的初始粒子采样策略,通过从奖励感知后验分布初始化粒子,结合维度鲁棒的提议机制与梯度引导的动力学,实现高效且可扩展的后验采样,从而显著提升多种奖励对齐任务的性能。

链接: https://arxiv.org/abs/2506.01320
作者: Taehoon Yoon,Yunhong Min,Kyeongmin Yeo,Minhyuk Sung
机构: KAIST(韩国科学技术院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce \Psi -Sampler, an SMC-based framework incorporating pCNL-based initial particle sampling for effective inference-time reward alignment with a score-based generative model. Inference-time reward alignment with score-based generative models has recently gained significant traction, following a broader paradigm shift from pre-training to post-training optimization. At the core of this trend is the application of Sequential Monte Carlo (SMC) to the denoising process. However, existing methods typically initialize particles from the Gaussian prior, which inadequately captures reward-relevant regions and results in reduced sampling efficiency. We demonstrate that initializing from the reward-aware posterior significantly improves alignment performance. To enable posterior sampling in high-dimensional latent spaces, we introduce the preconditioned Crank-Nicolson Langevin (pCNL) algorithm, which combines dimension-robust proposals with gradient-informed dynamics. This approach enables efficient and scalable posterior sampling and consistently improves performance across various reward alignment tasks, including layout-to-image generation, quantity-aware generation, and aesthetic-preference generation, as demonstrated in our experiments.
zh

[CV-78] Learning Sparsity for Effective and Efficient Music Performance Question Answering ACL2025

【速读】:该论文旨在解决音乐表演场景下多模态信息理解与推理中的挑战,特别是针对Music AVQA(Audio-Visual Question Answering)任务中音频-视觉表示融合效率不足的问题。现有方法通常依赖于密集且未优化的表示,导致关键信息提取不高效、冗余信息未有效减少以及关键样本优先级不足。其解决方案的关键在于提出一种名为Sparsify的稀疏学习框架,该框架整合了三种稀疏化策略,并构建了一个端到端的处理流程,从而在保持准确性的前提下显著提升了训练效率,同时通过关键子集选择算法进一步提高了数据效率。

链接: https://arxiv.org/abs/2506.01319
作者: Xingjian Diao,Tianzhen Yang,Chunhui Zhang,Weiyi Wu,Ming Cheng,Jiang Gui
机构: Dartmouth College (达特茅斯学院); Yale University (耶鲁大学)
类目: ound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
备注: Accepted to the main conference of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025)

点击查看摘要

Abstract:Music performances, characterized by dense and continuous audio as well as seamless audio-visual integration, present unique challenges for multimodal scene understanding and reasoning. Recent Music Performance Audio-Visual Question Answering (Music AVQA) datasets have been proposed to reflect these challenges, highlighting the continued need for more effective integration of audio-visual representations in complex question answering. However, existing Music AVQA methods often rely on dense and unoptimized representations, leading to inefficiencies in the isolation of key information, the reduction of redundancy, and the prioritization of critical samples. To address these challenges, we introduce Sparsify, a sparse learning framework specifically designed for Music AVQA. It integrates three sparsification strategies into an end-to-end pipeline and achieves state-of-the-art performance on the Music AVQA datasets. In addition, it reduces training time by 28.32% compared to its fully trained dense counterpart while maintaining accuracy, demonstrating clear efficiency gains. To further improve data efficiency, we propose a key-subset selection algorithm that selects and uses approximately 25% of MUSIC-AVQA v2.0 training data and retains 70-80% of full-data performance across models.
zh

[CV-79] SAM-I2V: Upgrading SAM to Support Promptable Video Segmentation with Less than 0.2% Training Cost CVPR2025

【速读】:该论文试图解决将可提示图像分割能力扩展到视频领域时面临的挑战,尤其是动态场景中精确且时间一致的掩码传播问题。其解决方案的关键在于提出SAM-I2V方法,通过策略性地升级预训练的Segment Anything Model (SAM),使其支持可提示视频分割(PVS),从而显著降低训练复杂性和资源需求。核心创新包括:基于SAM静态图像编码器的图像到视频特征提取升级模块、用于有效利用历史信息的记忆过滤策略,以及利用对象记忆确保动态场景中时间一致掩码传播的记忆作为提示机制。

链接: https://arxiv.org/abs/2506.01304
作者: Haiyang Mei,Pengyu Zhang,Mike Zheng Shou
机构: Show Lab, National University of Singapore (Show 实验室,新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025

点击查看摘要

Abstract:Foundation models like the Segment Anything Model (SAM) have significantly advanced promptable image segmentation in computer vision. However, extending these capabilities to videos presents substantial challenges, particularly in ensuring precise and temporally consistent mask propagation in dynamic scenes. SAM 2 attempts to address this by training a model on massive image and video data from scratch to learn complex spatiotemporal associations, resulting in huge training costs that hinder research and practical deployment. In this paper, we introduce SAM-I2V, an effective image-to-video upgradation method for cultivating a promptable video segmentation (PVS) model. Our approach strategically upgrades the pre-trained SAM to support PVS, significantly reducing training complexity and resource requirements. To achieve this, we introduce three key innovations: (i) an image-to-video feature extraction upgrader built upon SAM’s static image encoder to enable spatiotemporal video perception, (ii) a memory filtering strategy that selects the most relevant past frames for more effective utilization of historical information, and (iii) a memory-as-prompt mechanism leveraging object memory to ensure temporally consistent mask propagation in dynamic scenes. Comprehensive experiments demonstrate that our method achieves over 90% of SAM 2’s performance while using only 0.2% of its training cost. Our work presents a resource-efficient pathway to PVS, lowering barriers for further research in PVS model design and enabling broader applications and advancements in the field. Code and model are available at: this https URL.
zh

[CV-80] ReAgent -V: A Reward-Driven Multi-Agent Framework for Video Understanding

【速读】:该论文旨在解决传统视频理解方法在复杂场景下缺乏动态反馈机制、无法自我修正与适应的问题。其解决方案的关键在于提出ReAgent-V框架,该框架通过在推理过程中集成高效的帧选择与实时奖励生成,实现多视角的预测调整和高质量数据的自动筛选,从而提升模型的泛化能力和推理性能。

链接: https://arxiv.org/abs/2506.01300
作者: Yiyang Zhou,Yangfan He,Yaofeng Su,Siwei Han,Joel Jang,Gedas Bertasius,Mohit Bansal,Huaxiu Yao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 31 pages, 18 figures

点击查看摘要

Abstract:Video understanding is fundamental to tasks such as action recognition, video reasoning, and robotic control. Early video understanding methods based on large vision-language models (LVLMs) typically adopt a single-pass reasoning paradigm without dynamic feedback, limiting the model’s capacity to self-correct and adapt in complex scenarios. Recent efforts have attempted to address this limitation by incorporating reward models and reinforcement learning to enhance reasoning, or by employing tool-agent frameworks. However, these approaches face several challenges, including high annotation costs, reward signals that fail to capture real-time reasoning states, and low inference efficiency. To overcome these issues, we propose ReAgent-V, a novel agentic video understanding framework that integrates efficient frame selection with real-time reward generation during inference. These reward signals not only guide iterative answer refinement through a multi-perspective reflection mechanism-adjusting predictions from conservative, neutral, and aggressive viewpoints-but also enable automatic filtering of high-quality data for supervised fine-tuning (SFT), direct preference optimization (DPO), and group relative policy optimization (GRPO). ReAgent-V is lightweight, modular, and extensible, supporting flexible tool integration tailored to diverse tasks. Extensive experiments on 12 datasets across three core applications-video understanding, video reasoning enhancement, and vision-language-action model alignment-demonstrate significant gains in generalization and reasoning, with improvements of up to 6.9%, 2.1%, and 9.8%, respectively, highlighting the effectiveness and versatility of the proposed framework.
zh

[CV-81] ReFoCUS: Reinforcement-guided Frame Optimization for Contextual Understanding

【速读】:该论文试图解决视频内容理解中因帧选择策略不优化而导致的视觉-语言推理能力受限问题(vision-language reasoning)。现有方法通常依赖静态启发式规则或外部检索模块来提供帧信息,难以有效获取与查询相关的上下文。解决方案的关键在于提出一种基于强化学习的帧级策略优化框架ReFoCUS(Reinforcement-guided Frame Optimization for Contextual UnderStanding),其核心是将优化目标从文本响应转向视觉输入选择,通过参考大型语言模型(LLM)生成的奖励信号,学习能够支持时间定位响应的帧选择策略。该方法采用自回归条件选择架构,在保证时间连贯性的同时降低复杂度,无需帧级显式监督,显著提升了多视频问答基准的推理性能。

链接: https://arxiv.org/abs/2506.01274
作者: Hosu Lee,Junho Kim,Hyunjun Kim,Yong Man Ro
机构: KAIST(韩国科学技术院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent progress in Large Multi-modal Models (LMMs) has enabled effective vision-language reasoning, yet the ability to understand video content remains constrained by suboptimal frame selection strategies. Existing approaches often rely on static heuristics or external retrieval modules to feed frame information into video-LLMs, which may fail to provide the query-relevant information. In this work, we introduce ReFoCUS (Reinforcement-guided Frame Optimization for Contextual UnderStanding), a novel frame-level policy optimization framework that shifts the optimization target from textual responses to visual input selection. ReFoCUS learns a frame selection policy via reinforcement learning, using reward signals derived from a reference LMM to reflect the model’s intrinsic preferences for frames that best support temporally grounded responses. To efficiently explore the large combinatorial frame space, we employ an autoregressive, conditional selection architecture that ensures temporal coherence while reducing complexity. Our approach does not require explicit supervision at the frame-level and consistently improves reasoning performance across multiple video QA benchmarks, highlighting the benefits of aligning frame selection with model-internal utility.
zh

[CV-82] Visual Sparse Steering: Improving Zero-shot Image Classification with Sparsity Guided Steering Vectors

【速读】:该论文旨在解决在不进行微调或无需访问大规模标注数据集的情况下,在推理阶段对视觉基础模型进行引导的问题,这在动态或资源受限的场景中尤为具有挑战性。其解决方案的关键在于提出Visual Sparse Steering (VS2),该方法通过使用从Top-k稀疏自编码器(Sparse Autoencoders, SAE)中学习到的稀疏特征生成的引导向量来指导视觉模型,而无需对比数据。此外,还进一步提出了VS2++,通过引入伪标签邻居来增强相关稀疏特征,以及Prototype-Aligned Sparse Steering (PASS),通过在SAE训练过程中引入原型对齐损失,使稀疏特征更贴合下游任务需求,从而提升模型性能。

链接: https://arxiv.org/abs/2506.01247
作者: Gerasimos Chatzoudis,Zhuowei Li,Gemma E. Moran,Hao Wang,Dimitris N. Metaxas
机构: Rutgers University (罗格斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Steering vision foundation models at inference time without retraining or access to large labeled datasets is a desirable yet challenging objective, particularly in dynamic or resource-constrained settings. In this paper, we introduce Visual Sparse Steering (VS2), a lightweight, test-time method that guides vision models using steering vectors derived from sparse features learned by top- k Sparse Autoencoders without requiring contrastive data. Specifically, VS2 surpasses zero-shot CLIP by 4.12% on CIFAR-100, 1.08% on CUB-200, and 1.84% on Tiny-ImageNet. We further propose VS2++, a retrieval-augmented variant that selectively amplifies relevant sparse features using pseudo-labeled neighbors at inference time. With oracle positive/negative sets, VS2++ achieves absolute top-1 gains over CLIP zero-shot of up to 21.44% on CIFAR-100, 7.08% on CUB-200, and 20.47% on Tiny-ImageNet. Interestingly, VS2 and VS2++ raise per-class accuracy by up to 25% and 38%, respectively, showing that sparse steering benefits specific classes by disambiguating visually or taxonomically proximate categories rather than providing a uniform boost. Finally, to better align the sparse features learned through the SAE reconstruction task with those relevant for downstream performance, we propose Prototype-Aligned Sparse Steering (PASS). By incorporating a prototype-alignment loss during SAE training, using labels only during training while remaining fully test-time unsupervised, PASS consistently, though modestly, outperforms VS2, achieving a 6.12% gain over VS2 only on CIFAR-100 with ViT-B/32.
zh

[CV-83] Fourier-Modulated Implicit Neural Representation for Multispectral Satellite Image Compression

【速读】:该论文旨在解决多光谱卫星图像在数据压缩与分析中面临的高维度、大数据量以及多通道间空间分辨率差异带来的挑战。其解决方案的关键在于提出ImpliSat框架,该框架利用隐式神经表示(Implicit Neural Representations, INR)将卫星图像建模为坐标空间上的连续函数,从而在不同空间分辨率下捕捉精细的空间细节,并通过傅里叶调制算法动态适应各波段的光谱和空间特性,实现高效压缩同时保持关键图像细节。

链接: https://arxiv.org/abs/2506.01234
作者: Woojin Cho,Steve Andreas Immanuel,Junhyuk Heo,Darongsae Kwon
机构: TelePIX(电信像素)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to IGARSS 2025 (Oral)

点击查看摘要

Abstract:Multispectral satellite images play a vital role in agriculture, fisheries, and environmental monitoring. However, their high dimensionality, large data volumes, and diverse spatial resolutions across multiple channels pose significant challenges for data compression and analysis. This paper presents ImpliSat, a unified framework specifically designed to address these challenges through efficient compression and reconstruction of multispectral satellite data. ImpliSat leverages Implicit Neural Representations (INR) to model satellite images as continuous functions over coordinate space, capturing fine spatial details across varying spatial resolutions. Furthermore, we introduce a Fourier modulation algorithm that dynamically adjusts to the spectral and spatial characteristics of each band, ensuring optimal compression while preserving critical image details.
zh

[CV-84] Dirty and Clean-Label attack detection using GAN discriminators

【速读】:该论文试图解决在训练深度计算机视觉模型时,由于从不可靠来源收集图像而导致模型行为可能受到脏标签或干净标签攻击的问题。解决方案的关键在于利用生成式对抗网络(Generative Adversarial Network, GAN)的判别器,对单类图像进行保护,通过训练后的判别器置信度分数设定阈值,以识别错误标注的图像,并在扰动幅度ε达到0.20时能够100%检测出测试中的污染样本。

链接: https://arxiv.org/abs/2506.01224
作者: John Smutny
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages total. Appendix starts on page 10

点击查看摘要

Abstract:Gathering enough images to train a deep computer vision model is a constant challenge. Unfortunately, collecting images from unknown sources can leave your model s behavior at risk of being manipulated by a dirty-label or clean-label attack unless the images are properly inspected. Manually inspecting each image-label pair is impractical and common poison-detection methods that involve re-training your model can be time consuming. This research uses GAN discriminators to protect a single class against mislabeled and different levels of modified images. The effect of said perturbation on a basic convolutional neural network classifier is also included for reference. The results suggest that after training on a single class, GAN discriminator s confidence scores can provide a threshold to identify mislabeled images and identify 100% of the tested poison starting at a perturbation epsilon magnitude of 0.20, after decision threshold calibration using in-class samples. Developers can use this report as a basis to train their own discriminators to protect high valued classes in their CV models.
zh

[CV-85] A Review on Coarse to Fine-Grained Animal Action Recognition

【速读】:该论文试图解决动物行为识别中细粒度(Fine-Grained, FG)动作识别的挑战,特别是在户外环境中由于非刚性身体结构、频繁遮挡以及缺乏大规模标注数据集所带来的困难。其解决方案的关键在于评估和改进适用于动物行为分析的时空深度学习框架(如SlowFast),并强调了与人类动作识别相比,动物行为识别在种内变异性和环境复杂性方面的独特性,从而为提升跨物种行为分析的准确性与泛化能力指明了未来研究方向。

链接: https://arxiv.org/abs/2506.01214
作者: Ali Zia,Renuka Sharma,Abdelwahed Khamis,Xuesong Li,Muhammad Husnain,Numan Shafi,Saeed Anwar,Sabine Schmoelzl,Eric Stone,Lars Petersson,Vivien Rolland
机构: CSIRO(澳大利亚联邦科学与工业研究组织); University of Engineering and Technology, Pakistan(巴基斯坦工程与技术大学); Australian National University(澳大利亚国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This review provides an in-depth exploration of the field of animal action recognition, focusing on coarse-grained (CG) and fine-grained (FG) techniques. The primary aim is to examine the current state of research in animal behaviour recognition and to elucidate the unique challenges associated with recognising subtle animal actions in outdoor environments. These challenges differ significantly from those encountered in human action recognition due to factors such as non-rigid body structures, frequent occlusions, and the lack of large-scale, annotated datasets. The review begins by discussing the evolution of human action recognition, a more established field, highlighting how it progressed from broad, coarse actions in controlled settings to the demand for fine-grained recognition in dynamic environments. This shift is particularly relevant for animal action recognition, where behavioural variability and environmental complexity present unique challenges that human-centric models cannot fully address. The review then underscores the critical differences between human and animal action recognition, with an emphasis on high intra-species variability, unstructured datasets, and the natural complexity of animal habitats. Techniques like spatio-temporal deep learning frameworks (e.g., SlowFast) are evaluated for their effectiveness in animal behaviour analysis, along with the limitations of existing datasets. By assessing the strengths and weaknesses of current methodologies and introducing a recently-published dataset, the review outlines future directions for advancing fine-grained action recognition, aiming to improve accuracy and generalisability in behaviour analysis across species.
zh

[CV-86] Self-Supervised Multi-View Representation Learning using Vision-Language Model for 3D/4D Facial Expression Recognition

【速读】:该论文旨在解决面部表情识别(Facial Expression Recognition, FER)任务中的多视角视觉表征学习与自然语言监督融合问题,特别是在3D/4D场景下的微表情识别(Micro-Expression Recognition, MER)。其解决方案的关键在于提出SMILE-VLM模型,该模型通过三个核心组件实现:基于Barlow Twins风格损失的多视角去相关性学习、视觉-语言对比对齐以及跨模态冗余最小化,从而生成鲁棒、语义对齐且视角不变的嵌入表示。

链接: https://arxiv.org/abs/2506.01203
作者: Muzammil Behzad
机构: King Fahd University of Petroleum and Minerals (法赫德国王石油矿产大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Facial expression recognition (FER) is a fundamental task in affective computing with applications in human-computer interaction, mental health analysis, and behavioral understanding. In this paper, we propose SMILE-VLM, a self-supervised vision-language model for 3D/4D FER that unifies multiview visual representation learning with natural language supervision. SMILE-VLM learns robust, semantically aligned, and view-invariant embeddings by proposing three core components: multiview decorrelation via a Barlow Twins-style loss, vision-language contrastive alignment, and cross-modal redundancy minimization. Our framework achieves the state-of-the-art performance on multiple benchmarks. We further extend SMILE-VLM to the task of 4D micro-expression recognition (MER) to recognize the subtle affective cues. The extensive results demonstrate that SMILE-VLM not only surpasses existing unsupervised methods but also matches or exceeds supervised baselines, offering a scalable and annotation-efficient solution for expressive facial behavior understanding.
zh

[CV-87] Perceptual Inductive Bias Is What You Need Before Contrastive Learning CVPR2025 CVPR

【速读】:该论文试图解决当前对比表示学习框架在视觉任务中因缺乏人类视觉系统的归纳偏置而导致的收敛速度慢、学习捷径及纹理偏差问题。其解决方案的关键在于引入David Marr的多阶段感知理论,即首先利用早期视觉处理阶段的感知构造构建边界和表面层次表示,再进行对象语义的训练,从而提升模型在语义分割、深度估计和目标识别任务中的表现,并增强模型的鲁棒性和分布外泛化能力。

链接: https://arxiv.org/abs/2506.01201
作者: Tianqin Li,Junru Zhao,Dunhan Jiang,Shenghao Wu,Alan Ramirez,Tai Sing Lee
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025. Tianqin Li and Junru Zhao contributed equally to this work. Due to a formatting error during the CVPR submission, the equal contribution note was omitted in the official proceedings. This arXiv version corrects that oversight. The author order follows alphabetical order by last name

点击查看摘要

Abstract:David Marr’s seminal theory of human perception stipulates that visual processing is a multi-stage process, prioritizing the derivation of boundary and surface properties before forming semantic object representations. In contrast, contrastive representation learning frameworks typically bypass this explicit multi-stage approach, defining their objective as the direct learning of a semantic representation space for objects. While effective in general contexts, this approach sacrifices the inductive biases of vision, leading to slower convergence speed and learning shortcut resulting in texture bias. In this work, we demonstrate that leveraging Marr’s multi-stage theory-by first constructing boundary and surface-level representations using perceptual constructs from early visual processing stages and subsequently training for object semantics-leads to 2x faster convergence on ResNet18, improved final representations on semantic segmentation, depth estimation, and object recognition, and enhanced robustness and out-of-distribution capability. Together, we propose a pretraining stage before the general contrastive representation pretraining to further enhance the final representation quality and reduce the overall convergence time via inductive bias from human vision systems.
zh

[CV-88] OG-VLA: 3D-Aware Vision Language Action Model via Orthographic Image Generation

【速读】:该论文试图解决将自然语言指令和多视角RGBD观测映射为准静态机器人动作的问题,特别是在保持3D感知机器人策略在已知环境中的鲁棒性的同时,提升其对未见过的指令、场景和物体的泛化能力。解决方案的关键在于提出OG-VLA架构,通过结合视觉语言动作模型(Vision Language Action models, VLAs)的泛化能力与3D感知策略的鲁棒性,利用语言和视觉基础模型中的先验知识来增强3D感知关键帧策略的泛化性,同时通过点云投影和规范正交视图渲染实现输入视角不变性和输入输出空间的一致性。

链接: https://arxiv.org/abs/2506.01196
作者: Ishika Singh,Ankit Goyal,Stan Birchfield,Dieter Fox,Animesh Garg,Valts Blukis
机构: University of Southern California (南加州大学); NVIDIA (英伟达); Georgia Institute of Technology (佐治亚理工学院)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages

点击查看摘要

Abstract:We introduce OG-VLA, a novel architecture and learning framework that combines the generalization strengths of Vision Language Action models (VLAs) with the robustness of 3D-aware policies. We address the challenge of mapping natural language instructions and multi-view RGBD observations to quasi-static robot actions. 3D-aware robot policies achieve state-of-the-art performance on precise robot manipulation tasks, but struggle with generalization to unseen instructions, scenes, and objects. On the other hand, VLAs excel at generalizing across instructions and scenes, but can be sensitive to camera and robot pose variations. We leverage prior knowledge embedded in language and vision foundation models to improve generalization of 3D-aware keyframe policies. OG-VLA projects input observations from diverse views into a point cloud which is then rendered from canonical orthographic views, ensuring input view invariance and consistency between input and output spaces. These canonical views are processed with a vision backbone, a Large Language Model (LLM), and an image diffusion model to generate images that encode the next position and orientation of the end-effector on the input scene. Evaluations on the Arnold and Colosseum benchmarks demonstrate state-of-the-art generalization to unseen environments, with over 40% relative improvements while maintaining robust performance in seen settings. We also show real-world adaption in 3 to 5 demonstrations along with strong generalization. Videos and resources at this https URL
zh

[CV-89] SVarM: Linear Support Varifold Machines for Classification and Regression on Geometric Data

【速读】:该论文试图解决在几何深度学习领域中对几何数据(如曲线、图或曲面)进行统计分析的挑战,这一问题主要源于形状空间的非欧几里得性质,即形状空间被定义为在不变性群下的等价类。解决方案的关键在于提出SVarM方法,该方法利用形状的变测度(varifold)表示及其与测试函数 $ h:\mathbb{R}^n \times S^{n-1} \to \mathbb{R} $ 的对偶性,构建了一个类似于线性支持向量机的通用框架,但其操作空间是无限维的变测度空间。通过引入基于神经网络的可训练测试函数 $ h $,该方法在形状数据集上实现了分类和回归模型,表现出强大的性能和鲁棒性,同时显著减少了可训练参数的数量。

链接: https://arxiv.org/abs/2506.01189
作者: Emmanuel Hartman,Nicolas Charon
机构: University of Houston (休斯顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Differential Geometry (math.DG); Functional Analysis (math.FA)
备注: 22 pages, 12 figures

点击查看摘要

Abstract:Despite progress in the rapidly developing field of geometric deep learning, performing statistical analysis on geometric data–where each observation is a shape such as a curve, graph, or surface–remains challenging due to the non-Euclidean nature of shape spaces, which are defined as equivalence classes under invariance groups. Building machine learning frameworks that incorporate such invariances, notably to shape parametrization, is often crucial to ensure generalizability of the trained models to new observations. This work proposes SVarM to exploit varifold representations of shapes as measures and their duality with test functions h:\mathbbR^n \times S^n-1 \to \mathbbR . This method provides a general framework akin to linear support vector machines but operating instead over the infinite-dimensional space of varifolds. We develop classification and regression models on shape datasets by introducing a neural network-based representation of the trainable test function h . This approach demonstrates strong performance and robustness across various shape graph and surface datasets, achieving results comparable to state-of-the-art methods while significantly reducing the number of trainable parameters.
zh

[CV-90] FlowMo: Variance-Based Flow Guidance for Coherent Motion in Video Generation

【速读】:该论文试图解决文本到视频扩散模型在建模时间特性(如运动、物理和动态交互)方面的局限性。现有方法通过重新训练模型或引入外部条件信号来增强时间一致性,而本文提出了一种无需额外训练或辅助输入的解决方案——\textbf{FlowMo}。其关键在于从预训练模型的预测中直接提取有意义的时间表示,通过测量连续帧潜在表示之间的距离获得去外观偏差的时间表征,并利用时空维度上的块级方差估计运动一致性,在采样过程中动态引导模型降低该方差,从而提升视频生成的运动连贯性。

链接: https://arxiv.org/abs/2506.01144
作者: Ariel Shaulov,Itay Hazan,Lior Wolf,Hila Chefer
机构: Tel Aviv University (特拉维夫大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Text-to-video diffusion models are notoriously limited in their ability to model temporal aspects such as motion, physics, and dynamic interactions. Existing approaches address this limitation by retraining the model or introducing external conditioning signals to enforce temporal consistency. In this work, we explore whether a meaningful temporal representation can be extracted directly from the predictions of a pre-trained model without any additional training or auxiliary inputs. We introduce \textbfFlowMo, a novel training-free guidance method that enhances motion coherence using only the model’s own predictions in each diffusion step. FlowMo first derives an appearance-debiased temporal representation by measuring the distance between latents corresponding to consecutive frames. This highlights the implicit temporal structure predicted by the model. It then estimates motion coherence by measuring the patch-wise variance across the temporal dimension and guides the model to reduce this variance dynamically during sampling. Extensive experiments across multiple text-to-video models demonstrate that FlowMo significantly improves motion coherence without sacrificing visual quality or prompt alignment, offering an effective plug-and-play solution for enhancing the temporal fidelity of pre-trained video diffusion models.
zh

[CV-91] ProstaTD: A Large-scale Multi-source Dataset for Structured Surgical Triplet Detection

【速读】:该论文试图解决现有手术三元组检测数据集(如CholecT50)在空间边界框标注不精确、时间标签不一致且缺乏临床依据以及数据来源单一等方面的局限性。解决方案的关键在于引入ProstaTD,这是一个大规模、多机构的手术三元组检测数据集,其来源于技术要求较高的机器人辅助前列腺切除术领域,提供了临床定义的时间边界和高精度的边界框标注,涵盖了60,529帧视频和165,567个标注的三元组实例,具有广泛的手术实践和术中条件多样性。

链接: https://arxiv.org/abs/2506.01130
作者: Yiliang Chen,Zhixi Li,Cheng Xu,Alex Qinyang Liu,Xuemiao Xu,Jeremy Yuen-Chun Teoh,Shengfeng He,Jing Qin
机构: School of Nursing, The Hong Kong Polytechnic University (护理学院,香港理工大学); Nanfang Hospital, Southern Medical University (南方医院,南方医科大学); Department of Surgery, The Chinese University of Hong Kong (外科系,香港中文大学); South China University of Technology (华南理工大学); Singapore Management University (新加坡管理大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Surgical triplet detection has emerged as a pivotal task in surgical video analysis, with significant implications for performance assessment and the training of novice surgeons. However, existing datasets such as CholecT50 exhibit critical limitations: they lack precise spatial bounding box annotations, provide inconsistent and clinically ungrounded temporal labels, and rely on a single data source, which limits model this http URL address these shortcomings, we introduce ProstaTD, a large-scale, multi-institutional dataset for surgical triplet detection, developed from the technically demanding domain of robot-assisted prostatectomy. ProstaTD offers clinically defined temporal boundaries and high-precision bounding box annotations for each structured triplet action. The dataset comprises 60,529 video frames and 165,567 annotated triplet instances, collected from 21 surgeries performed across multiple institutions, reflecting a broad range of surgical practices and intraoperative conditions. The annotation process was conducted under rigorous medical supervision and involved more than 50 contributors, including practicing surgeons and medically trained annotators, through multiple iterative phases of labeling and verification. ProstaTD is the largest and most diverse surgical triplet dataset to date, providing a robust foundation for fair benchmarking, the development of reliable surgical AI systems, and scalable tools for procedural training.
zh

[CV-92] MOOSE: Pay Attention to Temporal Dynamics for Video Understanding via Optical Flows

【速读】:该论文旨在解决视频分析中与运动相关的任务所面临的高效且可解释的时序建模问题,例如检测自闭症患者中的异常运动行为或实时MRI中的人类言语发音运动分析。传统方法在捕捉时序动态时通常需要大量计算资源和细粒度标注数据,而这些数据并不容易获取。论文提出的解决方案是MOOSE(Motion Flow Over Spatial Space),其关键在于将光流与空间嵌入相结合,构建一个以时间为中心的视频编码器,从而实现高效的时序信息建模。该方法利用了广泛可用的预训练视觉和光流编码器,而非从头开始训练视频模型,显著降低了计算复杂度并提升了时序可解释性。

链接: https://arxiv.org/abs/2506.01119
作者: Hong Nguyen,Dung Tran,Hieu Hoang,Phong Nguyen,Shrikanth Narayanan
机构: Cranberry-Lemon University (克兰伯里-柠檬大学); University of Southern California (南加州大学); Hanoi University of Science and Technology (河内科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Many motion-centric video analysis tasks, such as atomic actions, detecting atypical motor behavior in individuals with autism, or analyzing articulatory motion in real-time MRI of human speech, require efficient and interpretable temporal modeling. Capturing temporal dynamics is a central challenge in video analysis, often requiring significant computational resources and fine-grained annotations that are not widely available. This paper presents MOOSE (Motion Flow Over Spatial Space), a novel temporally-centric video encoder explicitly integrating optical flow with spatial embeddings to model temporal information efficiently, inspired by human perception of motion. Unlike prior models, MOOSE takes advantage of rich, widely available pre-trained visual and optical flow encoders instead of training video models from scratch. This significantly reduces computational complexity while enhancing temporal interpretability. Our primary contributions includes (1) proposing a computationally efficient temporally-centric architecture for video understanding (2) demonstrating enhanced interpretability in modeling temporal dynamics; and (3) achieving state-of-the-art performance on diverse benchmarks, including clinical, medical, and standard action recognition datasets, confirming the broad applicability and effectiveness of our approach.
zh

[CV-93] Revolutionizing Radiology Workflow with Factual and Efficient CXR Report Generation

【速读】:该论文旨在解决医学影像解读中效率与准确性的不足问题,特别是针对胸部X光(CXR)报告生成的自动化需求。其解决方案的关键在于提出一种名为Clinician-Guided Adversarial Fine-Tuning (CGAFT) 的独特训练范式,该方法通过将专家临床反馈整合到对抗学习框架中,以减少事实性不一致并提高诊断精度;同时结合Knowledge Graph Augmentation Module (KGAM),在推理阶段动态验证生成的医学陈述,确保术语标准化并降低幻觉现象。

链接: https://arxiv.org/abs/2506.01118
作者: Pimchanok Sukjai,Apiradee Boonmee
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The escalating demand for medical image interpretation underscores the critical need for advanced artificial intelligence solutions to enhance the efficiency and accuracy of radiological diagnoses. This paper introduces CXR-PathFinder, a novel Large Language Model (LLM)-centric foundation model specifically engineered for automated chest X-ray (CXR) report generation. We propose a unique training paradigm, Clinician-Guided Adversarial Fine-Tuning (CGAFT), which meticulously integrates expert clinical feedback into an adversarial learning framework to mitigate factual inconsistencies and improve diagnostic precision. Complementing this, our Knowledge Graph Augmentation Module (KGAM) acts as an inference-time safeguard, dynamically verifying generated medical statements against authoritative knowledge bases to minimize hallucinations and ensure standardized terminology. Leveraging a comprehensive dataset of millions of paired CXR images and expert reports, our experiments demonstrate that CXR-PathFinder significantly outperforms existing state-of-the-art medical vision-language models across various quantitative metrics, including clinical accuracy (Macro F1 (14): 46.5, Micro F1 (14): 59.5). Furthermore, blinded human evaluation by board-certified radiologists confirms CXR-PathFinder’s superior clinical utility, completeness, and accuracy, establishing its potential as a reliable and efficient aid for radiological practice. The developed method effectively balances high diagnostic fidelity with computational efficiency, providing a robust solution for automated medical report generation.
zh

[CV-94] CountingFruit: Real-Time 3D Fruit Counting with Language-Guided Semantic Gaussian Splatting

【速读】:该论文旨在解决现实农业环境中准确水果计数的问题,该问题由于视觉遮挡、语义模糊以及三维重建的高计算需求而长期存在。其解决方案的关键在于提出FruitLangGS框架,该框架通过空间重建、语义嵌入和语言引导的实例估计来克服现有基于神经辐射场的方法在推理速度、泛化能力及开放集语义控制支持方面的不足。具体而言,FruitLangGS利用自适应高斯点云投射管道进行果园尺度场景重建,并通过压缩CLIP对齐的语言嵌入实现语义控制,最终在三维空间中通过提示驱动的语义过滤和分布感知采样实现高效的水果计数。

链接: https://arxiv.org/abs/2506.01109
作者: Fengze Li,Yangle Liu,Jieming Ma,Hai-Ning Liang,Yaochun Shen,Huangxiang Li,Zhijing Wu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Accurate fruit counting in real-world agricultural environments is a longstanding challenge due to visual occlusions, semantic ambiguity, and the high computational demands of 3D reconstruction. Existing methods based on neural radiance fields suffer from low inference speed, limited generalization, and lack support for open-set semantic control. This paper presents FruitLangGS, a real-time 3D fruit counting framework that addresses these limitations through spatial reconstruction, semantic embedding, and language-guided instance estimation. FruitLangGS first reconstructs orchard-scale scenes using an adaptive Gaussian splatting pipeline with radius-aware pruning and tile-based rasterization for efficient rendering. To enable semantic control, each Gaussian encodes a compressed CLIP-aligned language embedding, forming a compact and queryable 3D representation. At inference time, prompt-based semantic filtering is applied directly in 3D space, without relying on image-space segmentation or view-level fusion. The selected Gaussians are then converted into dense point clouds via distribution-aware sampling and clustered to estimate fruit counts. Experimental results on real orchard data demonstrate that FruitLangGS achieves higher rendering speed, semantic flexibility, and counting accuracy compared to prior approaches, offering a new perspective for language-driven, real-time neural rendering across open-world scenarios.
zh

[CV-95] DeepVerse: 4D Autoregressive Video Generation as a World Model

【速读】:该论文试图解决现有交互式模型仅预测视觉观测而忽视几何结构和空间一致性等隐藏状态的问题,这一缺陷导致误差快速累积和时间不一致性。解决方案的关键在于提出DeepVerse,这是一种新型的4D交互世界模型,其核心是将前一时间步的几何预测显式地纳入当前基于动作的预测中,从而更好地捕捉时空关系和物理动态,提升模型的长期空间一致性与预测准确性。

链接: https://arxiv.org/abs/2506.01103
作者: Junyi Chen,Haoyi Zhu,Xianglong He,Yifan Wang,Jianjun Zhou,Wenzheng Chang,Yang Zhou,Zizun Li,Zhoujie Fu,Jiangmiao Pang,Tong He
机构: Shanghai AI Lab (上海人工智能实验室); SJTU (上海交通大学); USTC (中国科学技术大学); THU (清华大学); ZJU (浙江大学); FDU (复旦大学); NTU (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:World models serve as essential building blocks toward Artificial General Intelligence (AGI), enabling intelligent agents to predict future states and plan actions by simulating complex physical interactions. However, existing interactive models primarily predict visual observations, thereby neglecting crucial hidden states like geometric structures and spatial coherence. This leads to rapid error accumulation and temporal inconsistency. To address these limitations, we introduce DeepVerse, a novel 4D interactive world model explicitly incorporating geometric predictions from previous timesteps into current predictions conditioned on actions. Experiments demonstrate that by incorporating explicit geometric constraints, DeepVerse captures richer spatio-temporal relationships and underlying physical dynamics. This capability significantly reduces drift and enhances temporal consistency, enabling the model to reliably generate extended future sequences and achieve substantial improvements in prediction accuracy, visual realism, and scene rationality. Furthermore, our method provides an effective solution for geometry-aware memory retrieval, effectively preserving long-term spatial consistency. We validate the effectiveness of DeepVerse across diverse scenarios, establishing its capacity for high-fidelity, long-horizon predictions grounded in geometry-aware dynamics.
zh

[CV-96] Keystep Recognition using Graph Neural Networks

【速读】:该论文试图解决细粒度的自我中心视频(egocentric video)中的动作步骤(keystep)识别问题,其核心挑战在于有效建模视频中的长期依赖关系。解决方案的关键在于提出一种灵活的图学习框架(GLEVR),将每个视频片段视为图中的节点,并通过构建稀疏且计算高效的图结构来捕捉这些长期依赖关系。此外,该方法在训练过程中利用自我中心与他者中心视频之间的对齐关系以及自动字幕作为额外模态,进一步提升了自我中心视频的推理性能。

链接: https://arxiv.org/abs/2506.01102
作者: Julia Lee Romero,Kyle Min,Subarna Tripathi,Morteza Karimzadeh
机构: University of Colorado Boulder (科罗拉多大学博尔德分校); Intel Labs (英特尔实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We pose keystep recognition as a node classification task, and propose a flexible graph-learning framework for fine-grained keystep recognition that is able to effectively leverage long-term dependencies in egocentric videos. Our approach, termed GLEVR, consists of constructing a graph where each video clip of the egocentric video corresponds to a node. The constructed graphs are sparse and computationally efficient, outperforming existing larger models substantially. We further leverage alignment between egocentric and exocentric videos during training for improved inference on egocentric videos, as well as adding automatic captioning as an additional modality. We consider each clip of each exocentric video (if available) or video captions as additional nodes during training. We examine several strategies to define connections across these nodes. We perform extensive experiments on the Ego-Exo4D dataset and show that our proposed flexible graph-based framework notably outperforms existing methods.
zh

[CV-97] Generic Token Compression in Multimodal Large Language Models from an Explainability Perspective

【速读】:该论文试图解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在处理大量视觉标记时导致的计算成本高和效率低的问题。其解决方案的关键在于通过可解释性方法评估每个视觉标记的重要性,并在输入阶段进行有效的标记压缩,而不会造成显著的性能损失。研究进一步提出了一种从第一层注意力图到解释结果的映射学习机制,从而避免了完整的推理过程,提升了实际部署的可行性。

链接: https://arxiv.org/abs/2506.01097
作者: Lei Lei,Jie Gu,Xiaokang Ma,Chu Tang,Jingmin Chen,Tong Xu
机构: University of Science and Technology of China (中国科学技术大学); Rightly Robotics (Rightly Robotics)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Existing Multimodal Large Language Models (MLLMs) process a large number of visual tokens, leading to significant computational costs and inefficiency. Previous works generally assume that all visual tokens are necessary in the shallow layers of LLMs, and therefore token compression typically occurs in intermediate layers. In contrast, our study reveals an interesting insight: with proper selection, token compression is feasible at the input stage of LLM with negligible performance loss. Specifically, we reveal that explainability methods can effectively evaluate the importance of each visual token with respect to the given instruction, which can well guide the token compression. Furthermore, we propose to learn a mapping from the attention map of the first LLM layer to the explanation results, thereby avoiding the need for a full inference pass and facilitating practical deployment. Interestingly, this mapping can be learned using a simple and lightweight convolutional network, whose training is efficient and independent of MLLMs. Extensive experiments on 10 image and video benchmarks across three leading MLLMs (Qwen2-VL, LLaVA-OneVision, and VILA1.5) demonstrate the effectiveness of our approach, e.g., pruning 50% visual tokens while retaining more than 96% of the original performance across all benchmarks for all these three MLLMs. It also exhibits strong generalization, even when the number of tokens in inference far exceeds that used in training.
zh

[CV-98] PromptVFX: Text-Driven Fields for Open-World 3D Gaussian Animation

【速读】:该论文旨在解决传统3D动画制作中所需的专业技能、时间消耗以及计算资源密集的问题。其核心挑战在于如何简化3D视觉效果(VFX)的生成过程,使非专业用户也能高效创建复杂的4D动态效果。解决方案的关键在于将3D动画重新定义为一个场预测任务,并引入一种基于文本驱动的框架,通过大型语言模型(LLMs)和视觉-语言模型(VLMs)生成函数,实时推断作用于3D高斯分布的时间变化4D流场,从而实现对颜色、透明度和位置的即时调整,避免了网格提取、手动或物理模拟等冗余步骤。

链接: https://arxiv.org/abs/2506.01091
作者: Mert Kiray,Paul Uhlenbruck,Nassir Navab,Benjamin Busam
机构: Technical University of Munich (慕尼黑工业大学); 3Dwe.ai
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Visual effects (VFX) are key to immersion in modern films, games, and AR/VR. Creating 3D effects requires specialized expertise and training in 3D animation software and can be time consuming. Generative solutions typically rely on computationally intense methods such as diffusion models which can be slow at 4D inference. We reformulate 3D animation as a field prediction task and introduce a text-driven framework that infers a time-varying 4D flow field acting on 3D Gaussians. By leveraging large language models (LLMs) and vision-language models (VLMs) for function generation, our approach interprets arbitrary prompts (e.g., “make the vase glow orange, then explode”) and instantly updates color, opacity, and positions of 3D Gaussians in real time. This design avoids overheads such as mesh extraction, manual or physics-based simulations and allows both novice and expert users to animate volumetric scenes with minimal effort on a consumer device even in a web browser. Experimental results show that simple textual instructions suffice to generate compelling time-varying VFX, reducing the manual effort typically required for rigging or advanced modeling. We thus present a fast and accessible pathway to language-driven 3D content creation that can pave the way to democratize VFX further.
zh

[CV-99] Learning What Matters: Prioritized Concept Learning via Relative Error-driven Sample Selection

【速读】:该论文旨在解决指令微调(instruction tuning)在视觉-语言模型(VLMs)中的高成本问题,包括大规模数据集、高质量标注和大量计算资源的需求。其解决方案的关键在于提出PROGRESS(PRioritized cOncept learninG via Relative Error-driven Sample Selection),该框架通过动态选择最具信息量的样本进行学习,从而实现数据和计算效率的提升。PROGRESS根据模型在训练过程中的学习进展,优先选择尚未掌握且当前阶段可学习的样本,有效控制技能获取的顺序与节奏,同时无需预先标注答案、依赖辅助VLM的监督或进行计算密集型的梯度计算。

链接: https://arxiv.org/abs/2506.01085
作者: Shivam Chandhok,Qian Yang,Oscar Manas,Kanishk Jain,Leonid Sigal,Aishwarya Agrawal
机构: Mila - Québec AI Institute (Mila - 魁北克人工智能研究所); University of British Columbia (不列颠哥伦比亚大学); Université de Montréal (蒙特利尔大学); Vector Institute for AI (向量人工智能研究所); Canada CIFAR AI Chair (加拿大CIFAR人工智能主席)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Preprint

点击查看摘要

Abstract:Instruction tuning has been central to the success of recent vision-language models (VLMs), but it remains expensive-requiring large-scale datasets, high-quality annotations, and large compute budgets. We propose PRioritized cOncept learninG via Relative Error-driven Sample Selection (PROGRESS), a data- and compute-efficient framework that enables VLMs to dynamically select what to learn next based on their evolving needs during training. At each stage, the model tracks its learning progress across skills and selects the most informative samples-those it has not already mastered and that are not too difficult to learn at the current stage of training. This strategy effectively controls skill acquisition and the order in which skills are learned. Specifically, we sample from skills showing the highest learning progress, prioritizing those with the most rapid improvement. Unlike prior methods, PROGRESS requires no upfront answer annotations, queries answers only on a need basis, avoids reliance on additional supervision from auxiliary VLMs, and does not require compute-heavy gradient computations for data selection. Experiments across multiple instruction-tuning datasets of varying scales demonstrate that PROGRESS consistently outperforms state-of-the-art baselines with much less data and supervision. Additionally, we show strong cross-architecture generalization and transferability to larger models, validating PROGRESS as a scalable solution for efficient learning.
zh

[CV-100] GThinker: Towards General Multimodal Reasoning via Cue-Guided Rethinking

【速读】:该论文试图解决当前主流多模态大语言模型(Multimodal Large Language Models, MLLMs)在视觉导向的多模态推理任务中表现不佳的问题,其根源在于这些模型过度依赖逻辑和知识驱动的慢思考策略,难以有效整合视觉信息,导致对视觉线索的合理锚定不足。解决方案的关键在于提出GThinker模型,该模型引入了Cue-Rethinking机制,这是一种基于视觉线索进行推理并迭代重新解释以解决不一致性的灵活推理模式,同时采用两阶段训练流程(包括模式引导的冷启动和激励强化学习),以增强跨领域的多模态推理能力。

链接: https://arxiv.org/abs/2506.01078
作者: Yufei Zhan,Ziheng Wu,Yousong Zhu,Rongkun Xue,Ruipu Luo,Zhenghao Chen,Can Zhang,Yifan Li,Zhentao He,Zheming Yang,Ming Tang,Minghui Qiu,Jinqiao Wang
机构: Foundation Model Research Center, Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所基础模型研究中心); ByteDance (字节跳动); School of Artificial Intelligence, University of Chinese Academy of Sciences (中国科学院大学人工智能学院); Peng Cheng Laboratory (鹏城实验室); Wuhan AI Research (武汉人工智能研究院); Xi’an Jiaotong University (西安交通大学); Renmin University of China (中国人民大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Tech report

点击查看摘要

Abstract:Despite notable advancements in multimodal reasoning, leading Multimodal Large Language Models (MLLMs) still underperform on vision-centric multimodal reasoning tasks in general scenarios. This shortfall stems from their predominant reliance on logic- and knowledge-based slow thinking strategies, while effective for domains like math and science, fail to integrate visual information effectively during reasoning. Consequently, these models often fail to adequately ground visual cues, resulting in suboptimal performance in tasks that require multiple plausible visual interpretations and inferences. To address this, we present GThinker (General Thinker), a novel reasoning MLLM excelling in multimodal reasoning across general scenarios, mathematics, and science. GThinker introduces Cue-Rethinking, a flexible reasoning pattern that grounds inferences in visual cues and iteratively reinterprets these cues to resolve inconsistencies. Building on this pattern, we further propose a two-stage training pipeline, including pattern-guided cold start and incentive reinforcement learning, designed to enable multimodal reasoning capabilities across domains. Furthermore, to support the training, we construct GThinker-11K, comprising 7K high-quality, iteratively-annotated reasoning paths and 4K curated reinforcement learning samples, filling the data gap toward general multimodal reasoning. Extensive experiments demonstrate that GThinker achieves 81.5% on the challenging comprehensive multimodal reasoning benchmark M ^3 CoT, surpassing the latest O4-mini model. It also shows an average improvement of 2.1% on general scenario multimodal reasoning benchmarks, while maintaining on-par performance in mathematical reasoning compared to counterpart advanced reasoning models. The code, model, and data will be released soon at this https URL.
zh

[CV-101] A Large Convolutional Neural Network for Clinical Target and Multi-organ Segmentation in Gynecologic Brachytherapy with Multi-stage Learning

【速读】:该论文旨在解决妇科近距离放射治疗(GYN-BT)中临床靶区(CTV)和危及器官(OARs)分割的准确性问题,这一问题在治疗计划优化中至关重要。由于解剖变异、CT影像中软组织对比度低以及标注数据集有限,传统方法面临显著挑战。论文提出的解决方案是GynBTNet,其关键在于采用多阶段学习框架,包括基于稀疏子流形卷积的大规模CT数据集自监督预训练、综合多器官分割数据集的监督微调,以及针对GYN-BT任务的专用数据集微调,从而提升分割性能。

链接: https://arxiv.org/abs/2506.01073
作者: Mingzhe Hu,Yuan Gao,Yuheng Li,Ricahrd LJ Qiu,Chih-Wei Chang,Keyur D. Shah,Priyanka Kapoor,Beth Bradshaw,Yuan Shao,Justin Roper,Jill Remick,Zhen Tian,Xiaofeng Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Purpose: Accurate segmentation of clinical target volumes (CTV) and organs-at-risk is crucial for optimizing gynecologic brachytherapy (GYN-BT) treatment planning. However, anatomical variability, low soft-tissue contrast in CT imaging, and limited annotated datasets pose significant challenges. This study presents GynBTNet, a novel multi-stage learning framework designed to enhance segmentation performance through self-supervised pretraining and hierarchical fine-tuning strategies. Methods: GynBTNet employs a three-stage training strategy: (1) self-supervised pretraining on large-scale CT datasets using sparse submanifold convolution to capture robust anatomical representations, (2) supervised fine-tuning on a comprehensive multi-organ segmentation dataset to refine feature extraction, and (3) task-specific fine-tuning on a dedicated GYN-BT dataset to optimize segmentation performance for clinical applications. The model was evaluated against state-of-the-art methods using the Dice Similarity Coefficient (DSC), 95th percentile Hausdorff Distance (HD95), and Average Surface Distance (ASD). Results: Our GynBTNet achieved superior segmentation performance, significantly outperforming nnU-Net and Swin-UNETR. Notably, it yielded a DSC of 0.837 +/- 0.068 for CTV, 0.940 +/- 0.052 for the bladder, 0.842 +/- 0.070 for the rectum, and 0.871 +/- 0.047 for the uterus, with reduced HD95 and ASD compared to baseline models. Self-supervised pretraining led to consistent performance improvements, particularly for structures with complex boundaries. However, segmentation of the sigmoid colon remained challenging, likely due to anatomical ambiguities and inter-patient variability. Statistical significance analysis confirmed that GynBTNet’s improvements were significant compared to baseline models.
zh

[CV-102] Aligned Contrastive Loss for Long-Tailed Recognition CVPR2025

【速读】:该论文试图解决长尾分布识别(long-tailed recognition)问题,即在数据分布不均衡的情况下模型性能下降的问题。解决方案的关键在于提出了一种对齐对比学习(Aligned Contrastive Learning, ACL)算法,通过消除监督对比学习(supervised contrastive learning, SCL)中的梯度冲突以及正负样本对之间的不平衡吸引力和排斥力梯度,从而提升模型的泛化能力。

链接: https://arxiv.org/abs/2506.01071
作者: Jiali Ma,Jiequan Cui,Maeno Kazuki,Lakshmi Subramanian,Karlekar Jayashree,Sugiri Pranata,Hanwang Zhang
机构: Panasonic R&D Center Singapore(松下研发中心新加坡); Nanyang Technological University(南洋理工大学); Panasonic Connect Co., Ltd. R&D Division(松下连接有限公司研发部)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2025 DG-EBF Workshop

点击查看摘要

Abstract:In this paper, we propose an Aligned Contrastive Learning (ACL) algorithm to address the long-tailed recognition problem. Our findings indicate that while multi-view training boosts the performance, contrastive learning does not consistently enhance model generalization as the number of views increases. Through theoretical gradient analysis of supervised contrastive learning (SCL), we identify gradient conflicts, and imbalanced attraction and repulsion gradients between positive and negative pairs as the underlying issues. Our ACL algorithm is designed to eliminate these problems and demonstrates strong performance across multiple benchmarks. We validate the effectiveness of ACL through experiments on long-tailed CIFAR, ImageNet, Places, and iNaturalist datasets. Results show that ACL achieves new state-of-the-art performance.
zh

[CV-103] Revolutionizing Blood Banks: AI-Driven Fingerprint-Blood Group Correlation for Enhanced Safety

【速读】:该论文试图解决如何通过结合指纹模式与ABO血型数据来提升个人识别的准确性问题。研究的关键在于评估指纹类型(环形、螺旋形和弓形)与血型之间的关联性,并验证血型数据是否能有效增强传统指纹识别系统的性能。研究结果表明,尽管存在一定的关联性,但不同血型的指纹模式之间并无统计学显著差异,因此血型数据在结合指纹识别时对个人识别的提升作用有限。

链接: https://arxiv.org/abs/2506.01069
作者: Malik A. Altayar,Muhyeeddin Alqaraleh,Mowafaq Salem Alzboon,Wesam T. Almagharbeh
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Identification of a person is central in forensic science, security, and healthcare. Methods such as iris scanning and genomic profiling are more accurate but expensive, time-consuming, and more difficult to implement. This study focuses on the relationship between the fingerprint patterns and the ABO blood group as a biometric identification tool. A total of 200 subjects were included in the study, and fingerprint types (loops, whorls, and arches) and blood groups were compared. Associations were evaluated with statistical tests, including chi-square and Pearson correlation. The study found that the loops were the most common fingerprint pattern and the O+ blood group was the most prevalent. Even though there was some associative pattern, there was no statistically significant difference in the fingerprint patterns of different blood groups. Overall, the results indicate that blood group data do not significantly improve personal identification when used in conjunction with fingerprinting. Although the study shows weak correlation, it may emphasize the efforts of multi-modal based biometric systems in enhancing the current biometric systems. Future studies may focus on larger and more diverse samples, and possibly machine learning and additional biometrics to improve identification methods. This study addresses an element of the ever-changing nature of the fields of forensic science and biometric identification, highlighting the importance of resilient analytical methods for personal identification.
zh

[CV-104] Fighting Fire with Fire (F3): A Training-free and Efficient Visual Adversarial Example Purification Method in LVLMs

【速读】:该论文试图解决大型视觉-语言模型(Large Vision-Language Models, LVLMs)在面对视觉对抗样本时性能显著下降的问题,即如何有效净化对抗样本以提升模型的鲁棒性。解决方案的关键在于提出一种名为F3的对抗净化框架,其核心思想是采用“以火攻火”的策略,通过向对抗样本中引入简单的扰动来削弱其破坏性,具体而言,F3利用随机扰动后的对抗样本生成的跨模态注意力作为参考目标,通过注入噪声优化其注意力机制,从而获得更清洁和可靠的模型输出。

链接: https://arxiv.org/abs/2506.01064
作者: Yudong Zhang,Ruobing Xie,Yiqing Huang,Jiansheng Chen,Xingwu Sun,Zhanhui Kang,Di Wang,Yu Wang
机构: Tsinghua University (清华大学); Tencent (腾讯); University of Science and Technology Beijing (北京科技大学); University of Macau (澳门大学); State Key Laboratory of Space Network and Communications, Tsinghua University (清华大学空间网络与通信国家重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in large vision-language models (LVLMs) have showcased their remarkable capabilities across a wide range of multimodal vision-language tasks. However, these models remain vulnerable to visual adversarial attacks, which can substantially compromise their performance. Despite their potential impact, the development of effective methods for purifying such adversarial examples has received relatively limited attention. In this paper, we introduce F3, a novel adversarial purification framework that employs a counterintuitive “fighting fire with fire” strategy: intentionally introducing simple perturbations to adversarial examples to mitigate their harmful effects. Specifically, F3 leverages cross-modal attentions derived from randomly perturbed adversary examples as reference targets. By injecting noise into these adversarial examples, F3 effectively refines their attention, resulting in cleaner and more reliable model outputs. Remarkably, this seemingly paradoxical approach of employing noise to counteract adversarial attacks yields impressive purification results. Furthermore, F3 offers several distinct advantages: it is training-free and straightforward to implement, and exhibits significant computational efficiency improvements compared to existing purification methods. These attributes render F3 particularly suitable for large-scale industrial applications where both robust performance and operational efficiency are critical priorities. The code will be made publicly available.
zh

[CV-105] AceVFI: A Comprehensive Survey of Advances in Video Frame Interpolation

【速读】:该论文旨在系统梳理和总结视频帧插值(Video Frame Interpolation, VFI)领域的研究进展,解决当前VFI方法在技术原理、学习范式、挑战问题及应用方向上的分散与不统一问题。其解决方案的关键在于提出一种全面的分类框架,将VFI方法分为中心时间帧插值(Center-Time Frame Interpolation, CTFI)和任意时间帧插值(Arbitrary-Time Frame Interpolation, ATFI),并深入分析各类方法的核心原理、设计假设和技术特征,同时探讨VFI面临的主要挑战如大运动、遮挡、光照变化和非线性运动等,为后续研究提供理论支持与实践指导。

链接: https://arxiv.org/abs/2506.01061
作者: Dahyeon Kye,Changhyun Roh,Sukhun Ko,Chanho Eom,Jihyong Oh
机构: Chung-Ang University (忠南大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Please visit our project page at this https URL

点击查看摘要

Abstract:Video Frame Interpolation (VFI) is a fundamental Low-Level Vision (LLV) task that synthesizes intermediate frames between existing ones while maintaining spatial and temporal coherence. VFI techniques have evolved from classical motion compensation-based approach to deep learning-based approach, including kernel-, flow-, hybrid-, phase-, GAN-, Transformer-, Mamba-, and more recently diffusion model-based approach. We introduce AceVFI, the most comprehensive survey on VFI to date, covering over 250+ papers across these approaches. We systematically organize and describe VFI methodologies, detailing the core principles, design assumptions, and technical characteristics of each approach. We categorize the learning paradigm of VFI methods namely, Center-Time Frame Interpolation (CTFI) and Arbitrary-Time Frame Interpolation (ATFI). We analyze key challenges of VFI such as large motion, occlusion, lighting variation, and non-linear motion. In addition, we review standard datasets, loss functions, evaluation metrics. We examine applications of VFI including event-based, cartoon, medical image VFI and joint VFI with other LLV tasks. We conclude by outlining promising future research directions to support continued progress in the field. This survey aims to serve as a unified reference for both newcomers and experts seeking a deep understanding of modern VFI landscapes.
zh

[CV-106] ECP-Mamba: An Efficient Multi-scale Self-supervised Contrastive Learning Method with State Space Model for PolSAR Image Classification

【速读】:该论文旨在解决极化合成孔径雷达(PolSAR)图像分类中对大量标注数据的依赖以及Transformer等架构计算效率低的问题。其解决方案的关键在于提出ECP-Mamba框架,该框架结合了多尺度自监督对比学习与状态空间模型(SSM)主干网络,通过多尺度预测预训练任务缓解标注数据稀缺问题,并利用螺旋扫描策略优化Mamba架构以提升计算效率,同时引入轻量级Cross Mamba模块实现多尺度特征的高效交互。

链接: https://arxiv.org/abs/2506.01040
作者: Zuzheng Kuang,Haixia Bi,Chen Xu,Jian Sun
机构: Xi’an Jiaotong University (西安交通大学); Peng Cheng Laboratory (鹏城实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recently, polarimetric synthetic aperture radar (PolSAR) image classification has been greatly promoted by deep neural networks. However,current deep learning-based PolSAR classification methods encounter difficulties due to its dependence on extensive labeled data and the computational inefficiency of architectures like Transformers. This paper presents ECP-Mamba, an efficient framework integrating multi-scale self-supervised contrastive learning with a state space model (SSM) backbone. Specifically, ECP-Mamba addresses annotation scarcity through a multi-scale predictive pretext task based on local-to-global feature correspondences, which uses a simplified self-distillation paradigm without negative sample pairs. To enhance computational efficiency,the Mamba architecture (a selective SSM) is first tailored for pixel-wise PolSAR classification task by designing a spiral scan strategy. This strategy prioritizes causally relevant features near the central pixel, leveraging the localized nature of pixel-wise classification tasks. Additionally, the lightweight Cross Mamba module is proposed to facilitates complementary multi-scale feature interaction with minimal overhead. Extensive experiments across four benchmark datasets demonstrate ECP-Mamba’s effectiveness in balancing high accuracy with resource efficiency. On the Flevoland 1989 dataset, ECP-Mamba achieves state-of-the-art performance with an overall accuracy of 99.70%, average accuracy of 99.64% and Kappa coefficient of 99.62e-2. Our code will be available at this https URL.
zh

[CV-107] Self-supervised ControlNet with Spatio-Temporal Mamba for Real-world Video Super-resolution CVPR2025

【速读】:该论文旨在解决现有基于扩散模型的视频超分辨率(Video Super-Resolution, VSR)方法在生成高分辨率视频时容易引入复杂退化和明显伪影的问题。其关键解决方案是结合自监督学习与Mamba结构,构建一个噪声鲁棒的现实世界VSR框架。具体而言,通过引入带有3D选择性扫描模块的视频状态空间块增强扩散模型的全局时空注意力机制,以在可接受的计算成本下提升帧间内容一致性;同时,设计一种自监督ControlNet,利用高分辨率特征作为引导,并通过对比学习提取低分辨率视频中对退化不敏感的特征,从而减少生成细节中的伪影。此外,采用基于高分辨率-低分辨率视频混合的三阶段训练策略以稳定VSR训练过程。

链接: https://arxiv.org/abs/2506.01037
作者: Shijun Shi,Jing Xu,Lijing Lu,Zhihang Li,Kai Hu
机构: Jiangnan University (江南大学); University of Science and Technology of China (中国科学技术大学); Peking University (北京大学); Chinese Academy of Sciences (中国科学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 10 figures, accepted by CVPR 2025

点击查看摘要

Abstract:Existing diffusion-based video super-resolution (VSR) methods are susceptible to introducing complex degradations and noticeable artifacts into high-resolution videos due to their inherent randomness. In this paper, we propose a noise-robust real-world VSR framework by incorporating self-supervised learning and Mamba into pre-trained latent diffusion models. To ensure content consistency across adjacent frames, we enhance the diffusion model with a global spatio-temporal attention mechanism using the Video State-Space block with a 3D Selective Scan module, which reinforces coherence at an affordable computational cost. To further reduce artifacts in generated details, we introduce a self-supervised ControlNet that leverages HR features as guidance and employs contrastive learning to extract degradation-insensitive features from LR videos. Finally, a three-stage training strategy based on a mixture of HR-LR videos is proposed to stabilize VSR training. The proposed Self-supervised ControlNet with Spatio-Temporal Continuous Mamba based VSR algorithm achieves superior perceptual quality than state-of-the-arts on real-world VSR benchmark datasets, validating the effectiveness of the proposed model design and training strategies.
zh

[CV-108] NavBench: Probing Multimodal Large Language Models for Embodied Navigation

【速读】:该论文试图解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在具身环境中的理解和执行能力不足的问题,特别是其在零样本设置下进行导航的能力尚未得到充分探索。解决方案的关键在于提出NavBench基准,用于评估MLLMs的具身导航能力,该基准包含两个核心部分:一是通过三个基于认知任务评估导航理解能力,二是通过432个场景中的逐步执行评估模型的实际操作性能。此外,研究还引入了一条将MLLM输出转换为机器人动作的流水线,以支持实际部署。

链接: https://arxiv.org/abs/2506.01031
作者: Yanyuan Qiao,Haodong Hong,Wenqi Lyu,Dong An,Siqi Zhang,Yutong Xie,Xinyu Wang,Qi Wu
机构: The University of Adelaide(阿德莱德大学); The University of Queensland(昆士兰大学); CSIRO Data61(澳大利亚联邦科学与工业研究组织数据61); Mohamed bin Zayed University of Artificial Intelligence(穆罕默德·本·扎耶德人工智能大学); Tongji University(同济大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have demonstrated strong generalization in vision-language tasks, yet their ability to understand and act within embodied environments remains underexplored. We present NavBench, a benchmark to evaluate the embodied navigation capabilities of MLLMs under zero-shot settings. NavBench consists of two components: (1) navigation comprehension, assessed through three cognitively grounded tasks including global instruction alignment, temporal progress estimation, and local observation-action reasoning, covering 3,200 question-answer pairs; and (2) step-by-step execution in 432 episodes across 72 indoor scenes, stratified by spatial, cognitive, and execution complexity. To support real-world deployment, we introduce a pipeline that converts MLLMs’ outputs into robotic actions. We evaluate both proprietary and open-source models, finding that GPT-4o performs well across tasks, while lighter open-source models succeed in simpler cases. Results also show that models with higher comprehension scores tend to achieve better execution performance. Providing map-based context improves decision accuracy, especially in medium-difficulty scenarios. However, most models struggle with temporal understanding, particularly in estimating progress during navigation, which may pose a key challenge.
zh

[CV-109] Modality Translation and Registration of MR and Ultrasound Images Using Diffusion Models

【速读】:该论文旨在解决多模态磁共振成像(MRI)与超声(US)图像配准在前列腺癌诊断中的挑战性问题,特别是由于模态差异导致的关键边界对齐失败和对无关细节过度敏感的问题。其解决方案的关键在于提出了一种基于分层特征解耦设计的解剖一致模态转换(Anatomically Coherent Modality Translation, ACMT)网络,通过引入一个定制的中间伪模态,使MRI和US图像均向该中间域进行转换,从而有效克服传统转换方法在下游配准任务中的瓶颈。

链接: https://arxiv.org/abs/2506.01025
作者: Xudong Ma,Nantheera Anantrasirichai,Stefanos Bolomytis,Alin Achim
机构: University of Bristol(布里斯托大学); North Bristol NHS Trust(北布里斯托国家医疗服务体系)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal MR-US registration is critical for prostate cancer diagnosis. However, this task remains challenging due to significant modality discrepancies. Existing methods often fail to align critical boundaries while being overly sensitive to irrelevant details. To address this, we propose an anatomically coherent modality translation (ACMT) network based on a hierarchical feature disentanglement design. We leverage shallow-layer features for texture consistency and deep-layer features for boundary preservation. Unlike conventional modality translation methods that convert one modality into another, our ACMT introduces the customized design of an intermediate pseudo modality. Both MR and US images are translated toward this intermediate domain, effectively addressing the bottlenecks faced by traditional translation methods in the downstream registration task. Experiments demonstrate that our method mitigates modality-specific discrepancies while preserving crucial anatomical boundaries for accurate registration. Quantitative evaluations show superior modality similarity compared to state-of-the-art modality translation methods. Furthermore, downstream registration experiments confirm that our translated images achieve the best alignment performance, highlighting the robustness of our framework for multi-modal prostate image registration.
zh

[CV-110] AuralSAM2: Enabling SAM2 Hear Through Pyramid Audio-Visual Feature Prompting

【速读】:该论文试图解决将音频模态与可提示视频分割模型Segment Anything Model 2 (SAM2)进行有效融合的问题,现有方法在效率、定位精度和跨模态语义交互方面存在不足。其解决方案的关键在于提出AuralSAM2,其中包含新颖的AuralFuser模块,该模块外部接入SAM2以整合多模态特征并生成特征级提示,从而引导SAM2解码器分割发声目标,同时通过特征金字塔提升语义理解与多模态场景下的目标感知能力。此外,引入了音频引导的对比学习以显式对齐音视频表示并缓解视觉主导模式带来的偏差。

链接: https://arxiv.org/abs/2506.01015
作者: Yuyuan Liu,Yuanhong Chen,Chong Wang,Junlin Han,Junde Wu,Can Peng,Jingkun Chen,Yu Tian,Gustavo Carneiro
机构: University of Oxford(牛津大学); University of Adelaide(阿德莱德大学); Stanford University(斯坦福大学); University of Central Florida(中佛罗里达大学); University of Surrey(萨里大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages, 18 Figures and 7 tables

点击查看摘要

Abstract:Segment Anything Model 2 (SAM2) exhibits strong generalisation for promptable segmentation in video clips; however, its integration with the audio modality remains underexplored. Existing approaches mainly follow two directions: (1) injecting adapters into the image encoder to receive audio signals, which incurs efficiency costs during prompt engineering, and (2) leveraging additional foundation models to generate visual prompts for the sounding objects, which are often imprecisely localised, leading to misguidance in SAM2. Moreover, these methods overlook the rich semantic interplay between hierarchical visual features and other modalities, resulting in suboptimal cross-modal fusion. In this work, we propose AuralSAM2, comprising the novel AuralFuser module, which externally attaches to SAM2 to integrate features from different modalities and generate feature-level prompts, guiding SAM2’s decoder in segmenting sounding targets. Such integration is facilitated by a feature pyramid, further refining semantic understanding and enhancing object awareness in multimodal scenarios. Additionally, the audio-guided contrastive learning is introduced to explicitly align audio and visual representations and to also mitigate biases caused by dominant visual patterns. Results on public benchmarks show that our approach achieves remarkable improvements over the previous methods in the field. Code is available at this https URL.
zh

[CV-111] Motion-Aware Concept Alignment for Consistent Video Editing

【速读】:该论文试图解决视频生成中图像域语义混合与视频之间的差距问题,即如何在保持原始运动和视觉上下文的前提下,将用户提供的参考图像的语义特征注入视频中的特定对象。解决方案的关键在于引入一种无需训练的框架MoCA-Video,其核心是利用对角线去噪调度和类无关分割技术,在潜在空间中检测和跟踪对象,并精确控制融合对象的空间位置,同时通过基于动量的语义校正和伽马残差噪声稳定化确保时间一致性。

链接: https://arxiv.org/abs/2506.01004
作者: Tong Zhang,Juan C Leon Alcazar,Bernard Ghanem
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce MoCA-Video (Motion-Aware Concept Alignment in Video), a training-free framework bridging the gap between image-domain semantic mixing and video. Given a generated video and a user-provided reference image, MoCA-Video injects the semantic features of the reference image into a specific object within the video, while preserving the original motion and visual context. Our approach leverages a diagonal denoising schedule and class-agnostic segmentation to detect and track objects in the latent space and precisely control the spatial location of the blended objects. To ensure temporal coherence, we incorporate momentum-based semantic corrections and gamma residual noise stabilization for smooth frame transitions. We evaluate MoCA’s performance using the standard SSIM, image-level LPIPS, temporal LPIPS, and introduce a novel metric CASS (Conceptual Alignment Shift Score) to evaluate the consistency and effectiveness of the visual shifts between the source prompt and the modified video frames. Using self-constructed dataset, MoCA-Video outperforms current baselines, achieving superior spatial consistency, coherent motion, and a significantly higher CASS score, despite having no training or fine-tuning. MoCA-Video demonstrates that structured manipulation in the diffusion noise trajectory allows for controllable, high-quality video synthesis.
zh

[CV-112] Understanding Model Reprogramming for CLIP via Decoupling Visual Prompts

【速读】:该论文试图解决视觉重编程(Visual Reprogramming, VR)中现有方法在CLIP模型上仅使用单一可训练噪声模式(即视觉提示)来适应下游任务时,由于学习能力有限导致的两个问题:一是无法捕捉描述中的多样化特征(如形状、颜色和纹理),二是可能偏向于对类别区分无帮助的非信息性属性。解决方案的关键在于提出一种解耦与重加权框架,通过将描述按显式原因(DVP-cse)或无监督聚类(DVP-cls)分组优化解耦的视觉提示,并利用概率重加权矩阵(Probabilistic Reweighting Matrix, PRM)整合这些提示的输出,以衡量其对每个下游类别的贡献。

链接: https://arxiv.org/abs/2506.01000
作者: Chengyi Cai,Zesheng Ye,Lei Feng,Jianzhong Qi,Feng Liu
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Model reprogramming adapts pretrained models to downstream tasks by modifying only the input and output spaces. Visual reprogramming (VR) is one instance for vision tasks that adds a trainable noise pattern (i.e., a visual prompt) to input images to facilitate downstream classification. The existing VR approaches for CLIP train a single visual prompt using all descriptions of different downstream classes. However, the limited learning capacity may result in (1) a failure to capture diverse aspects of the descriptions (e.g., shape, color, and texture), and (2) a possible bias toward less informative attributes that do not help distinguish between classes. In this paper, we introduce a decoupling-and-reweighting framework. Our decoupled visual prompts (DVP) are optimized using descriptions grouped by explicit causes (DVP-cse) or unsupervised clusters (DVP-cls). Then, we integrate the outputs of these visual prompts with a probabilistic reweighting matrix (PRM) that measures their contributions to each downstream class. Theoretically, DVP lowers the empirical risk bound. Experimentally, DVP outperforms baselines on average across 11 downstream datasets. Notably, the DVP-PRM integration enables insights into how individual visual prompts influence classification decisions, providing a probabilistic framework for understanding reprogramming. Our code is available at this https URL.
zh

[CV-113] Pseudo-Labeling Driven Refinement of Benchmark Object Detection Datasets via Analysis of Learning Patterns

【速读】:该论文试图解决MS-COCO数据集中存在的标注问题,包括缺失标签、错误类别分配、不准确的边界框、重复标签和群体标注不一致等,这些问题影响了目标检测模型的训练效果和泛化能力。解决方案的关键在于提出了一种基于损失和梯度的误差检测方法,结合四阶段伪标签精炼流程,包括使用可逆变换生成边界框、基于IoU的重复去除与置信度合并、通过专家物体识别器进行类别一致性验证以及基于物体区域激活图分析的空间调整,从而实现无需人工重新标注的可扩展且精确的标注错误修正。

链接: https://arxiv.org/abs/2506.00997
作者: Min Je Kim,Muhammad Munsif,Altaf Hussain,Hikmat Yar,Sung Wook Baik
机构: Sejong University (世宗大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Benchmark object detection (OD) datasets play a pivotal role in advancing computer vision applications such as autonomous driving, and surveillance, as well as in training and evaluating deep learning-based state-of-the-art detection models. Among them, MS-COCO has become a standard benchmark due to its diverse object categories and complex scenes. However, despite its wide adoption, MS-COCO suffers from various annotation issues, including missing labels, incorrect class assignments, inaccurate bounding boxes, duplicate labels, and group labeling inconsistencies. These errors not only hinder model training but also degrade the reliability and generalization of OD models. To address these challenges, we propose a comprehensive refinement framework and present MJ-COCO, a newly re-annotated version of MS-COCO. Our approach begins with loss and gradient-based error detection to identify potentially mislabeled or hard-to-learn samples. Next, we apply a four-stage pseudo-labeling refinement process: (1) bounding box generation using invertible transformations, (2) IoU-based duplicate removal and confidence merging, (3) class consistency verification via expert objects recognizer, and (4) spatial adjustment based on object region activation map analysis. This integrated pipeline enables scalable and accurate correction of annotation errors without manual re-labeling. Extensive experiments were conducted across four validation datasets: MS-COCO, Sama COCO, Objects365, and PASCAL VOC. Models trained on MJ-COCO consistently outperformed those trained on MS-COCO, achieving improvements in Average Precision (AP) and APS metrics. MJ-COCO also demonstrated significant gains in annotation coverage: for example, the number of small object annotations increased by more than 200,000 compared to MS-COCO.
zh

[CV-114] mporal In-Context Fine-Tuning for Versatile Control of Video Diffusion Models

【速读】:该论文旨在解决可控视频生成在有限数据和计算资源下的挑战,特别是在空间对齐条件下的灵活性和可扩展性受限的问题。其解决方案的关键在于提出一种名为Temporal In-Context Fine-Tuning (TIC-FT) 的高效且通用的微调方法,通过将条件帧和目标帧沿时间轴拼接,并插入具有逐步增加噪声水平的中间缓冲帧,以实现与预训练模型时间动态一致的平滑过渡,从而无需架构修改即可在少量样本下取得优异性能。

链接: https://arxiv.org/abs/2506.00996
作者: Kinam Kim,Junha Hyung,Jaegul Choo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: project page: this https URL

点击查看摘要

Abstract:Recent advances in text-to-video diffusion models have enabled high-quality video synthesis, but controllable generation remains challenging, particularly under limited data and compute. Existing fine-tuning methods for conditional generation often rely on external encoders or architectural modifications, which demand large datasets and are typically restricted to spatially aligned conditioning, limiting flexibility and scalability. In this work, we introduce Temporal In-Context Fine-Tuning (TIC-FT), an efficient and versatile approach for adapting pretrained video diffusion models to diverse conditional generation tasks. Our key idea is to concatenate condition and target frames along the temporal axis and insert intermediate buffer frames with progressively increasing noise levels. These buffer frames enable smooth transitions, aligning the fine-tuning process with the pretrained model’s temporal dynamics. TIC-FT requires no architectural changes and achieves strong performance with as few as 10-30 training samples. We validate our method across a range of tasks, including image-to-video and video-to-video generation, using large-scale base models such as CogVideoX-5B and Wan-14B. Extensive experiments show that TIC-FT outperforms existing baselines in both condition fidelity and visual quality, while remaining highly efficient in both training and inference. For additional results, visit this https URL
zh

[CV-115] FlexSelect: Flexible Token Selection for Efficient Long Video Understanding

【速读】:该论文旨在解决长视频理解中因计算和内存需求过高而对视频大语言模型(VideoLLM)造成的挑战。其解决方案的关键在于提出FlexSelect,这是一种灵活高效的标记选择策略,通过利用参考Transformer层的跨模态注意力模式来识别并保留最具语义相关性的内容,包含两个核心组件:无需训练的标记排序流程和基于排名监督的轻量级选择器,从而有效减少冗余标记并扩展模型的时间上下文长度。

链接: https://arxiv.org/abs/2506.00993
作者: Yunzhu Zhang,Yu Lu,Tianyi Wang,Fengyun Rao,Yi Yang,Linchao Zhu
机构: Zhejiang University (浙江大学); Tencent Inc (腾讯公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Long-form video understanding poses a significant challenge for video large language models (VideoLLMs) due to prohibitively high computational and memory demands. In this paper, we propose FlexSelect, a flexible and efficient token selection strategy for processing long videos. FlexSelect identifies and retains the most semantically relevant content by leveraging cross-modal attention patterns from a reference transformer layer. It comprises two key components: (1) a training-free token ranking pipeline that leverages faithful cross-modal attention weights to estimate each video token’s importance, and (2) a rank-supervised lightweight selector that is trained to replicate these rankings and filter redundant tokens. This generic approach can be seamlessly integrated into various VideoLLM architectures, such as LLaVA-Video, InternVL and Qwen-VL, serving as a plug-and-play module to extend their temporal context length. Empirically, FlexSelect delivers strong gains across multiple long-video benchmarks including VideoMME, MLVU, LongVB, and LVBench. Moreover, it achieves significant speed-ups (for example, up to 9 times on a LLaVA-Video-7B model), highlighting FlexSelect’s promise for efficient long-form video understanding. Project page available at: this https URL
zh

[CV-116] Quotient Network - A Network Similar to ResNet but Learning Quotients NEURIPS2024

【速读】:该论文试图解决ResNet(残差网络)在训练极深网络时存在的两个问题:一是目标特征与现有特征之间的差异缺乏独立且明确的含义,二是学习量基于绝对差异而非相对差异,从而对现有特征的大小敏感。解决方案的关键在于提出一种新的网络结构——商网络(quotient network),该网络通过学习目标特征与现有特征的商来替代传统的差异学习,从而有效克服上述问题,并在不增加新参数的情况下实现比ResNet更高的性能。

链接: https://arxiv.org/abs/2506.00992
作者: Peng Hui,Jiamuyang Zhao,Changxin Li,Qingzhen Zhu
机构: Jiangsu University (江苏大学); Beihang University (北京航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: This manuscript is the original version submitted to NeurIPS 2024, which was later revised and published as “Quotient Network: A Network Similar to ResNet but Learning Quotients” in Algorithms 2024, 17(11), 521 ( this https URL ). Please cite the journal version when referring to this work

点击查看摘要

Abstract:The emergence of ResNet provides a powerful tool for training extremely deep networks. The core idea behind it is to change the learning goals of the network. It no longer learns new features from scratch but learns the difference between the target and existing features. However, the difference between the two kinds of features does not have an independent and clear meaning, and the amount of learning is based on the absolute rather than the relative difference, which is sensitive to the size of existing features. We propose a new network that perfectly solves these two problems while still having the advantages of ResNet. Specifically, it chooses to learn the quotient of the target features with the existing features, so we call it the quotient network. In order to enable this network to learn successfully and achieve higher performance, we propose some design rules for this network so that it can be trained efficiently and achieve better performance than ResNet. Experiments on the CIFAR10, CIFAR100, and SVHN datasets prove that this network can stably achieve considerable improvements over ResNet by simply making tiny corresponding changes to the original ResNet network without adding new parameters.
zh

[CV-117] GOBench: Benchmarking Geometric Optics Generation and Understanding of MLLM s

【速读】:该论文试图解决多模态大语言模型(Multi-modality Large Language Models, MLLMs)在细粒度物理原理,特别是几何光学方面的理解与生成能力缺乏系统评估的问题。解决方案的关键在于引入GOBench基准,该基准通过两个任务——生成光学真实图像和理解潜在光学现象——对MLLMs的能力进行系统性评估,从而揭示当前模型在光学生成和理解方面存在的显著不足。

链接: https://arxiv.org/abs/2506.00991
作者: Xiaorong Zhu,Ziheng Jia,Jiarui Wang,Xiangyu Zhao,Haodong Duan,Xiongkuo Min,Jia Wang,Zicheng Zhang,Guangtao Zhai
机构: Shanghai Jiao Tong University (上海交通大学); Shanghai AI Laboratory (上海人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 5 figures

点击查看摘要

Abstract:The rapid evolution of Multi-modality Large Language Models (MLLMs) is driving significant advancements in visual understanding and generation. Nevertheless, a comprehensive assessment of their capabilities, concerning the fine-grained physical principles especially in geometric optics, remains underexplored. To address this gap, we introduce GOBench, the first benchmark to systematically evaluate MLLMs’ ability across two tasks: 1) Generating Optically Authentic Imagery and 2) Understanding Underlying Optical Phenomena. We curates high-quality prompts of geometric optical scenarios and use MLLMs to construct GOBench-Gen-1k this http URL then organize subjective experiments to assess the generated imagery based on Optical Authenticity, Aesthetic Quality, and Instruction Fidelity, revealing MLLMs’ generation flaws that violate optical principles. For the understanding task, we apply crafted evaluation instructions to test optical understanding ability of eleven prominent MLLMs. The experimental results demonstrate that current models face significant challenges in both optical generation and understanding. The top-performing generative model, GPT-4o-Image, cannot perfectly complete all generation tasks, and the best-performing MLLM model, Gemini-2.5Pro, attains a mere 37.35% accuracy in optical understanding.
zh

[CV-118] LensCraft: Your Professional Virtual Cinematographer

【速读】:该论文试图解决数字创作者在将创意愿景转化为精确摄像机运动时所面临的瓶颈问题,现有自动拍摄系统在机械执行与创作意图之间存在根本性权衡。解决方案的关键在于LensCraft通过数据驱动的方法模拟专业摄影师的技艺,结合电影摄影原则与实时适应动态场景的灵活性,同时考虑被摄对象的方位和真实体积,从而提升拍摄过程中的空间感知能力。

链接: https://arxiv.org/abs/2506.00988
作者: Zahra Dehghanian,Morteza Abolghasemi,Hossein Azizinaghsh,Amir Vahedi,Hamid Beigy,Hamid R. Rabiee
机构: Sharif University of Technology (谢里夫理工大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Digital creators, from indie filmmakers to animation studios, face a persistent bottleneck: translating their creative vision into precise camera movements. Despite significant progress in computer vision and artificial intelligence, current automated filming systems struggle with a fundamental trade-off between mechanical execution and creative intent. Crucially, almost all previous works simplify the subject to a single point-ignoring its orientation and true volume-severely limiting spatial awareness during filming. LensCraft solves this problem by mimicking the expertise of a professional cinematographer, using a data-driven approach that combines cinematographic principles with the flexibility to adapt to dynamic scenes in real time. Our solution combines a specialized simulation framework for generating high-fidelity training data with an advanced neural model that is faithful to the script while being aware of the volume and dynamic behavior of the subject. Additionally, our approach allows for flexible control via various input modalities, including text prompts, subject trajectory and volume, key points, or a full camera trajectory, offering creators a versatile tool to guide camera movements in line with their vision. Leveraging a lightweight real time architecture, LensCraft achieves markedly lower computational complexity and faster inference while maintaining high output quality. Extensive evaluation across static and dynamic scenarios reveals unprecedented accuracy and coherence, setting a new benchmark for intelligent camera systems compared to state-of-the-art models. Extended results, the complete dataset, simulation environment, trained model weights, and source code are publicly accessible on LensCraft Webpage.
zh

[CV-119] IVY-FAKE: A Unified Explainable Framework and Benchmark for Image and Video AIGC Detection

【速读】:该论文旨在解决当前人工智能生成内容(AIGC)检测方法在可解释性、多模态统一检测能力及模型透明度方面的不足。现有方法通常作为黑箱二分类器,缺乏可解释性且无法在统一框架下检测图像和视频,导致信任度降低和实际部署受限。其解决方案的关键在于构建一个名为IVY-FAKE的新型大规模、可解释的多模态AIGC检测数据集,并提出Ivy Explainable Detector (IVY-XDETECTOR),该架构实现了图像和视频内容的联合可解释检测,通过融合视觉与语言信息,在多个检测基准上达到了最先进性能。

链接: https://arxiv.org/abs/2506.00979
作者: Wayne Zhang,Changjiang Jiang,Zhonghao Zhang,Chenyang Si,Fengchang Yu,Wei Peng
机构: Wuhan University(武汉大学); Stanford University(斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 20pages,13figures,7 tables

点击查看摘要

Abstract:The rapid advancement of Artificial Intelligence Generated Content (AIGC) in visual domains has resulted in highly realistic synthetic images and videos, driven by sophisticated generative frameworks such as diffusion-based architectures. While these breakthroughs open substantial opportunities, they simultaneously raise critical concerns about content authenticity and integrity. Many current AIGC detection methods operate as black-box binary classifiers, which offer limited interpretability, and no approach supports detecting both images and videos in a unified framework. This dual limitation compromises model transparency, reduces trustworthiness, and hinders practical deployment. To address these challenges, we introduce IVY-FAKE , a novel, unified, and large-scale dataset specifically designed for explainable multimodal AIGC detection. Unlike prior benchmarks, which suffer from fragmented modality coverage and sparse annotations, IVY-FAKE contains over 150,000 richly annotated training samples (images and videos) and 18,700 evaluation examples, each accompanied by detailed natural-language reasoning beyond simple binary labels. Building on this, we propose Ivy Explainable Detector (IVY-XDETECTOR), a unified AIGC detection and explainable architecture that jointly performs explainable detection for both image and video content. Our unified vision-language model achieves state-of-the-art performance across multiple image and video detection benchmarks, highlighting the significant advancements enabled by our dataset and modeling framework. Our data is publicly available at this https URL.
zh

[CV-120] CAPAA: Classifier-Agnostic Projector-Based Adversarial Attack

【速读】:该论文旨在解决基于投影的对抗攻击在多分类器系统和不同相机姿态场景下的有效性不足问题。现有方法主要针对单一分类器和固定相机姿态,忽视了复杂系统中新增分类器或变化姿态带来的挑战。其解决方案的关键在于提出一种无监督分类器的对抗损失和优化框架(Classifier-Agnostic Projector-Based Adversarial Attack, CAPAA),通过聚合多个分类器的对抗与隐蔽性损失梯度,并引入基于注意力的梯度加权机制,将扰动集中在高分类激活区域,从而提升对抗投影在不同相机姿态场景下的鲁棒性。

链接: https://arxiv.org/abs/2506.00978
作者: Zhan Li,Mingyu Zhao,Xin Dong,Haibin Ling,Bingyao Huang
机构: Southwest University, China; Rutgers University, USA; Stony Brook University, USA
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Projector-based adversarial attack aims to project carefully designed light patterns (i.e., adversarial projections) onto scenes to deceive deep image classifiers. It has potential applications in privacy protection and the development of more robust classifiers. However, existing approaches primarily focus on individual classifiers and fixed camera poses, often neglecting the complexities of multi-classifier systems and scenarios with varying camera poses. This limitation reduces their effectiveness when introducing new classifiers or camera poses. In this paper, we introduce Classifier-Agnostic Projector-Based Adversarial Attack (CAPAA) to address these issues. First, we develop a novel classifier-agnostic adversarial loss and optimization framework that aggregates adversarial and stealthiness loss gradients from multiple classifiers. Then, we propose an attention-based gradient weighting mechanism that concentrates perturbations on regions of high classification activation, thereby improving the robustness of adversarial projections when applied to scenes with varying camera poses. Our extensive experimental evaluations demonstrate that CAPAA achieves both a higher attack success rate and greater stealthiness compared to existing baselines. Codes are available at: this https URL.
zh

[CV-121] Camera Trajectory Generation: A Comprehensive Survey of Methods Metrics and Future Directions

【速读】:该论文试图解决相机轨迹生成领域缺乏系统性和统一性综述的问题,旨在整合该领域的核心知识与最新进展。其解决方案的关键在于首次全面回顾了该领域,涵盖了从基础定义到先进方法的各个方面,包括不同的相机表示方法、规则驱动方法、基于优化的技术、机器学习进展以及融合多种策略的混合方法,并对常用评估指标和数据集进行了分析,为未来研究提供了方向和启示。

链接: https://arxiv.org/abs/2506.00974
作者: Zahra Dehghanian,Pouya Ardekhani,Amir Vahedi,Hamid Beigy,Hamid R. Rabiee
机构: Sharif University of Technology(沙里夫理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Camera trajectory generation is a cornerstone in computer graphics, robotics, virtual reality, and cinematography, enabling seamless and adaptive camera movements that enhance visual storytelling and immersive experiences. Despite its growing prominence, the field lacks a systematic and unified survey that consolidates essential knowledge and advancements in this domain. This paper addresses this gap by providing the first comprehensive review of the field, covering from foundational definitions to advanced methodologies. We introduce the different approaches to camera representation and present an in-depth review of available camera trajectory generation models, starting with rule-based approaches and progressing through optimization-based techniques, machine learning advancements, and hybrid methods that integrate multiple strategies. Additionally, we gather and analyze the metrics and datasets commonly used for evaluating camera trajectory systems, offering insights into how these tools measure performance, aesthetic quality, and practical applicability. Finally, we highlight existing limitations, critical gaps in current research, and promising opportunities for investment and innovation in the field. This paper not only serves as a foundational resource for researchers entering the field but also paves the way for advancing adaptive, efficient, and creative camera trajectory systems across diverse applications.
zh

[CV-122] Continual-MEGA: A Large-scale Benchmark for Generalizable Continual Anomaly Detection

【速读】:该论文旨在解决持续学习(continual learning)在异常检测中的挑战,特别是在真实世界部署场景下模型的适应性和泛化能力问题。其关键解决方案是提出一个名为Continual-MEGA的新基准,该基准通过整合现有数据集与新提出的ContinualAD数据集,显著扩展了现有的评估设置,并引入了一种新的零样本泛化(zero-shot generalization)场景,以衡量模型在未见过类别上的表现。此外,研究还提出了一种统一的基线算法,以提升小样本检测的鲁棒性并保持强泛化能力。

链接: https://arxiv.org/abs/2506.00956
作者: Geonu Lee,Yujeong Oh,Geonhui Jang,Soyoung Lee,Jeonghyo Song,Sungmin Cha,YoungJoon Yoo
机构: SNUAILAB(首尔国立大学人工智能实验室); Chung-Ang University(中央大学); New York University(纽约大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this paper, we introduce a new benchmark for continual learning in anomaly detection, aimed at better reflecting real-world deployment scenarios. Our benchmark, Continual-MEGA, includes a large and diverse dataset that significantly expands existing evaluation settings by combining carefully curated existing datasets with our newly proposed dataset, ContinualAD. In addition to standard continual learning with expanded quantity, we propose a novel scenario that measures zero-shot generalization to unseen classes, those not observed during continual adaptation. This setting poses a new problem setting that continual adaptation also enhances zero-shot performance. We also present a unified baseline algorithm that improves robustness in few-shot detection and maintains strong generalization. Through extensive evaluations, we report three key findings: (1) existing methods show substantial room for improvement, particularly in pixel-level defect localization; (2) our proposed method consistently outperforms prior approaches; and (3) the newly introduced ContinualAD dataset enhances the performance of strong anomaly detection models. We release the benchmark and code in this https URL.
zh

[CV-123] IGeR: Text-Instructed Generation and Refinement for Template-Free Hand-Object Interaction

【速读】:该论文旨在解决传统预定义3D物体模板在手-物体交互的3D重建任务中存在的人工成本高、适应性差的问题,特别是在处理遮挡等非约束场景时表现受限。其解决方案的关键在于提出一种基于文本指令的生成与优化框架(Text-Instructed Generation and Refinement, TIGeR),通过文本驱动的先验生成和视觉引导的细化过程,实现无需繁琐3D建模即可生成合理的物体形状先验,并利用2D-3D协同注意力机制对合成原型进行几何校准,从而提升重建精度与鲁棒性。

链接: https://arxiv.org/abs/2506.00953
作者: Yiyao Huang,Zhedong Zheng,Yu Ziwei,Yaxiong Wang,Tze Ho Elden Tse,Angela Yao
机构: National University of Singapore (新加坡国立大学); University of Macau (澳门大学); Hefei University of Technology (合肥工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Pre-defined 3D object templates are widely used in 3D reconstruction of hand-object interactions. However, they often require substantial manual efforts to capture or source, and inherently restrict the adaptability of models to unconstrained interaction scenarios, e.g., heavily-occluded objects. To overcome this bottleneck, we propose a new Text-Instructed Generation and Refinement (TIGeR) framework, harnessing the power of intuitive text-driven priors to steer the object shape refinement and pose estimation. We use a two-stage framework: a text-instructed prior generation and vision-guided refinement. As the name implies, we first leverage off-the-shelf models to generate shape priors according to the text description without tedious 3D crafting. Considering the geometric gap between the synthesized prototype and the real object interacted with the hand, we further calibrate the synthesized prototype via 2D-3D collaborative attention. TIGeR achieves competitive performance, i.e., 1.979 and 5.468 object Chamfer distance on the widely-used Dex-YCB and Obman datasets, respectively, surpassing existing template-free methods. Notably, the proposed framework shows robustness to occlusion, while maintaining compatibility with heterogeneous prior sources, e.g., retrieved hand-crafted prototypes, in practical deployment scenarios.
zh

[CV-124] Deformable registration and generative modelling of aortic anatomies by auto-decoders and neural ODEs

【速读】:该论文旨在解决血管形状的可变形配准以及合成解剖结构生成的问题。其核心解决方案是提出一种基于深度学习的模型AD-SVFD,该模型通过将每个几何体表示为加权点云,并将环境空间变形建模为常微分方程(ODE)在单位时间内的解,其中ODE的右端项由人工神经网络表达。关键在于通过最小化变形点云与参考点云之间的Chamfer距离来优化模型参数,同时利用ODE的反向积分定义逆变换,此外,其自解码器结构实现了跨形状队列的泛化和高效的权重共享。

链接: https://arxiv.org/abs/2506.00947
作者: Riccardo Tenderini,Luca Pegolotti,Fanwei Kong,Stefano Pagani,Francesco Regazzoni,Alison L. Marsden,Simone Deparis
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Numerical Analysis (math.NA)
备注: 29 pages, 7 figures, 6 tables, 2 algorithms. Submitted to “npj Biological Physics and Mechanics”. Dataset publicly available at this https URL

点击查看摘要

Abstract:This work introduces AD-SVFD, a deep learning model for the deformable registration of vascular shapes to a pre-defined reference and for the generation of synthetic anatomies. AD-SVFD operates by representing each geometry as a weighted point cloud and models ambient space deformations as solutions at unit time of ODEs, whose time-independent right-hand sides are expressed through artificial neural networks. The model parameters are optimized by minimizing the Chamfer Distance between the deformed and reference point clouds, while backward integration of the ODE defines the inverse transformation. A distinctive feature of AD-SVFD is its auto-decoder structure, that enables generalization across shape cohorts and favors efficient weight sharing. In particular, each anatomy is associated with a low-dimensional code that acts as a self-conditioning field and that is jointly optimized with the network parameters during training. At inference, only the latent codes are fine-tuned, substantially reducing computational overheads. Furthermore, the use of implicit shape representations enables generative applications: new anatomies can be synthesized by suitably sampling from the latent space and applying the corresponding inverse transformations to the reference geometry. Numerical experiments, conducted on healthy aortic anatomies, showcase the high-quality results of AD-SVFD, which yields extremely accurate approximations at competitive computational costs.
zh

[CV-125] 3D Skeleton-Based Action Recognition: A Review

【速读】:该论文试图解决现有3D骨架动作识别研究中普遍存在的问题,即以往综述多从模型设计角度出发,忽视了骨架动作识别任务中的基础步骤,导致对任务本质的理解不够深入。其解决方案的关键在于提出一个任务导向的框架,将整个任务分解为一系列子任务,并特别强调预处理步骤如模态生成和数据增强,同时深入探讨特征提取与时空建模等关键子任务,还涵盖了混合架构、Mamba模型、大语言模型(Large Language Models, LLMs)和生成式模型等最新进展,从而为理解与推进3D骨架动作识别领域提供系统性指导。

链接: https://arxiv.org/abs/2506.00915
作者: Mengyuan Liu,Hong Liu,Qianshuo Hu,Bin Ren,Junsong Yuan,Jiaying Lin,Jiajun Wen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:With the inherent advantages of skeleton representation, 3D skeleton-based action recognition has become a prominent topic in the field of computer vision. However, previous reviews have predominantly adopted a model-oriented perspective, often neglecting the fundamental steps involved in skeleton-based action recognition. This oversight tends to ignore key components of skeleton-based action recognition beyond model design and has hindered deeper, more intrinsic understanding of the task. To bridge this gap, our review aims to address these limitations by presenting a comprehensive, task-oriented framework for understanding skeleton-based action recognition. We begin by decomposing the task into a series of sub-tasks, placing particular emphasis on preprocessing steps such as modality derivation and data augmentation. The subsequent discussion delves into critical sub-tasks, including feature extraction and spatio-temporal modeling techniques. Beyond foundational action recognition networks, recently advanced frameworks such as hybrid architectures, Mamba models, large language models (LLMs), and generative models have also been highlighted. Finally, a comprehensive overview of public 3D skeleton datasets is presented, accompanied by an analysis of state-of-the-art algorithms evaluated on these benchmarks. By integrating task-oriented discussions, comprehensive examinations of sub-tasks, and an emphasis on the latest advancements, our review provides a fundamental and accessible structured roadmap for understanding and advancing the field of 3D skeleton-based action recognition.
zh

[CV-126] DS-VTON: High-Quality Virtual Try-on via Disentangled Dual-Scale Generation

【速读】:该论文旨在解决虚拟试衣任务中两个核心挑战:准确对齐服装图像与目标人体,以及保留服装的细粒度纹理和图案。其解决方案的关键在于提出DS-VTON框架,该框架通过双尺度结构实现目标解耦建模,第一阶段生成低分辨率试衣结果以捕捉服装与人体之间的语义对应关系,第二阶段引入残差引导的扩散过程,通过细化两尺度间的残差来重建高分辨率输出,从而专注于纹理保真度。此外,该方法采用全无掩码生成范式,避免依赖人体解析图或分割掩码,借助预训练扩散模型中的语义先验,更有效地保持人物外观和几何一致性。

链接: https://arxiv.org/abs/2506.00908
作者: Xianbing Sun,Yan Hong,Jiahui Zhan,Jun Lan,Huijia Zhu,Weiqiang Wang,Liqing Zhang,Jianfu Zhang
机构: Shanghai Jiao Tong University (上海交通大学); Ant Group (蚂蚁集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Despite recent progress, most existing virtual try-on methods still struggle to simultaneously address two core challenges: accurately aligning the garment image with the target human body, and preserving fine-grained garment textures and patterns. In this paper, we propose DS-VTON, a dual-scale virtual try-on framework that explicitly disentangles these objectives for more effective modeling. DS-VTON consists of two stages: the first stage generates a low-resolution try-on result to capture the semantic correspondence between garment and body, where reduced detail facilitates robust structural alignment. The second stage introduces a residual-guided diffusion process that reconstructs high-resolution outputs by refining the residual between the two scales, focusing on texture fidelity. In addition, our method adopts a fully mask-free generation paradigm, eliminating reliance on human parsing maps or segmentation masks. By leveraging the semantic priors embedded in pretrained diffusion models, this design more effectively preserves the person’s appearance and geometric consistency. Extensive experiments demonstrate that DS-VTON achieves state-of-the-art performance in both structural alignment and texture preservation across multiple standard virtual try-on benchmarks.
zh

[CV-127] owards Edge-Based Idle State Detection in Construction Machinery Using Surveillance Cameras

【速读】:该论文旨在解决建筑行业中设备利用率优化的问题,特别是通过准确及时地监测设备活动来识别闲置时段,从而降低运营成本和项目延误。解决方案的关键在于提出了一种名为Edge-IMI的框架,该框架包含目标检测、跟踪和闲置状态识别三个组件,专门设计用于在资源受限的基于CPU的边缘计算设备上运行,从而实现高效的现场推理并减少对高带宽云服务和昂贵硬件加速器的依赖。

链接: https://arxiv.org/abs/2506.00904
作者: Xander Küpers,Jeroen Klein Brinke,Rob Bemthuis,Ozlem Durmaz Incel
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 18 pages, 6 figures, 3 tables; to appear in Intelligent Systems and Applications, Lecture Notes in Networks and Systems (LNNS), Springer, 2025. Part of the 11th Intelligent Systems Conference (IntelliSys 2025), 28-29 August 2025, Amsterdam, The Netherlands

点击查看摘要

Abstract:The construction industry faces significant challenges in optimizing equipment utilization, as underused machinery leads to increased operational costs and project delays. Accurate and timely monitoring of equipment activity is therefore key to identifying idle periods and improving overall efficiency. This paper presents the Edge-IMI framework for detecting idle construction machinery, specifically designed for integration with surveillance camera systems. The proposed solution consists of three components: object detection, tracking, and idle state identification, which are tailored for execution on resource-constrained, CPU-based edge computing devices. The performance of Edge-IMI is evaluated using a combined dataset derived from the ACID and MOCS benchmarks. Experimental results confirm that the object detector achieves an F1 score of 71.75%, indicating robust real-world detection capabilities. The logistic regression-based idle identification module reliably distinguishes between active and idle machinery with minimal false positives. Integrating all three modules, Edge-IMI enables efficient on-site inference, reducing reliance on high-bandwidth cloud services and costly hardware accelerators. We also evaluate the performance of object detection models on Raspberry Pi 5 and an Intel NUC platforms, as example edge computing platforms. We assess the feasibility of real-time processing and the impact of model optimization techniques.
zh

[CV-128] Leverag ing CLIP Encoder for Multimodal Emotion Recognition WACV2025

【速读】:该论文旨在解决多模态情感识别(Multimodal Emotion Recognition, MER)中由于缺乏大规模标注数据而导致性能提升受限的问题。其解决方案的关键在于利用基于对比语言-图像预训练(Contrastive Language-Image Pre-training, CLIP)的架构,通过引入标签编码器(label encoder)将标签作为文本嵌入来捕捉语义信息,并设计跨模态解码器以实现不同模态特征在共享嵌入空间中的对齐,从而增强多模态表示的判别能力。

链接: https://arxiv.org/abs/2506.00903
作者: Yehun Song,Sunyoung Cho
机构: Agency for Defense Development (国防发展局); Sookmyung Women’s University (淑明女子大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at IEEE/CVF WACV 2025, pp.6115-6124, 2025

点击查看摘要

Abstract:Multimodal emotion recognition (MER) aims to identify human emotions by combining data from various modalities such as language, audio, and vision. Despite the recent advances of MER approaches, the limitations in obtaining extensive datasets impede the improvement of performance. To mitigate this issue, we leverage a Contrastive Language-Image Pre-training (CLIP)-based architecture and its semantic knowledge from massive datasets that aims to enhance the discriminative multimodal representation. We propose a label encoder-guided MER framework based on CLIP (MER-CLIP) to learn emotion-related representations across modalities. Our approach introduces a label encoder that treats labels as text embeddings to incorporate their semantic information, leading to the learning of more representative emotional features. To further exploit label semantics, we devise a cross-modal decoder that aligns each modality to a shared embedding space by sequentially fusing modality features based on emotion-related input from the label encoder. Finally, the label encoder-guided prediction enables generalization across diverse labels by embedding their semantic information as well as word labels. Experimental results show that our method outperforms the state-of-the-art MER methods on the benchmark datasets, CMU-MOSI and CMU-MOSEI.
zh

[CV-129] Uneven Event Modeling for Partially Relevant Video Retrieval ICME2025

【速读】:该论文旨在解决部分相关视频检索(Partially Relevant Video Retrieval, PRVR)中的事件建模问题,即如何准确地将未剪辑视频划分为与文本查询部分对应的短期事件。传统方法通常将视频分割为固定数量的等长片段,并依赖均值池化计算事件表示,导致事件边界模糊且存在对齐误差。该论文提出的解决方案关键在于引入不均匀事件建模(Uneven Event Modeling, UEM)框架,其中包含两个核心模块:基于时序依赖和语义相似性的渐进式分组视频分割(Progressive-Grouped Video Segmentation, PGVS)模块,用于生成清晰的事件边界;以及上下文感知事件优化(Context-Aware Event Refinement, CAER)模块,通过文本的交叉注意力机制优化事件表示,从而提升文本与视频之间的对齐精度。

链接: https://arxiv.org/abs/2506.00891
作者: Sa Zhu,Huashan Chen,Wanqian Zhang,Jinchao Zhang,Zexian Yang,Xiaoshuai Hao,Bo Li
机构: Institute of Information Engineering, Chinese Academy of Sciences; School of Cyber Security, University of Chinese Academy of Sciences; State Key Laboratory of Cyberspace Security Defense; Beijing Academy of Artificial Intelligence
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by ICME 2025

点击查看摘要

Abstract:Given a text query, partially relevant video retrieval (PRVR) aims to retrieve untrimmed videos containing relevant moments, wherein event modeling is crucial for partitioning the video into smaller temporal events that partially correspond to the text. Previous methods typically segment videos into a fixed number of equal-length clips, resulting in ambiguous event boundaries. Additionally, they rely on mean pooling to compute event representations, inevitably introducing undesired misalignment. To address these, we propose an Uneven Event Modeling (UEM) framework for PRVR. We first introduce the Progressive-Grouped Video Segmentation (PGVS) module, to iteratively formulate events in light of both temporal dependencies and semantic similarity between consecutive frames, enabling clear event boundaries. Furthermore, we also propose the Context-Aware Event Refinement (CAER) module to refine the event representation conditioned the text’s cross-attention. This enables event representations to focus on the most relevant frames for a given text, facilitating more precise text-video alignment. Extensive experiments demonstrate that our method achieves state-of-the-art performance on two PRVR benchmarks.
zh

[CV-130] Breaking Latent Prior Bias in Detectors for Generalizable AIGC Image Detection

【速读】:该论文试图解决当前生成式AI内容(AIGC)检测器在面对未见过的生成器输出时泛化能力不足的问题。其关键解决方案是提出On-Manifold Adversarial Training(OMAT),通过在固定条件下的扩散模型初始潜在噪声上进行优化,生成位于生成器输出流形上的对抗样本,从而避免传统像素空间攻击引入的流形外扰动,使检测器能够学习到更鲁棒的生成特征。

链接: https://arxiv.org/abs/2506.00874
作者: Yue Zhou,Xinan He,KaiQing Lin,Bin Fan,Feng Ding,Bin Li
机构: Shenzhen University (深圳大学); Nanchang University (南昌大学); University of North Texas (北德克萨斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Current AIGC detectors often achieve near-perfect accuracy on images produced by the same generator used for training but struggle to generalize to outputs from unseen generators. We trace this failure in part to latent prior bias: detectors learn shortcuts tied to patterns stemming from the initial noise vector rather than learning robust generative artifacts. To address this, we propose On-Manifold Adversarial Training (OMAT): by optimizing the initial latent noise of diffusion models under fixed conditioning, we generate on-manifold adversarial examples that remain on the generator’s output manifold-unlike pixel-space attacks, which introduce off-manifold perturbations that the generator itself cannot reproduce and that can obscure the true discriminative artifacts. To test against state-of-the-art generative models, we introduce GenImage++, a test-only benchmark of outputs from advanced generators (Flux.1, SD3) with extended prompts and diverse styles. We apply our adversarial-training paradigm to ResNet50 and CLIP baselines and evaluate across existing AIGC forensic benchmarks and recent challenge datasets. Extensive experiments show that adversarially trained detectors significantly improve cross-generator performance without any network redesign. Our findings on latent-prior bias offer valuable insights for future dataset construction and detector evaluation, guiding the development of more robust and generalizable AIGC forensic methodologies.
zh

[CV-131] Multiverse Through Deepfakes: The MultiFakeVerse Dataset of Person-Centric Visual and Conceptual Manipulations

【速读】:该论文试图解决当前深度伪造检测领域缺乏针对以人物为中心的对象、上下文和场景操作的大规模且具备推理能力的基准数据集的问题(benchmark dataset)。其解决方案的关键在于引入MultiFakeVerse,一个包含845,286张图像的大规模人物中心深度伪造数据集,这些图像通过视觉-语言模型(VLM)生成的操纵建议和图像编辑得到。该方法注重对影响人类感知重要性、意图或叙事的个体或场景上下文元素进行语义化、上下文感知的修改,而非传统数据集中常见的合成或低级身份替换和区域特定编辑。

链接: https://arxiv.org/abs/2506.00868
作者: Parul Gupta,Shreya Ghosh,Tom Gedeon,Thanh-Toan Do,Abhinav Dhall
机构: Monash University (莫纳什大学); Curtin University (科廷大学)
类目: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The rapid advancement of GenAI technology over the past few years has significantly contributed towards highly realistic deepfake content generation. Despite ongoing efforts, the research community still lacks a large-scale and reasoning capability driven deepfake benchmark dataset specifically tailored for person-centric object, context and scene manipulations. In this paper, we address this gap by introducing MultiFakeVerse, a large scale person-centric deepfake dataset, comprising 845,286 images generated through manipulation suggestions and image manipulations both derived from vision-language models (VLM). The VLM instructions were specifically targeted towards modifications to individuals or contextual elements of a scene that influence human perception of importance, intent, or narrative. This VLM-driven approach enables semantic, context-aware alterations such as modifying actions, scenes, and human-object interactions rather than synthetic or low-level identity swaps and region-specific edits that are common in existing datasets. Our experiments reveal that current state-of-the-art deepfake detection models and human observers struggle to detect these subtle yet meaningful manipulations. The code and dataset are available on \hrefthis https URLGitHub.
zh

[CV-132] Neural Path Guiding with Distribution Factorization

【速读】:该论文试图解决在渲染中使用蒙特卡洛(Monte Carlo, MC)积分时,现有神经方法在分布表示上难以同时兼顾速度与表达能力的问题。其解决方案的关键在于提出一种简单但有效的表示方法,将方向域上的二维分布分解为两个一维概率分布函数(PDF),并通过神经网络对每个一维PDF进行建模,利用插值技术在任意位置评估和采样PDF。此外,为提升训练稳定性并估计归一化因子,引入了一个额外的网络来缓存入射辐射。

链接: https://arxiv.org/abs/2506.00839
作者: Pedro Figueiredo,Qihao He,Nima Khademi Kalantari
机构: Texas A&M University (德克萨斯A&M大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 11 pages, 11 figures. Accepted to EGSR 2025

点击查看摘要

Abstract:In this paper, we present a neural path guiding method to aid with Monte Carlo (MC) integration in rendering. Existing neural methods utilize distribution representations that are either fast or expressive, but not both. We propose a simple, but effective, representation that is sufficiently expressive and reasonably fast. Specifically, we break down the 2D distribution over the directional domain into two 1D probability distribution functions (PDF). We propose to model each 1D PDF using a neural network that estimates the distribution at a set of discrete coordinates. The PDF at an arbitrary location can then be evaluated and sampled through interpolation. To train the network, we maximize the similarity of the learned and target distributions. To reduce the variance of the gradient during optimizations and estimate the normalization factor, we propose to cache the incoming radiance using an additional network. Through extensive experiments, we demonstrate that our approach is better than the existing methods, particularly in challenging scenes with complex light transport.
zh

[CV-133] Advancing from Automated to Autonomous Beamline by Leverag ing Computer Vision

【速读】:该论文旨在解决同步辐射光束线(synchrotron beamline)在操作过程中对人工安全监督的高度依赖问题,以实现更高效、可靠和安全的自动化实验。其解决方案的关键在于提出一种基于计算机视觉的系统,该系统结合深度学习与多视角摄像机,实现实时碰撞检测,通过设备分割、跟踪及几何分析,并利用迁移学习提升系统的鲁棒性,同时引入交互式标注模块以增强对新物体类别的适应能力。

链接: https://arxiv.org/abs/2506.00836
作者: Baolu Li,Hongkai Yu,Huiming Sun,Jin Ma,Yuewei Lin,Lu Ma,Yonghua Du
机构: Cleveland State University (克利夫兰州立大学); Brookhaven National Laboratory (布鲁克黑文国家实验室); National Synchrotron Light Source II (国家同步辐射光源II)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The synchrotron light source, a cutting-edge large-scale user facility, requires autonomous synchrotron beamline operations, a crucial technique that should enable experiments to be conducted automatically, reliably, and safely with minimum human intervention. However, current state-of-the-art synchrotron beamlines still heavily rely on human safety oversight. To bridge the gap between automated and autonomous operation, a computer vision-based system is proposed, integrating deep learning and multiview cameras for real-time collision detection. The system utilizes equipment segmentation, tracking, and geometric analysis to assess potential collisions with transfer learning that enhances robustness. In addition, an interactive annotation module has been developed to improve the adaptability to new object classes. Experiments on a real beamline dataset demonstrate high accuracy, real-time performance, and strong potential for autonomous synchrotron beamline operations.
zh

[CV-134] SynPO: Synergizing Descriptiveness and Preference Optimization for Video Detailed Captioning

【速读】:该论文旨在解决细粒度视频描述生成中现有方法难以捕捉细微视频动态和丰富细节信息的问题。其解决方案的关键在于引入偏好学习以增强视觉-语言模型在该任务中的表现,同时缓解直接偏好优化(DPO)的固有局限性。具体而言,提出了用于构建偏好对的流水线以及一种新型优化方法——协同偏好优化(SynPO),该方法通过防止负面偏好主导优化、显式保留模型的语言能力并提升训练效率,实现了对DPO及其变体的显著改进。

链接: https://arxiv.org/abs/2506.00835
作者: Jisheng Dang,Yizhou Zhang,Hao Ye,Teng Wang,Siming Chen,Huicheng Zheng,Yulan Guo,Jianhuang Lai,Bin Hu
机构: Sun Yat-Sen University (中山大学); Lanzhou University (兰州大学); The University of Hong Kong (香港大学); Beijing Institute of Technology (北京理工大学); National University of Singapore (新加坡国立大学)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Fine-grained video captioning aims to generate detailed, temporally coherent descriptions of video content. However, existing methods struggle to capture subtle video dynamics and rich detailed information. In this paper, we leverage preference learning to enhance the performance of vision-language models in fine-grained video captioning, while mitigating several limitations inherent to direct preference optimization (DPO). First, we propose a pipeline for constructing preference pairs that leverages the intrinsic properties of VLMs along with partial assistance from large language models, achieving an optimal balance between cost and data quality. Second, we propose Synergistic Preference Optimization (SynPO), a novel optimization method offering significant advantages over DPO and its variants. SynPO prevents negative preferences from dominating the optimization, explicitly preserves the model’s language capability to avoid deviation of the optimization objective, and improves training efficiency by eliminating the need for the reference model. We extensively evaluate SynPO not only on video captioning benchmarks (e.g., VDC, VDD, VATEX) but also across well-established NLP tasks, including general language understanding and preference evaluation, using diverse pretrained models. Results demonstrate that SynPO consistently outperforms DPO variants while achieving 20% improvement in training efficiency. Code is available at this https URL
zh

[CV-135] SkyReels-Audio: Omni Audio-Conditioned Talking Portraits in Video Diffusion Transformers

【速读】:该论文旨在解决多模态输入(包括文本、图像和视频)引导下的音频条件化说话肖像的生成与编辑问题,这一领域仍处于探索阶段。其解决方案的关键在于提出了一种统一框架SkyReels-Audio,该框架基于预训练的视频扩散变换器,支持无限长度的生成与编辑,并通过多模态输入实现多样且可控的条件化。此外,采用混合课程学习策略逐步对齐音频与面部动作,引入面部掩码损失和音频引导的无分类器指导机制以增强局部面部一致性,并利用滑动窗口去噪方法融合时序片段的潜在表示,从而确保长时间段和多样化身份下的视觉保真度与时间一致性。

链接: https://arxiv.org/abs/2506.00830
作者: Zhengcong Fei,Hao Jiang,Di Qiu,Baoxuan Gu,Youqiang Zhang,Jiahua Wang,Jialin Bai,Debang Li,Mingyuan Fan,Guibin Chen,Yahui Zhou
机构: Kunlun Inc. (昆仑公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The generation and editing of audio-conditioned talking portraits guided by multimodal inputs, including text, images, and videos, remains under explored. In this paper, we present SkyReels-Audio, a unified framework for synthesizing high-fidelity and temporally coherent talking portrait videos. Built upon pretrained video diffusion transformers, our framework supports infinite-length generation and editing, while enabling diverse and controllable conditioning through multimodal inputs. We employ a hybrid curriculum learning strategy to progressively align audio with facial motion, enabling fine-grained multimodal control over long video sequences. To enhance local facial coherence, we introduce a facial mask loss and an audio-guided classifier-free guidance mechanism. A sliding-window denoising approach further fuses latent representations across temporal segments, ensuring visual fidelity and temporal consistency across extended durations and diverse identities. More importantly, we construct a dedicated data pipeline for curating high-quality triplets consisting of synchronized audio, video, and textual descriptions. Comprehensive benchmark evaluations show that SkyReels-Audio achieves superior performance in lip-sync accuracy, identity consistency, and realistic facial dynamics, particularly under complex and challenging conditions.
zh

[CV-136] Improving Keystep Recognition in Ego-Video via Dexterous Focus

【速读】:该论文试图解决从第一视角(egocentric)视频中理解人类活动的挑战,传统活动识别技术在处理这类视频时面临独特难题,因为许多活动中头部的动态性较高。解决方案的关键在于提出一种框架,通过将第一视角视频输入限制为稳定且以手部为中心的视频,从而独立于网络架构有效地应对这些挑战。实验表明,这种简单的视频转换方法在Ego-Exo4D细粒度关键步骤识别基准上优于现有的第一视角视频基线,而无需修改底层模型结构。

链接: https://arxiv.org/abs/2506.00827
作者: Zachary Chavis,Stephen J. Guy,Hyun Soo Park
机构: University of Minnesota (明尼苏达大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this paper, we address the challenge of understanding human activities from an egocentric perspective. Traditional activity recognition techniques face unique challenges in egocentric videos due to the highly dynamic nature of the head during many activities. We propose a framework that seeks to address these challenges in a way that is independent of network architecture by restricting the ego-video input to a stabilized, hand-focused video. We demonstrate that this straightforward video transformation alone outperforms existing egocentric video baselines on the Ego-Exo4D Fine-Grained Keystep Recognition benchmark without requiring any alteration of the underlying model infrastructure.
zh

[CV-137] QuantFace: Low-Bit Post-Training Quantization for One-Step Diffusion Face Restoration

【速读】:该论文旨在解决扩散模型在人脸修复任务中因计算量大而难以部署到移动设备的问题。其关键解决方案是提出一种新颖的低比特量化方法QuantFace,将全精度(32位)权重和激活值量化至4~6位,通过分析激活数据分布并采用旋转-缩放通道平衡策略以保留原始数据信息,同时引入量化-蒸馏低秩适应(QD-LoRA)联合优化量化与蒸馏性能,并设计自适应比特宽度分配策略,将其建模为整数规划问题以平衡量化误差与感知指标,从而实现高效的模型压缩与性能保持。

链接: https://arxiv.org/abs/2506.00820
作者: Jiatong Li,Libo Zhu,Haotong Qin,Jingkai Wang,Linghe Kong,Guihai Chen,Yulun Zhang,Xiaokang Yang
机构: Shanghai Jiao Tong University (上海交通大学); ETH Zürich (苏黎世联邦理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion models have been achieving remarkable performance in face restoration. However, the heavy computations of diffusion models make it difficult to deploy them on devices like smartphones. In this work, we propose QuantFace, a novel low-bit quantization for one-step diffusion face restoration models, where the full-precision (\ie, 32-bit) weights and activations are quantized to 4 \sim 6-bit. We first analyze the data distribution within activations and find that they are highly variant. To preserve the original data information, we employ rotation-scaling channel balancing. Furthermore, we propose Quantization-Distillation Low-Rank Adaptation (QD-LoRA) that jointly optimizes for quantization and distillation performance. Finally, we propose an adaptive bit-width allocation strategy. We formulate such a strategy as an integer programming problem, which combines quantization error and perceptual metrics to find a satisfactory resource allocation. Extensive experiments on the synthetic and real-world datasets demonstrate the effectiveness of QuantFace under 6-bit and 4-bit. QuantFace achieves significant advantages over recent leading low-bit quantization methods for face restoration. The code is available at this https URL.
zh

[CV-138] L3A: Label-Augmented Analytic Adaptation for Multi-Label Class Incremental Learning ICML2025

【速读】:该论文旨在解决多标签类增量学习(Multi-label Class-Incremental Learning, MLCIL)中的两个主要挑战:标签缺失(label absence)和类别不平衡(class imbalance)。标签缺失导致历史信息不完整,而类别不平衡会使模型偏向多数类。论文提出的解决方案是Label-Augmented Analytic Adaptation (L3A),其关键在于集成两个核心模块:伪标签(pseudo-label, PL)模块通过为当前阶段的样本生成伪标签来解决标签缺失问题;加权分析分类器(weighted analytic classifier, WAC)通过引入样本特定权重,自适应地平衡类别贡献并缓解类别不平衡问题。

链接: https://arxiv.org/abs/2506.00816
作者: Xiang Zhang,Run He,Jiao Chen,Di Fang,Ming Li,Ziqian Zeng,Cen Chen,Huiping Zhuang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by ICML2025

点击查看摘要

Abstract:Class-incremental learning (CIL) enables models to learn new classes continually without forgetting previously acquired knowledge. Multi-label CIL (MLCIL) extends CIL to a real-world scenario where each sample may belong to multiple classes, introducing several challenges: label absence, which leads to incomplete historical information due to missing labels, and class imbalance, which results in the model bias toward majority classes. To address these challenges, we propose Label-Augmented Analytic Adaptation (L3A), an exemplar-free approach without storing past samples. L3A integrates two key modules. The pseudo-label (PL) module implements label augmentation by generating pseudo-labels for current phase samples, addressing the label absence problem. The weighted analytic classifier (WAC) derives a closed-form solution for neural networks. It introduces sample-specific weights to adaptively balance the class contribution and mitigate class imbalance. Experiments on MS-COCO and PASCAL VOC datasets demonstrate that L3A outperforms existing methods in MLCIL tasks. Our code is available at this https URL.
zh

[CV-139] IME: TabPFN-Integrated Multimodal Engine for Robust Tabular-Image Learning

【速读】:该论文旨在解决表格-图像多模态学习中的两个关键问题:(1)缺乏像视觉和语言领域那样标准化的预训练表格数据表示;(2)处理表格模态中常见的缺失值问题,这在真实世界的医疗数据集中尤为普遍。解决方案的关键在于提出一种新的多模态框架——TabPFN-Integrated Multimodal Engine (TIME),该框架利用最近提出的表格基础模型TabPFN作为冻结的表格编码器,生成对缺失数据具有天然鲁棒性的强大嵌入,并将其与预训练视觉主干网络提取的图像特征进行融合。

链接: https://arxiv.org/abs/2506.00813
作者: Jiaqi Luo,Yuan Yuan,Shixin Xu
机构: Duke Kunshan University (杜克昆山大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Tabular-image multimodal learning, which integrates structured tabular data with imaging data, holds great promise for a variety of tasks, especially in medical applications. Yet, two key challenges remain: (1) the lack of a standardized, pretrained representation for tabular data, as is commonly available in vision and language domains; and (2) the difficulty of handling missing values in the tabular modality, which are common in real-world medical datasets. To address these issues, we propose the TabPFN-Integrated Multimodal Engine (TIME), a novel multimodal framework that builds on the recently introduced tabular foundation model, TabPFN. TIME leverages TabPFN as a frozen tabular encoder to generate robust, strong embeddings that are naturally resilient to missing data, and combines them with image features from pretrained vision backbones. We explore a range of fusion strategies and tabular encoders, and evaluate our approach on both natural and medical datasets. Extensive experiments demonstrate that TIME consistently outperforms competitive baselines across both complete and incomplete tabular inputs, underscoring its practical value in real-world multimodal learning scenarios.
zh

[CV-140] Aiding Medical Diagnosis through Image Synthesis and Classification

【速读】:该论文试图解决医疗影像资源在多样性与可及性方面的不足,以支持更广泛和有效的临床学习。其解决方案的关键在于构建一个能够从文本描述生成真实医学图像并利用分类模型验证准确性的系统。该系统基于预训练的Stable Diffusion模型,通过低秩适应(LoRA)在PathMNIST数据集上进行微调,并采用领域特定提示引导生成有意义特征;同时,使用ResNet-18分类模型确保生成图像的质量,通过迭代过程筛选出正确分类的图像,从而实现高精度的医学图像合成。

链接: https://arxiv.org/abs/2506.00786
作者: Kanishk Choudhary
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 6 figures. Under review

点击查看摘要

Abstract:Medical professionals, especially those in training, often depend on visual reference materials to support an accurate diagnosis and develop pattern recognition skills. However, existing resources may lack the diversity and accessibility needed for broad and effective clinical learning. This paper presents a system designed to generate realistic medical images from textual descriptions and validate their accuracy through a classification model. A pretrained stable diffusion model was fine-tuned using Low-Rank Adaptation (LoRA) on the PathMNIST dataset, consisting of nine colorectal histopathology tissue types. The generative model was trained multiple times using different training parameter configurations, guided by domain-specific prompts to capture meaningful features. To ensure quality control, a ResNet-18 classification model was trained on the same dataset, achieving 99.76% accuracy in detecting the correct label of a colorectal histopathological medical image. Generated images were then filtered using the trained classifier and an iterative process, where inaccurate outputs were discarded and regenerated until they were correctly classified. The highest performing version of the generative model from experimentation achieved an F1 score of 0.6727, with precision and recall scores of 0.6817 and 0.7111, respectively. Some types of tissue, such as adipose tissue and lymphocytes, reached perfect classification scores, while others proved more challenging due to structural complexity. The self-validating approach created demonstrates a reliable method for synthesizing domain-specific medical images because of high accuracy in both the generation and classification portions of the system, with potential applications in both diagnostic support and clinical education. Future work includes improving prompt-specific accuracy and extending the system to other areas of medical imaging.
zh

[CV-141] GeoChain: Multimodal Chain-of-Thought for Geographic Reasoning

【速读】:该论文试图解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在逐步地理推理任务中的评估问题,特别是针对视觉、空间、文化及精确地理位置推理的复杂性。解决方案的关键在于构建GeoChain,这是一个包含1.46百万张Mapillary街景图像的大规模基准数据集,每张图像配有一个21步的思维链(Chain-of-Thought, CoT)问题序列,涵盖超过3000万对问答对,并附带语义分割信息和视觉可定位性评分,以全面评估模型的地理推理能力。

链接: https://arxiv.org/abs/2506.00785
作者: Sahiti Yerramilli,Nilay Pande,Rynaa Grover,Jayant Sravan Tamarapalli
机构: Google(谷歌); Waymo(威马)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This paper introduces GeoChain, a large-scale benchmark for evaluating step-by-step geographic reasoning in multimodal large language models (MLLMs). Leveraging 1.46 million Mapillary street-level images, GeoChain pairs each image with a 21-step chain-of-thought (CoT) question sequence (over 30 million QA pairs). These sequences guide models from coarse attributes to fine-grained localization across four reasoning categories - visual, spatial, cultural, and precise geolocation - annotated by difficulty. Images are also enriched with semantic segmentation (150 classes) and a visual locatability score. Our benchmarking of contemporary MLLMs (GPT-4.1 variants, Claude 3.7, Gemini 2.5 variants) on a diverse 2,088-image subset reveals consistent challenges: models frequently exhibit weaknesses in visual grounding, display erratic reasoning, and struggle to achieve accurate localization, especially as the reasoning complexity escalates. GeoChain offers a robust diagnostic methodology, critical for fostering significant advancements in complex geographic reasoning within MLLMs.
zh

[CV-142] Depth-Aware Scoring and Hierarchical Alignment for Multiple Object Tracking ICIP2025

【速读】:该论文旨在解决当前基于运动的多目标跟踪(MOT)方法在遮挡或视觉相似物体场景下性能不佳的问题,这些问题主要源于现有方法过度依赖交并比(IoU)进行目标关联。论文提出的解决方案关键在于引入一种深度感知框架,通过零样本方法估计深度,并将其作为独立特征融入关联过程;同时,提出了一种分层对齐分数,通过结合粗粒度边界框重叠与细粒度(像素级)对齐来优化关联精度,而无需额外可学习参数。这是首个在关联步骤中将3D特征(单目深度)作为独立决策矩阵的MOT框架。

链接: https://arxiv.org/abs/2506.00774
作者: Milad Khanchi,Maria Amer,Charalambos Poullis
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICIP 2025

点击查看摘要

Abstract:Current motion-based multiple object tracking (MOT) approaches rely heavily on Intersection-over-Union (IoU) for object association. Without using 3D features, they are ineffective in scenarios with occlusions or visually similar objects. To address this, our paper presents a novel depth-aware framework for MOT. We estimate depth using a zero-shot approach and incorporate it as an independent feature in the association process. Additionally, we introduce a Hierarchical Alignment Score that refines IoU by integrating both coarse bounding box overlap and fine-grained (pixel-level) alignment to improve association accuracy without requiring additional learnable parameters. To our knowledge, this is the first MOT framework to incorporate 3D features (monocular depth) as an independent decision matrix in the association step. Our framework achieves state-of-the-art results on challenging benchmarks without any training nor fine-tuning. The code is available at this https URL
zh

[CV-143] EcoLens: Leverag ing Multi-Objective Bayesian Optimization for Energy-Efficient Video Processing on Edge Devices

【速读】:该论文试图解决在资源受限环境中实时视频分析时,如何平衡能量消耗与视频语义的问题。其解决方案的关键在于提出一种动态优化处理配置的系统,通过多目标贝叶斯优化方法,在边缘设备上最小化能量使用的同时保持深度学习推理所需的视频特征。该系统首先通过离线分析不同配置(包括设备CPU频率、帧过滤特征、差异阈值和视频比特率)对能耗和推理准确性的影响力,建立先验知识,随后在在线阶段根据实时需求智能调整配置,以实现目标推理精度下的最低能耗。

链接: https://arxiv.org/abs/2506.00754
作者: Benjamin Civjan,Bo Chen,Ruixiao Zhang,Klara Nahrstedt
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Video processing for real-time analytics in resource-constrained environments presents a significant challenge in balancing energy consumption and video semantics. This paper addresses the problem of energy-efficient video processing by proposing a system that dynamically optimizes processing configurations to minimize energy usage on the edge, while preserving essential video features for deep learning inference. We first gather an extensive offline profile of various configurations consisting of device CPU frequencies, frame filtering features, difference thresholds, and video bitrates, to establish apriori knowledge of their impact on energy consumption and inference accuracy. Leveraging this insight, we introduce an online system that employs multi-objective Bayesian optimization to intelligently explore and adapt configurations in real time. Our approach continuously refines processing settings to meet a target inference accuracy with minimal edge device energy expenditure. Experimental results demonstrate the system’s effectiveness in reducing video processing energy use while maintaining high analytical performance, offering a practical solution for smart devices and edge computing applications.
zh

[CV-144] ArtiScene: Language-Driven Artistic 3D Scene Generation Through Image Intermediary CVPR

【速读】:该论文旨在解决传统3D场景设计过程中对艺术专业知识和复杂软件技能的高要求问题,以及现有文本到3D生成方法因高质量3D数据稀缺而导致的性能受限问题。其解决方案的关键在于利用生成的2D图像作为中介,引导3D合成过程,从而结合自由形式文本到图像生成的灵活性与2D中介布局的多样性和可靠性,提出了一个无需训练的自动化场景设计流程ArtiScene。

链接: https://arxiv.org/abs/2506.00742
作者: Zeqi Gu,Yin Cui,Zhaoshuo Li,Fangyin Wei,Yunhao Ge,Jinwei Gu,Ming-Yu Liu,Abe Davis,Yifan Ding
机构: NVIDIA; Cornell University
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by CVPR

点击查看摘要

Abstract:Designing 3D scenes is traditionally a challenging task that demands both artistic expertise and proficiency with complex software. Recent advances in text-to-3D generation have greatly simplified this process by letting users create scenes based on simple text descriptions. However, as these methods generally require extra training or in-context learning, their performance is often hindered by the limited availability of high-quality 3D data. In contrast, modern text-to-image models learned from web-scale images can generate scenes with diverse, reliable spatial layouts and consistent, visually appealing styles. Our key insight is that instead of learning directly from 3D scenes, we can leverage generated 2D images as an intermediary to guide 3D synthesis. In light of this, we introduce ArtiScene, a training-free automated pipeline for scene design that integrates the flexibility of free-form text-to-image generation with the diversity and reliability of 2D intermediary layouts. First, we generate 2D images from a scene description, then extract the shape and appearance of objects to create 3D models. These models are assembled into the final scene using geometry, position, and pose information derived from the same intermediary image. Being generalizable to a wide range of scenes and styles, ArtiScene outperforms state-of-the-art benchmarks by a large margin in layout and aesthetic quality by quantitative metrics. It also averages a 74.89% winning rate in extensive user studies and 95.07% in GPT-4o evaluation. Project page: this https URL Comments: Accepted by CVPR Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2506.00742 [cs.CV] (or arXiv:2506.00742v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2506.00742 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-145] Involution-Infused DenseNet with Two-Step Compression for Resource-Efficient Plant Disease Classification

【速读】:该论文旨在解决作物病害检测中深度学习模型计算需求高、难以在资源受限设备(如智能手机和边缘设备)上部署的问题。其关键解决方案是采用两阶段模型压缩方法,结合权重剪枝(Weight Pruning)与知识蒸馏(Knowledge Distillation),同时将DenseNet与逆卷积层(Involutional Layers)进行混合,以在保持高精度的同时降低模型复杂度,从而实现高效、实时的病害识别与农业管理。

链接: https://arxiv.org/abs/2506.00735
作者: T. Ahmed,S. Jannat,Md. F. Islam,J. Noor
机构: BRAC University (BRAC大学); United International University (联合国际大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Agriculture is vital for global food security, but crops are vulnerable to diseases that impact yield and quality. While Convolutional Neural Networks (CNNs) accurately classify plant diseases using leaf images, their high computational demands hinder their deployment in resource-constrained settings such as smartphones, edge devices, and real-time monitoring systems. This study proposes a two-step model compression approach integrating Weight Pruning and Knowledge Distillation, along with the hybridization of DenseNet with Involutional Layers. Pruning reduces model size and computational load, while distillation improves the smaller student models performance by transferring knowledge from a larger teacher network. The hybridization enhances the models ability to capture spatial features efficiently. These compressed models are suitable for real-time applications, promoting precision agriculture through rapid disease identification and crop management. The results demonstrate ResNet50s superior performance post-compression, achieving 99.55% and 98.99% accuracy on the PlantVillage and PaddyLeaf datasets, respectively. The DenseNet-based model, optimized for efficiency, recorded 99.21% and 93.96% accuracy with a minimal parameter count. Furthermore, the hybrid model achieved 98.87% and 97.10% accuracy, supporting the practical deployment of energy-efficient devices for timely disease intervention and sustainable farming practices.
zh

[CV-146] Adaptive Plane Reformatting for 4D Flow MRI using Deep Reinforcement Learning

【速读】:该论文试图解决在医学影像中,尤其是4D流MRI中,平面重新定位任务中传统深度强化学习(DRL)方法对测试数据集与训练数据集在位置和方向上一致性要求过高的问题。解决方案的关键在于引入一种基于当前状态的灵活坐标系统,使代理能够在任意位置或方向的体积中进行导航,从而提升方法的适应性和泛化能力。该方法采用异步优势演员-评论家(A3C)算法,相较于深度Q网络(DQN)表现出更优的性能,并在实验中验证了其在平面重新定位角度误差和距离误差上的改进以及与专家标注结果的统计等效性。

链接: https://arxiv.org/abs/2506.00727
作者: Javier Bisbal,Julio Sotelo,Maria I Valdés,Pablo Irarrazaval,Marcelo E Andia,Julio García,José Rodriguez-Palomarez,Francesca Raimondi,Cristián Tejos,Sergio Uribe
机构: Pontificia Universidad Católica de Chile (智利天主教大学); Department of Electrical Engineering (电气工程系); Departamento de Informática (计算机科学系); Institute for Biological and Medical Engineering (生物医学工程研究所); Department of Radiology (放射科); department of Medical Imaging and Radiation Sciences (医学影像与放射科学系); Stephenson Cardiac Imaging Centre (斯蒂芬森心脏成像中心); Department of Cardiology (心血管科); Vall d’Hebron Hospital Universitari (巴尔瓦赫隆大学医院); Vall d’Hebron Institut de Recerca (巴尔瓦赫隆医学研究机构); Department of Medicine (医学系); CIBER de Enfermedades Cardiovasculares (心血管疾病西班牙研究中心); Papa Giovanni XXIII Hospital (若望二十三世医院); Hopital Necker Enfants Malades (尼科尔儿童医院); Millennium Institute for Intelligent Healthcare Engineering, iHEALTH (智能医疗工程千年研究所)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 4 figures, submitted to IEEE Transactions on Medical Imaging

点击查看摘要

Abstract:Deep reinforcement learning (DRL) algorithms have shown robust results in plane reformatting tasks. In these methods, an agent sequentially adjusts the position and orientation of an initial plane towards an objective location. This process allows accurate plane reformatting, without the need for detailed landmarks, which makes it suitable for images with limited contrast and resolution, such as 4D flow MRI. However, current DRL methods require the test dataset to be in the same position and orientation as the training dataset. In this paper, we present a novel technique that utilizes a flexible coordinate system based on the current state, enabling navigation in volumes at any position or orientation. We adopted the Asynchronous Advantage Actor Critic (A3C) algorithm for reinforcement learning, outperforming Deep Q Network (DQN). Experimental results in 4D flow MRI demonstrate improved accuracy in plane reformatting angular and distance errors (6.32 ± 4.15 ° and 3.40 ± 2.75 mm), as well as statistically equivalent flow measurements determined by a plane reformatting process done by an expert (p=0.21). The method’s flexibility and adaptability make it a promising candidate for other medical imaging applications beyond 4D flow MRI.
zh

[CV-147] Common Inpainted Objects In-N-Out of Context

【速读】:该论文试图解决现有视觉数据集中缺乏上下文外示例(out-of-context examples)的问题,从而限制了模型对场景上下文理解的能力。其解决方案的关键在于通过基于扩散的修复技术系统性地替换COCO图像中的物体,生成97,722张具有语义一致性和不一致性场景的独特图像,并利用多模态大语言模型对修复后的物体进行精确验证和分类,以确保数据集的高质量与多样性。

链接: https://arxiv.org/abs/2506.00721
作者: Tianze Yang,Tyson Jordan,Ninghao Liu,Jin Sun
机构: University of Georgia (佐治亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 12 pages, 7 figures

点击查看摘要

Abstract:We present Common Inpainted Objects In-N-Out of Context (COinCO), a novel dataset addressing the scarcity of out-of-context examples in existing vision datasets. By systematically replacing objects in COCO images through diffusion-based inpainting, we create 97,722 unique images featuring both contextually coherent and inconsistent scenes, enabling effective context learning. Each inpainted object is meticulously verified and categorized as in- or out-of-context through a multimodal large language model assessment. Our analysis reveals significant patterns in semantic priors that influence inpainting success across object categories. We demonstrate three key tasks enabled by COinCO: (1) training context classifiers that effectively determine whether existing objects belong in their context; (2) a novel Objects-from-Context prediction task that determines which new objects naturally belong in given scenes at both instance and clique levels, and (3) context-enhanced fake detection on state-of-the-art methods without fine-tuning. COinCO provides a controlled testbed with contextual variations, establishing a foundation for advancing context-aware visual understanding in computer vision and image forensics. Our code and data are at: this https URL.
zh

[CV-148] From Local Cues to Global Percepts: Emergent Gestalt Organization in Self-Supervised Vision Models

【速读】:该论文试图解决现代视觉模型是否能够表现出类似人类视觉的格式塔(Gestalt)组织行为,以及在何种训练条件下这些行为会显现的问题。其解决方案的关键在于通过引入Distorted Spatial Relationship Testbench (DiSRT) 来评估模型对全局空间扰动的敏感性,并发现自监督学习(如MAE、CLIP)能够促进模型产生与格式塔规律一致的激活模式,同时揭示了分类微调会削弱这种能力,而基于Top-K激活稀疏性的机制可恢复全局感知敏感性。

链接: https://arxiv.org/abs/2506.00718
作者: Tianqin Li,Ziqi Wen,Leiran Song,Jun Liu,Zhi Jing,Tai Sing Lee
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Human vision organizes local cues into coherent global forms using Gestalt principles like closure, proximity, and figure-ground assignment – functions reliant on global spatial structure. We investigate whether modern vision models show similar behaviors, and under what training conditions these emerge. We find that Vision Transformers (ViTs) trained with Masked Autoencoding (MAE) exhibit activation patterns consistent with Gestalt laws, including illusory contour completion, convexity preference, and dynamic figure-ground segregation. To probe the computational basis, we hypothesize that modeling global dependencies is necessary for Gestalt-like organization. We introduce the Distorted Spatial Relationship Testbench (DiSRT), which evaluates sensitivity to global spatial perturbations while preserving local textures. Using DiSRT, we show that self-supervised models (e.g., MAE, CLIP) outperform supervised baselines and sometimes even exceed human performance. ConvNeXt models trained with MAE also exhibit Gestalt-compatible representations, suggesting such sensitivity can arise without attention architectures. However, classification finetuning degrades this ability. Inspired by biological vision, we show that a Top-K activation sparsity mechanism can restore global sensitivity. Our findings identify training conditions that promote or suppress Gestalt-like perception and establish DiSRT as a diagnostic for global structure sensitivity across models.
zh

[CV-149] Vid2Coach: Transforming How-To Videos into Task Assistants

【速读】:该论文旨在解决盲人及低视力(BLV)人群在使用视觉导向的教程视频时所面临的障碍,因为这些视频依赖于视觉比较,难以被BLV人群有效理解。解决方案的关键在于提出Vid2Coach系统,该系统通过将教程视频转换为可穿戴摄像头辅助工具,提供可访问的指令和混合主动性反馈。Vid2Coach的核心技术包括从视频中生成带有步骤演示细节和完成标准的可访问指令,以及利用检索增强生成技术从BLV专用资源中提取非视觉操作方法,并通过嵌入智能眼镜的摄像头监控用户进度,提供情境感知的指导、主动反馈和用户问题解答。

链接: https://arxiv.org/abs/2506.00717
作者: Mina Huh,Zihui Xue,Ujjaini Das,Kumar Ashutosh,Kristen Grauman,Amy Pavel
机构: The University of Texas at Austin (得克萨斯大学奥斯汀分校)
类目: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:People use videos to learn new recipes, exercises, and crafts. Such videos remain difficult for blind and low vision (BLV) people to follow as they rely on visual comparison. Our observations of visual rehabilitation therapists (VRTs) guiding BLV people to follow how-to videos revealed that VRTs provide both proactive and responsive support including detailed descriptions, non-visual workarounds, and progress feedback. We propose Vid2Coach, a system that transforms how-to videos into wearable camera-based assistants that provide accessible instructions and mixed-initiative feedback. From the video, Vid2Coach generates accessible instructions by augmenting narrated instructions with demonstration details and completion criteria for each step. It then uses retrieval-augmented-generation to extract relevant non-visual workarounds from BLV-specific resources. Vid2Coach then monitors user progress with a camera embedded in commercial smart glasses to provide context-aware instructions, proactive feedback, and answers to user questions. BLV participants (N=8) using Vid2Coach completed cooking tasks with 58.5% fewer errors than when using their typical workflow and wanted to use Vid2Coach in their daily lives. Vid2Coach demonstrates an opportunity for AI visual assistance that strengthens rather than replaces non-visual expertise.
zh

[CV-150] Fovea Stacking: Imaging with Dynamic Localized Aberration Correction

【速读】:该论文旨在解决小型化相机中由于光学系统简化导致的严重像差问题,尤其是在离轴区域,这些像差难以仅通过软件校正。其解决方案的关键在于引入一种名为Fovea Stacking的新成像系统,该系统利用可变形相位板(Deformable Phase Plates, DPPs)实现图像传感器上任意位置的局部像差校正。通过基于可微光学模型优化DPP的形变,能够在视点处生成清晰的图像,类似于人眼的黄斑区,进而通过叠加多个不同视点的校正图像,获得无像差的合成图像。

链接: https://arxiv.org/abs/2506.00716
作者: Shi Mao,Yogeshwar Mishra,Wolfgang Heidrich
机构: King Abdullah University of Science and Technology (KAUST)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The desire for cameras with smaller form factors has recently lead to a push for exploring computational imaging systems with reduced optical complexity such as a smaller number of lens elements. Unfortunately such simplified optical systems usually suffer from severe aberrations, especially in off-axis regions, which can be difficult to correct purely in software. In this paper we introduce Fovea Stacking, a new type of imaging system that utilizes emerging dynamic optical components called deformable phase plates (DPPs) for localized aberration correction anywhere on the image sensor. By optimizing DPP deformations through a differentiable optical model, off-axis aberrations are corrected locally, producing a foveated image with enhanced sharpness at the fixation point - analogous to the eye’s fovea. Stacking multiple such foveated images, each with a different fixation point, yields a composite image free from aberrations. To efficiently cover the entire field of view, we propose joint optimization of DPP deformations under imaging budget constraints. Due to the DPP device’s non-linear behavior, we introduce a neural network-based control model for improved alignment between simulation-hardware performance. We further demonstrated that for extended depth-of-field imaging, fovea stacking outperforms traditional focus stacking in image quality. By integrating object detection or eye-tracking, the system can dynamically adjust the lens to track the object of interest-enabling real-time foveated video suitable for downstream applications such as surveillance or foveated virtual reality displays. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2506.00716 [cs.CV] (or arXiv:2506.00716v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2506.00716 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-151] QoQ-Med: Building Multimodal Clinical Foundation Models with Domain-Aware GRPO Training

【速读】:该论文旨在解决临床决策中对异构数据进行推理的问题,现有多模态语言模型(Multimodal Language Models, MMLMs)主要以视觉为中心,难以在不同临床专科之间泛化。其解决方案的关键在于引入QoQ-Med-7B/32B,这是首个开放的通用临床基础模型,能够联合推理医学图像、时间序列信号和文本报告。该模型采用领域感知相对策略优化(Domain-aware Relative Policy Optimization, DRPO),通过根据领域稀有性和模态难度分层调整归一化奖励,缓解因临床数据分布偏差导致的性能不平衡问题。

链接: https://arxiv.org/abs/2506.00711
作者: Wei Dai,Peilin Chen,Chanakya Ekbote,Paul Pu Liang
机构: Massachusetts Institute of Technology (麻省理工学院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Clinical decision-making routinely demands reasoning over heterogeneous data, yet existing multimodal language models (MLLMs) remain largely vision-centric and fail to generalize across clinical specialties. To bridge this gap, we introduce QoQ-Med-7B/32B, the first open generalist clinical foundation model that jointly reasons across medical images, time-series signals, and text reports. QoQ-Med is trained with Domain-aware Relative Policy Optimization (DRPO), a novel reinforcement-learning objective that hierarchically scales normalized rewards according to domain rarity and modality difficulty, mitigating performance imbalance caused by skewed clinical data distributions. Trained on 2.61 million instruction tuning pairs spanning 9 clinical domains, we show that DRPO training boosts diagnostic performance by 43% in macro-F1 on average across all visual domains as compared to other critic-free training methods like GRPO. Furthermore, with QoQ-Med trained on intensive segmentation data, it is able to highlight salient regions related to the diagnosis, with an IoU 10x higher than open models while reaching the performance of OpenAI o4-mini. To foster reproducibility and downstream research, we release (i) the full model weights, (ii) the modular training pipeline, and (iii) all intermediate reasoning traces at this https URL.
zh

[CV-152] Concept-Centric Token Interpretation for Vector-Quantized Generative Models

【速读】:该论文试图解决向量量化生成模型(Vector-Quantized Generative Models, VQGMs)中代码本(codebook)的离散标记(discrete tokens)在图像生成过程中的解释性问题,即哪些标记对于生成特定概念的图像是关键的。解决方案的关键在于提出一种名为CORTEX的新方法,通过识别与概念相关的标记组合来解释VQGMs,其框架包含两种方法:一种是针对单个图像的样本级解释方法,用于分析标记的重要性得分;另一种是针对整个代码本的代码本级解释方法,用于发现全局相关的标记。

链接: https://arxiv.org/abs/2506.00698
作者: Tianze Yang,Yucheng Shi,Mengnan Du,Xuansheng Wu,Qiaoyu Tan,Jin Sun,Ninghao Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 17 pages, 7 figures

点击查看摘要

Abstract:Vector-Quantized Generative Models (VQGMs) have emerged as powerful tools for image generation. However, the key component of VQGMs – the codebook of discrete tokens – is still not well understood, e.g., which tokens are critical to generate an image of a certain concept? This paper introduces Concept-Oriented Token Explanation (CORTEX), a novel approach for interpreting VQGMs by identifying concept-specific token combinations. Our framework employs two methods: (1) a sample-level explanation method that analyzes token importance scores in individual images, and (2) a codebook-level explanation method that explores the entire codebook to find globally relevant tokens. Experimental results demonstrate CORTEX’s efficacy in providing clear explanations of token usage in the generative process, outperforming baselines across multiple pretrained VQGMs. Besides enhancing VQGMs transparency, CORTEX is useful in applications such as targeted image editing and shortcut feature detection. Our code is available at this https URL.
zh

[CV-153] CineMA: A Foundation Model for Cine Cardiac MRI

【速读】:该论文旨在解决心脏磁共振(Cardiac Magnetic Resonance, CMR)中临床重要测量指标(如射血分数)提取过程耗时且主观的问题。其解决方案的关键在于开发了一个名为CineMA的基础AI模型,该模型通过自监督的自动编码器架构,在有限标注数据的情况下实现自动化任务处理。CineMA在74,916例 cine CMR研究数据上进行预训练,随后在多个数据集和任务中进行微调,表现出优于或相当于卷积神经网络(Convolutional Neural Networks, CNNs)的性能,并展现出更高的标签效率。

链接: https://arxiv.org/abs/2506.00679
作者: Yunguan Fu,Weixi Yi,Charlotte Manisty,Anish N Bhuva,Thomas A Treibel,James C Moon,Matthew J Clarkson,Rhodri Huw Davies,Yipeng Hu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Cardiac magnetic resonance (CMR) is a key investigation in clinical cardiovascular medicine and has been used extensively in population research. However, extracting clinically important measurements such as ejection fraction for diagnosing cardiovascular diseases remains time-consuming and subjective. We developed CineMA, a foundation AI model automating these tasks with limited labels. CineMA is a self-supervised autoencoder model trained on 74,916 cine CMR studies to reconstruct images from masked inputs. After fine-tuning, it was evaluated across eight datasets on 23 tasks from four categories: ventricle and myocardium segmentation, left and right ventricle ejection fraction calculation, disease detection and classification, and landmark localisation. CineMA is the first foundation model for cine CMR to match or outperform convolutional neural networks (CNNs). CineMA demonstrated greater label efficiency than CNNs, achieving comparable or better performance with fewer annotations. This reduces the burden of clinician labelling and supports replacing task-specific training with fine-tuning foundation models in future cardiac imaging applications. Models and code for pre-training and fine-tuning are available at this https URL, democratising access to high-performance models that otherwise require substantial computational resources, promoting reproducibility and accelerating clinical translation.
zh

[CV-154] Scene Detection Policies and Keyframe Extraction Strategies for Large-Scale Video Analysis

【速读】:该论文试图解决视频理解流水线中场景分割和关键帧提取的泛化性不足问题,现有方法在不同类型的视频(如短格式媒体、长片、档案内容和监控录像)中表现不稳定。其解决方案的关键在于提出一个统一且自适应的框架,根据视频长度动态选择分割策略:针对短视频采用自适应阈值法,中等长度视频采用混合策略,长视频则采用基于时间间隔的分割方法。此外,关键帧选择通过轻量级模块实现,利用清晰度、亮度和时间分布的综合指标进行评分,避免了复杂的显著性模型,同时保证视觉相关性。

链接: https://arxiv.org/abs/2506.00667
作者: Vasilii Korolkov
机构: Binat, Inc. (Binat, Inc.)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: 24 pages, 8 figures, submitted as a preprint. ArXiv preprint only, not submitted to a journal yet

点击查看摘要

Abstract:Robust scene segmentation and keyframe extraction are essential preprocessing steps in video understanding pipelines, supporting tasks such as indexing, summarization, and semantic retrieval. However, existing methods often lack generalizability across diverse video types and durations. We present a unified, adaptive framework for automatic scene detection and keyframe selection that handles formats ranging from short-form media to long-form films, archival content, and surveillance footage. Our system dynamically selects segmentation policies based on video length: adaptive thresholding for short videos, hybrid strategies for mid-length ones, and interval-based splitting for extended recordings. This ensures consistent granularity and efficient processing across domains. For keyframe selection, we employ a lightweight module that scores sampled frames using a composite metric of sharpness, luminance, and temporal spread, avoiding complex saliency models while ensuring visual relevance. Designed for high-throughput workflows, the system is deployed in a commercial video analysis platform and has processed content from media, education, research, and security domains. It offers a scalable and interpretable solution suitable for downstream applications such as UI previews, embedding pipelines, and content filtering. We discuss practical implementation details and outline future enhancements, including audio-aware segmentation and reinforcement-learned frame scoring.
zh

[CV-155] Poster: Adapting Pretrained Vision Transformers with LoRA Against Attack Vectors

【速读】:该论文试图解决图像分类器(如用于自动驾驶车辆导航的分类器)在面对对抗性攻击时易受攻击的问题,这类攻击通过微小的输入图像扰动导致恶意误分类。解决方案的关键在于通过对预训练的视觉变压器(Vision Transformer)进行低秩适应(low-rank adaptation),调整其权重和类别,从而提高模型对对抗性攻击的鲁棒性,并实现无需重新训练的可扩展微调。

链接: https://arxiv.org/abs/2506.00661
作者: Richard E. Neddo,Sean Willis,Zander Blasingame,Chen Liu
机构: Clarkson University (克拉克森大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Presented at IEEE MOST 2025

点击查看摘要

Abstract:Image classifiers, such as those used for autonomous vehicle navigation, are largely known to be susceptible to adversarial attacks that target the input image set. There is extensive discussion on adversarial attacks including perturbations that alter the input images to cause malicious misclassifications without perceivable modification. This work proposes a countermeasure for such attacks by adjusting the weights and classes of pretrained vision transformers with a low-rank adaptation to become more robust against adversarial attacks and allow for scalable fine-tuning without retraining.
zh

[CV-156] Video Signature: In-generation Watermarking for Latent Video Diffusion Models

【速读】:该论文旨在解决生成式视频内容(AIGC)在知识产权保护和内容溯源方面存在的问题,特别是现有水印方法在视频生成中存在计算开销大、难以平衡视频质量与水印提取效果的缺陷。其解决方案的关键在于提出一种基于潜在视频扩散模型的在生成过程中嵌入水印的方法——Video Signature (VIDSIG),通过部分微调潜在解码器,并引入感知敏感层抑制机制(PAS)以保持视觉质量,同时结合轻量级时间对齐模块提升帧序列的时间一致性,从而实现隐式且自适应的水印集成。

链接: https://arxiv.org/abs/2506.00652
作者: Yu Huang,Junhao Chen,Qi Zheng,Hanqian Li,Shuliang Liu,Xuming Hu
机构: The Hong Kong University of Science and Technology (Guangzhou); The Hong Kong University of Science and Technology
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:The rapid development of Artificial Intelligence Generated Content (AIGC) has led to significant progress in video generation but also raises serious concerns about intellectual property protection and reliable content tracing. Watermarking is a widely adopted solution to this issue, but existing methods for video generation mainly follow a post-generation paradigm, which introduces additional computational overhead and often fails to effectively balance the trade-off between video quality and watermark extraction. To address these issues, we propose Video Signature (VIDSIG), an in-generation watermarking method for latent video diffusion models, which enables implicit and adaptive watermark integration during generation. Specifically, we achieve this by partially fine-tuning the latent decoder, where Perturbation-Aware Suppression (PAS) pre-identifies and freezes perceptually sensitive layers to preserve visual quality. Beyond spatial fidelity, we further enhance temporal consistency by introducing a lightweight Temporal Alignment module that guides the decoder to generate coherent frame sequences during fine-tuning. Experimental results show that VIDSIG achieves the best overall performance in watermark extraction, visual quality, and generation efficiency. It also demonstrates strong robustness against both spatial and temporal tampering, highlighting its practicality in real-world scenarios.
zh

[CV-157] xt-to-CT Generation via 3D Latent Diffusion Model with Contrastive Vision-Language Pretraining

【速读】:该论文旨在解决将文本到图像生成技术从二维医学影像扩展到三维计算机断层扫描(CT)影像的挑战,这一过程面临高维度、解剖复杂性以及缺乏有效的视觉-语言对齐框架等问题。其解决方案的关键在于提出了一种结合潜在扩散模型与三维对比视觉-语言预训练机制的新架构,通过双编码器CLIP风格模型在配对的CT体积和放射科报告上进行训练,建立共享嵌入空间作为生成条件输入,并利用预训练的三维变分自编码器(VAE)将CT体积压缩到低维潜在空间,从而实现高效的3D去噪扩散过程。

链接: https://arxiv.org/abs/2506.00633
作者: Daniele Molino,Camillo Maria Caruso,Filippo Ruffini,Paolo Soda,Valerio Guarrasi
机构: Unicampus(unicampus); Umeå University (Umeå University)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Objective: While recent advances in text-conditioned generative models have enabled the synthesis of realistic medical images, progress has been largely confined to 2D modalities such as chest X-rays. Extending text-to-image generation to volumetric Computed Tomography (CT) remains a significant challenge, due to its high dimensionality, anatomical complexity, and the absence of robust frameworks that align vision-language data in 3D medical imaging. Methods: We introduce a novel architecture for Text-to-CT generation that combines a latent diffusion model with a 3D contrastive vision-language pretraining scheme. Our approach leverages a dual-encoder CLIP-style model trained on paired CT volumes and radiology reports to establish a shared embedding space, which serves as the conditioning input for generation. CT volumes are compressed into a low-dimensional latent space via a pretrained volumetric VAE, enabling efficient 3D denoising diffusion without requiring external super-resolution stages. Results: We evaluate our method on the CT-RATE dataset and conduct a comprehensive assessment of image fidelity, clinical relevance, and semantic alignment. Our model achieves competitive performance across all tasks, significantly outperforming prior baselines for text-to-CT generation. Moreover, we demonstrate that CT scans synthesized by our framework can effectively augment real data, improving downstream diagnostic performance. Conclusion: Our results show that modality-specific vision-language alignment is a key component for high-quality 3D medical image generation. By integrating contrastive pretraining and volumetric diffusion, our method offers a scalable and controllable solution for synthesizing clinically meaningful CT volumes from text, paving the way for new applications in data augmentation, medical education, and automated clinical simulation.
zh

[CV-158] Long-Tailed Visual Recognition via Permutation-Invariant Head-to-Tail Feature Fusion

【速读】:该论文旨在解决长尾数据分布不平衡对深度学习模型的影响,这种不平衡导致模型过度关注头部类别而忽视尾部类别,进而降低识别准确率。其关键解决方案是提出一种称为“排列不变且从头到尾特征融合”(PI-H2T)的方法,该方法通过排列不变表示融合(PIF)增强表示空间,生成更紧密的特征和自动类间距,并通过从头到尾融合(H2TF)将语义信息从头部类别转移到尾部类别,从而调整偏倚分类器并提升尾部类别的多样性。

链接: https://arxiv.org/abs/2506.00625
作者: Mengke Li,Zhikai Hu,Yang Lu,Weichao Lan,Yiu-ming Cheung,Hui Huang
机构: Shenzhen University (深圳大学); Xiamen University (厦门大学); Hong Kong Baptist University (香港浸会大学); Inspur Smart City Technology Co., Ltd. (浪潮智慧城市科技有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The imbalanced distribution of long-tailed data presents a significant challenge for deep learning models, causing them to prioritize head classes while neglecting tail classes. Two key factors contributing to low recognition accuracy are the deformed representation space and a biased classifier, stemming from insufficient semantic information in tail classes. To address these issues, we propose permutation-invariant and head-to-tail feature fusion (PI-H2T), a highly adaptable method. PI-H2T enhances the representation space through permutation-invariant representation fusion (PIF), yielding more clustered features and automatic class margins. Additionally, it adjusts the biased classifier by transferring semantic information from head to tail classes via head-to-tail fusion (H2TF), improving tail class diversity. Theoretical analysis and experiments show that PI-H2T optimizes both the representation space and decision boundaries. Its plug-and-play design ensures seamless integration into existing methods, providing a straightforward path to further performance improvements. Extensive experiments on long-tailed benchmarks confirm the effectiveness of PI-H2T.
zh

[CV-159] Parallel Rescaling: Rebalancing Consistency Guidance for Personalized Diffusion Models

【速读】:该论文试图解决在仅有少量参考图像的情况下,个性化扩散模型(diffusion models)难以平衡身份保真度与文本提示一致性的问题。现有方法如DreamBooth和Textual Inversion容易过拟合,导致生成图像与文本提示之间出现偏差。论文提出的解决方案关键在于一种并行重缩放技术,该技术将一致性引导信号分解为与无分类器引导(CFG)平行和正交的组件,并通过重缩放平行分量来最小化对CFG的干扰,同时保持主体的身份特征。此方法无需额外训练数据或昂贵的标注,显著提升了提示对齐度和视觉保真度。

链接: https://arxiv.org/abs/2506.00607
作者: JungWoo Chae,Jiyoon Kim,Sangheum Hwang
机构: Nexon Korea (耐克森韩国); LGCNS AI Research (LGCNS人工智能研究); Seoul National University of Science and Technology (首尔科学综合大学院大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Personalizing diffusion models to specific users or concepts remains challenging, particularly when only a few reference images are available. Existing methods such as DreamBooth and Textual Inversion often overfit to limited data, causing misalignment between generated images and text prompts when attempting to balance identity fidelity with prompt adherence. While Direct Consistency Optimization (DCO) with its consistency-guided sampling partially alleviates this issue, it still struggles with complex or stylized prompts. In this paper, we propose a parallel rescaling technique for personalized diffusion models. Our approach explicitly decomposes the consistency guidance signal into parallel and orthogonal components relative to classifier free guidance (CFG). By rescaling the parallel component, we minimize disruptive interference with CFG while preserving the subject’s identity. Unlike prior personalization methods, our technique does not require additional training data or expensive annotations. Extensive experiments show improved prompt alignment and visual fidelity compared to baseline methods, even on challenging stylized prompts. These findings highlight the potential of parallel rescaled guidance to yield more stable and accurate personalization for diverse user inputs.
zh

[CV-160] ABCDEFGH: An Adaptation-Based Convolutional Neural Network-CycleGAN Disease-Courses Evolution Framework Using Generative Models in Health Education

【速读】:该论文试图解决现代医学教育中由于隐私问题和教育资源短缺导致的高质量教学材料获取困难的问题。其解决方案的关键在于利用生成式模型,特别是卷积神经网络(Convolutional Neural Networks, CNNs)和CycleGAN(Zhu et al., 2017),生成多样且可比的合成医学图像,从而在不泄露患者隐私的前提下支持医学教育。

链接: https://arxiv.org/abs/2506.00605
作者: Ruiming Min,Minghao Liu
机构: The Hong Kong University of Science and Technology (香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:With the advancement of modern medicine and the development of technologies such as MRI, CT, and cellular analysis, it has become increasingly critical for clinicians to accurately interpret various diagnostic images. However, modern medical education often faces challenges due to limited access to high-quality teaching materials, stemming from privacy concerns and a shortage of educational resources (Balogh et al., 2015). In this context, image data generated by machine learning models, particularly generative models, presents a promising solution. These models can create diverse and comparable imaging datasets without compromising patient privacy, thereby supporting modern medical education. In this study, we explore the use of convolutional neural networks (CNNs) and CycleGAN (Zhu et al., 2017) for generating synthetic medical images. The source code is available at this https URL.
zh

[CV-161] SatDreamer360: Geometry Consistent Street-View Video Generation from Satellite Imagery

【速读】:该论文旨在解决从卫星图像生成连续、几何和时间上一致的地面视角视频的问题(ground-view video),现有方法通常仅能合成单帧地面视图,且依赖辅助输入如高度图或手工投影,难以保证时间一致性。其解决方案的关键在于提出SatDreamer360框架,该框架通过引入一种紧凑的三平面表示(tri-plane representation)直接编码场景几何信息,并结合基于光线的像素注意力机制实现跨视角对应,同时采用基于极线约束的时间注意力模块确保多帧一致性。

链接: https://arxiv.org/abs/2506.00600
作者: Xianghui Ze,Beiyi Zhu,Zhenbo Song,Jianfeng Lu,Yujiao Shi
机构: Nanjing University of Science and Technology (南京理工大学); ShanghaiTech University (上海科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Generating continuous ground-level video from satellite imagery is a challenging task with significant potential for applications in simulation, autonomous navigation, and digital twin cities. Existing approaches primarily focus on synthesizing individual ground-view images, often relying on auxiliary inputs like height maps or handcrafted projections, and fall short in producing temporally consistent sequences. In this paper, we propose SatDreamer360, a novel framework that generates geometrically and temporally consistent ground-view video from a single satellite image and a predefined trajectory. To bridge the large viewpoint gap, we introduce a compact tri-plane representation that encodes scene geometry directly from the satellite image. A ray-based pixel attention mechanism retrieves view-dependent features from the tri-plane, enabling accurate cross-view correspondence without requiring additional geometric priors. To ensure multi-frame consistency, we propose an epipolar-constrained temporal attention module that aligns features across frames using the known relative poses along the trajectory. To support evaluation, we introduce VIGOR++, a large-scale dataset for cross-view video generation, with dense trajectory annotations and high-quality ground-view sequences. Extensive experiments demonstrate that SatDreamer360 achieves superior performance in fidelity, coherence, and geometric alignment across diverse urban scenes.
zh

[CV-162] XYZ-IBD: High-precision Bin-picking Dataset for Object 6D Pose Estimation Capturing Real-world Industrial Complexity

【速读】:该论文旨在解决工业场景中6D位姿估计的挑战性问题,特别是在真实工业环境中存在的复杂对象几何形状、反光材料、严重遮挡和密集杂乱等难题。其解决方案的关键在于构建了一个名为XYZ-IBD的数据集,该数据集通过高精度工业相机和商用相机采集了包含RGB、灰度和深度图像的多视角真实场景,并结合大规模模拟渲染数据,实现了毫米级精度的位姿标注,从而为工业机器人操作提供了更真实且具有挑战性的基准。

链接: https://arxiv.org/abs/2506.00599
作者: Junwen Huang,Jizhong Liang,Jiaqi Hu,Martin Sundermeyer,Peter KT Yu,Nassir Navab,Benjamin Busam
机构: Technical University of Munich (慕尼黑工业大学); Munich Center for Machine Learning (慕尼黑机器学习中心); Shanghai Jiaotong University (上海交通大学); XYZ Robotics (XYZ机器人公司); Google(谷歌)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce XYZ-IBD, a bin-picking dataset for 6D pose estimation that captures real-world industrial complexity, including challenging object geometries, reflective materials, severe occlusions, and dense clutter. The dataset reflects authentic robotic manipulation scenarios with millimeter-accurate annotations. Unlike existing datasets that primarily focus on household objects, which approach saturation,XYZ-IBD represents the unsolved realistic industrial conditions. The dataset features 15 texture-less, metallic, and mostly symmetrical objects of varying shapes and sizes. These objects are heavily occluded and randomly arranged in bins with high density, replicating the challenges of real-world bin-picking. XYZ-IBD was collected using two high-precision industrial cameras and one commercially available camera, providing RGB, grayscale, and depth images. It contains 75 multi-view real-world scenes, along with a large-scale synthetic dataset rendered under simulated bin-picking conditions. We employ a meticulous annotation pipeline that includes anti-reflection spray, multi-view depth fusion, and semi-automatic annotation, achieving millimeter-level pose labeling accuracy required for industrial manipulation. Quantification in simulated environments confirms the reliability of the ground-truth annotations. We benchmark state-of-the-art methods on 2D detection, 6D pose estimation, and depth estimation tasks on our dataset, revealing significant performance degradation in our setups compared to current academic household benchmarks. By capturing the complexity of real-world bin-picking scenarios, XYZ-IBD introduces more realistic and challenging problems for future research. The dataset and benchmark are publicly available at this https URL.
zh

[CV-163] Seg2Any: Open-set Segmentation-Mask-to-Image Generation with Precise Shape and Semantic Control

【速读】:该论文旨在解决当前顶级文本到图像(text-to-image, T2I)模型在精确空间布局控制方面的不足,即难以准确生成具有指定属性和位置的实体。现有的分割掩码到图像(segmentation-mask-to-image, S2I)方法无法同时保证语义一致性和形状一致性。为了解决这些问题,作者提出了Seg2Any,其关键在于将分割掩码条件解耦为区域语义和高频形状组件,并通过语义对齐注意力掩码和实体轮廓图来分别引导生成过程中的语义和形状结构。此外,引入了属性隔离注意力掩码机制以防止多实体场景下的属性泄露,从而提升生成结果的准确性与一致性。

链接: https://arxiv.org/abs/2506.00596
作者: Danfeng li,Hui Zhang,Sheng Wang,Jiacheng Li,Zuxuan Wu
机构: Fudan University (复旦大学); HiThink Research (HiThink 研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Despite recent advances in diffusion models, top-tier text-to-image (T2I) models still struggle to achieve precise spatial layout control, i.e. accurately generating entities with specified attributes and locations. Segmentation-mask-to-image (S2I) generation has emerged as a promising solution by incorporating pixel-level spatial guidance and regional text prompts. However, existing S2I methods fail to simultaneously ensure semantic consistency and shape consistency. To address these challenges, we propose Seg2Any, a novel S2I framework built upon advanced multimodal diffusion transformers (e.g. FLUX). First, to achieve both semantic and shape consistency, we decouple segmentation mask conditions into regional semantic and high-frequency shape components. The regional semantic condition is introduced by a Semantic Alignment Attention Mask, ensuring that generated entities adhere to their assigned text prompts. The high-frequency shape condition, representing entity boundaries, is encoded as an Entity Contour Map and then introduced as an additional modality via multi-modal attention to guide image spatial structure. Second, to prevent attribute leakage across entities in multi-entity scenarios, we introduce an Attribute Isolation Attention Mask mechanism, which constrains each entity’s image tokens to attend exclusively to themselves during image self-attention. To support open-set S2I generation, we construct SACap-1M, a large-scale dataset containing 1 million images with 5.9 million segmented entities and detailed regional captions, along with a SACap-Eval benchmark for comprehensive S2I evaluation. Extensive experiments demonstrate that Seg2Any achieves state-of-the-art performance on both open-set and closed-set S2I benchmarks, particularly in fine-grained spatial and attribute control of entities.
zh

[CV-164] MR2US-Pro: Prostate MR to Ultrasound Image Translation and Registration Based on Diffusion Models

【速读】:该论文旨在解决多模态医学影像(特别是磁共振成像MRI和经直肠超声TRUS)之间精确配准的难题,这一问题主要源于两者在维度和解剖表示上的差异。其解决方案的关键在于提出了一种两阶段框架:首先进行TRUS三维重建,随后实现跨模态配准。在TRUS三维重建阶段,该方法摒弃了依赖外部探头跟踪信息的传统方式,转而利用矢状位与横断位TRUS图像之间的自然相关性,并通过基于聚类的特征匹配方法实现无需额外跟踪信息的二维帧空间定位。在配准阶段,引入了一种由模态转换引导的无监督扩散框架,将MRI和US映射到一个伪中间模态,从而保留关键配准特征并简化配准过程,同时结合解剖感知的配准策略以提升内部结构的一致性。

链接: https://arxiv.org/abs/2506.00591
作者: Xudong Ma,Nantheera Anantrasirichai,Stefanos Bolomytis,Alin Achim
机构: University of Bristol (布里斯托大学); North Bristol NHS Trust (北布里斯托国家医疗服务体系)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The diagnosis of prostate cancer increasingly depends on multimodal imaging, particularly magnetic resonance imaging (MRI) and transrectal ultrasound (TRUS). However, accurate registration between these modalities remains a fundamental challenge due to the differences in dimensionality and anatomical representations. In this work, we present a novel framework that addresses these challenges through a two-stage process: TRUS 3D reconstruction followed by cross-modal registration. Unlike existing TRUS 3D reconstruction methods that rely heavily on external probe tracking information, we propose a totally probe-location-independent approach that leverages the natural correlation between sagittal and transverse TRUS views. With the help of our clustering-based feature matching method, we enable the spatial localization of 2D frames without any additional probe tracking information. For the registration stage, we introduce an unsupervised diffusion-based framework guided by modality translation. Unlike existing methods that translate one modality into another, we map both MR and US into a pseudo intermediate modality. This design enables us to customize it to retain only registration-critical features, greatly easing registration. To further enhance anatomical alignment, we incorporate an anatomy-aware registration strategy that prioritizes internal structural coherence while adaptively reducing the influence of boundary inconsistencies. Extensive validation demonstrates that our approach outperforms state-of-the-art methods by achieving superior registration accuracy with physically realistic deformations in a completely unsupervised fashion.
zh

[CV-165] Event-based multi-view photogrammetry for high-dynamic high-velocity target measurement DATE

【速读】:该论文旨在解决高动态、高速目标运动机械性能表征中的测量难题,这些问题在武器系统验证和精密制造等领域至关重要。现有测量方法面临动态范围有限、观测不连续和成本高等挑战。论文提出的解决方案关键在于利用基于事件的多视角摄影测量系统,通过提取目标前缘特征以消除尾迹效应,并采用重投影误差关联事件与目标轨迹,从而获得比传统交点方法更多的数据。最终结合目标速度衰减模型进行数据拟合,实现高精度的运动测量。

链接: https://arxiv.org/abs/2506.00578
作者: Taihang Lei,Banglei Guan,Minzu Liang,Xiangyu Li,Jianbing Liu,Jing Tao,Yang Shang,Qifeng Yu
机构: National University of Defense Technology (国防科技大学); The College of Aerospace Science and Engineering (航天科学与工程学院); The College of Science (理学院); The Hunan Provincial Key Laboratory of Image Measurement and Vision Navigation (湖南省图像测量与视觉导航重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 9 figures, 1 table. This paper was accepted by Acta Mechanica Sinica (Date: this http URL 2025)

点击查看摘要

Abstract:The characterization of mechanical properties for high-dynamic, high-velocity target motion is essential in industries. It provides crucial data for validating weapon systems and precision manufacturing processes etc. However, existing measurement methods face challenges such as limited dynamic range, discontinuous observations, and high costs. This paper presents a new approach leveraging an event-based multi-view photogrammetric system, which aims to address the aforementioned challenges. First, the monotonicity in the spatiotemporal distribution of events is leveraged to extract the target’s leading-edge features, eliminating the tailing effect that complicates motion measurements. Then, reprojection error is used to associate events with the target’s trajectory, providing more data than traditional intersection methods. Finally, a target velocity decay model is employed to fit the data, enabling accurate motion measurements via ours multi-view data joint computation. In a light gas gun fragment test, the proposed method showed a measurement deviation of 4.47% compared to the electromagnetic speedometer.
zh

[CV-166] CReFT-CAD: Boosting Orthographic Projection Reasoning for CAD via Reinforcement Fine-Tuning

【速读】:该论文旨在解决传统深度学习方法在计算机辅助设计(CAD)流程中因标准三维重建管道导致的尺寸不精确和参数编辑性受限的问题,以及现有视觉-语言模型(VLMs)在监督微调(SFT)过程中易陷入模式记忆、泛化能力差的缺陷。其解决方案的关键在于提出CReFT-CAD,这是一种两阶段微调范式,首先通过课程驱动的强化学习阶段构建稳定的推理能力,随后通过监督后微调提升指令遵循与语义提取能力。

链接: https://arxiv.org/abs/2506.00568
作者: Ke Niu,Zhuofan Chen,Haiyang Yu,Yuwen Chen,Teng Fu,Mengyang Zhao,Bin Li,Xiangyang Xue
机构: Fudan University (复旦大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Computer-Aided Design (CAD) plays a pivotal role in industrial manufacturing. Orthographic projection reasoning underpins the entire CAD workflow, encompassing design, manufacturing, and simulation. However, prevailing deep-learning approaches employ standard 3D reconstruction pipelines as an alternative, which often introduce imprecise dimensions and limit the parametric editability required for CAD workflows. Recently, some researchers adopt vision-language models (VLMs), particularly supervised fine-tuning (SFT), to tackle CAD-related challenges. SFT shows promise but often devolves into pattern memorization, yielding poor out-of-distribution performance on complex reasoning tasks. To address these gaps, we introduce CReFT-CAD, a two-stage fine-tuning paradigm that first employs a curriculum-driven reinforcement learning stage with difficulty-aware rewards to build reasoning ability steadily, and then applies supervised post-tuning to hone instruction following and semantic extraction. Complementing this, we release TriView2CAD, the first large-scale, open-source benchmark for orthographic projection reasoning, comprising 200,000 synthetic and 3,000 real-world orthographic projections with precise dimension annotations and six interoperable data modalities. We benchmark leading VLMs on orthographic projection reasoning and demonstrate that CReFT-CAD substantially improves reasoning accuracy and out-of-distribution generalizability in real-world scenarios, offering valuable insights for advancing CAD reasoning research.
zh

[CV-167] SEED: A Benchmark Dataset for Sequential Facial Attribute Editing with Diffusion Models

【速读】:该论文旨在解决序列化面部编辑中的编辑归属与检测鲁棒性问题,特别是在处理多步骤的语义属性修改(如发型、妆容或配饰)时所面临的挑战。其解决方案的关键在于构建一个大规模的、精细标注的序列化编辑人脸数据集SEED,该数据集包含超过90,000张经过一至四次连续属性修改的面部图像,并配备了详细的编辑序列、属性掩码和提示信息。此外,论文还提出了FAITH模型,一种基于频率感知的Transformer架构,通过引入高频线索以提升对细微序列变化的敏感性,从而增强对序列编辑的跟踪与分析能力。

链接: https://arxiv.org/abs/2506.00562
作者: Yule Zhu,Ping Liu,Zhedong Zheng,Wei Liu
机构: Huazhong University of Science and Technology (华中科技大学); University of Nevada Reno (内华达大学雷诺分校); University of Macau (澳门大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Diffusion models have recently enabled precise and photorealistic facial editing across a wide range of semantic attributes. Beyond single-step modifications, a growing class of applications now demands the ability to analyze and track sequences of progressive edits, such as stepwise changes to hair, makeup, or accessories. However, sequential editing introduces significant challenges in edit attribution and detection robustness, further complicated by the lack of large-scale, finely annotated benchmarks tailored explicitly for this task. We introduce SEED, a large-scale Sequentially Edited facE Dataset constructed via state-of-the-art diffusion models. SEED contains over 90,000 facial images with one to four sequential attribute modifications, generated using diverse diffusion-based editing pipelines (LEdits, SDXL, SD3). Each image is annotated with detailed edit sequences, attribute masks, and prompts, facilitating research on sequential edit tracking, visual provenance analysis, and manipulation robustness assessment. To benchmark this task, we propose FAITH, a frequency-aware transformer-based model that incorporates high-frequency cues to enhance sensitivity to subtle sequential changes. Comprehensive experiments, including systematic comparisons of multiple frequency-domain methods, demonstrate the effectiveness of FAITH and the unique challenges posed by SEED. SEED offers a challenging and flexible resource for studying progressive diffusion-based edits at scale. Dataset and code will be publicly released at: this https URL.
zh

[CV-168] Using Diffusion Ensembles to Estimate Uncertainty for End-to-End Autonomous Driving

【速读】:该论文旨在解决自动驾驶中端到端规划系统在处理不确定性时的不足,特别是在封闭环仿真环境如CARLA中,现有系统要么未将不确定性纳入规划本身,要么依赖于无法泛化的专用表示。其解决方案的关键在于提出EnDfuser,一个基于扩散模型(diffusion model)的端到端驾驶系统,通过将注意力池化与轨迹规划整合到单一的扩散Transformer模块中,有效利用复杂的感知信息(如融合的相机和LiDAR特征)。EnDfuser通过集成扩散生成多个候选轨迹(128条),提供对不确定、多模态未来轨迹空间的可解释性,从而提升驾驶决策的安全性。

链接: https://arxiv.org/abs/2506.00560
作者: Florian Wintel,Sigmund H. Høeg,Gabriel Kiss,Frank Lindseth
机构: Norwegian University of Science and Technology (挪威科技大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:End-to-end planning systems for autonomous driving are improving rapidly, especially in closed-loop simulation environments like CARLA. Many such driving systems either do not consider uncertainty as part of the plan itself, or obtain it by using specialized representations that do not generalize. In this paper, we propose EnDfuser, an end-to-end driving system that uses a diffusion model as the trajectory planner. EnDfuser effectively leverages complex perception information like fused camera and LiDAR features, through combining attention pooling and trajectory planning into a single diffusion transformer module. Instead of committing to a single plan, EnDfuser produces a distribution of candidate trajectories (128 for our case) from a single perception frame through ensemble diffusion. By observing the full set of candidate trajectories, EnDfuser provides interpretability for uncertain, multi-modal future trajectory spaces, where there are multiple plausible options. EnDfuser achieves a competitive driving score of 70.1 on the Longest6 benchmark in CARLA with minimal concessions on inference speed. Our findings suggest that ensemble diffusion, used as a drop-in replacement for traditional point-estimate trajectory planning modules, can help improve the safety of driving decisions by modeling the uncertainty of the posterior trajectory distribution.
zh

[CV-169] ViVo: A Dataset for Volumetric VideoReconstruction and Compression

【速读】:该论文旨在解决现有体视频(volumetric video)数据集在语义和低级特征多样性方面的不足,以支持更有效的体视频重建与压缩模型的开发与验证。其解决方案的关键在于提出一个新的数据集ViVo,该数据集不仅忠实于真实世界的体视频生产流程,还首次将多样性定义扩展至包含以人为中心的特征(如皮肤、头发)以及动态视觉现象(如透明、反射、液体等),并提供了包括14组多视角RGB与深度视频对、同步的30FPS帧校准数据、音频信息及对应的2D前景掩码和3D点云在内的丰富原始数据。

链接: https://arxiv.org/abs/2506.00558
作者: Adrian Azzarelli,Ge Gao,Ho Man Kwan,Fan Zhang,Nantheera Anantrasirichai,Ollie Moolan-Feroze,David Bull
机构: University of Bristol (布里斯托大学); Condense Reality Ltd (Condense Reality Ltd)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:As research on neural volumetric video reconstruction and compression flourishes, there is a need for diverse and realistic datasets, which can be used to develop and validate reconstruction and compression models. However, existing volumetric video datasets lack diverse content in terms of both semantic and low-level features that are commonly present in real-world production pipelines. In this context, we propose a new dataset, ViVo, for VolumetrIc VideO reconstruction and compression. The dataset is faithful to real-world volumetric video production and is the first dataset to extend the definition of diversity to include both human-centric characteristics (skin, hair, etc.) and dynamic visual phenomena (transparent, reflective, liquid, etc.). Each video sequence in this database contains raw data including fourteen multi-view RGB and depth video pairs, synchronized at 30FPS with per-frame calibration and audio data, and their associated 2-D foreground masks and 3-D point clouds. To demonstrate the use of this database, we have benchmarked three state-of-the-art (SotA) 3-D reconstruction methods and two volumetric video compression algorithms. The obtained results evidence the challenging nature of the proposed dataset and the limitations of existing datasets for both volumetric video reconstruction and compression tasks, highlighting the need to develop more effective algorithms for these applications. The database and the associated results are available at this https URL
zh

[CV-170] 3D Trajectory Reconstruction of Moving Points Based on Asynchronous Cameras

【速读】:该论文旨在解决异步相机下点目标三维轨迹重建的问题,该问题包含轨迹重建与相机时间同步两个耦合的子问题。传统方法通常仅针对其中一个子问题进行处理,而本文提出了一种同时解决这两个子问题的3D轨迹重建方法。其关键在于:首先,将轨迹交点法扩展至异步相机以克服传统三角测量对相机同步的依赖;其次,基于成像机制和目标动力学特性建立相机时间信息与目标运动模型,并同时优化参数以实现无需精确时间参数的轨迹重建;最后,通过引入更紧密和连续的运动点约束,优化相机旋转、时间信息及目标运动参数,从而显著提升重建精度,尤其是在相机旋转不准确的情况下。

链接: https://arxiv.org/abs/2506.00541
作者: Huayu Huang,Banglei Guan,Yang Shang,Qifeng Yu
机构: National University of Defense Technology (国防科技大学); Hunan Provincial Key Laboratory of Image Measurement and Vision Navigation (湖南省图像测量与视觉导航重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This paper has been accepted by Acta Mechanica Sinica

点击查看摘要

Abstract:Photomechanics is a crucial branch of solid mechanics. The localization of point targets constitutes a fundamental problem in optical experimental mechanics, with extensive applications in various missions of UAVs. Localizing moving targets is crucial for analyzing their motion characteristics and dynamic properties. Reconstructing the trajectories of points from asynchronous cameras is a significant challenge. It encompasses two coupled sub-problems: trajectory reconstruction and camera synchronization. Present methods typically address only one of these sub-problems individually. This paper proposes a 3D trajectory reconstruction method for point targets based on asynchronous cameras, simultaneously solving both sub-problems. Firstly, we extend the trajectory intersection method to asynchronous cameras to resolve the limitation of traditional triangulation that requires camera synchronization. Secondly, we develop models for camera temporal information and target motion, based on imaging mechanisms and target dynamics characteristics. The parameters are optimized simultaneously to achieve trajectory reconstruction without accurate time parameters. Thirdly, we optimize the camera rotations alongside the camera time information and target motion parameters, using tighter and more continuous constraints on moving points. The reconstruction accuracy is significantly improved, especially when the camera rotations are inaccurate. Finally, the simulated and real-world experimental results demonstrate the feasibility and accuracy of the proposed method. The real-world results indicate that the proposed algorithm achieved a localization error of 112.95 m at an observation range of 15 ~ 20 km.
zh

[CV-171] SenseFlow: Scaling Distribution Matching for Flow-based Text-to-Image Distillation

【速读】:该论文旨在解决传统分布匹配蒸馏(DMD)在大规模基于流的文本到图像模型(如SD 3.5和FLUX)中出现的收敛困难问题。其关键解决方案是提出隐式分布对齐(IDA),通过正则化生成器与假分布之间的距离来提升模型的可扩展性;同时引入段内指导(ISG),将时间步重要性分布从教师模型中重新定位,从而进一步增强蒸馏效果。

链接: https://arxiv.org/abs/2506.00523
作者: Xingtong Ge,Xin Zhang,Tongda Xu,Yi Zhang,Xinjie Zhang,Yan Wang,Jun Zhang
机构: The Hong Kong University of Science and Technology (香港科技大学); SenseTime Research (商汤科技); Institute for AI Industry Research, Tsinghua University (清华大学人工智能产业研究院); The Chinese University of Hong Kong (香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: under review

点击查看摘要

Abstract:The Distribution Matching Distillation (DMD) has been successfully applied to text-to-image diffusion models such as Stable Diffusion (SD) 1.5. However, vanilla DMD suffers from convergence difficulties on large-scale flow-based text-to-image models, such as SD 3.5 and FLUX. In this paper, we first analyze the issues when applying vanilla DMD on large-scale models. Then, to overcome the scalability challenge, we propose implicit distribution alignment (IDA) to regularize the distance between the generator and fake distribution. Furthermore, we propose intra-segment guidance (ISG) to relocate the timestep importance distribution from the teacher model. With IDA alone, DMD converges for SD 3.5; employing both IDA and ISG, DMD converges for SD 3.5 and FLUX.1 dev. Along with other improvements such as scaled up discriminator models, our final model, dubbed \textbfSenseFlow, achieves superior performance in distillation for both diffusion based text-to-image models such as SDXL, and flow-matching models such as SD 3.5 Large and FLUX. The source code will be avaliable at this https URL.
zh

[CV-172] SSAM: Self-Supervised Association Modeling for Test-Time Adaption

【速读】:该论文旨在解决测试阶段适应(Test-time adaptation, TTA)中图像编码器因缺乏显式监督而被固定所带来的局限性,这种做法忽视了图像编码器在弥合训练与测试数据分布差异中的关键作用。解决方案的关键在于提出一种新的TTA框架SSAM(Self-Supervised Association Modeling),其核心是通过双阶段关联学习实现编码器的动态优化,包括软原型估计(Soft Prototype Estimation, SPE)和原型锚定图像重建(Prototype-anchored Image Reconstruction, PIR),从而提升模型在分布偏移情况下的适应能力与性能。

链接: https://arxiv.org/abs/2506.00513
作者: Yaxiong Wang,Zhenqiang Zhang,Lechao Cheng,Zhun Zhong,Dan Guo,Meng Wang
机构: Hefei University of Technology (合肥工业大学); Institute of Artificial Intelligence, Hefei Comprehensive National Science Center (合肥综合性国家科学中心人工智能研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 papges

点击查看摘要

Abstract:Test-time adaption (TTA) has witnessed important progress in recent years, the prevailing methods typically first encode the image and the text and design strategies to model the association between them. Meanwhile, the image encoder is usually frozen due to the absence of explicit supervision in TTA scenarios. We identify a critical limitation in this paradigm: While test-time images often exhibit distribution shifts from training data, existing methods persistently freeze the image encoder due to the absence of explicit supervision during adaptation. This practice overlooks the image encoder’s crucial role in bridging distribution shift between training and test. To address this challenge, we propose SSAM (Self-Supervised Association Modeling), a new TTA framework that enables dynamic encoder refinement through dual-phase association learning. Our method operates via two synergistic components: 1) Soft Prototype Estimation (SPE), which estimates probabilistic category associations to guide feature space reorganization, and 2) Prototype-anchored Image Reconstruction (PIR), enforcing encoder stability through cluster-conditional image feature reconstruction. Comprehensive experiments across diverse baseline methods and benchmarks demonstrate that SSAM can surpass state-of-the-art TTA baselines by a clear margin while maintaining computational efficiency. The framework’s architecture-agnostic design and minimal hyperparameter dependence further enhance its practical applicability.
zh

[CV-173] UNSURF: Uncertainty Quantification for Cortical Surface Reconstruction of Clinical Brain MRIs MICCAI2025

【速读】:该论文试图解决临床脑部MRI扫描中皮层表面重建的不确定性建模问题,尤其是在不同方向、分辨率和对比度下的表面定位不确定性。解决方案的关键在于提出UNSURF,这是一种基于预测体素级有符号距离函数(SDF)与拟合表面实际SDF之间差异的新型不确定性度量方法。该方法能够更准确地反映表面重建的误差,并在多个层面实现有效的自动化质量控制,同时提升阿尔茨海默病分类任务的性能。

链接: https://arxiv.org/abs/2506.00498
作者: Raghav Mehta,Karthik Gopinath,Ben Glocker,Juan Eugenio Iglesias
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Raghav Mehta and Karthik Gopinath contributed equally. Ben Glocker and Juan Eugenio Iglesias contributed equally. Paper under review at MICCAI 2025

点击查看摘要

Abstract:We propose UNSURF, a novel uncertainty measure for cortical surface reconstruction of clinical brain MRI scans of any orientation, resolution, and contrast. It relies on the discrepancy between predicted voxel-wise signed distance functions (SDFs) and the actual SDFs of the fitted surfaces. Our experiments on real clinical scans show that traditional uncertainty measures, such as voxel-wise Monte Carlo variance, are not suitable for modeling the uncertainty of surface placement. Our results demonstrate that UNSURF estimates correlate well with the ground truth errors and: \textit(i)~enable effective automated quality control of surface reconstructions at the subject-, parcel-, mesh node-level; and \textit(ii)~improve performance on a downstream Alzheimer’s disease classification task.
zh

[CV-174] Dynamic Domain Adaptation-Driven Physics-Informed Graph Representation Learning for AC-OPF

【速读】:该论文旨在解决传统交流最优潮流(AC-OPF)求解器在约束空间中变量分布与最优解之间复杂关系建模不足的问题,以及仅基于空间拓扑建模导致的时间信息等先验知识难以融合的局限性。其解决方案的关键在于提出一种名为DDA-PIGCN(动态域适应驱动的物理信息图卷积网络)的新方法,通过引入多层硬物理约束提升长程依赖特征的一致性优化,并采用动态域适应学习机制迭代更新关键状态变量,实现精确的约束验证,同时利用电力系统的物理结构捕捉发电机组与负荷之间的时空依赖关系,从而实现时空拓扑信息的深度整合。

链接: https://arxiv.org/abs/2506.00478
作者: Hongjie Zhu,Zezheng Zhang,Zeyu Zhang,Yu Bai,Shimin Wen,Huazhang Wang,Daji Ergu,Ying Cai,Yang Zhao
机构: Southwest Minzu University (西南民族大学); Northeast Electric Power University (东北电力大学); La Trobe University (拉特罗布大学); Institute of Tibetan Plateau Research (青藏高原研究所)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Alternating Current Optimal Power Flow (AC-OPF) aims to optimize generator power outputs by utilizing the non-linear relationships between voltage magnitudes and phase angles in a power system. However, current AC-OPF solvers struggle to effectively represent the complex relationship between variable distributions in the constraint space and their corresponding optimal solutions. This limitation in constraint modeling restricts the system’s ability to develop diverse knowledge representations. Additionally, modeling the power grid solely based on spatial topology further limits the integration of additional prior knowledge, such as temporal information. To overcome these challenges, we propose DDA-PIGCN (Dynamic Domain Adaptation-Driven Physics-Informed Graph Convolutional Network), a new method designed to address constraint-related issues and build a graph-based learning framework that incorporates spatiotemporal features. DDA-PIGCN improves consistency optimization for features with varying long-range dependencies by applying multi-layer, hard physics-informed constraints. It also uses a dynamic domain adaptation learning mechanism that iteratively updates and refines key state variables under predefined constraints, enabling precise constraint verification. Moreover, it captures spatiotemporal dependencies between generators and loads by leveraging the physical structure of the power grid, allowing for deep integration of topological information across time and space. Extensive comparative and ablation studies show that DDA-PIGCN delivers strong performance across several IEEE standard test cases (such as case9, case30, and case300), achieving mean absolute errors (MAE) from 0.0011 to 0.0624 and constraint satisfaction rates between 99.6% and 100%, establishing it as a reliable and efficient AC-OPF solver.
zh

[CV-175] Flashbacks to Harmonize Stability and Plasticity in Continual Learning

【速读】:该论文试图解决持续学习(Continual Learning, CL)中模型的稳定性与可塑性之间的平衡问题。传统方法主要通过正则化模型更新来保护旧知识,而本文提出的Flashback Learning (FL) 方法通过双向正则化机制,显式地平衡这一权衡,从而在快速吸收新知识的同时主动保留旧知识。FL 的关键在于使用两个不同的知识库分别增强模型的可塑性和稳定性,并通过双阶段训练过程实现对模型更新的联合正则化,进而提升模型的整体性能。

链接: https://arxiv.org/abs/2506.00477
作者: Leila Mahmoodi,Peyman Moghadam,Munawar Hayat,Christian Simon,Mehrtash Harandi
机构: Monash University (莫纳什大学); CSIRO, Data61 (澳大利亚联邦科学与工业研究组织,数据61); Queensland University of Technology (昆士兰科技大学); SONY (索尼)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
备注: Manuscript submitted to Neural Networks (Elsevier) in August 2024; and accepted in May 2025 for publication. This version is author-accepted manuscript before copyediting and typesetting. The codes of this article will be available at this https URL

点击查看摘要

Abstract:We introduce Flashback Learning (FL), a novel method designed to harmonize the stability and plasticity of models in Continual Learning (CL). Unlike prior approaches that primarily focus on regularizing model updates to preserve old information while learning new concepts, FL explicitly balances this trade-off through a bidirectional form of regularization. This approach effectively guides the model to swiftly incorporate new knowledge while actively retaining its old knowledge. FL operates through a two-phase training process and can be seamlessly integrated into various CL methods, including replay, parameter regularization, distillation, and dynamic architecture techniques. In designing FL, we use two distinct knowledge bases: one to enhance plasticity and another to improve stability. FL ensures a more balanced model by utilizing both knowledge bases to regularize model updates. Theoretically, we analyze how the FL mechanism enhances the stability-plasticity balance. Empirically, FL demonstrates tangible improvements over baseline methods within the same training budget. By integrating FL into at least one representative baseline from each CL category, we observed an average accuracy improvement of up to 4.91% in Class-Incremental and 3.51% in Task-Incremental settings on standard image classification benchmarks. Additionally, measurements of the stability-to-plasticity ratio confirm that FL effectively enhances this balance. FL also outperforms state-of-the-art CL methods on more challenging datasets like ImageNet.
zh

[CV-176] BAGNet: A Boundary-Aware Graph Attention Network for 3D Point Cloud Semantic Segmentation IJCNN2025

【速读】:该论文旨在解决点云数据由于其固有的不规则性和无结构特性,在语义分割任务中面临的挑战。其解决方案的关键在于提出一种名为边界感知图注意力网络(Boundary-Aware Graph attention Network, BAGNet)的新方法,该方法通过引入边界感知图注意力层(BAGLayer)来捕捉边界点的复杂空间结构信息,并利用轻量级注意力池化层提取全局特征,从而在提升模型精度的同时降低计算时间。

链接: https://arxiv.org/abs/2506.00475
作者: Wei Tao,Xiaoyang Qu,Kai Lu,Jiguang Wan,Shenglin He,Jianzong Wang
机构: Ping An Technology (平安科技); Huazhong University of Science and Technology (华中科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by the 2025 International Joint Conference on Neural Networks (IJCNN 2025)

点击查看摘要

Abstract:Since the point cloud data is inherently irregular and unstructured, point cloud semantic segmentation has always been a challenging task. The graph-based method attempts to model the irregular point cloud by representing it as a graph; however, this approach incurs substantial computational cost due to the necessity of constructing a graph for every point within a large-scale point cloud. In this paper, we observe that boundary points possess more intricate spatial structural information and develop a novel graph attention network known as the Boundary-Aware Graph attention Network (BAGNet). On one hand, BAGNet contains a boundary-aware graph attention layer (BAGLayer), which employs edge vertex fusion and attention coefficients to capture features of boundary points, reducing the computation time. On the other hand, BAGNet employs a lightweight attention pooling layer to extract the global feature of the point cloud to maintain model accuracy. Extensive experiments on standard datasets demonstrate that BAGNet outperforms state-of-the-art methods in point cloud semantic segmentation with higher accuracy and less inference time.
zh

[CV-177] SST: Self-training with Self-adaptive Thresholding for Semi-supervised Learning

【速读】:该论文旨在解决半监督学习(Semi-Supervised Learning, SSL)中由于依赖固定阈值导致的高质量伪标签选择不准确的问题。其解决方案的关键在于提出一种自适应阈值机制(Self-Adaptive Thresholding, SAT),该机制能够根据模型的学习进度动态调整类别特定的阈值,从而确保伪标签数据的质量,减少错误伪标签和确认偏误的风险。

链接: https://arxiv.org/abs/2506.00467
作者: Shuai Zhao,Heyan Huang,Xinge Li,Xiaokang Chen,Rui Wang
机构: Beijing Institute of Technology (北京理工大学); Institute of Atmospheric Physics, Chinese Academy of Sciences (中国科学院大气物理研究所); StarSee (星 seen)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by Information Processing Management (IPM)

点击查看摘要

Abstract:Neural networks have demonstrated exceptional performance in supervised learning, benefiting from abundant high-quality annotated data. However, obtaining such data in real-world scenarios is costly and labor-intensive. Semi-supervised learning (SSL) offers a solution to this problem. Recent studies, such as Semi-ViT and Noisy Student, which employ consistency regularization or pseudo-labeling, have demonstrated significant achievements. However, they still face challenges, particularly in accurately selecting sufficient high-quality pseudo-labels due to their reliance on fixed thresholds. Recent methods such as FlexMatch and FreeMatch have introduced flexible or self-adaptive thresholding techniques, greatly advancing SSL research. Nonetheless, their process of updating thresholds at each iteration is deemed time-consuming, computationally intensive, and potentially unnecessary. To address these issues, we propose Self-training with Self-adaptive Thresholding (SST), a novel, effective, and efficient SSL framework. SST introduces an innovative Self-Adaptive Thresholding (SAT) mechanism that adaptively adjusts class-specific thresholds based on the model’s learning progress. SAT ensures the selection of high-quality pseudo-labeled data, mitigating the risks of inaccurate pseudo-labels and confirmation bias. Extensive experiments demonstrate that SST achieves state-of-the-art performance with remarkable efficiency, generalization, and scalability across various architectures and datasets. Semi-SST-ViT-Huge achieves the best results on competitive ImageNet-1K SSL benchmarks, with 80.7% / 84.9% Top-1 accuracy using only 1% / 10% labeled data. Compared to the fully-supervised DeiT-III-ViT-Huge, which achieves 84.8% Top-1 accuracy using 100% labeled data, our method demonstrates superior performance using only 10% labeled data.
zh

[CV-178] Performance Analysis of Few-Shot Learning Approaches for Bangla Handwritten Character and Digit Recognition

【速读】:该论文旨在解决在有限标注数据条件下识别孟加拉语手写字符和数字的挑战,特别是在脚本结构复杂且数据集稀缺的情况下。其解决方案的关键在于提出一种名为SynergiProtoNet的混合网络架构,该架构结合了先进的聚类技术与稳健的嵌入框架,通过多层级(高、低层次)特征提取在原型学习框架中捕捉细粒度细节和上下文细微差别,从而提升手写字符和数字的识别准确率。

链接: https://arxiv.org/abs/2506.00447
作者: Mehedi Ahamed,Radib Bin Kabir,Tawsif Tashwar Dipto,Mueeze Al Mushabbir,Sabbir Ahmed,Md. Hasanul Kabir
机构: Islamic University of Technology (伊斯兰技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This study investigates the performance of few-shot learning (FSL) approaches in recognizing Bangla handwritten characters and numerals using limited labeled data. It demonstrates the applicability of these methods to scripts with intricate and complex structures, where dataset scarcity is a common challenge. Given the complexity of Bangla script, we hypothesize that models performing well on these characters can generalize effectively to languages of similar or lower structural complexity. To this end, we introduce SynergiProtoNet, a hybrid network designed to improve the recognition accuracy of handwritten characters and digits. The model integrates advanced clustering techniques with a robust embedding framework to capture fine-grained details and contextual nuances. It leverages multi-level (both high- and low-level) feature extraction within a prototypical learning framework. We rigorously benchmark SynergiProtoNet against several state-of-the-art few-shot learning models: BD-CSPN, Prototypical Network, Relation Network, Matching Network, and SimpleShot, across diverse evaluation settings including Monolingual Intra-Dataset Evaluation, Monolingual Inter-Dataset Evaluation, Cross-Lingual Transfer, and Split Digit Testing. Experimental results show that SynergiProtoNet consistently outperforms existing methods, establishing a new benchmark in few-shot learning for handwritten character and digit recognition. The code is available on GitHub: this https URL.
zh

[CV-179] Efficient 3D Brain Tumor Segmentation with Axial-Coronal-Sagittal Embedding

【速读】:该论文旨在解决医学影像中脑肿瘤分割任务的性能提升问题,特别是针对当前最先进的nnU-Net模型在训练需求高和预训练权重利用不足方面的局限性。解决方案的关键在于将轴向-冠状-矢状卷积与ImageNet预训练权重整合到nnU-Net框架中,从而减少训练轮次、可训练参数并提高效率;同时提出了两种将2D预训练权重迁移至3D领域的策略,以保持关键特征表示和信息传播的有效性,并探索了结合分类与分割的联合模型,利用脑胶质瘤分级分类代理任务的预训练编码器,进一步提升了分割性能。

链接: https://arxiv.org/abs/2506.00434
作者: Tuan-Luc Huynh,Thanh-Danh Le,Tam V. Nguyen,Trung-Nghia Le,Minh-Triet Tran
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by PSIVT 2023. Best paper award. Repo: this https URL

点击查看摘要

Abstract:In this paper, we address the crucial task of brain tumor segmentation in medical imaging and propose innovative approaches to enhance its performance. The current state-of-the-art nnU-Net has shown promising results but suffers from extensive training requirements and underutilization of pre-trained weights. To overcome these limitations, we integrate Axial-Coronal-Sagittal convolutions and pre-trained weights from ImageNet into the nnU-Net framework, resulting in reduced training epochs, reduced trainable parameters, and improved efficiency. Two strategies for transferring 2D pre-trained weights to the 3D domain are presented, ensuring the preservation of learned relationships and feature representations critical for effective information propagation. Furthermore, we explore a joint classification and segmentation model that leverages pre-trained encoders from a brain glioma grade classification proxy task, leading to enhanced segmentation performance, especially for challenging tumor labels. Experimental results demonstrate that our proposed methods in the fast training settings achieve comparable or even outperform the ensemble of cross-validation models, a common practice in the brain tumor segmentation literature.
zh

[CV-180] Latent Wavelet Diffusion: Enabling 4K Image Synthesis for Free

【速读】:该论文旨在解决高分辨率图像生成中的核心挑战,即在计算效率与保留细粒度视觉细节之间取得平衡。其解决方案的关键在于提出一种轻量级框架Latent Wavelet Diffusion (LWD),该框架通过三个关键组件实现:(1) 一种尺度一致的变分自编码器目标,以提高潜在表示的频谱保真度;(2) 小波能量图,用于识别和定位潜在空间中细节丰富的区域;(3) 一种时间依赖的掩码策略,在训练过程中将去噪监督集中于高频成分。LWD无需架构修改且不增加额外计算开销,却能持续提升超高分辨率图像生成的感知质量和FID指标。

链接: https://arxiv.org/abs/2506.00433
作者: Luigi Sigillo,Shengfeng He,Danilo Comminiello
机构: Sapienza University of Rome (罗马第一大学); Singapore Management University (新加坡管理大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:High-resolution image synthesis remains a core challenge in generative modeling, particularly in balancing computational efficiency with the preservation of fine-grained visual detail. We present Latent Wavelet Diffusion (LWD), a lightweight framework that enables any latent diffusion model to scale to ultra-high-resolution image generation (2K to 4K) for free. LWD introduces three key components: (1) a scale-consistent variational autoencoder objective that enhances the spectral fidelity of latent representations; (2) wavelet energy maps that identify and localize detail-rich spatial regions within the latent space; and (3) a time-dependent masking strategy that focuses denoising supervision on high-frequency components during training. LWD requires no architectural modifications and incurs no additional computational overhead. Despite its simplicity, it consistently improves perceptual quality and reduces FID in ultra-high-resolution image synthesis, outperforming strong baseline models. These results highlight the effectiveness of frequency-aware, signal-driven supervision as a principled and efficient approach for high-resolution generative modeling.
zh

[CV-181] DPA: Instance Decoupled Prompt Attention for Incremental Medical Object Detection ICML2025

【速读】:该论文旨在解决持续学习中由于医学与自然领域概念差异导致的前景-背景信息紧密耦合以及提示与图像-文本标记之间耦合注意力带来的挑战,特别是在增量医学目标检测任务中的性能瓶颈。解决方案的关键在于提出\method框架,其核心包括两个主要组件:实例级提示生成(Instance-level Prompt Generation, \ipg),通过从图像中解耦细粒度实例级知识并生成关注密集预测的提示;以及解耦提示注意力(Decoupled Prompt Attention, \dpa),通过解耦原始提示注意力实现更直接高效的提示信息传递,同时降低内存消耗并缓解灾难性遗忘。

链接: https://arxiv.org/abs/2506.00406
作者: Huahui Yi,Wei Xu,Ziyuan Qin,Xi Chen,Xiaohu Wu,Kang Li,Qicheng Lao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: accepted to ICML 2025

点击查看摘要

Abstract:Existing prompt-based approaches have demonstrated impressive performance in continual learning, leveraging pre-trained large-scale models for classification tasks; however, the tight coupling between foreground-background information and the coupled attention between prompts and image-text tokens present significant challenges in incremental medical object detection tasks, due to the conceptual gap between medical and natural domains. To overcome these challenges, we introduce the \method~framework, which comprises two main components: 1) Instance-level Prompt Generation (\ipg), which decouples fine-grained instance-level knowledge from images and generates prompts that focus on dense predictions, and 2) Decoupled Prompt Attention (\dpa), which decouples the original prompt attention, enabling a more direct and efficient transfer of prompt information while reducing memory usage and mitigating catastrophic forgetting. We collect 13 clinical, cross-modal, multi-organ, and multi-category datasets, referred to as \dataset, and experiments demonstrate that \method~outperforms existing SOTA methods, with FAP improvements of 5.44%, 4.83%, 12.88%, and 4.59% in full data, 1-shot, 10-shot, and 50-shot settings, respectively.
zh

[CV-182] Sequence-Based Identification of First-Person Camera Wearers in Third-Person Views

【速读】:该论文试图解决多摄像机佩戴者在共享环境中的交互问题,这一问题在沉浸式学习和协作机器人等应用中具有重要意义,但目前研究仍较为薄弱。解决方案的关键在于提出TF2025数据集,该数据集包含同步的第一人称和第三人称视角,并引入一种基于序列的方法,通过结合运动线索和人员重识别技术来识别第三人称视频中的第一人称佩戴者。

链接: https://arxiv.org/abs/2506.00394
作者: Ziwei Zhao,Xizi Wang,Yuchen Wang,Feng Cheng,David Crandall
机构: Indiana University Bloomington (印第安纳大学布卢明顿分校); ByteDance (字节跳动)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The increasing popularity of egocentric cameras has generated growing interest in studying multi-camera interactions in shared environments. Although large-scale datasets such as Ego4D and Ego-Exo4D have propelled egocentric vision research, interactions between multiple camera wearers remain underexplored-a key gap for applications like immersive learning and collaborative robotics. To bridge this, we present TF2025, an expanded dataset with synchronized first- and third-person views. In addition, we introduce a sequence-based method to identify first-person wearers in third-person footage, combining motion cues and person re-identification.
zh

[CV-183] Feature Fusion and Knowledge-Distilled Multi-Modal Multi-Target Detection

【速读】:该论文旨在解决在监控与防御领域中,多目标检测与分类(Multi-Target Detection and Classification, MTD)因异构输入数据源和资源受限嵌入式设备上算法计算复杂性而面临的挑战。其解决方案的关键在于提出一种基于特征融合与知识蒸馏的框架,通过数据融合提升检测精度,并利用知识蒸馏增强模型的领域适应性。该方法在RGB和热成像图像输入的基础上构建了一个新颖的多模态模型,并结合知识蒸馏训练流程,通过后验概率优化任务和复合损失函数实现教师模型到学生模型的知识迁移,从而在保持较高检测性能的同时显著降低推理时间。

链接: https://arxiv.org/abs/2506.00365
作者: Ngoc Tuyen Do,Tri Nhu Do
机构: Hanoi University of Science and Technology (河内科技大学); Polytechnique Montréal (蒙特利尔工程学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
备注:

点击查看摘要

Abstract:In the surveillance and defense domain, multi-target detection and classification (MTD) is considered essential yet challenging due to heterogeneous inputs from diverse data sources and the computational complexity of algorithms designed for resource-constrained embedded devices, particularly for Al-based solutions. To address these challenges, we propose a feature fusion and knowledge-distilled framework for multi-modal MTD that leverages data fusion to enhance accuracy and employs knowledge distillation for improved domain adaptation. Specifically, our approach utilizes both RGB and thermal image inputs within a novel fusion-based multi-modal model, coupled with a distillation training pipeline. We formulate the problem as a posterior probability optimization task, which is solved through a multi-stage training pipeline supported by a composite loss function. This loss function effectively transfers knowledge from a teacher model to a student model. Experimental results demonstrate that our student model achieves approximately 95% of the teacher model’s mean Average Precision while reducing inference time by approximately 50%, underscoring its suitability for practical MTD deployment scenarios.
zh

[CV-184] st-time Vocabulary Adaptation for Language-driven Object Detection ICIP2025

【速读】:该论文旨在解决开放词汇目标检测模型中用户定义的类别词汇可能过于宽泛或错误指定,从而影响检测器整体性能的问题。解决方案的关键在于提出一种无需训练的插件式 Vocabulary Adapter (VocAda),其在推理阶段通过图像描述生成、名词解析和相关类别筛选三个步骤,自动调整用户定义的词汇,使其更贴合当前图像内容,从而提升检测效果。

链接: https://arxiv.org/abs/2506.00333
作者: Mingxuan Liu,Tyler L. Hayes,Massimiliano Mancini,Elisa Ricci,Riccardo Volpi,Gabriela Csurka
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted as a conference paper at ICIP 2025

点击查看摘要

Abstract:Open-vocabulary object detection models allow users to freely specify a class vocabulary in natural language at test time, guiding the detection of desired objects. However, vocabularies can be overly broad or even mis-specified, hampering the overall performance of the detector. In this work, we propose a plug-and-play Vocabulary Adapter (VocAda) to refine the user-defined vocabulary, automatically tailoring it to categories that are relevant for a given image. VocAda does not require any training, it operates at inference time in three steps: i) it uses an image captionner to describe visible objects, ii) it parses nouns from those captions, and iii) it selects relevant classes from the user-defined vocabulary, discarding irrelevant ones. Experiments on COCO and Objects365 with three state-of-the-art detectors show that VocAda consistently improves performance, proving its versatility. The code is open source.
zh

[CV-185] Foresight: Adaptive Layer Reuse for Accelerated and High-Quality Text-to-Video Generation

【速读】:该论文旨在解决Diffusion Transformers (DiTs)在视频生成任务中因模型规模大和时空注意力的二次计算成本导致的计算开销过大的问题。其解决方案的关键在于提出Foresight,一种自适应的层复用技术,通过动态识别并复用不同步骤中的DiT块输出,以减少去噪过程中的计算冗余,同时保持基线性能,从而在速度与质量之间实现更优的平衡。

链接: https://arxiv.org/abs/2506.00329
作者: Muhammad Adnan,Nithesh Kurella,Akhil Arunkumar,Prashant J. Nair
机构: The University of British Columbia (不列颠哥伦比亚大学); d-Matrix (d-Matrix)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion Transformers (DiTs) achieve state-of-the-art results in text-to-image, text-to-video generation, and editing. However, their large model size and the quadratic cost of spatial-temporal attention over multiple denoising steps make video generation computationally expensive. Static caching mitigates this by reusing features across fixed steps but fails to adapt to generation dynamics, leading to suboptimal trade-offs between speed and quality. We propose Foresight, an adaptive layer-reuse technique that reduces computational redundancy across denoising steps while preserving baseline performance. Foresight dynamically identifies and reuses DiT block outputs for all layers across steps, adapting to generation parameters such as resolution and denoising schedules to optimize efficiency. Applied to OpenSora, Latte, and CogVideoX, Foresight achieves up to 1.63x end-to-end speedup, while maintaining video quality. The source code of Foresight is available at \textttthis https URL. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2506.00329 [cs.LG] (or arXiv:2506.00329v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2506.00329 Focus to learn more arXiv-issued DOI via DataCite
zh

[CV-186] Latent Guidance in Diffusion Models for Perceptual Evaluations

【速读】:该论文试图解决无参考图像质量评估(No-Reference Image Quality Assessment, NR-IQA)中感知一致性不足的问题,即现有潜在扩散模型在生成高维图像数据时,缺乏对感知一致性的深入探索。论文的关键解决方案是提出感知流形引导(Perceptual Manifold Guidance, PMG),该方法利用预训练的潜在扩散模型和感知质量特征,从去噪U-Net中获取与人类感知高度相关的多尺度、多时间步的超特征,从而提升NR-IQA任务的性能。

链接: https://arxiv.org/abs/2506.00327
作者: Shreshth Saini,Ru-Ling Liao,Yan Ye,Alan C. Bovik
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 24 Pages, 7 figures, 10 Tables

点击查看摘要

Abstract:Despite recent advancements in latent diffusion models that generate high-dimensional image data and perform various downstream tasks, there has been little exploration into perceptual consistency within these models on the task of No-Reference Image Quality Assessment (NR-IQA). In this paper, we hypothesize that latent diffusion models implicitly exhibit perceptually consistent local regions within the data manifold. We leverage this insight to guide on-manifold sampling using perceptual features and input measurements. Specifically, we propose Perceptual Manifold Guidance (PMG), an algorithm that utilizes pretrained latent diffusion models and perceptual quality features to obtain perceptually consistent multi-scale and multi-timestep feature maps from the denoising U-Net. We empirically demonstrate that these hyperfeatures exhibit high correlation with human perception in IQA tasks. Our method can be applied to any existing pretrained latent diffusion model and is straightforward to integrate. To the best of our knowledge, this paper is the first work on guiding diffusion model with perceptual features for NR-IQA. Extensive experiments on IQA datasets show that our method, LGDM, achieves state-of-the-art performance, underscoring the superior generalization capabilities of diffusion models for NR-IQA tasks.
zh

[CV-187] owards Effective and Efficient Adversarial Defense with Diffusion Models for Robust Visual Tracking

【速读】:该论文旨在解决深度学习视觉跟踪方法在面对精心设计的对抗攻击时表现出的脆弱性问题,这种攻击会导致跟踪性能急剧下降。其解决方案的关键在于提出了一种基于去噪扩散概率模型(Denoise Diffusion Probabilistic Models)的新型对抗防御方法,称为DiffDf,该方法通过结合像素级重建损失、语义一致性损失和结构相似性损失,建立多尺度防御机制,从而在逐步去噪过程中有效抑制对抗扰动。

链接: https://arxiv.org/abs/2506.00325
作者: Long Xu,Peng Gao,Wen-Jia Tang,Fei Wang,Ru-Yue Yuan
机构: Qufu Normal University (曲阜师范大学); Harbin Institute of Technology Shenzhen (哈尔滨工业大学深圳)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Although deep learning-based visual tracking methods have made significant progress, they exhibit vulnerabilities when facing carefully designed adversarial attacks, which can lead to a sharp decline in tracking performance. To address this issue, this paper proposes for the first time a novel adversarial defense method based on denoise diffusion probabilistic models, termed DiffDf, aimed at effectively improving the robustness of existing visual tracking methods against adversarial attacks. DiffDf establishes a multi-scale defense mechanism by combining pixel-level reconstruction loss, semantic consistency loss, and structural similarity loss, effectively suppressing adversarial perturbations through a gradual denoising process. Extensive experimental results on several mainstream datasets show that the DiffDf method demonstrates excellent generalization performance for trackers with different architectures, significantly improving various evaluation metrics while achieving real-time inference speeds of over 30 FPS, showcasing outstanding defense performance and efficiency. Codes are available at this https URL.
zh

[CV-188] Improving Optical Flow and Stereo Depth Estimation by Leverag ing Uncertainty-Based Learning Difficulties CVPR

【速读】:该论文旨在解决传统光学流和立体深度模型训练中采用统一损失函数所导致的不足,即忽视了像素及上下文区域在学习难度上的显著差异。其解决方案的关键在于引入基于不确定性的置信度图,并提出两种针对性的损失函数:Difficulty Balancing (DB) 损失通过误差驱动的置信度度量引导网络关注困难区域;Occlusion Avoiding (OA) 损失则通过引导网络进入循环一致性可靠的区域来避免遮挡影响,从而有效管理训练过程中各类挑战性像素和区域。

链接: https://arxiv.org/abs/2506.00324
作者: Jisoo Jeong,Hong Cai,Jamie Menjay Lin,Fatih Porikli
机构: Qualcomm AI Research†(高通人工智能研究)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPRW2025

点击查看摘要

Abstract:Conventional training for optical flow and stereo depth models typically employs a uniform loss function across all pixels. However, this one-size-fits-all approach often overlooks the significant variations in learning difficulty among individual pixels and contextual regions. This paper investigates the uncertainty-based confidence maps which capture these spatially varying learning difficulties and introduces tailored solutions to address them. We first present the Difficulty Balancing (DB) loss, which utilizes an error-based confidence measure to encourage the network to focus more on challenging pixels and regions. Moreover, we identify that some difficult pixels and regions are affected by occlusions, resulting from the inherently ill-posed matching problem in the absence of real correspondences. To address this, we propose the Occlusion Avoiding (OA) loss, designed to guide the network into cycle consistency-based confident regions, where feature matching is more reliable. By combining the DB and OA losses, we effectively manage various types of challenging pixels and regions during training. Experiments on both optical flow and stereo depth tasks consistently demonstrate significant performance improvements when applying our proposed combination of the DB and OA losses.
zh

[CV-189] Chain-of-Frames: Advancing Video Understanding in Multimodal LLM s via Frame-Aware Reasoning

【速读】:该论文试图解决视频理解任务中大型语言模型(Large Language Models, LLMs)生成的推理过程缺乏对视频关键帧的明确引用,从而导致性能受限和幻觉率较高的问题。解决方案的关键在于构建一个名为CoF-Data的数据集,其中包含关于自然和合成视频的多样化问题、答案以及与具体视频帧相关联的推理轨迹,并在此基础上微调现有的视频LLMs,使模型在生成链式推理时能够直接引用相关的视频帧,从而提升推理的准确性和任务表现。

链接: https://arxiv.org/abs/2506.00318
作者: Sara Ghazanfari,Francesco Croce,Nicolas Flammarion,Prashanth Krishnamurthy,Farshad Khorrami,Siddharth Garg
机构: New York University (纽约大学); EPFL (瑞士联邦理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent work has shown that eliciting Large Language Models (LLMs) to generate reasoning traces in natural language before answering the user’s request can significantly improve their performance across tasks. This approach has been extended to multimodal LLMs, where the models can produce chain-of-thoughts (CoT) about the content of input images and videos. In this work, we propose to obtain video LLMs whose reasoning steps are grounded in, and explicitly refer to, the relevant video frames. For this, we first create CoF-Data, a large dataset of diverse questions, answers, and corresponding frame-grounded reasoning traces about both natural and synthetic videos, spanning various topics and tasks. Then, we fine-tune existing video LLMs on this chain-of-frames (CoF) data. Our approach is simple and self-contained, and, unlike existing approaches for video CoT, does not require auxiliary networks to select or caption relevant frames. We show that our models based on CoF are able to generate chain-of-thoughts that accurately refer to the key frames to answer the given question. This, in turn, leads to improved performance across multiple video understanding benchmarks, for example, surpassing leading video LLMs on Video-MME, MVBench, and VSI-Bench, and notably reducing the hallucination rate. Code available at this https URLthis http URL.
zh

[CV-190] 3D Gaussian Splat Vulnerabilities CVPR’25

【速读】:该论文旨在解决3D Gaussian Splatting (3DGS)在安全关键型应用中可能面临的对抗性攻击问题,即如何通过操纵场景内容对系统造成危害。其解决方案的关键在于提出CLOAK和DAGGER两种攻击方法:CLOAK利用视图依赖的高斯外观(颜色和纹理随视角变化)嵌入仅从特定视角可见的对抗性内容;DAGGER则是一种无需访问底层训练数据的目标性对抗攻击,通过直接扰动3D高斯分布来欺骗多阶段目标检测器,如Faster R-CNN。这些方法揭示了3DGS中尚未被充分研究的安全漏洞,为自主导航等安全关键型应用带来了新的潜在威胁。

链接: https://arxiv.org/abs/2506.00280
作者: Matthew Hull,Haoyang Yang,Pratham Mehta,Mansi Phute,Aeree Cho,Haoran Wang,Matthew Lau,Wenke Lee,Willian T. Lunardi,Martin Andreoni,Polo Chau
机构: Georgia Tech(佐治亚理工学院); Technology Innovation Institute(技术创新研究所)
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 4 pages, 4 figures, CVPR '25 Workshop on Neural Fields Beyond Conventional Cameras

点击查看摘要

Abstract:With 3D Gaussian Splatting (3DGS) being increasingly used in safety-critical applications, how can an adversary manipulate the scene to cause harm? We introduce CLOAK, the first attack that leverages view-dependent Gaussian appearances - colors and textures that change with viewing angle - to embed adversarial content visible only from specific viewpoints. We further demonstrate DAGGER, a targeted adversarial attack directly perturbing 3D Gaussians without access to underlying training data, deceiving multi-stage object detectors e.g., Faster R-CNN, through established methods such as projected gradient descent. These attacks highlight underexplored vulnerabilities in 3DGS, introducing a new potential threat to robotic learning for autonomous navigation and other safety-critical 3DGS applications.
zh

[CV-191] PerFormer: A Permutation Based Vision Transformer for Remaining Useful Life Prediction

【速读】:该论文试图解决在退化系统中准确估计剩余使用寿命(RUL)的问题,这是现代故障预测与健康管理(PHM)中的关键任务。传统方法如卷积神经网络(CNN)和循环神经网络(RNN)虽已被广泛应用于RUL预测,但随着视觉Transformer(ViT)在计算机视觉任务中的优越表现,其在RUL预测中的潜力值得探索。然而,直接将ViT应用于多变量传感器数据面临挑战,主要是时间序列数据中空间信息的模糊性。该论文提出的解决方案是PerFormer,其关键在于通过一种基于排列的视觉Transformer方法对多变量时间序列数据进行排列,模拟图像数据的空间特性,从而使其适用于ViT。为生成所需的排列矩阵,作者引入了一种新的排列损失函数,旨在引导任意矩阵收敛至排列矩阵。实验结果表明,PerFormer在NASA的C-MAPSS数据集上优于现有基于CNN、RNN和多种Transformer模型的方法。

链接: https://arxiv.org/abs/2506.00259
作者: Zhengyang Fan,Wanru Li,Kuo-chu Chang,Ting Yuan
机构: George Mason University (乔治梅森大学); University of Wisconsin-Madison (威斯康星大学麦迪逊分校); Shanghai Jiao Tong University (上海交通大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurately estimating the remaining useful life (RUL) for degradation systems is crucial in modern prognostic and health management (PHM). Convolutional Neural Networks (CNNs), initially developed for tasks like image and video recognition, have proven highly effectively in RUL prediction, demonstrating remarkable performance. However, with the emergence of the Vision Transformer (ViT), a Transformer model tailored for computer vision tasks such as image classification, and its demonstrated superiority over CNNs, there is a natural inclination to explore its potential in enhancing RUL prediction accuracy. Nonetheless, applying ViT directly to multivariate sensor data for RUL prediction poses challenges, primarily due to the ambiguous nature of spatial information in time series data. To address this issue, we introduce the PerFormer, a permutation-based vision transformer approach designed to permute multivariate time series data, mimicking spatial characteristics akin to image data, thereby making it suitable for ViT. To generate the desired permutation matrix, we introduce a novel permutation loss function aimed at guiding the convergence of any matrix towards a permutation matrix. Our experiments on NASA’s C-MAPSS dataset demonstrate the PerFormer’s superior performance in RUL prediction compared to state-of-the-art methods employing CNNs, Recurrent Neural Networks (RNNs), and various Transformer models. This underscores its effectiveness and potential in PHM applications.
zh

[CV-192] Ctrl-Crash: Controllable Diffusion for Realistic Car Crashes

【速读】:该论文试图解决现有视频扩散技术在生成真实汽车碰撞图像时的不足,这一问题主要源于大多数驾驶数据集中事故事件的稀缺性。为提升交通安全性,需要具备真实性和可控性的事故模拟。解决方案的关键在于提出Ctrl-Crash模型,该模型通过边界框、碰撞类型和初始图像帧等信号进行条件控制,实现反事实场景生成,并通过无分类器指导实现对每个条件信号独立可调的精细控制,从而在定量和定性评估中均达到当前最优性能。

链接: https://arxiv.org/abs/2506.00227
作者: Anthony Gosselin,Ge Ya Luo,Luis Lara,Florian Golemo,Derek Nowrouzezahrai,Liam Paull,Alexia Jolicoeur-Martineau,Christopher Pal
机构: Mila(蒙特利尔学习算法研究所); Polytechnique Montréal(蒙特利尔工程学院); Université de Montréal(蒙特利尔大学); McGill University(麦吉尔大学); CIFAR AI Chair(加拿大高级研究院人工智能主席); Samsung SAIL Montréal(三星蒙特利尔人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: Under review

点击查看摘要

Abstract:Video diffusion techniques have advanced significantly in recent years; however, they struggle to generate realistic imagery of car crashes due to the scarcity of accident events in most driving datasets. Improving traffic safety requires realistic and controllable accident simulations. To tackle the problem, we propose Ctrl-Crash, a controllable car crash video generation model that conditions on signals such as bounding boxes, crash types, and an initial image frame. Our approach enables counterfactual scenario generation where minor variations in input can lead to dramatically different crash outcomes. To support fine-grained control at inference time, we leverage classifier-free guidance with independently tunable scales for each conditioning signal. Ctrl-Crash achieves state-of-the-art performance across quantitative video quality metrics (e.g., FVD and JEDi) and qualitative measurements based on a human-evaluation of physical realism and video quality compared to prior diffusion-based methods.
zh

[CV-193] Understanding while Exploring: Semantics-driven Active Mapping

【速读】:该论文旨在解决在未知环境中实现有效机器人自主性所面临的挑战,特别是如何通过主动探索和精确理解几何与语义信息来提升地图构建的完整性、准确性和鲁棒性。解决方案的关键在于提出一种名为ActiveSGM的主动语义映射框架,该框架基于3D高斯点云(3DGS)映射核心,结合语义与几何不确定性量化以及稀疏语义表示,以预测潜在观测的有用性并指导探索策略。

链接: https://arxiv.org/abs/2506.00225
作者: Liyan Chen,Huangying Zhan,Hairong Yin,Yi Xu,Philippos Mordohai
机构: Stevens Institute of Technology (斯蒂文斯理工学院); Goertek Alpha Labs (戈尔特克阿尔法实验室); Purdue University (普渡大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Effective robotic autonomy in unknown environments demands proactive exploration and precise understanding of both geometry and semantics. In this paper, we propose ActiveSGM, an active semantic mapping framework designed to predict the informativeness of potential observations before execution. Built upon a 3D Gaussian Splatting (3DGS) mapping backbone, our approach employs semantic and geometric uncertainty quantification, coupled with a sparse semantic representation, to guide exploration. By enabling robots to strategically select the most beneficial viewpoints, ActiveSGM efficiently enhances mapping completeness, accuracy, and robustness to noisy semantic data, ultimately supporting more adaptive scene exploration. Our experiments on the Replica and Matterport3D datasets highlight the effectiveness of ActiveSGM in active semantic mapping tasks.
zh

[CV-194] FastCAR: Fast Classification And Regression for Task Consolidation in Multi-Task Learning to Model a Continuous Property Variable of Detected Object Class

【速读】:该论文旨在解决多任务学习(Multi-Task Learning, MTL)中任务异质性带来的挑战,特别是在分类任务与回归任务之间仅有微弱相关性的场景下,如何实现有效的任务整合。解决方案的关键在于提出一种名为FastCAR的任务整合方法,其核心是通过标签转换策略,使得仅需单一任务回归网络架构即可处理分类与回归任务,从而在保持高性能的同时提升训练效率和推理速度。

链接: https://arxiv.org/abs/2506.00208
作者: Anoop Kini,Andreas Jansche,Timo Bernthaler,Gerhard Schneider
机构: Hochschule Aalen(霍赫施塔特应用技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:FastCAR is a novel task consolidation approach in Multi-Task Learning (MTL) for a classification and a regression task, despite the non-triviality of task heterogeneity with only a subtle correlation. The approach addresses the classification of a detected object (occupying the entire image frame) and regression for modeling a continuous property variable (for instances of an object class), a crucial use case in science and engineering. FastCAR involves a label transformation approach that is amenable for use with only a single-task regression network architecture. FastCAR outperforms traditional MTL model families, parametrized in the landscape of architecture and loss weighting schemes, when learning both tasks are collectively considered (classification accuracy of 99.54%, regression mean absolute percentage error of 2.4%). The experiments performed used “Advanced Steel Property Dataset” contributed by us this https URL. The dataset comprises 4536 images of 224x224 pixels, annotated with discrete object classes and its hardness property that can take continuous values. Our proposed FastCAR approach for task consolidation achieves training time efficiency (2.52x quicker) and reduced inference latency (55% faster) than benchmark MTL networks.
zh

[CV-195] Efficient Endangered Deer Species Monitoring with UAV Aerial Imagery and Deep Learning

【速读】:该论文试图解决传统识别方法在监测濒危鹿类物种时存在的资源和时间成本高问题,其解决方案的关键在于利用无人机(UAV)和深度学习技术,通过高分辨率航拍图像与先进的计算机视觉技术实现鹿类物种的自动化识别。研究采用定制化的YOLO算法,并基于无人机采集的大量图像数据进行训练,以提高识别的准确性和效率。

链接: https://arxiv.org/abs/2506.00164
作者: Agustín Roca,Gabriel Torre,Juan I. Giribet,Gastón Castro,Leonardo Colombo,Ignacio Mas,Javier Pereira
机构: Universidad de San Andrés - CONICET(圣安德烈斯大学-国家科学研究中心); Consejo Superior de Investigaciones Científicas(西班牙高等科学研究委员会); Museo Argentino de Ciencias Naturales ”Bernardino Rivadavia”(阿根廷自然科学院“贝尔纳迪诺·里瓦达维亚博物馆”); CONICET(国家科学研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper examines the use of Unmanned Aerial Vehicles (UAVs) and deep learning for detecting endangered deer species in their natural habitats. As traditional identification processes require trained manual labor that can be costly in resources and time, there is a need for more efficient solutions. Leveraging high-resolution aerial imagery, advanced computer vision techniques are applied to automate the identification process of deer across two distinct projects in Buenos Aires, Argentina. The first project, Pantano Project, involves the marsh deer in the Paraná Delta, while the second, WiMoBo, focuses on the Pampas deer in Campos del Tuyú National Park. A tailored algorithm was developed using the YOLO framework, trained on extensive datasets compiled from UAV-captured images. The findings demonstrate that the algorithm effectively identifies marsh deer with a high degree of accuracy and provides initial insights into its applicability to Pampas deer, albeit with noted limitations. This study not only supports ongoing conservation efforts but also highlights the potential of integrating AI with UAV technology to enhance wildlife monitoring and management practices.
zh

[CV-196] Detection of Endangered Deer Species Using UAV Imagery: A Comparative Study Between Efficient Deep Learning Approaches

【速读】:该论文旨在解决在无人机(UAV)影像中对湿地鹿(marsh deer)进行检测的问题,特别是在目标占据图像比例极小且被植被遮挡的复杂场景下。其解决方案的关键在于引入带有分割头(segmentation head)的YOLO模型,并通过精确的分割掩码进行细粒度训练,从而提升检测性能。

链接: https://arxiv.org/abs/2506.00154
作者: Agustín Roca,Gastón Castro,Gabriel Torre,Leonardo J. Colombo,Ignacio Mas,Javier Pereira,Juan I. Giribet
机构: Universidad de San Andrés (圣安德烈斯大学); CONICET (阿根廷国家科学与技术研究理事会); Centre for Automation and Robotics (CSIC-UPM) (自动化与机器人中心(CSIC-UPM)); Museo Argentino Bernardino Rivadavia (贝尔纳迪诺·里瓦达维亚阿根廷博物馆)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This study compares the performance of state-of-the-art neural networks including variants of the YOLOv11 and RT-DETR models for detecting marsh deer in UAV imagery, in scenarios where specimens occupy a very small portion of the image and are occluded by vegetation. We extend previous analysis adding precise segmentation masks for our datasets enabling a fine-grained training of a YOLO model with a segmentation head included. Experimental results show the effectiveness of incorporating the segmentation head achieving superior detection performance. This work contributes valuable insights for improving UAV-based wildlife monitoring and conservation strategies through scalable and accurate AI-driven detection systems.
zh

[CV-197] Geo-Sign: Hyperbolic Contrastive Regularisation for Geometrically Aware Sign Language Translation

【速读】:该论文试图解决手语翻译(Sign Language Translation, SLT)中骨骼表示的几何特性不足的问题,旨在通过改进骨骼特征的几何结构来提升模型的表征能力。其解决方案的关键在于利用双曲几何(hyperbolic geometry)的特性,将从时空图卷积网络(Spatio-Temporal Graph Convolutional Networks, ST-GCNs)提取的骨骼特征投影到庞加莱球(Poincaré ball)模型中,从而生成更具区分性的嵌入表示,特别是在细微动作如手指运动的建模上。为此,研究者引入了双曲投影层、加权弗雷歇均值聚合方案以及直接在双曲空间中操作的几何对比损失,并将其作为正则化函数集成到端到端的翻译框架中,以增强语言模型中的表示效果。

链接: https://arxiv.org/abs/2506.00129
作者: Edward Fish,Richard Bowden
机构: CVSSP, University of Surrey(视觉、语音与信号处理中心,萨里大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Under Review

点击查看摘要

Abstract:Recent progress in Sign Language Translation (SLT) has focussed primarily on improving the representational capacity of large language models to incorporate Sign Language features. This work explores an alternative direction: enhancing the geometric properties of skeletal representations themselves. We propose Geo-Sign, a method that leverages the properties of hyperbolic geometry to model the hierarchical structure inherent in sign language kinematics. By projecting skeletal features derived from Spatio-Temporal Graph Convolutional Networks (ST-GCNs) into the Poincaré ball model, we aim to create more discriminative embeddings, particularly for fine-grained motions like finger articulations. We introduce a hyperbolic projection layer, a weighted Fréchet mean aggregation scheme, and a geometric contrastive loss operating directly in hyperbolic space. These components are integrated into an end-to-end translation framework as a regularisation function, to enhance the representations within the language model. This work demonstrates the potential of hyperbolic geometry to improve skeletal representations for Sign Language Translation, improving on SOTA RGB methods while preserving privacy and improving computational efficiency. Code available here: this https URL.
zh

[CV-198] Visual Embodied Brain: Let Multimodal Large Language Models See Think and Control in Spaces

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在物理实体如仿生机器人中应用时,难以统一视觉-空间推理与物理交互能力的问题。其解决方案的关键在于提出一种名为VeBrain的统一框架,该框架将机器人控制转化为二维视觉空间中的文本基础任务,从而统一不同任务的目标和映射空间,并通过新型机器人适配器将MLLM生成的文本控制信号转换为真实机器人的运动策略。

链接: https://arxiv.org/abs/2506.00123
作者: Gen Luo,Ganlin Yang,Ziyang Gong,Guanzhou Chen,Haonan Duan,Erfei Cui,Ronglei Tong,Zhi Hou,Tianyi Zhang,Zhe Chen,Shenglong Ye,Lewei Lu,Jingbo Wang,Wenhai Wang,Jifeng Dai,Yu Qiao,Rongrong Ji,Xizhou Zhu
机构: Shanghai AI Laboratory; Tsinghua University; University of Science and Technology of China; Shanghai Jiao Tong University; Xiamen University; SenseTime Research; Zhejiang University; Nanjing University
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:The remarkable progress of Multimodal Large Language Models (MLLMs) has attracted increasing attention to extend them to physical entities like legged robot. This typically requires MLLMs to not only grasp multimodal understanding abilities, but also integrate visual-spatial reasoning and physical interaction capabilities. Nevertheless,existing methods struggle to unify these capabilities due to their fundamental this http URL this paper, we present the Visual Embodied Brain (VeBrain), a unified framework for perception, reasoning, and control in real world. VeBrain reformulates robotic control into common text-based MLLM tasks in the 2D visual space, thus unifying the objectives and mapping spaces of different tasks. Then, a novel robotic adapter is proposed to convert textual control signals from MLLMs to motion policies of real robots. From the data perspective, we further introduce VeBrain-600k, a high-quality instruction dataset encompassing various capabilities of VeBrain. In VeBrain-600k, we take hundreds of hours to collect, curate and annotate the data, and adopt multimodal chain-of-thought(CoT) to mix the different capabilities into a single conversation. Extensive experiments on 13 multimodal benchmarks and 5 spatial intelligence benchmarks demonstrate the superior performance of VeBrain to existing MLLMs like Qwen2.5-VL. When deployed to legged robots and robotic arms, VeBrain shows strong adaptability, flexibility, and compositional capabilities compared to existing methods. For example, compared to Qwen2.5-VL, VeBrain not only achieves substantial gains on MMVet by +5.6%, but also excels in legged robot tasks with +50% average gains.
zh

[CV-199] EgoVIS@CVPR: What Changed and What Could Have Changed? State-Change Counterfactuals for Procedure-Aware Video Representation Learning

【速读】:该论文试图解决现有程序感知视频表示学习方法未能显式建模状态变化(scene transformations)的问题,从而难以准确理解操作步骤对场景的转变及其对后续步骤的影响。其解决方案的关键在于引入由大语言模型(LLM)生成的状态变化描述作为视频编码器的监督信号,并通过生成状态变化的反事实(counterfactuals)来模拟假设的失败结果,使模型能够通过想象“如果……会怎样”的未见场景来增强对活动因果关系的理解。

链接: https://arxiv.org/abs/2506.00101
作者: Chi-Hsi Kung,Frangil Ramirez,Juhyung Ha,Yi-Ting Chen,David Crandall,Yi-Hsuan Tsai
机构: Indiana University (印第安纳大学); National Yang-Ming Chiao-Tung University (国家阳明交通大学); Atmanity Inc (Atmanity公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 4 pages, 1 figure, 4 tables. Full paper is available at arXiv:2503.21055

点击查看摘要

Abstract:Understanding a procedural activity requires modeling both how action steps transform the scene, and how evolving scene transformations can influence the sequence of action steps, even those that are accidental or erroneous. Yet, existing work on procedure-aware video representations fails to explicitly learned the state changes (scene transformations). In this work, we study procedure-aware video representation learning by incorporating state-change descriptions generated by LLMs as supervision signals for video encoders. Moreover, we generate state-change counterfactuals that simulate hypothesized failure outcomes, allowing models to learn by imagining the unseen ``What if’’ scenarios. This counterfactual reasoning facilitates the model’s ability to understand the cause and effect of each step in an activity. To verify the procedure awareness of our model, we conduct extensive experiments on procedure-aware tasks, including temporal action segmentation, error detection, and more. Our results demonstrate the effectiveness of the proposed state-change descriptions and their counterfactuals, and achieve significant improvements on multiple tasks.
zh

[CV-200] From Motion to Behavior: Hierarchical Modeling of Humanoid Generative Behavior Control

【速读】:该论文试图解决现有研究在人类运动生成建模中忽视了人类活动的层次化目标导向性质的问题,即当前研究主要关注低层次、短周期运动或高层次动作规划,而缺乏对整体行为计划的建模。解决方案的关键在于提出一种统一框架——生成式行为控制(Generative Behavior Control, GBC),通过将运动与由大型语言模型(Large Language Models, LLMs)生成的层次化行为计划对齐,从而驱动多样化的高层意图下的运动生成。该方法结合了任务和运动规划在机器人学中的控制思想,并借助LLMs提升运动的多样性和物理真实性。

链接: https://arxiv.org/abs/2506.00043
作者: Jusheng Zhang,Jinzhou Tang,Sidi Liu,Mingyan Li,Sheng Zhang,Jian Wang,Keze Wang
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Human motion generative modeling or synthesis aims to characterize complicated human motions of daily activities in diverse real-world environments. However, current research predominantly focuses on either low-level, short-period motions or high-level action planning, without taking into account the hierarchical goal-oriented nature of human activities. In this work, we take a step forward from human motion generation to human behavior modeling, which is inspired by cognitive science. We present a unified framework, dubbed Generative Behavior Control (GBC), to model diverse human motions driven by various high-level intentions by aligning motions with hierarchical behavior plans generated by large language models (LLMs). Our insight is that human motions can be jointly controlled by task and motion planning in robotics, but guided by LLMs to achieve improved motion diversity and physical fidelity. Meanwhile, to overcome the limitations of existing benchmarks, i.e., lack of behavioral plans, we propose GBC-100K dataset annotated with a hierarchical granularity of semantic and motion plans driven by target goals. Our experiments demonstrate that GBC can generate more diverse and purposeful high-quality human motions with 10* longer horizons compared with existing methods when trained on GBC-100K, laying a foundation for future research on behavioral modeling of human motions. Our dataset and source code will be made publicly available.
zh

[CV-201] GaussianFusion: Gaussian-Based Multi-Sensor Fusion for End-to-End Autonomous Driving

【速读】:该论文旨在解决端到端自动驾驶系统中多传感器融合的性能与鲁棒性提升问题。现有方法主要采用基于注意力的扁平化融合或通过几何变换实现的鸟瞰图融合,但这些方法通常存在可解释性受限或计算密集的问题。解决方案的关键在于提出GaussianFusion框架,该框架利用直观且紧凑的高斯表示作为中间载体,从多种传感器中聚合信息。具体而言,通过在驾驶场景中均匀初始化一组2D高斯,并结合显式和隐式特征进行逐步优化,从而有效融合空间与语义信息,提升轨迹预测的准确性与鲁棒性。

链接: https://arxiv.org/abs/2506.00034
作者: Shuai Liu,Quanmin Liang,Zefeng Li,Boyang Li,Kai Huang
机构: Sun Yat-sen University (中山大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multi-sensor fusion is crucial for improving the performance and robustness of end-to-end autonomous driving systems. Existing methods predominantly adopt either attention-based flatten fusion or bird’s eye view fusion through geometric transformations. However, these approaches often suffer from limited interpretability or dense computational overhead. In this paper, we introduce GaussianFusion, a Gaussian-based multi-sensor fusion framework for end-to-end autonomous driving. Our method employs intuitive and compact Gaussian representations as intermediate carriers to aggregate information from diverse sensors. Specifically, we initialize a set of 2D Gaussians uniformly across the driving scene, where each Gaussian is parameterized by physical attributes and equipped with explicit and implicit features. These Gaussians are progressively refined by integrating multi-modal features. The explicit features capture rich semantic and spatial information about the traffic scene, while the implicit features provide complementary cues beneficial for trajectory planning. To fully exploit rich spatial and semantic information in Gaussians, we design a cascade planning head that iteratively refines trajectory predictions through interactions with Gaussians. Extensive experiments on the NAVSIM and Bench2Drive benchmarks demonstrate the effectiveness and robustness of the proposed GaussianFusion framework. The source code will be released at this https URL.
zh

[CV-202] ransport Network Graph and Air Pollution

【速读】:该论文试图解决城市交通网络与空气污染之间关系的全面理解问题,现有研究因模型有限且特征不足而未能提供完整的视角。其解决方案的关键在于通过全球城市0.3百万张图像的解读,提取交通网络的几何模式,并将其作为12个指标的一部分,以探究网络与污染的相关性。研究进一步识别出提升连通性、优化道路类型分布以及避免极端聚类系数等策略,有助于缓解污染,其方法论通过仅基于图结构的研究,区分了永久性基础设施与衍生发展的影响,从而为城市规划提供更精准高效的污染减排指导。

链接: https://arxiv.org/abs/2506.01164
作者: Nan Xu
机构: University of Melbourne (墨尔本大学)
类目: Physics and Society (physics.soc-ph); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Air pollution can be studied in the urban structure regulated by transport networks. Transport networks can be studied as geometric and topological graph characteristics through designed models. Current studies do not offer a comprehensive view as limited models with insufficient features are examined. Our study finds geometric patterns of pollution-indicated transport networks through 0.3 million image interpretations of global cities. These are then described as part of 12 indices to investigate the network-pollution correlation. Strategies such as improved connectivity, more balanced road types and the avoidance of extreme clustering coefficient are identified as beneficial for alleviated pollution. As a graph-only study, it informs superior urban planning by separating the impact of permanent infrastructure from that of derived development for a more focused and efficient effort toward pollution reduction.
zh

[CV-203] ProtInvTree: Deliberate Protein Inverse Folding with Reward-guided Tree Search

【速读】:该论文试图解决蛋白质逆折叠问题(protein inverse folding),即设计能够折叠成目标三维结构的蛋白质序列。传统方法往往忽视了该问题的一对多特性,即多个不同的序列可以折叠成相同的结构。解决方案的关键在于提出ProtInvTree,这是首个基于奖励引导的树搜索框架,通过将序列生成转化为逐步决策过程,实现对多种设计路径的探索与优秀候选序列的利用,同时引入两阶段的聚焦与定位动作机制以及跳跃去噪策略,以高效评估中间状态并保持结构一致性,从而生成多样且结构一致的蛋白质序列。

链接: https://arxiv.org/abs/2506.00925
作者: Mengdi Liu,Xiaoxue Cheng,Zhangyang Gao,Hong Chang,Cheng Tan,Shiguang Shan,Xilin Chen
机构: Institute of Computing Technology, Chinese Academy of Sciences (中国科学院计算技术研究所); University of Chinese Academy of Sciences (中国科学院大学); AI Lab, Research Center for Industries of the Future, Westlake University (西湖大学未来产业研究院人工智能实验室); Gaoling School of Artificial Intelligence, Renmin University of China (中国人民大学高瓴人工智能学院)
类目: Biomolecules (q-bio.BM); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Designing protein sequences that fold into a target 3D structure, known as protein inverse folding, is a fundamental challenge in protein engineering. While recent deep learning methods have achieved impressive performance by recovering native sequences, they often overlook the one-to-many nature of the problem: multiple diverse sequences can fold into the same structure. This motivates the need for a generative model capable of designing diverse sequences while preserving structural consistency. To address this trade-off, we introduce ProtInvTree, the first reward-guided tree-search framework for protein inverse folding. ProtInvTree reformulates sequence generation as a deliberate, step-wise decision-making process, enabling the exploration of multiple design paths and exploitation of promising candidates through self-evaluation, lookahead, and backtracking. We propose a two-stage focus-and-grounding action mechanism that decouples position selection and residue generation. To efficiently evaluate intermediate states, we introduce a jumpy denoising strategy that avoids full rollouts. Built upon pretrained protein language models, ProtInvTree supports flexible test-time scaling by expanding the search depth and breadth without retraining. Empirically, ProtInvTree outperforms state-of-the-art baselines across multiple benchmarks, generating structurally consistent yet diverse sequences, including those far from the native ground truth.
zh

[CV-204] Image Restoration Learning via Noisy Supervision in the Fourier Domain

【速读】:该论文试图解决噪声监督(noisy supervision)在图像修复学习中的有效性问题,特别是针对实际应用中常见的空间相关噪声(如低光成像和遥感)以及基于像素级损失函数的监督信息有限的问题。解决方案的关键在于利用傅里叶域(Fourier domain)进行噪声监督,通过分析空间相关噪声的傅里叶系数的稀疏性和独立性,以及其包含全局信息的特点,建立一种统一的学习框架,从而提升图像修复任务的性能。

链接: https://arxiv.org/abs/2506.00564
作者: Haosen Liu,Jiahao Liu,Shan Tan,Edmund Y. Lam
机构: The University of Hong Kong (香港大学); Huazhong University of Science and Technology (华中科技大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Noisy supervision refers to supervising image restoration learning with noisy targets. It can alleviate the data collection burden and enhance the practical applicability of deep learning techniques. However, existing methods suffer from two key drawbacks. Firstly, they are ineffective in handling spatially correlated noise commonly observed in practical applications such as low-light imaging and remote sensing. Secondly, they rely on pixel-wise loss functions that only provide limited supervision information. This work addresses these challenges by leveraging the Fourier domain. We highlight that the Fourier coefficients of spatially correlated noise exhibit sparsity and independence, making them easier to handle. Additionally, Fourier coefficients contain global information, enabling more significant supervision. Motivated by these insights, we propose to establish noisy supervision in the Fourier domain. We first prove that Fourier coefficients of a wide range of noise converge in distribution to the Gaussian distribution. Exploiting this statistical property, we establish the equivalence between using noisy targets and clean targets in the Fourier domain. This leads to a unified learning framework applicable to various image restoration tasks, diverse network architectures, and different noise models. Extensive experiments validate the outstanding performance of this framework in terms of both quantitative indices and perceptual quality.
zh

[CV-205] A European Multi-Center Breast Cancer MRI Dataset

【速读】:该论文旨在解决乳腺癌早期检测中因MRI检查耗时且依赖专家放射科医生而带来的效率与准确性挑战。其解决方案的关键在于利用生成式AI(Generative AI)开发自动化方法,以提高乳腺MRI图像的癌症检测与分类能力,从而辅助放射科医生并实现更早的癌症发现。为此,ODELIA联盟公开了多中心数据集,以支持相关AI工具的开发。

链接: https://arxiv.org/abs/2506.00474
作者: Gustav Müller-Franzes,Lorena Escudero Sánchez,Nicholas Payne,Alexandra Athanasiou,Michael Kalogeropoulos,Aitor Lopez,Alfredo Miguel Soro Busto,Julia Camps Herrero,Nika Rasoolzadeh,Tianyu Zhang,Ritse Mann,Debora Jutz,Maike Bode,Christiane Kuhl,Wouter Veldhuis,Oliver Lester Saldanha,JieFu Zhu,Jakob Nikolas Kather,Daniel Truhn,Fiona J. Gilbert
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Detecting breast cancer early is of the utmost importance to effectively treat the millions of women afflicted by breast cancer worldwide every year. Although mammography is the primary imaging modality for screening breast cancer, there is an increasing interest in adding magnetic resonance imaging (MRI) to screening programmes, particularly for women at high risk. Recent guidelines by the European Society of Breast Imaging (EUSOBI) recommended breast MRI as a supplemental screening tool for women with dense breast tissue. However, acquiring and reading MRI scans requires significantly more time from expert radiologists. This highlights the need to develop new automated methods to detect cancer accurately using MRI and Artificial Intelligence (AI), which have the potential to support radiologists in breast MRI interpretation and classification and help detect cancer earlier. For this reason, the ODELIA consortium has made this multi-centre dataset publicly available to assist in developing AI tools for the detection of breast cancer on MRI.
zh

[CV-206] Applying Vision Transformers on Spectral Analysis of Astronomical Objects

【速读】:该论文试图解决天文学中对光谱数据进行高效分析与分类的问题,特别是针对恒星对象的分类和红移(z)估计任务。其解决方案的关键在于将传统的单维光谱数据转换为二维图像表示,并利用预训练的视觉变压器(Vision Transformers, ViTs)模型,通过空间自注意力机制同时捕捉局部和全局的光谱特征。研究者在ImageNet上预训练的ViT基础上,使用来自SDSS和LAMOST巡天的数百万条光谱数据进行微调,从而实现了在实际天体光谱数据上的有效应用。

链接: https://arxiv.org/abs/2506.00294
作者: Luis Felipe Strano Moraes,Ignacio Becker,Pavlos Protopapas,Guillermo Cabrera-Vives
机构: 未知
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 9 figures

点击查看摘要

Abstract:We apply pre-trained Vision Transformers (ViTs), originally developed for image recognition, to the analysis of astronomical spectral data. By converting traditional one-dimensional spectra into two-dimensional image representations, we enable ViTs to capture both local and global spectral features through spatial self-attention. We fine-tune a ViT pretrained on ImageNet using millions of spectra from the SDSS and LAMOST surveys, represented as spectral plots. Our model is evaluated on key tasks including stellar object classification and redshift ( z ) estimation, where it demonstrates strong performance and scalability. We achieve classification accuracy higher than Support Vector Machines and Random Forests, and attain R^2 values comparable to AstroCLIP’s spectrum encoder, even when generalizing across diverse object types. These results demonstrate the effectiveness of using pretrained vision models for spectroscopic data analysis. To our knowledge, this is the first application of ViTs to large-scale, which also leverages real spectroscopic data and does not rely on synthetic inputs.
zh

人工智能

[AI-0] Feel the Force: Contact-Driven Learning from Humans

【速读】:该论文试图解决在机器人操作中精确控制细粒度力(fine-grained forces)的核心挑战。现有方法依赖于机器人自身收集的数据或仿真环境中的策略,难以适应现实世界中多样的交互场景;而直接从人类学习虽然提供了可扩展的解决方案,但仅依靠视觉演示无法获取精确的接触力信息。论文提出的解决方案关键在于通过感知人类触觉行为来建模力敏感的操作,利用触觉手套测量接触力并结合视觉模型估计手部姿态,训练出一个闭环策略以持续预测操作所需的力,并通过共享的视觉和动作表示将策略迁移到Franka Panda机器人上,最终实现精确的力感知控制。

链接: https://arxiv.org/abs/2506.01944
作者: Ademi Adeniji,Zhuoran Chen,Vincent Liu,Venkatesh Pattabiraman,Raunaq Bhirangi,Siddhant Haldar,Pieter Abbeel,Lerrel Pinto
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Controlling fine-grained forces during manipulation remains a core challenge in robotics. While robot policies learned from robot-collected data or simulation show promise, they struggle to generalize across the diverse range of real-world interactions. Learning directly from humans offers a scalable solution, enabling demonstrators to perform skills in their natural embodiment and in everyday environments. However, visual demonstrations alone lack the information needed to infer precise contact forces. We present FeelTheForce (FTF): a robot learning system that models human tactile behavior to learn force-sensitive manipulation. Using a tactile glove to measure contact forces and a vision-based model to estimate hand pose, we train a closed-loop policy that continuously predicts the forces needed for manipulation. This policy is re-targeted to a Franka Panda robot with tactile gripper sensors using shared visual and action representations. At execution, a PD controller modulates gripper closure to track predicted forces-enabling precise, force-aware control. Our approach grounds robust low-level force control in scalable human supervision, achieving a 77% success rate across 5 force-sensitive manipulation tasks. Code and videos are available at this https URL.
zh

[AI-1] RoboEgo System Card: An Omnimodal Model with Native Full Duplexity

【速读】:该论文旨在解决多模态模型在处理超过三种模态(如视觉、音频和文本)以及实现全双工响应以适应快速变化的人类指令方面的两大挑战。解决方案的关键在于提出RoboEgo(别名:FLM-Ego),这是一个统一的模型系统,其核心架构和算法原生支持全双工性,实现了理论上的80毫秒双工延迟,从而在现实世界条件下的流式视觉基础对话中展现出优越的响应速度和语音自然度,同时保持与最先进半双工多模态模型相当的内容质量。

链接: https://arxiv.org/abs/2506.01934
作者: Yiqun Yao,Xiang Li,Xin Jiang,Xuezhi Fang,Naitong Yu,Aixin Sun,Yequan Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Humans naturally process real-world multimodal information in a full-duplex manner. In artificial intelligence, replicating this capability is essential for advancing model development and deployment, particularly in embodied contexts. The development of multimodal models faces two primary challenges: (1) effectively handling more than three modalities-such as vision, audio, and text; and (2) delivering full-duplex responses to rapidly evolving human instructions. To facilitate research on models that support both omnimodal processing and full duplexity, we present RoboEgo (alias: FLM-Ego), a unified model system designed to address both challenges. RoboEgo incorporates a backbone architecture and algorithms that natively support full duplexity, achieving a theoretical duplex latency of 80 ms. In streaming visually grounded conversations under real-world conditions, RoboEgo exhibits superior responsiveness and speech naturalness, while maintaining comparable content qualities to state-of-the-art semi-duplex omnimodal models-a feat previously considered unattainable by native full-duplex systems.
zh

[AI-2] Red Teaming AI Policy: A Taxonomy of Avoision and the EU AI Act

【速读】:该论文试图解决企业在面对即将生效的《欧盟人工智能法案》(AI Act)时,如何通过“规避”(avoision)行为来减轻监管负担的问题。解决方案的关键在于构建一个框架和分类法,用于系统性地分析和理解企业可能采取的规避策略,这些策略根据企业面临的人工智能监管暴露程度分为三个层级:是否处于AIA的适用范围、是否被豁免于AIA条款,或是否属于高监管审查类别。在每个层级中,论文进一步明确了规避行为在组织和技术形式上的表现方式,旨在为未来AI监管提供一种对抗性评估工具,即“红队测试”(red teaming)。

链接: https://arxiv.org/abs/2506.01931
作者: Rui-Jie Yew,Bill Marino,Suresh Venkatasubramanian
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: Forthcoming at the 2025 ACM Conference on Fairness, Accountability, and Transparency

点击查看摘要

Abstract:The shape of AI regulation is beginning to emerge, most prominently through the EU AI Act (the “AIA”). By 2027, the AIA will be in full effect, and firms are starting to adjust their behavior in light of this new law. In this paper, we present a framework and taxonomy for reasoning about “avoision” – conduct that walks the line between legal avoidance and evasion – that firms might engage in so as to minimize the regulatory burden the AIA poses. We organize these avoision strategies around three “tiers” of increasing AIA exposure that regulated entities face depending on: whether their activities are (1) within scope of the AIA, (2) exempted from provisions of the AIA, or are (3) placed in a category with higher regulatory scrutiny. In each of these tiers and for each strategy, we specify the organizational and technological forms through which avoision may manifest. Our goal is to provide an adversarial framework for “red teaming” the AIA and AI regulation on the horizon.
zh

[AI-3] Online Competitive Information Gathering for Partially Observable Trajectory Games

【速读】:该论文试图解决在部分可观测随机博弈(POSGs)中,博弈论智能体如何有效地进行信息收集以制定最优策略的问题。传统方法在完全连续的POSG中进行规划时面临计算不可行性,需依赖大量离线计算或对玩家信念顺序做出假设。论文提出了一种有限历史/时间范围的POSG改进模型,该模型在轨迹空间中能够实现具有竞争力的信息收集行为,并通过一系列近似方法,提出了一个在线计算理性轨迹计划的方法,该方法利用基于粒子的联合状态空间估计并执行随机梯度博弈。解决方案的关键在于结合粒子滤波与在线优化,从而在复杂环境中实现有效的信息主动获取。

链接: https://arxiv.org/abs/2506.01927
作者: Mel Krusniak,Hang Xu,Parker Palermo,Forrest Laine
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Robotics (cs.RO)
备注: Accepted at RSS 2025

点击查看摘要

Abstract:Game-theoretic agents must make plans that optimally gather information about their opponents. These problems are modeled by partially observable stochastic games (POSGs), but planning in fully continuous POSGs is intractable without heavy offline computation or assumptions on the order of belief maintained by each player. We formulate a finite history/horizon refinement of POSGs which admits competitive information gathering behavior in trajectory space, and through a series of approximations, we present an online method for computing rational trajectory plans in these games which leverages particle-based estimations of the joint state space and performs stochastic gradient play. We also provide the necessary adjustments required to deploy this method on individual agents. The method is tested in continuous pursuit-evasion and warehouse-pickup scenarios (alongside extensions to N 2 players and to more complex environments with visual and physical obstacles), demonstrating evidence of active information gathering and outperforming passive competitors.
zh

[AI-4] ransformers as Multi-task Learners: Decoupling Features in Hidden Markov Models

【速读】:该论文试图解决Transformer模型在序列学习任务中表现出的多任务泛化能力的理论理解不足问题,其核心在于揭示Transformer层间行为的机制。解决方案的关键在于通过分析典型序列模型(如隐马尔可夫模型)的层结构,发现低层主要关注局部特征提取,而高层特征则呈现出时间解耦特性,并基于这些实证观察提供理论分析,从而解释Transformer在多种任务中的表达能力和效率。

链接: https://arxiv.org/abs/2506.01919
作者: Yifan Hao,Chenlu Ye,Chi Han,Tong Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Transformer based models have shown remarkable capabilities in sequence learning across a wide range of tasks, often performing well on specific task by leveraging input-output examples. Despite their empirical success, a comprehensive theoretical understanding of this phenomenon remains limited. In this work, we investigate the layerwise behavior of Transformers to uncover the mechanisms underlying their multi-task generalization ability. Taking explorations on a typical sequence model, i.e, Hidden Markov Models, which are fundamental to many language tasks, we observe that: first, lower layers of Transformers focus on extracting feature representations, primarily influenced by neighboring tokens; second, on the upper layers, features become decoupled, exhibiting a high degree of time disentanglement. Building on these empirical insights, we provide theoretical analysis for the expressiveness power of Transformers. Our explicit constructions align closely with empirical observations, providing theoretical support for the Transformer’s effectiveness and efficiency on sequence learning across diverse tasks.
zh

[AI-5] Understanding Overadaptation in Supervised Fine-Tuning: The Role of Ensemble Methods

【速读】:该论文试图解决监督微调(Supervised Fine-Tuning, SFT)过程中模型对预训练阶段获得的知识产生遗忘的问题,以及由此导致的过适应(overadaptation)现象。其解决方案的关键在于通过将预训练模型与其微调后的版本进行集成(ensembling),从而平衡由微调不足引起的偏差(bias)和因过度拟合微调数据引入的方差(variance),进而提升模型在微调领域内的性能。

链接: https://arxiv.org/abs/2506.01901
作者: Yifan Hao,Xingyuan Pan,Hanning Zhang,Chenlu Ye,Rui Pan,Tong Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Supervised fine-tuning (SFT) on domain-specific data is the dominant approach for adapting foundation models to specialized tasks. However, it has been observed that SFT models tend to forget knowledge acquired during pretraining. In vision models, ensembling a pretrained model with its fine-tuned counterpart has been shown to mitigate this issue. In this work, we demonstrate that the same holds for language models, and, more strikingly, we observe an overadaptation phenomenon: the ensemble model not only retains general knowledge from the foundation model but also outperforms the fine-tuned model even on the fine-tuning domain itself. Despite the empirical success of ensembling, a theoretical understanding of its benefits remains underexplored. We develop a formal theoretical analysis of the overadaptation phenomenon. Ensembling mitigates this by balancing two primary sources of error: bias, caused by insufficient fine-tuning, and variance, introduced by overfitting to fine-tuning data. While regularization techniques aim to address this trade-off, we show that ensembling provides a more effective solution. We analyze this phenomenon in over-parameterized linear settings and demonstrate that interpolating between pretrained and fine-tuned weights significantly improves performance. These findings offer theoretical justification for the observed advantages of model ensembling, supported by empirical experiments consistent with our analysis.
zh

[AI-6] COALESCE: Economic and Security Dynamics of Skill-Based Task Outsourcing Among Team of Autonomous LLM Agents

【速读】:该论文旨在解决自主大型语言模型(Large Language Model, LLM)代理在部署过程中面临的计算资源消耗大、尤其是对图形处理单元(GPU)依赖性强的问题。其解决方案的关键在于提出COALESCE框架,该框架通过让LLM代理动态地将特定子任务外包给专业且成本更低的第三方LLM代理,实现资源利用的优化。COALESCE的核心机制包括混合技能表示、动态技能发现、自动化任务分解、统一成本模型以及基于市场的决策算法,从而有效降低整体运行成本并提升系统可扩展性。

链接: https://arxiv.org/abs/2506.01900
作者: Manish Bhatt,Ronald F. Del Rosario,Vineeth Sai Narajala,Idan Habler
机构: 未知
类目: Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Cryptography and Security (cs.CR)
备注: 20 pages, 2 figures, github linked

点击查看摘要

Abstract:The meteoric rise and proliferation of autonomous Large Language Model (LLM) agents promise significant capabilities across various domains. However, their deployment is increasingly constrained by substantial computational demands, specifically for Graphics Processing Unit (GPU) resources. This paper addresses the critical problem of optimizing resource utilization in LLM agent systems. We introduce COALESCE (Cost-Optimized and Secure Agent Labour Exchange via Skill-based Competence Estimation), a novel framework designed to enable autonomous LLM agents to dynamically outsource specific subtasks to specialized, cost-effective third-party LLM agents. The framework integrates mechanisms for hybrid skill representation, dynamic skill discovery, automated task decomposition, a unified cost model comparing internal execution costs against external outsourcing prices, simplified market-based decision-making algorithms, and a standardized communication protocol between LLM agents. Comprehensive validation through 239 theoretical simulations demonstrates 41.8% cost reduction potential, while large-scale empirical validation across 240 real LLM tasks confirms 20.3% cost reduction with proper epsilon-greedy exploration, establishing both theoretical viability and practical effectiveness. The emergence of proposed open standards like Google’s Agent2Agent (A2A) protocol further underscores the need for frameworks like COALESCE that can leverage such standards for efficient agent interaction. By facilitating a dynamic market for agent capabilities, potentially utilizing protocols like A2A for communication, COALESCE aims to significantly reduce operational costs, enhance system scalability, and foster the emergence of specialized agent economies, making complex LLM agent functionalities more accessible and economically viable.
zh

[AI-7] CogniAlign: Word-Level Multimodal Speech Alignment with Gated Cross-Attention for Alzheimers Detection

【速读】:该论文旨在解决阿尔茨海默病(Alzheimer’s disease)早期检测的问题,通过整合音频和文本模态以提升检测的准确性。其解决方案的关键在于提出CogniAlign架构,该架构采用词级时间对齐策略,将音频嵌入与对应的文本标记基于转录时间戳进行同步,从而支持更精确的跨模态交互。此外,引入了基于门控的交叉注意力融合机制,利用文本模态的优异单模态性能指导音频特征的注意力分配,并通过插入词间停顿标记来建模韵律线索,进一步增强多模态表示。

链接: https://arxiv.org/abs/2506.01890
作者: David Ortiz-Perez,Manuel Benavent-Lledo,Javier Rodriguez-Juan,Jose Garcia-Rodriguez,David Tomás
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Early detection of cognitive disorders such as Alzheimer’s disease is critical for enabling timely clinical intervention and improving patient outcomes. In this work, we introduce CogniAlign, a multimodal architecture for Alzheimer’s detection that integrates audio and textual modalities, two non-intrusive sources of information that offer complementary insights into cognitive health. Unlike prior approaches that fuse modalities at a coarse level, CogniAlign leverages a word-level temporal alignment strategy that synchronizes audio embeddings with corresponding textual tokens based on transcription timestamps. This alignment supports the development of token-level fusion techniques, enabling more precise cross-modal interactions. To fully exploit this alignment, we propose a Gated Cross-Attention Fusion mechanism, where audio features attend over textual representations, guided by the superior unimodal performance of the text modality. In addition, we incorporate prosodic cues, specifically interword pauses, by inserting pause tokens into the text and generating audio embeddings for silent intervals, further enriching both streams. We evaluate CogniAlign on the ADReSSo dataset, where it achieves an accuracy of 90.36%, outperforming existing state-of-the-art methods. A detailed ablation study confirms the advantages of our alignment strategy, attention-based fusion, and prosodic modeling.
zh

[AI-8] Agnostic Reinforcement Learning: Foundations and Algorithms

【速读】:该论文试图解决在状态空间较大的环境中,强化学习(Reinforcement Learning, RL)的统计复杂性缺乏理论理解的问题,特别是在需要函数逼近以实现样本高效学习的情况下。其解决方案的关键在于从学习理论的角度系统地研究带有函数逼近的强化学习的统计复杂性,特别关注一种最弱形式的函数逼近——无偏策略学习(agnostic policy learning),即学习者在给定策略类Π中寻找最优策略,而无需保证Π中包含任务的最优策略。论文通过三个关键维度——环境访问、覆盖条件和表示条件,深入探讨了无偏策略学习,并设计了具有理论保障的新学习算法,同时刻画了任何算法的性能下限。

链接: https://arxiv.org/abs/2506.01884
作者: Gene Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: Ph.D. thesis

点击查看摘要

Abstract:Reinforcement Learning (RL) has demonstrated tremendous empirical success across numerous challenging domains. However, we lack a strong theoretical understanding of the statistical complexity of RL in environments with large state spaces, where function approximation is required for sample-efficient learning. This thesis addresses this gap by rigorously examining the statistical complexity of RL with function approximation from a learning theoretic perspective. Departing from a long history of prior work, we consider the weakest form of function approximation, called agnostic policy learning, in which the learner seeks to find the best policy in a given class \Pi , with no guarantee that \Pi contains an optimal policy for the underlying task. We systematically explore agnostic policy learning along three key axes: environment access – how a learner collects data from the environment; coverage conditions – intrinsic properties of the underlying MDP measuring the expansiveness of state-occupancy measures for policies in the class \Pi , and representational conditions – structural assumptions on the class \Pi itself. Within this comprehensive framework, we (1) design new learning algorithms with theoretical guarantees and (2) characterize fundamental performance bounds of any algorithm. Our results reveal significant statistical separations that highlight the power and limitations of agnostic policy learning. Comments: Ph.D. thesis Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML) Cite as: arXiv:2506.01884 [cs.LG] (or arXiv:2506.01884v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2506.01884 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-9] scDataset: Scalable Data Loading for Deep Learning on Large-Scale Single-Cell Omics

【速读】:该论文旨在解决大规模单细胞数据集在训练深度学习模型时面临的高效数据加载问题。传统AnnData格式的数据加载方案存在内存占用高、存储需求大以及随机磁盘访问速度慢等缺陷。论文提出的解决方案是scDataset,其关键在于结合了块采样(block sampling)和批量获取(batched fetching)技术,从而在保证数据随机性的同时提升I/O效率。

链接: https://arxiv.org/abs/2506.01883
作者: Davide D’Ascenzo,Sebastiano Cultrera di Montesano
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Databases (cs.DB)
备注:

点击查看摘要

Abstract:Modern single-cell datasets now comprise hundreds of millions of cells, presenting significant challenges for training deep learning models that require shuffled, memory-efficient data loading. While the AnnData format is the community standard for storing single-cell datasets, existing data loading solutions for AnnData are often inadequate: some require loading all data into memory, others convert to dense formats that increase storage demands, and many are hampered by slow random disk access. We present scDataset, a PyTorch IterableDataset that operates directly on one or more AnnData files without the need for format conversion. The core innovation is a combination of block sampling and batched fetching, which together balance randomness and I/O efficiency. On the Tahoe 100M dataset, scDataset achieves up to a 48 \times speed-up over AnnLoader, a 27 \times speed-up over HuggingFace Datasets, and an 18 \times speed-up over BioNeMo in single-core settings. These advances democratize large-scale single-cell model training for the broader research community.
zh

[AI-10] Learning to Explore: An In-Context Learning Approach for Pure Exploration

【速读】:该论文旨在解决主动序列假设检验(active sequential hypothesis testing)问题,也称为纯探索(pure exploration),其目标是通过主动控制数据收集过程来高效识别决策问题中的正确假设。现有基于强化学习(Reinforcement Learning, RL)的方法在相关信息结构未能充分表示时表现不佳,而更复杂的最佳臂识别(Best Arm Identification, BAI)方法则难以设计且通常依赖显式建模假设。论文提出的解决方案关键在于引入上下文纯探索(In-Context Pure Exploration, ICPE),该方法利用Transformer模型直接从经验中学习探索策略,结合监督学习与强化学习以识别并利用相关任务间的潜在结构,无需先验假设。

链接: https://arxiv.org/abs/2506.01876
作者: Alessio Russo,Ryan Welch,Aldo Pacchiano
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In this work, we study the active sequential hypothesis testing problem, also known as pure exploration, where the goal is to actively control a data collection process to efficiently identify the correct hypothesis underlying a decision problem. While relevant across multiple domains, devising adaptive exploration strategies remains challenging, particularly due to difficulties in encoding appropriate inductive biases. Existing Reinforcement Learning (RL)-based methods often underperform when relevant information structures are inadequately represented, whereas more complex methods, like Best Arm Identification (BAI) techniques, may be difficult to devise and typically rely on explicit modeling assumptions. To address these limitations, we introduce In-Context Pure Exploration (ICPE), an in-context learning approach that uses Transformers to learn exploration strategies directly from experience. ICPE combines supervised learning and reinforcement learning to identify and exploit latent structure across related tasks, without requiring prior assumptions. Numerical results across diverse synthetic and semi-synthetic benchmarks highlight ICPE’s capability to achieve robust performance performance in deterministic, stochastic, and structured settings. These results demonstrate ICPE’s ability to match optimal instance-dependent algorithms using only deep learning techniques, making it a practical and general approach to data-efficient exploration.
zh

[AI-11] Frugal Machine Learning for Energy-efficient and Resource-aware Artificial Intelligence

【速读】:该论文旨在解决在资源受限环境下实现高效、低成本的机器学习模型设计问题,特别是在边缘计算和物联网设备中面临的带宽、能耗和延迟等严格限制。其解决方案的关键在于通过多种Frugal Machine Learning(FML)策略,包括输入精简、学习过程精简和模型精简,以在不同阶段减少资源消耗,同时保持模型性能。关键技术手段涵盖模型压缩、节能硬件、数据高效学习以及自适应方法如参数正则化、知识蒸馏和动态架构设计,从而支持模型的增量更新而无需完全重新训练。

链接: https://arxiv.org/abs/2506.01869
作者: John Violos,Konstantina-Christina Diamanti,Ioannis Kompatsiaris,Symeon Papadopoulos
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Frugal Machine Learning (FML) refers to the practice of designing Machine Learning (ML) models that are efficient, cost-effective, and mindful of resource constraints. This field aims to achieve acceptable performance while minimizing the use of computational resources, time, energy, and data for both training and inference. FML strategies can be broadly categorized into input frugality, learning process frugality, and model frugality, each focusing on reducing resource consumption at different stages of the ML pipeline. This chapter explores recent advancements, applications, and open challenges in FML, emphasizing its importance for smart environments that incorporate edge computing and IoT devices, which often face strict limitations in bandwidth, energy, or latency. Technological enablers such as model compression, energy-efficient hardware, and data-efficient learning techniques are discussed, along with adaptive methods including parameter regularization, knowledge distillation, and dynamic architecture design that enable incremental model updates without full retraining. Furthermore, it provides a comprehensive taxonomy of frugal methods, discusses case studies across diverse domains, and identifies future research directions to drive innovation in this evolving field.
zh

[AI-12] Fodor and Pylyshyns Legacy - Still No Human-like Systematic Compositionality in Neural Networks

【速读】:该论文试图解决神经网络是否能够实现人类水平的系统性组合性(systematic compositionality)的问题,特别是通过元学习(meta-learning)框架来达成这一目标。论文的关键在于批判性地重新审视近期提出的一种基于元学习的组合性路径,并指出当前神经元元学习系统在实现组合性任务时存在显著限制,仅能在非常狭窄和受限的元学习设定下完成此类任务,因此认为Fodor和Pylyshyn关于神经网络缺乏组合性建模能力的论断仍然成立。

链接: https://arxiv.org/abs/2506.01820
作者: Tim Woydt,Moritz Willig,Antonia Wüst,Lukas Helff,Wolfgang Stammer,Constantin A. Rothkopf,Kristian Kersting
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Strong meta-learning capabilities for systematic compositionality are emerging as an important skill for navigating the complex and changing tasks of today’s world. However, in presenting models for robust adaptation to novel environments, it is important to refrain from making unsupported claims about the performance of meta-learning systems that ultimately do not stand up to scrutiny. While Fodor and Pylyshyn famously posited that neural networks inherently lack this capacity as they are unable to model compositional representations or structure-sensitive operations, and thus are not a viable model of the human mind, Lake and Baroni recently presented meta-learning as a pathway to compositionality. In this position paper, we critically revisit this claim and highlight limitations in the proposed meta-learning framework for compositionality. Our analysis shows that modern neural meta-learning systems can only perform such tasks, if at all, under a very narrow and restricted definition of a meta-learning setup. We therefore claim that `Fodor and Pylyshyn’s legacy’ persists, and to date, there is no human-like systematic compositionality learned in neural networks.
zh

[AI-13] he Ultimate Test of Superintelligent AI Agents : Can an AI Balance Care and Control in Asymmetric Relationships?

【速读】:该论文试图解决当前人工智能评估体系中对超级智能人工智能代理在道德和关系维度上的评估不足问题,特别是其在权力不对称情境下对低智能代理的操控、培育及工具性使用能力。解决方案的关键在于提出“牧羊人测试”(Shepherd Test),该测试强调道德主体性、层级行为以及在生存风险下的复杂决策,从而为人工智能治理提供新的评价框架。

链接: https://arxiv.org/abs/2506.01813
作者: Djallel Bouneffouf,Matthew Riemer,Kush Varshney
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper introduces the Shepherd Test, a new conceptual test for assessing the moral and relational dimensions of superintelligent artificial agents. The test is inspired by human interactions with animals, where ethical considerations about care, manipulation, and consumption arise in contexts of asymmetric power and self-preservation. We argue that AI crosses an important, and potentially dangerous, threshold of intelligence when it exhibits the ability to manipulate, nurture, and instrumentally use less intelligent agents, while also managing its own survival and expansion goals. This includes the ability to weigh moral trade-offs between self-interest and the well-being of subordinate agents. The Shepherd Test thus challenges traditional AI evaluation paradigms by emphasizing moral agency, hierarchical behavior, and complex decision-making under existential stakes. We argue that this shift is critical for advancing AI governance, particularly as AI systems become increasingly integrated into multi-agent environments. We conclude by identifying key research directions, including the development of simulation environments for testing moral behavior in AI, and the formalization of ethical manipulation within multi-agent systems.
zh

[AI-14] A Study on the MCP x A2A Framework for Enhancing Interoperability of LLM -based Autonomous Agents

【速读】:该论文旨在解决基于大语言模型(Large Language Model, LLM)的自主代理之间高效交互及与外部系统集成的问题。其解决方案的关键在于提出并分析Agent-to-Agent (A2A)协议和Model Context Protocol (MCP),前者提供了一种标准化的通信方式,使异构环境下的代理能够有效协作,后者则为代理与外部工具和资源的连接提供了结构化输入输出框架,从而实现跨协议的互操作性与高效协同。

链接: https://arxiv.org/abs/2506.01804
作者: Cheonsu Jeong
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper provides an in-depth technical analysis and implementation methodology of the open-source Agent-to-Agent (A2A) protocol developed by Google and the Model Context Protocol (MCP) introduced by Anthropic. While the evolution of LLM-based autonomous agents is rapidly accelerating, efficient interactions among these agents and their integration with external systems remain significant challenges. In modern AI systems, collaboration between autonomous agents and integration with external tools have become essential elements for building practical AI applications. A2A offers a standardized communication method that enables agents developed in heterogeneous environments to collaborate effectively, while MCP provides a structured I/O framework for agents to connect with external tools and resources. Prior studies have focused primarily on the features and applications of either A2A or MCP individually. In contrast, this study takes an integrated approach, exploring how the two protocols can complement each other to address interoperability issues and facilitate efficient collaboration within complex agent ecosystems.
zh

[AI-15] Systematic Hazard Analysis for Frontier AI using STPA

【速读】:该论文试图解决前沿人工智能系统在安全保证方面的不足,特别是现有安全框架中对危害识别和分析的结构化方法缺失问题。解决方案的关键在于引入STPA(Systems-Theoretic Process Analysis)这一系统性方法,通过分析控制器与被控过程之间的交互及反馈回路,识别可能导致危害的因果因素,从而拓宽安全保证的范围、提高可追溯性并增强系统的鲁棒性。

链接: https://arxiv.org/abs/2506.01782
作者: Simon Mylius
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注: 29 pages, 5 figures, 7 tables

点击查看摘要

Abstract:All of the frontier AI companies have published safety frameworks where they define capability thresholds and risk mitigations that determine how they will safely develop and deploy their models. Adoption of systematic approaches to risk modelling, based on established practices used in safety-critical industries, has been recommended, however frontier AI companies currently do not describe in detail any structured approach to identifying and analysing hazards. STPA (Systems-Theoretic Process Analysis) is a systematic methodology for identifying how complex systems can become unsafe, leading to hazards. It achieves this by mapping out controllers and controlled processes then analysing their interactions and feedback loops to understand how harmful outcomes could occur (Leveson Thomas, 2018). We evaluate STPA’s ability to broaden the scope, improve traceability and strengthen the robustness of safety assurance for frontier AI systems. Applying STPA to the threat model and scenario described in ‘A Sketch of an AI Control Safety Case’ (Korbak et al., 2025), we derive a list of Unsafe Control Actions. From these we select a subset and explore the Loss Scenarios that lead to them if left unmitigated. We find that STPA is able to identify causal factors that may be missed by unstructured hazard analysis methodologies thereby improving robustness. We suggest STPA could increase the safety assurance of frontier AI when used to complement or check coverage of existing AI governance techniques including capability thresholds, model evaluations and emergency procedures. The application of a systematic methodology supports scalability by increasing the proportion of the analysis that could be conducted by LLMs, reducing the burden on human domain experts.
zh

[AI-16] Enhancing Customer Service Chatbots with Context-Aware NLU through Selective Attention and Multi-task Learning

【速读】:该论文旨在解决客户服务质量聊天机器人在处理模糊客户查询时意图分类准确率低的问题。现有模型仅依赖客户查询内容进行意图预测,难以应对如“我没有收到我的包裹”这类可能表示订单延迟或未正确接收的模糊表述。解决方案的关键在于引入一种上下文感知的自然语言理解(NLU)模型,该模型结合了客户查询和客户订单状态的上下文信息,并通过新颖的注意力模块提取相关上下文特征,同时采用多任务学习范式以有效利用训练数据中的多种标签类型,从而提升意图分类的准确性。

链接: https://arxiv.org/abs/2506.01781
作者: Subhadip Nandi,Neeraj Agrawal,Anshika Singh,Priyanka Bhatt
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Customer service chatbots are conversational systems aimed at addressing customer queries, often by directing them to automated workflows. A crucial aspect of this process is the classification of the customer’s intent. Presently, most intent classification models for customer care utilise only customer query for intent prediction. This may result in low-accuracy models, which cannot handle ambiguous queries. An ambiguous query like “I didn’t receive my package” could indicate a delayed order, or an order that was delivered but the customer failed to receive it. Resolution of each of these scenarios requires the execution of very different sequence of steps. Utilizing additional information, such as the customer’s order delivery status, in the right manner can help identify the intent for such ambiguous queries. In this paper, we have introduced a context-aware NLU model that incorporates both, the customer query and contextual information from the customer’s order status for predicting customer intent. A novel selective attention module is used to extract relevant context features. We have also proposed a multi-task learning paradigm for the effective utilization of different label types available in our training data. Our suggested method, Multi-Task Learning Contextual NLU with Selective Attention Weighted Context (MTL-CNLU-SAWC), yields a 4.8% increase in top 2 accuracy score over the baseline model which only uses user queries, and a 3.5% improvement over existing state-of-the-art models that combine query and context. We have deployed our model to production for Walmart’s customer care domain. Accurate intent prediction through MTL-CNLU-SAWC helps to better direct customers to automated workflows, thereby significantly reducing escalations to human agents, leading to almost a million dollars in yearly savings for the company.
zh

[AI-17] Greening AI-enabled Systems with Software Engineering: A Research Agenda for Environmentally Sustainable AI Practices

【速读】:该论文旨在解决人工智能(Artificial Intelligence, AI)驱动系统日益增长的环境影响问题,通过软件工程方法开发可持续的解决方案。其关键在于提出一个研究议程,明确开放的研究方向和实践建议,以指导基于软件工程原则的环境可持续AI系统的开发,重点关注能源评估与标准化、基准测试实践、可持续性感知架构、运行时适应、实证方法论及教育等核心挑战。

链接: https://arxiv.org/abs/2506.01774
作者: Luís Cruz,João Paulo Fernandes,Maja H. Kirkeby,Silverio Martínez-Fernández,June Sallou,Hina Anwar,Enrique Barba Roque,Justus Bogner,Joel Castaño,Fernando Castor,Aadil Chasmawala,Simão Cunha,Daniel Feitosa,Alexandra González,Andreas Jedlitschka,Patricia Lago,Ana Oprescu,Pooja Rani,João Saraiva,Federica Sarro,Raghavendra Selvan,Karthik Vaidhyanathan,Roberto Verdecchia,Ivan P. Yamshchikov,Henry Muccini
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The environmental impact of Artificial Intelligence (AI)-enabled systems is increasing rapidly, and software engineering plays a critical role in developing sustainable solutions. The “Greening AI with Software Engineering” CECAM-Lorentz workshop (no. 1358, 2025) funded by the Centre Européen de Calcul Atomique et Moléculaire and the Lorentz Center, provided an interdisciplinary forum for 29 participants, from practitioners to academics, to share knowledge, ideas, practices, and current results dedicated to advancing green software and AI research. The workshop was held February 3-7, 2025, in Lausanne, Switzerland. Through keynotes, flash talks, and collaborative discussions, participants identified and prioritized key challenges for the field. These included energy assessment and standardization, benchmarking practices, sustainability-aware architectures, runtime adaptation, empirical methodologies, and education. This report presents a research agenda emerging from the workshop, outlining open research directions and practical recommendations to guide the development of environmentally sustainable AI-enabled systems rooted in software engineering principles.
zh

[AI-18] ReGA: Representation-Guided Abstraction for Model-based Safeguarding of LLM s

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在安全性和安全性监控方面的问题,特别是其生成有害内容的风险以及对越狱攻击的脆弱性。解决方案的关键在于提出一种基于模型分析的框架ReGA,该框架采用表示引导的抽象方法,通过利用安全关键表示(safety-critical representations)来克服传统方法在扩展到LLMs时的可扩展性问题,从而有效实现对LLMs的安全建模与监控。

链接: https://arxiv.org/abs/2506.01770
作者: Zeming Wei,Chengcan Wu,Meng Sun
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have achieved significant success in various tasks, yet concerns about their safety and security have emerged. In particular, they pose risks in generating harmful content and vulnerability to jailbreaking attacks. To analyze and monitor machine learning models, model-based analysis has demonstrated notable potential in stateful deep neural networks, yet suffers from scalability issues when extending to LLMs due to their vast feature spaces. In this paper, we propose ReGA, a model-based analysis framework with representation-guided abstraction, to safeguard LLMs against harmful prompts and generations. By leveraging safety-critical representations, which are low-dimensional directions emerging in hidden states that indicate safety-related concepts, ReGA effectively addresses the scalability issue when constructing the abstract model for safety modeling. Our comprehensive evaluation shows that ReGA performs sufficiently well in distinguishing between safe and harmful inputs, achieving an AUROC of 0.975 at the prompt level and 0.985 at the conversation level. Additionally, ReGA exhibits robustness to real-world attacks and generalization across different safety perspectives, outperforming existing safeguard paradigms in terms of interpretability and scalability. Overall, ReGA serves as an efficient and scalable solution to enhance LLM safety by integrating representation engineering with model-based abstraction, paving the way for new paradigms to utilize software insights for AI safety. Our code is available at this https URL.
zh

[AI-19] Principled data augmentation for learning to solve quadratic programming problems

【速读】:该论文旨在解决在数据稀缺环境下,基于消息传递图神经网络(MPNN)的端到端学习优化(L2O)方法在求解二次规划(QPs)问题时的鲁棒性不足问题。其解决方案的关键在于提出一种针对QPs的数据增强方法,该方法结合了理论依据支持的数据增强技术,生成既多样化又保持最优性的实例,并将其集成到基于对比学习的自监督学习框架中,从而提升MPNN在L2O任务中的性能和泛化能力。

链接: https://arxiv.org/abs/2506.01728
作者: Chendi Qian,Christopher Morris
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Linear and quadratic optimization are crucial in numerous real-world applications, from training machine learning models to integer-linear optimization. Recently, learning-to-optimize methods (L2O) for linear (LPs) or quadratic programs (QPs) using message-passing graph neural networks (MPNNs) have gained traction, promising lightweight, data-driven proxies for solving such optimization problems. For example, they replace the costly computation of strong branching scores in branch-and-bound solvers, requiring solving many such optimization problems. However, robust L2O MPNNs remain challenging in data-scarce settings, especially when addressing complex optimization problems such as QPs. This work introduces a principled approach to data augmentation tailored for QPs via MPNNs. Our method leverages theoretically justified data augmentation techniques to generate diverse yet optimality-preserving instances. Furthermore, we integrate these augmentations into a self-supervised learning framework based on contrastive learning, thereby pretraining MPNNs for enhanced performance on L2O tasks. Extensive experiments demonstrate that our approach improves generalization in supervised scenarios and facilitates effective transfer learning to related optimization problems.
zh

[AI-20] Generate Not Recommend: Personalized Multimodal Content Generation

【速读】:该论文旨在解决信息过载问题,通过推荐系统从大量网络内容中检索并呈现个性化结果,但传统推荐系统受限于仅能过滤现有项目,缺乏生成新概念的能力,从而无法充分满足用户需求。其解决方案的关键在于提出一种新的范式,即直接生成多模态的个性化项目(如图像),以满足用户需求。为此,研究者利用任意到任意的大规模多模态模型(Large Multimodal Models, LMMs),并通过监督微调和在线强化学习策略进行训练,使其具备为用户提供定制化下一项目的的能力。

链接: https://arxiv.org/abs/2506.01704
作者: Jiongnan Liu,Zhicheng Dou,Ning Hu,Chenyan Xiong
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:To address the challenge of information overload from massive web contents, recommender systems are widely applied to retrieve and present personalized results for users. However, recommendation tasks are inherently constrained to filtering existing items and lack the ability to generate novel concepts, limiting their capacity to fully satisfy user demands and preferences. In this paper, we propose a new paradigm that goes beyond content filtering and selecting: directly generating personalized items in a multimodal form, such as images, tailored to individual users. To accomplish this, we leverage any-to-any Large Multimodal Models (LMMs) and train them in both supervised fine-tuning and online reinforcement learning strategy to equip them with the ability to yield tailored next items for users. Experiments on two benchmark datasets and user study confirm the efficacy of the proposed method. Notably, the generated images not only align well with users’ historical preferences but also exhibit relevance to their potential future interests.
zh

[AI-21] A Descriptive and Normative Theory of Human Beliefs in RLHF

【速读】:该论文试图解决强化学习人类反馈(RLHF)中人类偏好建模的局限性,特别是人类对智能体能力的信念在偏好生成中的作用未被充分考虑的问题。其解决方案的关键在于提出一种新的偏好模型,该模型将人类对智能体能力的信念纳入考量,并建立了一个规范性理论,通过量化人类信念与理想信念之间的不匹配程度来约束最终学习策略的误差。研究结果表明,减少这种不匹配可以提升RLHF的性能,并为RLHF实践者提供了新的最佳实践方向。

链接: https://arxiv.org/abs/2506.01692
作者: Sylee Dandekar,Shripad Deshmukh,Frank Chiu,W. Bradley Knox,Scott Niekum
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Human preferences in RLHF are typically modeled as a function of the human’s reward function or corresponding optimal state-action values. In this work, we propose that human beliefs about the capabilities of the agent being trained also play a key role in preference generation. We examine two questions related to this hypothesis, one descriptive and one normative, respectively: Do human labelers’ beliefs about agent capabilities affect the preferences that they provide? And what is the ideal set of beliefs about an agent – and resulting preferences – for humans to have? We propose a new preference model that incorporates human beliefs and provide a normative theory that bounds the error on the final learned policy based on the \textitmismatch between the human’s beliefs and an idealized set of beliefs. We then confirm via a human study that beliefs about agent capabilities do, in fact, significantly affect preferences and can be influenced through simple interventions. Additionally, we empirically show through synthetic experiments that it is often suboptimal for human preference labelers to assume agent optimality. Collectively, these results theoretically and empirically demonstrate how reducing the mismatch between human beliefs and agent capabilities can lead to more performant RLHF and point toward new best practices for RLHF practitioners.
zh

[AI-22] Reasoning -Based Approach with Chain-of-Thought for Alzheimers Detection Using Speech and Large Language Models INTERSPEECH2025

【速读】:该论文旨在解决老龄化社会中老年人健康问题,特别是阿尔茨海默病(Alzheimer’s Disease, AD)的诊断与治疗挑战。其解决方案的关键在于提出一种基于链式思维(Chain-of-Thought, CoT)推理的方法,结合语音和语言模型,通过自动语音识别将语音转换为文本,并在大语言模型(Large Language Model, LLM)中引入线性层进行AD与非AD分类,利用带有CoT推理和提示的监督微调(Supervised Fine-Tuning, SFT)策略,显著提升了诊断性能。

链接: https://arxiv.org/abs/2506.01683
作者: Chanwoo Park,Anna Seo Gyeong Choi,Sunghye Cho,Chanwoo Kim
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted to INTERSPEECH 2025

点击查看摘要

Abstract:Societies worldwide are rapidly entering a super-aged era, making elderly health a pressing concern. The aging population is increasing the burden on national economies and households. Dementia cases are rising significantly with this demographic shift. Recent research using voice-based models and large language models (LLM) offers new possibilities for dementia diagnosis and treatment. Our Chain-of-Thought (CoT) reasoning method combines speech and language models. The process starts with automatic speech recognition to convert speech to text. We add a linear layer to an LLM for Alzheimer’s disease (AD) and non-AD classification, using supervised fine-tuning (SFT) with CoT reasoning and cues. This approach showed an 16.7% relative performance improvement compared to methods without CoT prompt reasoning. To the best of our knowledge, our proposed method achieved state-of-the-art performance in CoT approaches.
zh

[AI-23] K12Vista: Exploring the Boundaries of MLLM s in K-12 Education

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在K12教育场景下的知识理解与推理能力研究不足的问题。现有研究存在学科覆盖范围窄、数据规模不足、题型多样性缺乏以及以答案为中心的评估方法等局限,导致对模型能力的探索不充分。其解决方案的关键在于构建K12Vista,一个涵盖小学到高中五门核心学科、33,000道题目的多模态基准测试集,并通过自动化数据流水线构建K12-PEM-800K,提供详细的步骤级推理过程评估注释。此外,还提出了K12-PEM模型,用于综合评估推理过程与答案的正确性,并引入K12-PEBench作为首个高质量的人工标注推理能力评估基准。

链接: https://arxiv.org/abs/2506.01676
作者: Chong Li,Chenglin Zhu,Tao Zhang,Mingan Lin,Zenan Zhou,Jian Xie
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multimodal large language models have demonstrated remarkable reasoning capabilities in various visual tasks. However, their abilities in K12 scenarios are still systematically underexplored. Previous studies suffer from various limitations including narrow subject coverage, insufficient data scale, lack of diversity in question types, and naive answer-centric evaluation method, resulting in insufficient exploration of model capabilities. To address these gaps, we propose K12Vista, the most comprehensive multimodal benchmark for Chinese K12 subject knowledge understanding and reasoning to date, featuring 33,000 questions across five core subjects from primary to high school and three question types. Moreover, beyond the final outcome, we are also concerned with the correctness of MLLMs’ reasoning processes. For this purpose, we meticulously compiles errors from MLLMs’ reasoning processes and leverage an automated data pipeline to construct K12-PEM-800K, the largest process evaluation dataset offering detailed step-by-step judgement annotations for MLLMs’ reasoning. Subsequently, we developed K12-PEM, an advanced process evaluation model that integrates an overall assessment of both the reasoning process and answer correctness. Moreover, we also introduce K12-PEBench, the first high-quality, human-annotated benchmark specifically designed for evaluating abilities of reasoning process this http URL experiments reveal that current MLLMs exhibit significant flaws when reasoning within K12Vista, providing critical insights for the development of more capable this http URL open our resources at this https URL.
zh

[AI-24] Provably Safe Reinforcement Learning from Analytic Gradients

【速读】:该论文试图解决在安全关键型应用中部署自主机器人时的安全保障问题,特别是针对分析梯度强化学习(analytic gradient-based reinforcement learning)缺乏有效安全保障方法的现状。解决方案的关键在于开发首个适用于该学习范式的有效安全机制,通过分析现有可微分安全措施,结合修改后的映射和梯度形式,并与先进的学习算法及可微分仿真集成,从而在不牺牲性能的前提下实现受保护的训练过程。

链接: https://arxiv.org/abs/2506.01665
作者: Tim Walter,Hannah Markgraf,Jonathan Külz,Matthias Althoff
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: 16 pages, 10 figures

点击查看摘要

Abstract:Deploying autonomous robots in safety-critical applications requires safety guarantees. Provably safe reinforcement learning is an active field of research which aims to provide such guarantees using safeguards. These safeguards should be integrated during training to prevent a large sim-to-real gap. While there are several approaches for safeguarding sampling-based reinforcement learning, analytic gradient-based reinforcement learning often achieves superior performance and sample efficiency. However, there is no safeguarding approach for this learning paradigm yet. Our work addresses this gap by developing the first effective safeguard for analytic gradient-based reinforcement learning. We analyse existing, differentiable safeguards, adapt them through modified mappings and gradient formulations, and integrate them with a state-of-the-art learning algorithm and a differentiable simulation. We evaluate how different safeguards affect policy optimisation using numerical experiments on two classical control tasks. The results demonstrate safeguarded training without compromising performance.
zh

[AI-25] Explainable AI Systems Must Be Contestable: Heres How to Make It Happen

【速读】:该论文试图解决可争议性(contestability)在可解释人工智能(Explainable AI, XAI)中缺乏明确定义和有效实现方法的问题。当前,尽管可争议性被视作系统安全的重要保障,但其未形成正式定义,也无算法可保证,并且实践者缺乏满足监管要求的具体指导。论文的关键解决方案是提出一个基于系统文献综述的严格形式化定义,并构建了一个模块化框架,涵盖设计阶段和事后机制,涉及人机界面、技术架构、法律流程和组织工作流,同时引入了可争议性评估量表作为操作化工具,以推动AI系统的实际改进与问责机制的嵌入。

链接: https://arxiv.org/abs/2506.01662
作者: Catarina Moreira,Anna Palatkina,Dacia Braca,Dylan M. Walsh,Peter J. Leihn,Fang Chen,Nina C. Hubig
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:As AI regulations around the world intensify their focus on system safety, contestability has become a mandatory, yet ill-defined, safeguard. In XAI, “contestability” remains an empty promise: no formal definition exists, no algorithm guarantees it, and practitioners lack concrete guidance to satisfy regulatory requirements. Grounded in a systematic literature review, this paper presents the first rigorous formal definition of contestability in explainable AI, directly aligned with stakeholder requirements and regulatory mandates. We introduce a modular framework of by-design and post-hoc mechanisms spanning human-centered interfaces, technical architectures, legal processes, and organizational workflows. To operationalize our framework, we propose the Contestability Assessment Scale, a composite metric built on more than twenty quantitative criteria. Through multiple case studies across diverse application domains, we reveal where state-of-the-art systems fall short and show how our framework drives targeted improvements. By converting contestability from regulatory theory into a practical framework, our work equips practitioners with the tools to embed genuine recourse and accountability into AI systems.
zh

[AI-26] Engram Memory Encoding and Retrieval: A Neurocomputational Perspective

【速读】:该论文试图解决记忆编码、存储与提取的精确机制问题,特别是在神经元层面如何形成并维持长期记忆的难题。其解决方案的关键在于构建一个整合生物学发现与机制模型的计算框架,通过结合细胞神经科学和计算建模的方法,探讨如何识别和操控记忆痕迹(engram)神经元、突触可塑性如何促进稳定的记忆痕迹,以及稀疏性如何促进高效且抗干扰的表征。关键策略包括采用稀疏正则化、记忆痕迹门控以及受生物学启发的架构如稀疏分布式记忆和脉冲神经网络等计算方法,以揭示突触可塑性和稀疏性约束相互作用如何促成记忆效率、容量和稳定性。

链接: https://arxiv.org/abs/2506.01659
作者: Daniel Szelogowski
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
备注: 18 pages, 7 figures, 3 tables

点击查看摘要

Abstract:Despite substantial research into the biological basis of memory, the precise mechanisms by which experiences are encoded, stored, and retrieved in the brain remain incompletely understood. A growing body of evidence supports the engram theory, which posits that sparse populations of neurons undergo lasting physical and biochemical changes to support long-term memory. Yet, a comprehensive computational framework that integrates biological findings with mechanistic models remains elusive. This work synthesizes insights from cellular neuroscience and computational modeling to address key challenges in engram research: how engram neurons are identified and manipulated; how synaptic plasticity mechanisms contribute to stable memory traces; and how sparsity promotes efficient, interference-resistant representations. Relevant computational approaches – such as sparse regularization, engram gating, and biologically inspired architectures like Sparse Distributed Memory and spiking neural networks – are also examined. Together, these findings suggest that memory efficiency, capacity, and stability emerge from the interaction of plasticity and sparsity constraints. By integrating neurobiological and computational perspectives, this paper provides a comprehensive theoretical foundation for engram research and proposes a roadmap for future inquiry into the mechanisms underlying memory, with implications for the diagnosis and treatment of memory-related disorders.
zh

[AI-27] Bidirectional Soft Actor-Critic: Leverag ing Forward and Reverse KL Divergence for Efficient Reinforcement Learning

【速读】:该论文试图解决传统Soft Actor-Critic (SAC)算法在策略更新中依赖反向Kullback-Leibler (KL)散度导致的最优投影策略难以求解、梯度近似不稳定及样本效率低的问题。解决方案的关键在于引入前向KL散度,对于高斯策略而言,前向KL散度能够产生显式的最优投影策略,即对应于目标Boltzmann分布动作边缘的均值和方差。基于两种KL散度的优势,作者提出了双向SAC(Bidirectional SAC),首先利用前向KL投影初始化策略,再通过优化反向KL散度进行精调,从而显著提升了算法性能。

链接: https://arxiv.org/abs/2506.01639
作者: Yixian Zhang,Huaze Tang,Changxu Wei,Wenbo Ding
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The Soft Actor-Critic (SAC) algorithm, a state-of-the-art method in maximum entropy reinforcement learning, traditionally relies on minimizing reverse Kullback-Leibler (KL) divergence for policy updates. However, this approach leads to an intractable optimal projection policy, necessitating gradient-based approximations that can suffer from instability and poor sample efficiency. This paper investigates the alternative use of forward KL divergence within SAC. We demonstrate that for Gaussian policies, forward KL divergence yields an explicit optimal projection policy – corresponding to the mean and variance of the target Boltzmann distribution’s action marginals. Building on the distinct advantages of both KL directions, we propose Bidirectional SAC, an algorithm that first initializes the policy using the explicit forward KL projection and then refines it by optimizing the reverse KL divergence. Comprehensive experiments on continuous control benchmarks show that Bidirectional SAC significantly outperforms standard SAC and other baselines, achieving up to a 30% increase in episodic rewards, alongside enhanced sample efficiency.
zh

[AI-28] Gradient-Based Model Fingerprinting for LLM Similarity Detection and Family Classification

【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)在软件工程中因微调、合并和再分发导致的未经授权的模型衍生问题,特别是在模型血缘追踪和许可协议执行方面缺乏有效机制的问题。解决方案的关键在于将LLMs视为需要溯源跟踪的软件构件,并提出了一种基于梯度的指纹框架TensorGuard,通过分析随机输入扰动在张量层上的梯度响应来提取模型内在的行为特征,从而实现模型相似性检测和家族分类。该方法独立于训练数据、水印或特定模型格式,能够支持广泛采用的safetensors格式,并通过统计分析构建高维指纹,进而实现模型间的直接相似性评估与系统性家族分类。

链接: https://arxiv.org/abs/2506.01631
作者: Zehao Wu,Yanjie Zhao,Haoyu Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:As Large Language Models (LLMs) become integral software components in modern applications, unauthorized model derivations through fine-tuning, merging, and redistribution have emerged as critical software engineering challenges. Unlike traditional software where clone detection and license compliance are well-established, the LLM ecosystem lacks effective mechanisms to detect model lineage and enforce licensing agreements. This gap is particularly problematic when open-source model creators, such as Meta’s LLaMA, require derivative works to maintain naming conventions for attribution, yet no technical means exist to verify compliance. To fill this gap, treating LLMs as software artifacts requiring provenance tracking, we present TensorGuard, a gradient-based fingerprinting framework for LLM similarity detection and family classification. Our approach extracts model-intrinsic behavioral signatures by analyzing gradient responses to random input perturbations across tensor layers, operating independently of training data, watermarks, or specific model formats. TensorGuard supports the widely-adopted safetensors format and constructs high-dimensional fingerprints through statistical analysis of gradient features. These fingerprints enable two complementary capabilities: direct pairwise similarity assessment between arbitrary models through distance computation, and systematic family classification of unknown models via the K-Means clustering algorithm with domain-informed centroid initialization using known base models. Experimental evaluation on 58 models comprising 8 base models and 50 derivatives across five model families (Llama, Qwen, Gemma, Phi, Mistral) demonstrates 94% classification accuracy under our centroid-initialized K-Means clustering. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Software Engineering (cs.SE) Cite as: arXiv:2506.01631 [cs.LG] (or arXiv:2506.01631v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2506.01631 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-29] Robust Satisficing Gaussian Process Bandits Under Adversarial Attacks

【速读】:该论文试图解决在存在未知且可能变化的对抗性扰动情况下,高斯过程(Gaussian Process, GP)优化的问题。传统鲁棒优化方法通常关注最坏情况下的性能最大化,而本文提出了一种鲁棒满足目标(robust satisficing objective),其核心是确保在对抗条件下能够稳定达到预设的性能阈值τ。解决方案的关键在于提出两种基于不同鲁棒满足形式的新算法,并证明它们属于一个通用的鲁棒满足框架,同时根据对抗者的性质提供不同的性能保证。

链接: https://arxiv.org/abs/2506.01625
作者: Artun Saday,Yaşar Cahit Yıldırım,Cem Tekin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We address the problem of Gaussian Process (GP) optimization in the presence of unknown and potentially varying adversarial perturbations. Unlike traditional robust optimization approaches that focus on maximizing performance under worst-case scenarios, we consider a robust satisficing objective, where the goal is to consistently achieve a predefined performance threshold \tau , even under adversarial conditions. We propose two novel algorithms based on distinct formulations of robust satisficing, and show that they are instances of a general robust satisficing framework. Further, each algorithm offers different guarantees depending on the nature of the adversary. Specifically, we derive two regret bounds: one that is sublinear over time, assuming certain conditions on the adversary and the satisficing threshold \tau , and another that scales with the perturbation magnitude but requires no assumptions on the adversary. Through extensive experiments, we demonstrate that our approach outperforms the established robust optimization methods in achieving the satisficing objective, particularly when the ambiguity set of the robust optimization framework is inaccurately specified.
zh

[AI-30] Social Cooperation in Conversational AI Agents

【速读】:该论文试图解决基于大型开放领域语言模型(Large, Open-Domain Language Models, LLMs)的AI代理在长期交互中表现不佳的问题,尤其是在用户反复纠正代理错误的情况下,模型难以泛化。解决方案的关键在于显式建模人类的社会智能,即人类与行为不可预测的其他代理建立和维持长期关系的能力。通过数学建模人类在长时间内用于交流和推理的策略,可以推导出新的博弈论目标,从而优化LLMs和未来AI代理的表现。

链接: https://arxiv.org/abs/2506.01624
作者: Mustafa Mert Çelikok,Saptarashmi Bandyopadhyay,Robert Loftin
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 4 pages, RLDM 2025 abstract (Spotlight presentation)

点击查看摘要

Abstract:The development of AI agents based on large, open-domain language models (LLMs) has paved the way for the development of general-purpose AI assistants that can support human in tasks such as writing, coding, graphic design, and scientific research. A major challenge with such agents is that, by necessity, they are trained by observing relatively short-term interactions with humans. Such models can fail to generalize to long-term interactions, for example, interactions where a user has repeatedly corrected mistakes on the part of the agent. In this work, we argue that these challenges can be overcome by explicitly modeling humans’ social intelligence, that is, their ability to build and maintain long-term relationships with other agents whose behavior cannot always be predicted. By mathematically modeling the strategies humans use to communicate and reason about one another over long periods of time, we may be able to derive new game theoretic objectives against which LLMs and future AI agents may be optimized.
zh

[AI-31] MAGIK: Mapping to Analogous Goals via Imagination-enabled Knowledge Transfer

【速读】:该论文试图解决强化学习(Reinforcement Learning, RL)代理在面对结构相似但未见过的新任务时,需要大量重新训练的问题。其解决方案的关键在于提出MAGIK框架,该框架通过想象机制将目标任务中的实体映射到源领域中的对应实体,从而允许代理复用其原始策略,实现无需与目标环境交互的知识迁移。

链接: https://arxiv.org/abs/2506.01623
作者: Ajsal Shereef Palattuparambil,Thommen George Karimpanal,Santu Rana
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Humans excel at analogical reasoning - applying knowledge from one task to a related one with minimal relearning. In contrast, reinforcement learning (RL) agents typically require extensive retraining even when new tasks share structural similarities with previously learned ones. In this work, we propose MAGIK, a novel framework that enables RL agents to transfer knowledge to analogous tasks without interacting with the target environment. Our approach leverages an imagination mechanism to map entities in the target task to their analogues in the source domain, allowing the agent to reuse its original policy. Experiments on custom MiniGrid and MuJoCo tasks show that MAGIK achieves effective zero-shot transfer using only a small number of human-labelled examples. We compare our approach to related baselines and highlight how it offers a novel and effective mechanism for knowledge transfer via imagination-based analogy mapping.
zh

[AI-32] General agents need world models ICML2025

【速读】:该论文试图解决的问题是:世界模型(world models)是否是实现灵活、目标导向行为的必要条件,还是无模型学习(model-free learning)就足以完成此类任务。论文的解决方案之关键在于证明了任何能够泛化到多步目标导向任务的智能体必须已经学习到了其环境的预测模型,并且该模型可以从智能体的策略中提取出来。研究进一步表明,提升智能体的性能或其能够完成的目标复杂度,依赖于学习更加精确的世界模型。

链接: https://arxiv.org/abs/2506.01622
作者: Jonathan Richens,David Abel,Alexis Bellot,Tom Everitt
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO); Machine Learning (stat.ML)
备注: Accepted ICML 2025

点击查看摘要

Abstract:Are world models a necessary ingredient for flexible, goal-directed behaviour, or is model-free learning sufficient? We provide a formal answer to this question, showing that any agent capable of generalizing to multi-step goal-directed tasks must have learned a predictive model of its environment. We show that this model can be extracted from the agent’s policy, and that increasing the agents performance or the complexity of the goals it can achieve requires learning increasingly accurate world models. This has a number of consequences: from developing safe and general agents, to bounding agent capabilities in complex environments, and providing new algorithms for eliciting world models from agents.
zh

[AI-33] MLA-Trust: Benchmarking Trustworthiness of Multimodal LLM Agents in GUI Environments

【速读】:该论文旨在解决多模态大语言模型基础代理(MLA)在交互场景中所面临的可信性挑战,这些挑战超越了传统语言模型的局限,因其能够直接修改数字状态并引发不可逆的真实世界后果。解决方案的关键在于提出MLA-Trust框架,这是首个全面且统一的评估体系,从真实性、可控性、安全性和隐私性四个原则性维度对MLA的可信性进行评估,并通过设计高风险交互任务和构建丰富的评估数据集进行验证,同时揭示了多模态交互场景下MLA特有的可信性漏洞。

链接: https://arxiv.org/abs/2506.01616
作者: Xiao Yang,Jiawei Chen,Jun Luo,Zhengwei Fang,Yinpeng Dong,Hang Su,Jun Zhu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The emergence of multimodal LLM-based agents (MLAs) has transformed interaction paradigms by seamlessly integrating vision, language, action and dynamic environments, enabling unprecedented autonomous capabilities across GUI applications ranging from web automation to mobile systems. However, MLAs introduce critical trustworthiness challenges that extend far beyond traditional language models’ limitations, as they can directly modify digital states and trigger irreversible real-world consequences. Existing benchmarks inadequately tackle these unique challenges posed by MLAs’ actionable outputs, long-horizon uncertainty and multimodal attack vectors. In this paper, we introduce MLA-Trust, the first comprehensive and unified framework that evaluates the MLA trustworthiness across four principled dimensions: truthfulness, controllability, safety and privacy. We utilize websites and mobile applications as realistic testbeds, designing 34 high-risk interactive tasks and curating rich evaluation datasets. Large-scale experiments involving 13 state-of-the-art agents reveal previously unexplored trustworthiness vulnerabilities unique to multimodal interactive scenarios. For instance, proprietary and open-source GUI-interacting MLAs pose more severe trustworthiness risks than static MLLMs, particularly in high-stakes domains; the transition from static MLLMs into interactive MLAs considerably compromises trustworthiness, enabling harmful content generation in multi-step interactions that standalone MLLMs would typically prevent; multi-step execution, while enhancing the adaptability of MLAs, involves latent nonlinear risk accumulation across successive interactions, circumventing existing safeguards and resulting in unpredictable derived risks. Moreover, we present an extensible toolbox to facilitate continuous evaluation of MLA trustworthiness across diverse interactive environments.
zh

[AI-34] Contrastive Learning for Efficient Transaction Validation in UTXO-based Blockchains

【速读】:该论文旨在解决基于UTXO(Unspent Transaction Output)的区块链在扩展性方面的问题,特别是现有UTXO集合分片方法在验证者之间有效分布UTXO时存在的困难,这种困难导致由于父-子交易依赖关系而产生显著的通信开销,从而严重影响交易处理速度。论文提出的解决方案的关键在于利用机器学习(Machine Learning, ML)优化UTXO集合的分片以及入账交易的路由,通过结合对比学习和无监督学习构建交易输出的嵌入空间,使模型能够根据支出关系对交易输出进行分组,从而高效地将交易路由至包含其父UTXO的分片。该方法通过在历史交易数据上使用三元组损失和在线半难负样本挖掘进行训练,将父-子支出模式直接嵌入模型参数中,从而消除了实时查找父交易的高成本需求,显著降低了跨分片通信开销,提升了吞吐量和可扩展性。

链接: https://arxiv.org/abs/2506.01614
作者: Hamid Attar,Luigi Lunardon,Alessio Pagani
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 6 pages, 5 figures, 3 tables

点击查看摘要

Abstract:This paper introduces a Machine Learning (ML) approach for scalability of UTXO-based blockchains, such as Bitcoin. Prior approaches to UTXO set sharding struggle with distributing UTXOs effectively across validators, creating substantial communication overhead due to child-parent transaction dependencies. This overhead, which arises from the need to locate parent UTXOs, significantly hampers transaction processing speeds. Our solution uses ML to optimize not only UTXO set sharding but also the routing of incoming transactions, ensuring that transactions are directed to shards containing their parent UTXOs. At the heart of our approach is a framework that combines contrastive and unsupervised learning to create an embedding space for transaction outputs. This embedding allows the model to group transaction outputs based on spending relationships, making it possible to route transactions efficiently to the correct validation microservices. Trained on historical transaction data with triplet loss and online semi-hard negative mining, the model embeds parent-child spending patterns directly into its parameters, thus eliminating the need for costly, real-time parent transaction lookups. This significantly reduces cross-shard communication overhead, boosting throughput and scalability.
zh

[AI-35] Policy Newton Algorithm in Reproducing Kernel Hilbert Space

【速读】:该论文试图解决在再生核希尔伯特空间(Reproducing Kernel Hilbert Space, RKHS)中表示的强化学习(Reinforcement Learning, RL)策略优化问题,特别是当前方法受限于一阶优化技术而无法有效利用二阶优化方法的问题。解决方案的关键在于提出一种名为Policy Newton in RKHS的二阶优化框架,该框架通过优化一个带有三次正则化的辅助目标函数来避免直接计算无限维Hessian算子的逆,同时利用Representer定理将无限维优化问题转化为可计算的有限维问题,从而实现了对RKHS中RL策略的有效二阶优化。

链接: https://arxiv.org/abs/2506.01597
作者: Yixian Zhang,Huaze Tang,Chao Wang,Wenbo Ding
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement learning (RL) policies represented in Reproducing Kernel Hilbert Spaces (RKHS) offer powerful representational capabilities. While second-order optimization methods like Newton’s method demonstrate faster convergence than first-order approaches, current RKHS-based policy optimization remains constrained to first-order techniques. This limitation stems primarily from the intractability of explicitly computing and inverting the infinite-dimensional Hessian operator in RKHS. We introduce Policy Newton in RKHS, the first second-order optimization framework specifically designed for RL policies represented in RKHS. Our approach circumvents direct computation of the inverse Hessian operator by optimizing a cubic regularized auxiliary objective function. Crucially, we leverage the Representer Theorem to transform this infinite-dimensional optimization into an equivalent, computationally tractable finite-dimensional problem whose dimensionality scales with the trajectory data volume. We establish theoretical guarantees proving convergence to a local optimum with a local quadratic convergence rate. Empirical evaluations on a toy financial asset allocation problem validate these theoretical properties, while experiments on standard RL benchmarks demonstrate that Policy Newton in RKHS achieves superior convergence speed and higher episodic rewards compared to established first-order RKHS approaches and parametric second-order methods. Our work bridges a critical gap between non-parametric policy representations and second-order optimization methods in reinforcement learning.
zh

[AI-36] Understanding and Improving Laplacian Positional Encodings For Temporal GNNs ECML-PKDD2025

【速读】:该论文旨在解决时间图(temporal graph)中位置编码(positional encoding)研究进展有限的问题,特别是针对通过超拉普拉斯矩阵(supra-Laplacian)扩展静态拉普拉斯特征向量方法所面临的高计算开销、理论理解不足以及应用时机和方式不明确等挑战。其解决方案的关键在于:(1)构建一个理论框架,将超拉普拉斯编码与按时间片的编码联系起来,突出利用额外时间连通性的优势;(2)提出新的方法以降低计算开销,实现高达56倍的运行速度提升,并支持包含50,000个活跃节点的图;(3)通过广泛的实验研究确定哪些模型、任务和数据集最受益于这些编码。

链接: https://arxiv.org/abs/2506.01596
作者: Yaniv Galron,Fabrizio Frasca,Haggai Maron,Eran Treister,Moshe Eliasof
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: ECML-PKDD 2025

点击查看摘要

Abstract:Temporal graph learning has applications in recommendation systems, traffic forecasting, and social network analysis. Although multiple architectures have been introduced, progress in positional encoding for temporal graphs remains limited. Extending static Laplacian eigenvector approaches to temporal graphs through the supra-Laplacian has shown promise, but also poses key challenges: high eigendecomposition costs, limited theoretical understanding, and ambiguity about when and how to apply these encodings. In this paper, we address these issues by (1) offering a theoretical framework that connects supra-Laplacian encodings to per-time-slice encodings, highlighting the benefits of leveraging additional temporal connectivity, (2) introducing novel methods to reduce the computational overhead, achieving up to 56x faster runtimes while scaling to graphs with 50,000 active nodes, and (3) conducting an extensive experimental study to identify which models, tasks, and datasets benefit most from these encodings. Our findings reveal that while positional encodings can significantly boost performance in certain scenarios, their effectiveness varies across different models.
zh

[AI-37] VirnyFlow: A Design Space for Responsible Model Development

【速读】:该论文旨在解决机器学习(Machine Learning, ML)模型开发中对现实问题多目标特性的理解不足以及传统AutoML框架在定制化优化和实际约束适应性方面的局限性。其解决方案的关键在于提出VirnyFlow,这是一个面向负责任模型开发的设计空间,通过集成评估协议定义、多目标贝叶斯优化、成本感知多臂老虎机、查询优化和分布式并行计算,使数据科学家能够定义自定义优化标准,跨管道阶段进行全面实验,并在符合现实约束的前提下迭代优化模型。

链接: https://arxiv.org/abs/2506.01584
作者: Denys Herasymuk,Nazar Protsiv,Julia Stoyanovich
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Developing machine learning (ML) models requires a deep understanding of real-world problems, which are inherently multi-objective. In this paper, we present VirnyFlow, the first design space for responsible model development, designed to assist data scientists in building ML pipelines that are tailored to the specific context of their problem. Unlike conventional AutoML frameworks, VirnyFlow enables users to define customized optimization criteria, perform comprehensive experimentation across pipeline stages, and iteratively refine models in alignment with real-world constraints. Our system integrates evaluation protocol definition, multi-objective Bayesian optimization, cost-aware multi-armed bandits, query optimization, and distributed parallelism into a unified architecture. We show that VirnyFlow significantly outperforms state-of-the-art AutoML systems in both optimization quality and scalability across five real-world benchmarks, offering a flexible, efficient, and responsible alternative to black-box automation in ML development.
zh

[AI-38] FlexiSAGA: A Flexible Systolic Array GEMM Accelerator for Sparse and Dense Processing

【速读】:该论文旨在解决深度神经网络(Deep Neural Networks, DNNs)推理过程中计算复杂度高导致在资源受限的边缘设备上处理效率低的问题。其解决方案的关键在于利用DNN运算权重中的稀疏性,并通过设计一种架构可配置且数据流灵活的AI硬件加速器FlexiSAGA,支持多种稀疏与密集的数据流,从而高效处理通用矩阵乘法(GEMMs)。此外,论文还提出了一种针对FlexiSAGA架构定制的DNN剪枝方法,以实现密集和稀疏卷积及全连接运算的近最优处理,推动了DNN与硬件的协同设计流程。

链接: https://arxiv.org/abs/2506.01566
作者: Mika Markus Müller,Konstantin Lübeck,Alexander Louis-Ferdinand Jung,Jannik Steinmetz,Oliver Bringmann
机构: 未知
类目: Performance (cs.PF); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Machine Learning (cs.LG)
备注: Accepted Version for: SAMOS XXV

点击查看摘要

Abstract:Artificial Intelligence (AI) algorithms, such as Deep Neural Networks (DNNs), have become an important tool for a wide range of applications, from computer vision to natural language processing. However, the computational complexity of DNN inference poses a significant challenge, particularly for processing on resource-constrained edge devices. One promising approach to address this challenge is the exploitation of sparsity in DNN operator weights. In this work, we present FlexiSAGA, an architecturally configurable and dataflow-flexible AI hardware accelerator for the sparse and dense processing of general matrix multiplications (GEMMs). FlexiSAGA supports seven different sparse and dense dataflows, enabling efficient processing of resource intensive DNN operators. Additionally, we propose a DNN pruning method specifically tailored towards the FlexiSAGA architecture, allowing for near-optimal processing of dense and sparse convolution and fully-connected operators, facilitating a DNN/HW co-design flow. Our results show a whole DNN sparse-over-dense inference speedup ranging from 1.41 up to 4.28, outperforming commercial and literature-reported accelerator platforms. Comments: Accepted Version for: SAMOS XXV Subjects: Performance (cs.PF); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Machine Learning (cs.LG) Cite as: arXiv:2506.01566 [cs.PF] (or arXiv:2506.01566v1 [cs.PF] for this version) https://doi.org/10.48550/arXiv.2506.01566 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-39] LAMARL: LLM -Aided Multi-Agent Reinforcement Learning for Cooperative Policy Generation

【速读】:该论文试图解决多智能体强化学习(MARL)在复杂多机器人任务中面临的数据样本效率低以及需要迭代手动奖励调优的问题。其解决方案的关键在于提出一种基于大语言模型(LLM)的MARL方法(LAMARL),该方法通过将LLM与MARL相结合,实现了先验策略和奖励函数的自动化生成,从而显著提升了样本效率并减少了人工干预。

链接: https://arxiv.org/abs/2506.01538
作者: Guobin Zhu,Rui Zhou,Wenkang Ji,Shiyu Zhao
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Accepted by IEEE Robotics and Automation Letters

点击查看摘要

Abstract:Although Multi-Agent Reinforcement Learning (MARL) is effective for complex multi-robot tasks, it suffers from low sample efficiency and requires iterative manual reward tuning. Large Language Models (LLMs) have shown promise in single-robot settings, but their application in multi-robot systems remains largely unexplored. This paper introduces a novel LLM-Aided MARL (LAMARL) approach, which integrates MARL with LLMs, significantly enhancing sample efficiency without requiring manual design. LAMARL consists of two modules: the first module leverages LLMs to fully automate the generation of prior policy and reward functions. The second module is MARL, which uses the generated functions to guide robot policy training effectively. On a shape assembly benchmark, both simulation and real-world experiments demonstrate the unique advantages of LAMARL. Ablation studies show that the prior policy improves sample efficiency by an average of 185.9% and enhances task completion, while structured prompts based on Chain-of-Thought (CoT) and basic APIs improve LLM output success rates by 28.5%-67.5%. Videos and code are available at this https URL
zh

[AI-40] A Diffusion-Based Method for Learning the Multi-Outcome Distribution of Medical Treatments KDD2025

【速读】:该论文试图解决医疗治疗中多维结果(multi-dimensional treatment outcomes)的联合分布学习问题,传统机器学习方法大多专注于单结局预测,而实际医疗数据通常包含多个相互依赖的结局。解决方案的关键在于提出一种基于扩散的方法DIME(Diffusion-based Method for Estimating multi-outcome distributions),该方法能够学习多个医疗结局的联合干预分布,明确捕捉结局间的依赖结构,并处理混合类型的结果(如二分类、多分类和连续变量)。DIME通过因果掩码机制考虑因果推断的基本问题,并在训练中利用定制化的条件掩码分解联合分布,在推理中通过自回归生成实现对联合干预分布的学习,从而超越传统的点估计方法。

链接: https://arxiv.org/abs/2506.01533
作者: Yuchen Ma,Jonas Schweisthal,Hengrui Zhang,Stefan Feuerriegel
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at KDD 2025

点击查看摘要

Abstract:In medicine, treatments often influence multiple, interdependent outcomes, such as primary endpoints, complications, adverse events, or other secondary endpoints. Hence, to make optimal treatment decisions, clinicians are interested in learning the distribution of multi-dimensional treatment outcomes. However, the vast majority of machine learning methods for predicting treatment effects focus on single-outcome settings, despite the fact that medical data often include multiple, interdependent outcomes. To address this limitation, we propose a novel diffusion-based method called DIME to learn the joint distribution of multiple outcomes of medical treatments. We addresses three challenges relevant in medical practice: (i)it is tailored to learn the joint interventional distribution of multiple medical outcomes, which enables reliable decision-making with uncertainty quantification rather than relying solely on point estimates; (ii)it explicitly captures the dependence structure between outcomes; (iii)it can handle outcomes of mixed type, including binary, categorical, and continuous variables. In DIME, we take into account the fundamental problem of causal inference through causal masking. For training, our method decomposes the joint distribution into a series of conditional distributions with a customized conditional masking to account for the dependence structure across outcomes. For inference, our method auto-regressively generates predictions. This allows our method to move beyond point estimates of causal quantities and thus learn the joint interventional distribution. To the best of our knowledge, DIME is the first neural method tailored to learn the joint, multi-outcome distribution of medical treatments. Across various experiments, we demonstrate that our method effectively learns the joint distribution and captures shared information among multiple outcomes.
zh

[AI-41] Learning of Population Dynamics: Inverse Optimization Meets JKO Scheme

【速读】:该论文试图解决从离散时间点的样本快照中恢复粒子演化的潜在过程的问题,即学习种群动力学。其解决方案的关键在于引入 \textit{JKOnet},该方法将Jko框架与逆优化技术相结合,通过传统的端到端对抗训练过程实现种群动力学的学习,而无需依赖如输入凸神经网络等限制性架构选择。

链接: https://arxiv.org/abs/2506.01502
作者: Mikhail Persiianov,Jiawei Chen,Petr Mokrov,Alexander Tyurin,Evgeny Burnaev,Alexander Korotin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Learning population dynamics involves recovering the underlying process that governs particle evolution, given evolutionary snapshots of samples at discrete time points. Recent methods frame this as an energy minimization problem in probability space and leverage the celebrated JKO scheme for efficient time discretization. In this work, we introduce \textttiJKOnet , an approach that combines the JKO framework with inverse optimization techniques to learn population dynamics. Our method relies on a conventional \textitend-to-end adversarial training procedure and does not require restrictive architectural choices, e.g., input-convex neural networks. We establish theoretical guarantees for our methodology and demonstrate improved performance over prior JKO-based methods.
zh

[AI-42] Automatic Stage Lighting Control: Is it a Rule-Driven Process or Generative Task?

【速读】:该论文试图解决传统自动舞台灯光控制(Automatic Stage Lighting Control, ASLC)方法在音乐分类和灯光模式映射上的局限性,这些方法通常仅能将音乐划分为有限类别并映射到预定义的灯光模式,导致结果缺乏合理性与多样性。解决方案的关键在于提出一种端到端的生成式方法——Skip-BART,该方法将ASLC建模为一个生成任务而非单纯的分类问题,通过修改BART模型,使其能够以音频音乐为输入,生成灯光色相和明度(强度)作为输出,并引入一种新颖的跳跃连接机制以增强音乐与灯光之间的关系。

链接: https://arxiv.org/abs/2506.01482
作者: Zijian Zhao,Dian Jin,Zijing Zhou,Xiaoyu Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Stage lighting plays an essential role in live music performances, influencing the engaging experience of both musicians and audiences. Given the high costs associated with hiring or training professional lighting engineers, Automatic Stage Lighting Control (ASLC) has gained increasing attention. However, most existing approaches only classify music into limited categories and map them to predefined light patterns, resulting in formulaic and monotonous outcomes that lack rationality. To address this issue, this paper presents an end-to-end solution that directly learns from experienced lighting engineers – Skip-BART. To the best of our knowledge, this is the first work to conceptualize ASLC as a generative task rather than merely a classification problem. Our method modifies the BART model to take audio music as input and produce light hue and value (intensity) as output, incorporating a novel skip connection mechanism to enhance the relationship between music and light within the frame this http URL validate our method through both quantitative analysis and an human evaluation, demonstrating that Skip-BART outperforms conventional rule-based methods across all evaluation metrics and shows only a limited gap compared to real lighting this http URL, our method yields a p-value of 0.72 in a statistical comparison based on human evaluations with human lighting engineers, suggesting that the proposed approach closely matches human lighting engineering performance. To support further research, we have made our self-collected dataset, code, and trained model parameters available at this https URL .
zh

[AI-43] Agent ic AI and Multiagent ic: Are We Reinventing the Wheel?

【速读】:该论文试图解决当前在生成式AI领域中,“Agentic AI”和“Multiagentic AI”这两个术语被误用的问题,这种误用混淆了新兴概念与人工智能文献中已有的成熟概念——即智能代理(intelligent agents)和多代理系统(multi-agent systems)。论文的关键解决方案是倡导使用经过验证的术语和理论框架,强调应借鉴长期以来在自主代理和多代理系统领域的研究成果,包括代理架构、通信语言、协调与合作算法以及协议技术等,以确保在基于大语言模型(LLM)的AI代理发展中保持科学严谨性,避免重复劳动。

链接: https://arxiv.org/abs/2506.01463
作者: V.Botti
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The terms Agentic AI and Multiagentic AI have recently gained popularity in discussions on generative artificial intelligence, often used to describe autonomous software agents and systems composed of such agents. However, the use of these terms confuses these buzzwords with well-established concepts in AI literature: intelligent agents and multi-agent systems. This article offers a critical analysis of this conceptual misuse. We review the theoretical origins of “agentic” in the social sciences (Bandura, 1986) and philosophical notions of intentionality (Dennett, 1971), and then summarise foundational works on intelligent agents and multi-agent systems by Wooldridge, Jennings and others. We examine classic agent architectures, from simple reactive agents to Belief-Desire-Intention (BDI) models, and highlight key properties (autonomy, reactivity, proactivity, social capability) that define agency in AI. We then discuss recent developments in large language models (LLMs) and agent platforms based on LLMs, including the emergence of LLM-powered AI agents and open-source multi-agent orchestration frameworks. We argue that the term AI Agentic is often used as a buzzword for what are essentially AI agents, and AI Multiagentic for what are multi-agent systems. This confusion overlooks decades of research in the field of autonomous agents and multi-agent systems. The article advocates for scientific and technological rigour and the use of established terminology from the state of the art in AI, incorporating the wealth of existing knowledge, including standards for multi-agent system platforms, communication languages and coordination and cooperation algorithms, agreement technologies (automated negotiation, argumentation, virtual organisations, trust, reputation, etc.), into the new and promising wave of LLM-based AI agents, so as not to end up reinventing the wheel.
zh

[AI-44] ShaTS: A Shapley-based Explainability Method for Time Series Artificial Intelligence Models applied to Anomaly Detection in Industrial Internet of Things

【速读】:该论文旨在解决工业物联网环境中异常检测与解释技术在处理时间序列数据时存在的解释不精确或不可操作的问题。传统解释方法往往忽视数据的时序结构,导致解释效果不佳。解决方案的关键在于提出ShaTS(Shapley values for Time Series models),这是一种模型无关的可解释人工智能方法,通过引入先验特征分组策略来保留时间依赖性,从而生成连贯且可操作的解释,提升了时间序列模型的解释精度和资源效率。

链接: https://arxiv.org/abs/2506.01450
作者: Manuel Franco de la Peña(1),Ángel Luis Perales Gómez(1),Lorenzo Fernández Maimó(1) ((1) Departamento de Ingeniería y Tecnología de Computadores, University of Murcia, Spain, Murcia)
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 22 pages;16 figures;Submitted to Elsevier (Information Fusion)

点击查看摘要

Abstract:Industrial Internet of Things environments increasingly rely on advanced Anomaly Detection and explanation techniques to rapidly detect and mitigate cyberincidents, thereby ensuring operational safety. The sequential nature of data collected from these environments has enabled improvements in Anomaly Detection using Machine Learning and Deep Learning models by processing time windows rather than treating the data as tabular. However, conventional explanation methods often neglect this temporal structure, leading to imprecise or less actionable explanations. This work presents ShaTS (Shapley values for Time Series models), which is a model-agnostic explainable Artificial Intelligence method designed to enhance the precision of Shapley value explanations for time series models. ShaTS addresses the shortcomings of traditional approaches by incorporating an a priori feature grouping strategy that preserves temporal dependencies and produces both coherent and actionable insights. Experiments conducted on the SWaT dataset demonstrate that ShaTS accurately identifies critical time instants, precisely pinpoints the sensors, actuators, and processes affected by anomalies, and outperforms SHAP in terms of both explainability and resource efficiency, fulfilling the real-time requirements of industrial environments.
zh

[AI-45] Agent ic Episodic Control

【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)在数据效率低和泛化能力差方面的局限性。其解决方案的关键在于提出一种名为代理情景控制(Agentic Episodic Control, AEC)的架构,该架构通过整合大语言模型(Large Language Models, LLMs)与RL,增强决策能力。AEC利用LLM将观察映射为语言基础的嵌入,并将其存储于情景记忆中以快速检索高价值经验;同时,采用世界图工作记忆模块捕捉结构化的环境动态,提升关系推理能力;此外,轻量级关键状态检测器动态协调情景记忆回溯与世界模型引导的探索。通过结合试错学习与LLM生成的语义先验,AEC显著提升了RL的数据效率和泛化能力。

链接: https://arxiv.org/abs/2506.01442
作者: Xidong Yang,Wenhao Li,Junjie Sheng,Chuyun Shen,Yun Hua,Xiangfeng Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement learning (RL) has driven breakthroughs in AI, from game-play to scientific discovery and AI alignment. However, its broader applicability remains limited by challenges such as low data efficiency and poor generalizability. Recent advances suggest that large language models, with their rich world knowledge and reasoning capabilities, could complement RL by enabling semantic state modeling and task-agnostic planning. In this work, we propose the Agentic Episodic Control (AEC), a novel architecture that integrates RL with LLMs to enhance decision-making. The AEC can leverage a large language model (LLM) to map the observations into language-grounded embeddings, which further can be stored in an episodic memory for rapid retrieval of high-value experiences. Simultaneously, a World-Graph working memory module is utilized to capture structured environmental dynamics in order to enhance relational reasoning. Furthermore, a lightweight critical state detector dynamically arbitrates between the episodic memory recall and the world-model-guided exploration. On the whole, by combining the trial-and-error learning scheme with LLM-derived semantic priors, the proposed AEC can improve both data efficiency and generalizability in reinforcement learning. In experiments on BabyAI-Text benchmark tasks, AEC demonstrates substantial improvements over existing baselines, especially on complex and generalization tasks like FindObj, where it outperforms the best baseline by up to 76%. The proposed AEC framework bridges the strengths of numeric reinforcement learning and symbolic reasoning, which provides a pathway toward more adaptable and sample-efficient agents.
zh

[AI-46] Distinguishing Autonomous AI Agents from Collaborative Agent ic Systems: A Comprehensive Framework for Understanding Modern Intelligent Architectures

【速读】:该论文试图解决如何区分独立AI代理(AI Agents)与协作式智能体生态系统(Agentic AI ecosystems)的问题,通过系统分析其操作原理、结构组成和部署方法建立明确的框架。解决方案的关键在于提出一种全面的架构比较,涵盖规划机制、记忆系统、协调协议和决策过程,并针对单智能体与多智能体应用场景进行对比分析,同时识别可靠性、协调复杂性和可扩展性等关键挑战,并提出基于增强推理框架、稳健记忆架构和改进协调机制的创新解决方案。

链接: https://arxiv.org/abs/2506.01438
作者: Prashik Buddhaghosh Bansod
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The emergence of large language models has catalyzed two distinct yet interconnected paradigms in artificial intelligence: standalone AI Agents and collaborative Agentic AI ecosystems. This comprehensive study establishes a definitive framework for distinguishing these architectures through systematic analysis of their operational principles, structural compositions, and deployment methodologies. We characterize AI Agents as specialized, tool-enhanced systems leveraging foundation models for targeted automation within constrained environments. Conversely, Agentic AI represents sophisticated multi-entity frameworks where distributed agents exhibit emergent collective intelligence through coordinated interaction protocols. Our investigation traces the evolutionary trajectory from traditional rule-based systems through generative AI foundations to contemporary agent architectures. We present detailed architectural comparisons examining planning mechanisms, memory systems, coordination protocols, and decision-making processes. The study categorizes application landscapes, contrasting single-agent implementations in customer service and content management with multi-agent deployments in research automation and complex decision support. We identify critical challenges including reliability issues, coordination complexities, and scalability constraints, while proposing innovative solutions through enhanced reasoning frameworks, robust memory architectures, and improved coordination mechanisms. This framework provides essential guidance for practitioners selecting appropriate agentic approaches and establishes foundational principles for next-generation intelligent system development.
zh

[AI-47] FinRobot: Generative Business Process AI Agents for Enterprise Resource Planning in Finance

【速读】:该论文旨在解决传统企业资源计划(Enterprise Resource Planning, ERP)系统在面对日益复杂和数据密集型业务操作时,因依赖静态规则工作流而导致的适应性差、可扩展性不足及智能化水平低的问题。其解决方案的关键在于提出了一种基于人工智能原生的、代理驱动的框架,即生成式业务流程人工智能代理(Generative Business Process AI Agents, GBPAs),通过将生成式AI与业务流程建模及多代理编排相结合,实现企业工作流的自主性、推理能力和动态优化,从而支持端到端的复杂任务自动化。

链接: https://arxiv.org/abs/2506.01423
作者: Hongyang Yang,Likun Lin,Yang She,Xinyu Liao,Jiaoyang Wang,Runjia Zhang,Yuquan Mo,Christina Dan Wang
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE); General Finance (q-fin.GN)
备注:

点击查看摘要

Abstract:Enterprise Resource Planning (ERP) systems serve as the digital backbone of modern financial institutions, yet they continue to rely on static, rule-based workflows that limit adaptability, scalability, and intelligence. As business operations grow more complex and data-rich, conventional ERP platforms struggle to integrate structured and unstructured data in real time and to accommodate dynamic, cross-functional workflows. In this paper, we present the first AI-native, agent-based framework for ERP systems, introducing a novel architecture of Generative Business Process AI Agents (GBPAs) that bring autonomy, reasoning, and dynamic optimization to enterprise workflows. The proposed system integrates generative AI with business process modeling and multi-agent orchestration, enabling end-to-end automation of complex tasks such as budget planning, financial reporting, and wire transfer processing. Unlike traditional workflow engines, GBPAs interpret user intent, synthesize workflows in real time, and coordinate specialized sub-agents for modular task execution. We validate the framework through case studies in bank wire transfers and employee reimbursements, two representative financial workflows with distinct complexity and data modalities. Results show that GBPAs achieve up to 40% reduction in processing time, 94% drop in error rate, and improved regulatory compliance by enabling parallelism, risk control insertion, and semantic reasoning. These findings highlight the potential of GBPAs to bridge the gap between generative AI capabilities and enterprise-grade automation, laying the groundwork for the next generation of intelligent ERP systems. Subjects: Artificial Intelligence (cs.AI); Software Engineering (cs.SE); General Finance (q-fin.GN) Cite as: arXiv:2506.01423 [cs.AI] (or arXiv:2506.01423v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2506.01423 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-48] System Calls for Malware Detection and Classification: Methodologies and Applications

【速读】:该论文试图解决日益复杂且难以检测的恶意软件(malware)所带来的安全威胁,其核心问题是如何有效识别和分类恶意行为。解决方案的关键在于利用系统调用(system calls)和API调用,这些调用是用户应用程序与操作系统及其内核之间的核心通信机制,能够提供软件行为的有价值信息,从而用于检测可疑或有害活动。通过结合静态分析、动态分析、沙箱技术以及机器学习、统计分析和异常检测等高级方法,研究人员可以分析系统调用模式,区分正常与恶意行为,并在不同系统(如Windows、Linux和Android)中应用这些技术。

链接: https://arxiv.org/abs/2506.01412
作者: Bishwajit Prasad Gond,Durga Prasad Mohapatra
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:As malware continues to become more complex and harder to detect, Malware Analysis needs to continue to evolve to stay one step ahead. One promising key area approach focuses on using system calls and API Calls, the core communication between user applications and the operating system and their kernels. These calls provide valuable insight into how software or programs behaves, making them an useful tool for spotting suspicious or harmful activity of programs and software. This chapter takes a deep down look at how system calls are used in malware detection and classification, covering techniques like static and dynamic analysis, as well as sandboxing. By combining these methods with advanced techniques like machine learning, statistical analysis, and anomaly detection, researchers can analyze system call patterns to tell the difference between normal and malicious behavior. The chapter also explores how these techniques are applied across different systems, including Windows, Linux, and Android, while also looking at the ways sophisticated malware tries to evade detection.
zh

[AI-49] Compiler Optimization via LLM Reasoning for Efficient Model Serving

【速读】:该论文试图解决大规模模型服务中由于高昂成本导致的可访问性和创新速度受限问题,特别是现有编译器在处理神经网络工作负载时因转换空间庞大且高度依赖而难以实现有效优化的问题。其解决方案的关键在于提出一种名为REASONING COMPILER的新编译框架,该框架将优化过程建模为一个由大语言模型(Large Language Model, LLM)引导的顺序性、上下文感知的决策过程,并结合结构化蒙特卡洛树搜索(Monte Carlo Tree Search, MCTS)来平衡探索与利用,从而显著提升样本效率。

链接: https://arxiv.org/abs/2506.01374
作者: Sujun Tang,Christopher Priebe,Rohan Mahapatra,Lianhui Qin,Hadi Esmaeilzadeh
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Programming Languages (cs.PL)
备注:

点击查看摘要

Abstract:While model serving has unlocked unprecedented capabilities, the high cost of serving large-scale models continues to be a significant barrier to widespread accessibility and rapid innovation. Compiler optimizations have long driven substantial performance improvements, but existing compilers struggle with neural workloads due to the exponentially large and highly interdependent space of possible transformations. Although existing stochastic search techniques can be effective, they are often sample-inefficient and fail to leverage the structural context underlying compilation decisions. We set out to investigate the research question of whether reasoning with large language models (LLMs), without any retraining, can leverage the context-aware decision space of compiler optimization to significantly improve sample efficiency. To that end, we introduce a novel compilation framework (dubbed REASONING COMPILER) that formulates optimization as a sequential, context-aware decision process, guided by a large language model and structured Monte Carlo tree search (MCTS). The LLM acts as a proposal mechanism, suggesting hardware-aware transformations that reflect the current program state and accumulated performance feedback. Monte Carlo tree search (MCTS) incorporates the LLM-generated proposals to balance exploration and exploitation, facilitating structured, context-sensitive traversal of the expansive compiler optimization space. By achieving substantial speedups with markedly fewer samples than leading neural compilers, our approach demonstrates the potential of LLM-guided reasoning to transform the landscape of compiler optimization.
zh

[AI-50] Incentivizing LLM s to Self-Verify Their Answers

【速读】:该论文试图解决在特定推理任务上微调后的大型语言模型(Large Language Models, LLMs)在测试时扩展性能有限的问题,其核心问题是由于特定微调生成器与通用奖励模型之间的分布差异导致的改进效果有限。解决方案的关键在于提出一种框架,通过在单一强化学习(Reinforcement Learning, RL)过程中统一答案生成与验证,激励LLMs自我验证其答案,从而有效评估自身解题的正确性,并在推理阶段通过自我验证进一步提升性能,而无需依赖外部验证器。

链接: https://arxiv.org/abs/2506.01369
作者: Fuxiang Zhang,Jiacheng Xu,Chaojie Wang,Ce Cui,Yang Liu,Bo An
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable progress in complex reasoning tasks through both post-training and test-time scaling laws. While prevalent test-time scaling approaches are often realized by using external reward models to guide the model generation process, we find only marginal gains can be acquired when scaling a model post-trained on specific reasoning tasks. We identify that the limited improvement stems from distribution discrepancies between the specific post-trained generator and the general reward model. To address this, we propose a framework that incentivizes LLMs to self-verify their own answers. By unifying answer generation and verification within a single reinforcement learning (RL) process, we train models that can effectively assess the correctness of their own solutions. The trained model can further scale its performance during inference time by verifying its generations, without the need for external verifiers. We train our self-verification models based on Qwen2.5-Math-7B and DeepSeek-R1-Distill-Qwen-1.5B, demonstrating its capabilities across varying reasoning context lengths. Experiments on multiple mathematical reasoning benchmarks show that our models can not only improve post-training performance but also enable effective test-time scaling. Our code is available at this https URL.
zh

[AI-51] Unraveling Spatio-Temporal Foundation Models via the Pipeline Lens: A Comprehensive Review

【速读】:该论文试图解决传统时空深度学习模型在面对不同应用场景时需要单独训练所导致的计算和存储成本过高的问题。其解决方案的关键在于提出一种统一的时空基础模型框架,该框架通过学习时空数据中的通用知识或迁移预训练语言模型的通用能力,实现对多种时空任务的有效处理。此外,论文从流程视角系统性地梳理了时空基础模型的构建过程,包括数据类型介绍、预处理与嵌入技术、数据属性分类、模型训练目标及适应技术,为研究人员提供了清晰且结构化的指导。

链接: https://arxiv.org/abs/2506.01364
作者: Yuchen Fang,Hao Miao,Yuxuan Liang,Liwei Deng,Yue Cui,Ximu Zeng,Yuyang Xia,Yan Zhao,Torben Bach Pedersen,Christian S. Jensen,Xiaofang Zhou,Kai Zheng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 21 pages, 10 figures

点击查看摘要

Abstract:Spatio-temporal deep learning models aims to utilize useful patterns in such data to support tasks like prediction. However, previous deep learning models designed for specific tasks typically require separate training for each use case, leading to increased computational and storage costs. To address this issue, spatio-temporal foundation models have emerged, offering a unified framework capable of solving multiple spatio-temporal tasks. These foundation models achieve remarkable success by learning general knowledge with spatio-temporal data or transferring the general capabilities of pre-trained language models. While previous surveys have explored spatio-temporal data and methodologies separately, they have ignored a comprehensive examination of how foundation models are designed, selected, pre-trained, and adapted. As a result, the overall pipeline for spatio-temporal foundation models remains unclear. To bridge this gap, we innovatively provide an up-to-date review of previous spatio-temporal foundation models from the pipeline perspective. The pipeline begins with an introduction to different types of spatio-temporal data, followed by details of data preprocessing and embedding techniques. The pipeline then presents a novel data property taxonomy to divide existing methods according to data sources and dependencies, providing efficient and effective model design and selection for researchers. On this basis, we further illustrate the training objectives of primitive models, as well as the adaptation techniques of transferred models. Overall, our survey provides a clear and structured pipeline to understand the connection between core elements of spatio-temporal foundation models while guiding researchers to get started quickly. Additionally, we introduce emerging opportunities such as multi-objective training in the field of spatio-temporal foundation models.
zh

[AI-52] NoiseAR: AutoRegressing Initial Noise Prior for Diffusion Models

【速读】:该论文旨在解决扩散模型在初始噪声采样阶段缺乏结构和可控性的问题,传统方法通常采用静态、无结构的分布(如各向同性高斯分布)作为初始状态,难以实现对外部条件的有效控制。其解决方案的关键在于提出NoiseAR,这是一种基于自回归(Autoregressive)的初始噪声先验学习方法,通过将初始噪声先验参数的生成建模为对空间块或标记的自回归概率建模任务,从而生成动态且可控制的先验分布,使文本提示能够直接影响先验分布,实现对扩散初始化的细粒度控制。

链接: https://arxiv.org/abs/2506.01337
作者: Zeming Li,Xiangyue Liu,Xiangyu Zhang,Ping Tan,Heung-Yeung Shum
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Diffusion models have emerged as powerful generative frameworks, creating data samples by progressively denoising an initial random state. Traditionally, this initial state is sampled from a simple, fixed distribution like isotropic Gaussian, inherently lacking structure and a direct mechanism for external control. While recent efforts have explored ways to introduce controllability into the diffusion process, particularly at the initialization stage, they often rely on deterministic or heuristic approaches. These methods can be suboptimal, lack expressiveness, and are difficult to scale or integrate into more sophisticated optimization frameworks. In this paper, we introduce NoiseAR, a novel method for AutoRegressive Initial Noise Prior for Diffusion Models. Instead of a static, unstructured source, NoiseAR learns to generate a dynamic and controllable prior distribution for the initial noise. We formulate the generation of the initial noise prior’s parameters as an autoregressive probabilistic modeling task over spatial patches or tokens. This approach enables NoiseAR to capture complex spatial dependencies and introduce learned structure into the initial state. Crucially, NoiseAR is designed to be conditional, allowing text prompts to directly influence the learned prior, thereby achieving fine-grained control over the diffusion initialization. Our experiments demonstrate that NoiseAR can generate initial noise priors that lead to improved sample quality and enhanced consistency with conditional inputs, offering a powerful, learned alternative to traditional random initialization. A key advantage of NoiseAR is its probabilistic formulation, which naturally supports seamless integration into probabilistic frameworks like Markov Decision Processes and Reinforcement Learning. Our code will be available at this https URL
zh

[AI-53] ETDI: Mitigating Tool Squatting and Rug Pull Attacks in Model Context Protocol (MCP) by using OAuth-Enhanced Tool Definitions and Policy-Based Access Control

【速读】:该论文试图解决标准Model Context Protocol (MCP)在与外部工具和数据源集成时存在的安全漏洞问题,特别是Tool Poisoning和Rug Pull攻击。解决方案的关键在于引入增强的工具定义接口(Enhanced Tool Definition Interface, ETDI),其核心要素包括加密身份验证、不可变版本化工具定义以及显式权限管理,并通常结合OAuth 2.0实现。此外,论文还提出了基于细粒度策略的访问控制扩展,通过专用策略引擎在运行时上下文中动态评估工具能力,超越传统静态OAuth作用域的限制,从而构建更安全、可信和可控的人工智能应用生态系统。

链接: https://arxiv.org/abs/2506.01333
作者: Manish Bhatt,Vineeth Sai Narajala,Idan Habler
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
备注: 11 Pages, 10 figures, Github links in introduction

点击查看摘要

Abstract:The Model Context Protocol (MCP) plays a crucial role in extending the capabilities of Large Language Models (LLMs) by enabling integration with external tools and data sources. However, the standard MCP specification presents significant security vulnerabilities, notably Tool Poisoning and Rug Pull attacks. This paper introduces the Enhanced Tool Definition Interface (ETDI), a security extension designed to fortify MCP. ETDI incorporates cryptographic identity verification, immutable versioned tool definitions, and explicit permission management, often leveraging OAuth 2.0. We further propose extending MCP with fine-grained, policy-based access control, where tool capabilities are dynamically evaluated against explicit policies using a dedicated policy engine, considering runtime context beyond static OAuth scopes. This layered approach aims to establish a more secure, trustworthy, and controllable ecosystem for AI applications interacting with LLMs and external tools.
zh

[AI-54] STSA: Federated Class-Incremental Learning via Spatial-Temporal Statistics Aggregation

【速读】:该论文旨在解决联邦类增量学习(Federated Class-Incremental Learning, FCIL)中因数据异质性导致的空间-时间客户端漂移问题,以及现有方法在计算和通信开销上的不足。其解决方案的关键在于提出一种新的统一框架——空间-时间统计聚合(Spatial-Temporal Statistics Aggregation, STSA),该框架能够同时在空间维度(跨客户端)和时间维度(跨阶段)聚合特征统计信息,从而不受数据异质性影响,并在每个阶段以闭式解更新分类器。此外,还引入了通信高效的变体STSA-E,具有理论保障且显著降低了通信开销。

链接: https://arxiv.org/abs/2506.01327
作者: Zenghao Guan,Guojun Zhu,Yucan Zhou,Wu Liu,Weiping Wang,Jiebo Luo,Xiaoyan Gu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Federated Class-Incremental Learning (FCIL) enables Class-Incremental Learning (CIL) from distributed data. Existing FCIL methods typically integrate old knowledge preservation into local client training. However, these methods cannot avoid spatial-temporal client drift caused by data heterogeneity and often incur significant computational and communication overhead, limiting practical deployment. To address these challenges simultaneously, we propose a novel approach, Spatial-Temporal Statistics Aggregation (STSA), which provides a unified framework to aggregate feature statistics both spatially (across clients) and temporally (across stages). The aggregated feature statistics are unaffected by data heterogeneity and can be used to update the classifier in closed form at each stage. Additionally, we introduce STSA-E, a communication-efficient variant with theoretical guarantees, achieving similar performance to STSA-E with much lower communication overhead. Extensive experiments on three widely used FCIL datasets, with varying degrees of data heterogeneity, show that our method outperforms state-of-the-art FCIL methods in terms of performance, flexibility, and both communication and computation efficiency.
zh

[AI-55] ORMind: A Cognitive-Inspired End-to-End Reasoning Framework for Operations Research

【速读】:该论文试图解决在工业相关运筹学(Operations Research, OR)问题中,大型语言模型(Large Language Models, LLMs)应用所面临的两个关键部署挑战:一是自我纠正机制关注代码语法而非数学准确性,导致成本高昂的错误;二是复杂专家选择造成不可预测的工作流程,降低透明度并增加维护成本,使得其难以应用于对时间敏感的业务场景。解决方案的关键在于提出ORMind框架,该框架受认知启发,通过反事实推理增强优化能力,实现从需求到数学模型和可执行求解器代码的端到端工作流,从而提升解决OR问题的效率与准确性。

链接: https://arxiv.org/abs/2506.01326
作者: Zhiyuan Wang,Bokui Chen,Yinya Huang,Qingxing Cao,Ming He,Jianping Fan,Xiaodan Liang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted by Annual Meetings of the Association for Computational Linguistics 2025

点击查看摘要

Abstract:Operations research (OR) is widely deployed to solve critical decision-making problems with complex objectives and constraints, impacting manufacturing, logistics, finance, and healthcare outcomes. While Large Language Models (LLMs) have shown promising results in various domains, their practical application in industry-relevant operations research (OR) problems presents significant challenges and opportunities. Preliminary industrial applications of LLMs for operations research face two critical deployment challenges: 1) Self-correction focuses on code syntax rather than mathematical accuracy, causing costly errors; 2) Complex expert selection creates unpredictable workflows that reduce transparency and increase maintenance costs, making them impractical for time-sensitive business applications. To address these business limitations, we introduce ORMind, a cognitive-inspired framework that enhances optimization through counterfactual reasoning. Our approach emulates human cognition, implementing an end-to-end workflow that systematically transforms requirements into mathematical models and executable solver code. It is currently being tested internally in Lenovo’s AI Assistant, with plans to enhance optimization capabilities for both business and consumer customers. Experiments demonstrate that ORMind outperforms existing methods, achieving a 9.5% improvement on the NL4Opt dataset and a 14.6% improvement on the ComplexOR dataset.
zh

[AI-56] Unlearnings Blind Spots: Over-Unlearning and Prototypical Relearning Attack NEURIPS2025

【速读】:该论文旨在解决机器遗忘(Machine Unlearning, MU)中的两个关键盲点问题:过量遗忘(over-unlearning)导致保留数据附近区域的性能退化,以及事后重学习攻击(relearning attacks)试图恢复被遗忘的知识。其解决方案的关键在于提出Spotter,一个即插即用的目标函数,通过在遗忘集附近区域引入掩码知识蒸馏惩罚以抑制过量遗忘指标OU@\epsilon,并结合类内离散损失以打散遗忘类的嵌入,从而有效抵御原型重学习攻击。

链接: https://arxiv.org/abs/2506.01318
作者: SeungBum Ha,Saerom Park,Sung Whan Yoon
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 10 pages, 4 figures, 2 tables, Under review at NeurIPS 2025

点击查看摘要

Abstract:Machine unlearning (MU) aims to expunge a designated forget set from a trained model without costly retraining, yet the existing techniques overlook two critical blind spots: “over-unlearning” that deteriorates retained data near the forget set, and post-hoc “relearning” attacks that aim to resurrect the forgotten knowledge. We first derive the over-unlearning metric OU@\epsilon, which represents the collateral damage to the nearby region of the forget set, where the over-unlearning mainly appears. Next, we expose an unforeseen relearning threat on MU, i.e., the Prototypical Relearning Attack, which exploits the per-class prototype of the forget class with just a few samples, and easily restores the pre-unlearning performance. To counter both blind spots, we introduce Spotter, a plug-and-play objective that combines (i) a masked knowledge-distillation penalty on the nearby region of forget set to suppress OU@\epsilon, and (ii) an intra-class dispersion loss that scatters forget-class embeddings, neutralizing prototypical relearning attacks. On CIFAR-10, as one of validations, Spotter reduces OU@\epsilonby below the 0.05X of the baseline, drives forget accuracy to 0%, preserves accuracy of the retain set within 1% of difference with the original, and denies the prototype-attack by keeping the forget set accuracy within 1%, without accessing retained data. It confirms that Spotter is a practical remedy of the unlearning’s blind spots.
zh

[AI-57] -SHIRT: Token-Selective Hierarchical Data Selection for Instruction Tuning

【速读】:该论文试图解决指令微调数据选择中的两个关键问题:一是现有方法仅在样本层面评估质量,忽略了令牌级别的信息量;二是评分方法的鲁棒性不足,容易因表面的词汇特征而非真实质量选择样本。解决方案的关键在于提出一种新的数据选择框架T-SHIRT,该框架引入了一种新的评分方法,仅包含具有信息量的令牌进行质量评估,并且鼓励选择局部一致性低、邻居也表现出高质量的稳健样本。

链接: https://arxiv.org/abs/2506.01317
作者: Yanjun Fu,Faisal Hamman,Sanghamitra Dutta
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 17 pages, 6 figures

点击查看摘要

Abstract:Instruction tuning is essential for Large Language Models (LLMs) to effectively follow user instructions. To improve training efficiency and reduce data redundancy, recent works use LLM-based scoring functions, e.g., Instruction-Following Difficulty (IFD), to select high-quality instruction-tuning data with scores above a threshold. While these data selection methods often lead to models that can match or even exceed the performance of models trained on the full datasets, we identify two key limitations: (i) they assess quality at the sample level, ignoring token-level informativeness; and (ii) they overlook the robustness of the scoring method, often selecting a sample due to superficial lexical features instead of its true quality. In this work, we propose Token-Selective HIeRarchical Data Selection for Instruction Tuning (T-SHIRT), a novel data selection framework that introduces a new scoring method to include only informative tokens in quality evaluation and also promotes robust and reliable samples whose neighbors also show high quality with less local inconsistencies. We demonstrate that models instruction-tuned on a curated dataset (only 5% of the original size) using T-SHIRT can outperform those trained on the entire large-scale dataset by up to 5.48 points on average across eight benchmarks. Across various LLMs and training set scales, our method consistently surpasses existing state-of-the-art data selection techniques, while also remaining both cost-effective and highly efficient. For instance, by using GPT-2 for score computation, we are able to process a dataset of 52k samples using 40 minutes on a single GPU.
zh

[AI-58] Align is not Enough: Multimodal Universal Jailbreak Attack against Multimodal Large Language Models

【速读】:该论文试图解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在安全性和生成内容质量方面存在的问题,特别是针对文本驱动的越狱攻击(jailbreak attacks)所带来的安全隐患。其解决方案的关键在于提出一种统一的多模态通用越狱攻击框架,通过迭代的图像-文本交互和基于迁移的策略,生成具有普遍攻击效果的对抗性后缀和图像,从而揭示多模态交互可能成为关键的安全漏洞,并验证多模态通用越狱攻击能够引发更高质量的不良生成内容。

链接: https://arxiv.org/abs/2506.01307
作者: Youze Wang,Wenbo Hu,Yinpeng Dong,Jing Liu,Hanwang Zhang,Richang Hong
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have evolved into Multimodal Large Language Models (MLLMs), significantly enhancing their capabilities by integrating visual information and other types, thus aligning more closely with the nature of human intelligence, which processes a variety of data forms beyond just text. Despite advancements, the undesirable generation of these models remains a critical concern, particularly due to vulnerabilities exposed by text-based jailbreak attacks, which have represented a significant threat by challenging existing safety protocols. Motivated by the unique security risks posed by the integration of new and old modalities for MLLMs, we propose a unified multimodal universal jailbreak attack framework that leverages iterative image-text interactions and transfer-based strategy to generate a universal adversarial suffix and image. Our work not only highlights the interaction of image-text modalities can be used as a critical vulnerability but also validates that multimodal universal jailbreak attacks can bring higher-quality undesirable generations across different MLLMs. We evaluate the undesirable context generation of MLLMs like LLaVA, Yi-VL, MiniGPT4, MiniGPT-v2, and InstructBLIP, and reveal significant multimodal safety alignment issues, highlighting the inadequacy of current safety mechanisms against sophisticated multimodal attacks. This study underscores the urgent need for robust safety measures in MLLMs, advocating for a comprehensive review and enhancement of security protocols to mitigate potential risks associated with multimodal capabilities.
zh

[AI-59] Scalable In-Context Q-Learning

【速读】:该论文旨在解决在上下文强化学习(In-Context Reinforcement Learning, ICRL)中,现有方法在从次优轨迹中学习和实现精确的上下文推理时所面临的挑战。其解决方案的关键在于提出一种名为SICQL(Scalable In-Context Q-Learning)的框架,该框架结合动态规划与世界建模技术,以实现高效的奖励最大化和任务泛化,同时保持监督预训练的可扩展性和稳定性。SICQL通过基于提示的多头Transformer架构,同时预测最优策略和上下文价值函数,并利用预训练的通用世界模型构建紧凑提示,从而支持快速且精确的上下文推理。

链接: https://arxiv.org/abs/2506.01299
作者: Jinmei Liu,Fuhong Liu,Jianye Hao,Bo Wang,Huaxiong Li,Chunlin Chen,Zhi Wang
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent advancements in language models have demonstrated remarkable in-context learning abilities, prompting the exploration of in-context reinforcement learning (ICRL) to extend the promise to decision domains. Due to involving more complex dynamics and temporal correlations, existing ICRL approaches may face challenges in learning from suboptimal trajectories and achieving precise in-context inference. In the paper, we propose \textbfScalable \textbfIn-\textbfContext \textbfQ-\textbfLearning (\textbfSICQL), an innovative framework that harnesses dynamic programming and world modeling to steer ICRL toward efficient reward maximization and task generalization, while retaining the scalability and stability of supervised pretraining. We design a prompt-based multi-head transformer architecture that simultaneously predicts optimal policies and in-context value functions using separate heads. We pretrain a generalized world model to capture task-relevant information, enabling the construction of a compact prompt that facilitates fast and precise in-context inference. During training, we perform iterative policy improvement by fitting a state value function to an upper-expectile of the Q-function, and distill the in-context value functions into policy extraction using advantage-weighted regression. Extensive experiments across a range of discrete and continuous environments show consistent performance gains over various types of baselines, especially when learning from suboptimal data. Our code is available at this https URL
zh

[AI-60] MobCLIP: Learning General-purpose Geospatial Representation at Scale

【速读】:该论文旨在解决地理空间位置表征学习在实现通用地理空间智能中的核心挑战,当前的嵌入方法缺乏通用性,限制了其在人类和自然领域多样化任务中的应用。解决方案的关键在于提出MobCLIP,这是首个全国范围的通用位置编码器,通过有效的多模态融合整合了前所未有的数据模态多样性,采用基于CLIP的架构,将1亿多个兴趣点(Points of Interest, POIs)、全国范围的遥感影像和结构化人口统计数据与十亿边的移动性图谱对齐,并通过受视觉Transformer启发的网格单元标记化方法建立统一的表示空间,从而连接移动模式与多模态特征。

链接: https://arxiv.org/abs/2506.01297
作者: Ya Wen,Jixuan Cai,Qiyao Ma,Linyan Li,Xinhua Chen,Chris Webster,Yulun Zhou
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Representation learning of geospatial locations remains a core challenge in achieving general geospatial intelligence. Current embedding methods often lack versatility, limiting their utility across diverse tasks in both human and natural domains. We present MobCLIP, the first nationwide general-purpose location encoder, integrating an unprecedented diversity of data modalities through effective and scalable multimodal fusion. Adopting a novel CLIP-based architecture, our framework aligns 100M+ POIs, nationwide remote sensing imagery, and structured demographic statistics with a billion-edge mobility graph. By tokenizing spatial locations into grid cells inspired by Vision Transformers, we establish a unified representation space bridging mobility patterns and multimodal features. To rigorously evaluate the general-purpose effectiveness of MobCLIP, we construct a benchmark dataset composed of 11 downstream prediction tasks across social, economic, and natural domains. Experiments show that MobCLIP, with four input modalities and a compact 128-dimensional representation space, achieves significantly superior general-purpose predictive performances than state-of-the-art models by an average of 35%. Thanks to the effective integration of human-centric modalities, the performance gain is particularly profound in human-centric tasks, such as energy consumption (+260%), offline retail consumption amount (+98%), and crime cases (+95%) predictions. Echoing LLM scaling laws, we further demonstrate the scaling behavior in geospatial representation learning. We open-source code and pretrained models at: this http URL.
zh

[AI-61] SRating: Rating Quality of Diverse Time Series Data by Meta-learning from LLM Judgment

【速读】:该论文试图解决跨领域时间序列(Time Series, TS)数据质量评估的准确性与效率问题,现有方法在单一领域内表现良好,但无法有效适应不同领域间TS数据性质差异带来的挑战。解决方案的关键在于提出TSRating框架,其核心假设是大型语言模型(Large Language Models, LLMs)通过预训练获得了丰富的知识,能够理解并区分不同TS数据的质量差异。通过设计特定提示词引导LLMs进行质量比较,并构建专门的质量评分模型TSRater,结合元学习策略提升跨领域适应能力,同时采用signSGD优化训练效率,从而实现对多样化TS数据的高效且准确的质量评估。

链接: https://arxiv.org/abs/2506.01290
作者: Shunyu Wu,Dan Li,Haozheng Ye,Zhuomin Chen,Jiahui Zhou,Jian Lou,Zibin Zheng,See-Kiong Ng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:High-quality time series (TS) data are essential for ensuring TS model performance, rendering research on rating TS data quality indispensable. Existing methods have shown promising rating accuracy within individual domains, primarily by extending data quality rating techniques such as influence functions and Shapley values to account for temporal characteristics. However, they neglect the fact that real-world TS data can span vastly different domains and exhibit distinct properties, hampering the accurate and efficient rating of diverse TS data. In this paper, we propose TSRating, a novel and unified framework for rating the quality of time series data crawled from diverse domains. TSRating is built on the assumption that LLMs inherit ample knowledge, acquired during their extensive pretraining, enabling them to comprehend and discern quality differences in diverse TS data. We verify this assumption by devising a series of prompts to elicit quality comparisons from LLMs for pairs of TS samples. We then fit a dedicated rating model, termed TSRater, to convert the LLMs’ judgments into efficient quality predictions via TSRater’s inference on future TS samples. To ensure cross-domain adaptability, we develop a meta-learning scheme to train TSRater on quality comparisons collected from nine distinct domains. To improve training efficiency, we employ signSGD for inner-loop updates, thus circumventing the demanding computation of hypergradients. Extensive experimental results on eleven benchmark datasets across three time series tasks, each using both conventional TS models and TS foundation models, demonstrate that TSRating outperforms baselines in terms of estimation accuracy, efficiency, and domain adaptability.
zh

[AI-62] On the Hardness of Approximating Distributions with Probabilistic Circuits

【速读】:该论文试图解决概率建模中的表达能力与可 tractable 推理之间的权衡问题(tradeoff between expressivity and tractable inference)。其解决方案的关键在于通过允许一定的小近似误差来避免在精确表示分布时出现的指数级电路规模膨胀。研究首先证明了对于任何能够有效计算边缘概率的模型,用有界 f-散度近似任意分布是 NP\mathsf{NP}-难的,随后证明了可分解概率电路(decomposable PCs)与额外确定性概率电路(additionally deterministic PCs)在近似能力上存在指数级的规模差距。

链接: https://arxiv.org/abs/2506.01281
作者: John Leland,YooJung Choi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:A fundamental challenge in probabilistic modeling is balancing expressivity and tractable inference. Probabilistic circuits (PCs) aim to directly address this tradeoff by imposing structural constraints that guarantee efficient inference of certain queries while maintaining expressivity. Since inference complexity on PCs depends on circuit size, understanding the size bounds across circuit families is key to characterizing the tradeoff between tractability and expressive efficiency. However, expressive efficiency is often studied through exact representations, where exactly encoding distributions while enforcing various structural properties often incurs exponential size blow-ups. Thus, we pose the following question: can we avoid such size blow-ups by allowing some small approximation error? We first show that approximating an arbitrary distribution with bounded f -divergence is \mathsfNP -hard for any model that can tractably compute marginals. We then prove an exponential size gap for approximation between the class of decomposable PCs and additionally deterministic PCs.
zh

[AI-63] GeoLocSFT: Efficient Visual Geolocation via Supervised Fine-Tuning of Multimodal Foundation Models

【速读】:该论文旨在解决单张图像的视觉地理定位(visual geolocation)问题,即准确确定一张图像的拍摄地理位置,这一任务因地球范围广阔及远距离地点之间的视觉相似性而极具挑战性。论文提出的解决方案关键在于通过针对监督微调(Supervised Fine-Tuning, SFT)对大型多模态基础模型(Gemma 3)进行高效优化,仅使用少量高质量的图像-地理坐标对(2700组)进行训练,便实现了卓越的地理定位性能。该方法在多个标准基准测试中表现优异,证明了高质量监督信号与高效SFT在大规模图像地理定位中的有效性。

链接: https://arxiv.org/abs/2506.01277
作者: Qiang Yi,Lianlei Shan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 29 pages, 14 figures

点击查看摘要

Abstract:Accurately determining the geographic location where a single image was taken, visual geolocation, remains a formidable challenge due to the planet’s vastness and the deceptive similarity among distant locations. We introduce GeoLocSFT, a framework that demonstrates how targeted supervised fine-tuning (SFT) of a large multimodal foundation model (Gemma 3) using a small, high-quality dataset can yield highly competitive geolocation performance. GeoLocSFT is trained with only 2700 carefully selected image-GPS pairs from our geographically diverse MR600k dataset. Despite this limited data, our SFT-centric approach substantially improves over baseline models and achieves robust results on standard benchmarks such as Im2GPS-3k and YFCC-4k, as well as on our newly proposed and challenging MR40k benchmark, aimed specifically at sparsely populated regions. Further, we explore multi-candidate inference and aggregation strategies but find that the core gains are already realized at the SFT stage. Our findings highlight the power of high-quality supervision and efficient SFT for planet-scale image geolocation, especially when compared to prior methods that require massive databases or complex pipelines. To foster further research, we publicly release the MR40k benchmark dataset.
zh

[AI-64] Contra4: Evaluating Contrastive Cross-Modal Reasoning in Audio Video Image and 3D

【速读】:该论文试图解决多模态模型在面对自然语言提示时,能否进行对比推理以选择最相关模态的问题(contrastive cross-modal reasoning)。解决方案的关键在于构建了一个包含四种模态(图像、音频、视频和3D)的基准数据集Contra4,该数据集通过结合人工标注的描述和混合模型的往返一致性过滤器确保监督信号的质量,从而有效评估模型在多模态场景下的语义对齐能力。

链接: https://arxiv.org/abs/2506.01275
作者: Artemis Panagopoulou,Le Xue,Honglu Zhou,silvio savarese,Ran Xu,Caiming Xiong,Chris Callison-Burch,Mark Yatskar,Juan Carlos Niebles
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Real-world decision-making often begins with identifying which modality contains the most relevant information for a given query. While recent multimodal models have made impressive progress in processing diverse inputs, it remains unclear whether they can reason contrastively across multiple modalities to select the one that best satisfies a natural language prompt. We argue this capability is foundational, especially in retrieval-augmented and decision-time contexts, where systems must evaluate multiple signals and identify which one conveys the relevant information. To evaluate this skill, we introduce Contra4, a dataset for contrastive cross-modal reasoning across four modalities: image, audio, video, and 3D. Each example presents a natural language question alongside multiple candidate modality instances, and the model must select the one that semantically aligns with the prompt. Contra4 combines human-annotated captions with a mixture-of-models round-trip-consistency filter to ensure high-quality supervision, resulting in 174k training examples and a manually verified test set of 2.3k samples. While task-specific fine-tuning improves performance by 56% relative to baseline, state-of-the-art models still achieve only 56% accuracy overall and 42% in four-modality settings, underscoring a significant limitation in current multimodal models.
zh

[AI-65] RAISE: Reasoning Agent for Interactive SQL Exploration

【速读】:该论文旨在解决现有文本到SQL(text-to-SQL)系统依赖复杂多阶段流水线的问题,从而提升自然语言接口到数据库的性能与效率。其解决方案的关键在于提出了一种新颖的代理框架,将模式链接、查询生成和迭代优化统一在一个端到端的组件中,利用大语言模型(LLM)的内在推理能力,模拟人类在面对不熟悉数据库时的问答过程,包括假设构建、动态查询验证、结果推理及输出修正。该方法通过扩展测试时计算的深度,实现对数据库探索与反思的动态计算分配,尤其在模糊和未充分指定的场景中表现出色。

链接: https://arxiv.org/abs/2506.01273
作者: Fernando Granado,Roberto Lotufo,Jayr Pereira
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in large language models (LLMs) have propelled research in natural language interfaces to databases. However, most state-of-the-art text-to- SQL systems still depend on complex, multi-stage pipelines. This work proposes a novel agentic framework that unifies schema linking, query generation, and itera- tive refinement within a single, end-to-end component. By leveraging the intrinsic reasoning abilities of LLMs, our method emulates how humans answer questions when working with unfamiliar databases: understanding the data by formulating hypotheses, running dynamic queries to validate them, reasoning over the results, and revising outputs based on observed results. Crucially, our approach intro- duces a new strategy for scaling test-time computation in text-to-SQL: we scale the depth of interactive database exploration and reflection. This shift enables the model to allocate computation dynamically to better understand the data, especially useful in ambiguous and underspecified scenarios. Our experiments show that it improved the Execution Accuracy (EX) from 44.8% to 56.5% on the challenging BIRD dataset using DeepSeek-R1-Distill-Llama-70B. Fur- thermore, when equipped with steps to add more diversity to the answers, our agent achieves a Best-of-N accuracy of 81.8% with 8 rounds of candidate gener- ation, rivaling the 82.79% achieved by the top-ranked published solution, while reducing engineering complexity. These findings position our unified framework as a promising alternative for building natural language interfaces to databases.
zh

[AI-66] CleanS2S: Single-file Framework for Proactive Speech-to-Speech Interaction

【速读】:该论文旨在解决传统对话系统在实时性、交互自然度及对话控制灵活性方面的不足。其解决方案的关键在于提出CleanS2S框架,该框架通过将自动语音识别、大语言模型和文本到语音合成整合为统一的流水线,并利用全双工WebSocket连接和非阻塞I/O实现低延迟的实时中断处理。此外,该框架引入了主动交互机制,结合记忆系统与主观行为判断模块,支持五种类人响应策略,从而突破了传统基于回合的对话范式,实现了系统主导的对话控制与上下文感知的响应选择。

链接: https://arxiv.org/abs/2506.01268
作者: Yudong Lu,Yazhe Niu,Shuai Hu,Haolin Wang
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:CleanS2S is a framework for human-like speech-to-speech interaction that advances conversational AI through single-file implementation and proactive dialogue capabilities. Our system integrates automatic speech recognition, large language models, and text-to-speech synthesis into a unified pipeline with real-time interruption handling, achieving low transition latency through full-duplex websocket connections and non-blocking I/O. Beyond conventional chatbot paradigms, we pioneer a proactive interaction mechanism, which combines memory systems with Subjective Action Judgement module, enabling five human-like response strategies: interruption, refusal, deflection, silence, and standard response. The memory module dynamically aggregates historical, and contextual data to inform interaction decisions. This approach breaks the rigid turn-based convention by allowing system-initiated dialog control and context-aware response selection. And we propose Action Judgement SFT that assesses input streams for responses strategies. The framework’s single-file implementation with atomic configurations offers researchers unprecedented transparency and extensibility for interaction agents. The code of CleanS2S is released at \this https URL.
zh

[AI-67] General search techniques without common knowledge for imperfect-information games and application to superhuman Fog of War chess

【速读】:该论文试图解决在不完全信息棋类游戏(Fog of War chess)中实现超人类水平的智能决策问题,这类游戏相较于传统国际象棋增加了信息获取、对手知识推断和信号传递等复杂因素。解决方案的关键在于提出Obscuro,这是首个在该领域达到超人类水平的AI,其核心创新在于改进了不完全信息博弈中的搜索技术,从而实现了强大且可扩展的推理能力。

链接: https://arxiv.org/abs/2506.01242
作者: Brian Hu Zhang,Tuomas Sandholm
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Since the advent of AI, games have served as progress benchmarks. Meanwhile, imperfect-information variants of chess have existed for over a century, present extreme challenges, and have been the focus of significant AI research. Beyond calculation needed in regular chess, they require reasoning about information gathering, the opponent’s knowledge, signaling, etc. The most popular variant, Fog of War (FoW) chess (aka. dark chess) is a recognized challenge problem in AI after superhuman performance was reached in no-limit Texas hold’em poker. We present Obscuro, the first superhuman AI for FoW chess. It introduces advances to search in imperfect-information games, enabling strong, scalable reasoning. Experiments against the prior state-of-the-art AI and human players – including the world’s best – show that Obscuro is significantly stronger. FoW chess is the largest (by amount of imperfect information) turn-based game in which superhuman performance has been achieved and the largest game in which imperfect-information search has been successfully applied.
zh

[AI-68] Retrieval-Augmented Generation of Ontologies from Relational Databases

【速读】:该论文试图解决从关系型数据库(Relational Database, RDB)生成丰富语义本体(Ontology)的问题,现有方法要么需要大量人工参与来构建本体,要么仅能生成基础本体。解决方案的关键在于提出RIGOR(Retrieval-augmented Iterative Generation of RDB Ontologies),这是一个基于大语言模型(LLM)的方法,通过融合数据库模式及其文档、领域本体库和一个不断扩展的核心本体,利用检索增强生成(RAG)技术,迭代生成带有来源标记的本体片段,并通过判别模型进行优化后合并至核心本体,从而在减少人工干预的同时生成高质量的OWL本体。

链接: https://arxiv.org/abs/2506.01232
作者: Mojtaba Nayyeri,Athish A Yogi,Nadeen Fathallah,Ratan Bahadur Thapa,Hans-Michael Tautenhahn,Anton Schnurpel,Steffen Staab
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
备注: Under review

点击查看摘要

Abstract:Transforming relational databases into knowledge graphs with enriched ontologies enhances semantic interoperability and unlocks advanced graph-based learning and reasoning over data. However, previous approaches either demand significant manual effort to derive an ontology from a database schema or produce only a basic ontology. We present RIGOR, Retrieval-augmented Iterative Generation of RDB Ontologies, an LLM-driven approach that turns relational schemas into rich OWL ontologies with minimal human effort. RIGOR combines three sources via RAG, the database schema and its documentation, a repository of domain ontologies, and a growing core ontology, to prompt a generative LLM for producing successive, provenance-tagged delta ontology fragments. Each fragment is refined by a judge-LLM before being merged into the core ontology, and the process iterates table-by-table following foreign key constraints until coverage is complete. Applied to real-world databases, our approach outputs ontologies that score highly on standard quality dimensions such as accuracy, completeness, conciseness, adaptability, clarity, and consistency, while substantially reducing manual effort.
zh

[AI-69] owards Efficient Few-shot Graph Neural Architecture Search via Partitioning Gradient Contribution KDD2025

【速读】:该论文旨在解决超网络(supernet)中的权重耦合问题,该问题主要源于后续层中不同模块对前序层模块施加冲突的梯度方向。其解决方案的关键在于提出了一种基于梯度贡献(Gradient Contribution, GC)的方法,通过分解超网络反向传播过程中的向量-雅可比乘积,高效计算模块间梯度方向的余弦相似性,从而将存在冲突梯度方向的模块分配到不同的子超网络,而将梯度方向相似的模块进行聚合。此方法显著提升了超网络划分的质量与计算效率。

链接: https://arxiv.org/abs/2506.01231
作者: Wenhao Song,Xuan Wu,Bo Yang,You Zhou,Yubin Xiao,Yanchun Liang,Hongwei Ge,Heow Pueh Lee,Chunguo Wu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted by SIGKDD 2025

点击查看摘要

Abstract:To address the weight coupling problem, certain studies introduced few-shot Neural Architecture Search (NAS) methods, which partition the supernet into multiple sub-supernets. However, these methods often suffer from computational inefficiency and tend to provide suboptimal partitioning schemes. To address this problem more effectively, we analyze the weight coupling problem from a novel perspective, which primarily stems from distinct modules in succeeding layers imposing conflicting gradient directions on the preceding layer modules. Based on this perspective, we propose the Gradient Contribution (GC) method that efficiently computes the cosine similarity of gradient directions among modules by decomposing the Vector-Jacobian Product during supernet backpropagation. Subsequently, the modules with conflicting gradient directions are allocated to distinct sub-supernets while similar ones are grouped together. To assess the advantages of GC and address the limitations of existing Graph Neural Architecture Search methods, which are limited to searching a single type of Graph Neural Networks (Message Passing Neural Networks (MPNNs) or Graph Transformers (GTs)), we propose the Unified Graph Neural Architecture Search (UGAS) framework, which explores optimal combinations of MPNNs and GTs. The experimental results demonstrate that GC achieves state-of-the-art (SOTA) performance in supernet partitioning quality and time efficiency. In addition, the architectures searched by UGAS+GC outperform both the manually designed GNNs and those obtained by existing NAS methods. Finally, ablation studies further demonstrate the effectiveness of all proposed methods.
zh

[AI-70] SPEAR: Security Posture Evaluation using AI Planner-Reasoning on Attack-Connectivity Hypergraphs

【速读】:该论文试图解决在网络安全加固过程中,如何将网络连接参数纳入攻击图建模、在信息不完整的情况下对攻击图进行推理、以可理解的方式向系统管理员提供建议,以及支持他们进行多种场景和攻击者动机的“如果-那么”分析的问题。解决方案的关键在于提出SPEAR(Security Posture Evaluation and Analysis with a human-in-the-loop)框架,该框架利用AI规划的因果形式对网络中的漏洞和配置进行建模,并将网络配置和漏洞描述自动转换为用规划领域定义语言(PDDL)表示的规划模型,从而生成多样化的安全加固策略,供领域专家理解和系统性探索。

链接: https://arxiv.org/abs/2506.01227
作者: Rakesh Podder,Turgay Caglar,Shadaab Kawnain Bashir,Sarath Sreedharan,Indrajit Ray,Indrakshi Ray
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Graph-based frameworks are often used in network hardening to help a cyber defender understand how a network can be attacked and how the best defenses can be deployed. However, incorporating network connectivity parameters in the attack graph, reasoning about the attack graph when we do not have access to complete information, providing system administrator suggestions in an understandable format, and allowing them to do what-if analysis on various scenarios and attacker motives is still missing. We fill this gap by presenting SPEAR, a formal framework with tool support for security posture evaluation and analysis that keeps human-in-the-loop. SPEAR uses the causal formalism of AI planning to model vulnerabilities and configurations in a networked system. It automatically converts network configurations and vulnerability descriptions into planning models expressed in the Planning Domain Definition Language (PDDL). SPEAR identifies a set of diverse security hardening strategies that can be presented in a manner understandable to the domain expert. These allow the administrator to explore the network hardening solution space in a systematic fashion and help evaluate the impact and compare the different solutions.
zh

[AI-71] st Automation for Interactive Scenarios via Promptable Traffic Simulation CVPR2025

【速读】:该论文旨在解决自动驾驶汽车(Autonomous Vehicle, AV)规划器在公共道路上广泛部署前的评估问题,特别是针对人类行为不确定性的鲁棒性评估。其关键解决方案是提出一种自动化方法,通过参数化复杂的人类行为为低维目标位置,并将其输入可提示的交通模拟器ProSim,以生成现实且安全关键的人类行为,从而高效地构建全面的测试用例。该方法利用贝叶斯优化自动探索目标空间,识别出安全关键的行为,提升了评估的效率与有效性。

链接: https://arxiv.org/abs/2506.01199
作者: Augusto Mondelli,Yueshan Li,Alessandro Zanardi,Emilio Frazzoli
机构: 未知
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: Accepted by CVPR 2025 Workshop Data-Driven Autonomous Driving Simulation (track 1)

点击查看摘要

Abstract:Autonomous vehicle (AV) planners must undergo rigorous evaluation before widespread deployment on public roads, particularly to assess their robustness against the uncertainty of human behaviors. While recent advancements in data-driven scenario generation enable the simulation of realistic human behaviors in interactive settings, leveraging these models to construct comprehensive tests for AV planners remains an open challenge. In this work, we introduce an automated method to efficiently generate realistic and safety-critical human behaviors for AV planner evaluation in interactive scenarios. We parameterize complex human behaviors using low-dimensional goal positions, which are then fed into a promptable traffic simulator, ProSim, to guide the behaviors of simulated agents. To automate test generation, we introduce a prompt generation module that explores the goal domain and efficiently identifies safety-critical behaviors using Bayesian optimization. We apply our method to the evaluation of an optimization-based planner and demonstrate its effectiveness and efficiency in automatically generating diverse and realistic driving behaviors across scenarios with varying initial conditions.
zh

[AI-72] Doubly Robust Alignment for Large Language Models

【速读】:该论文旨在解决强化学习从人类反馈(Reinforcement Learning from Human Feedback, RLHF)中由于偏好模型(如 Bradley-Terry 模型)、参考策略或奖励函数的错误设定所导致的模型不一致性问题,这些问题会引发不良的微调效果。解决方案的关键在于提出一种双重稳健的偏好优化算法,该算法在偏好模型或参考策略任一正确设定的情况下仍能保持一致性,而无需两者同时正确。

链接: https://arxiv.org/abs/2506.01183
作者: Erhan Xu,Kai Ye,Hongyi Zhou,Luhan Zhu,Francesco Quinzan,Chengchun Shi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:This paper studies reinforcement learning from human feedback (RLHF) for aligning large language models with human preferences. While RLHF has demonstrated promising results, many algorithms are highly sensitive to misspecifications in the underlying preference model (e.g., the Bradley-Terry model), the reference policy, or the reward function, resulting in undesirable fine-tuning. To address model misspecification, we propose a doubly robust preference optimization algorithm that remains consistent when either the preference model or the reference policy is correctly specified (without requiring both). Our proposal demonstrates superior and more robust performance than state-of-the-art algorithms, both in theory and in practice. The code is available at this https URL
zh

[AI-73] Humanoid World Models: Open World Foundation Models for Humanoid Robotics

【速读】:该论文旨在解决人形机器人在以人类为中心的环境中执行复杂任务时,需要具备鲁棒的预测模型以推理其行为结果的问题。解决方案的关键在于引入一种轻量级开源的基于视频的人形世界模型(Humanoid World Models, HWM),该模型通过生成式模型(如Masked Transformers和FlowMatching)对动作条件下的未来自我中心观察进行预测,从而实现对环境动态的建模与预测。

链接: https://arxiv.org/abs/2506.01182
作者: Muhammad Qasim Ali,Aditya Sridhar,Shahbuland Matiana,Alex Wong,Mohammad Al-Sharman
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Humanoid robots have the potential to perform complex tasks in human centered environments but require robust predictive models to reason about the outcomes of their actions. We introduce Humanoid World Models (HWM) a family of lightweight open source video based models that forecast future egocentric observations conditioned on actions. We train two types of generative models Masked Transformers and FlowMatching on 100 hours of humanoid demonstrations. Additionally we explore architectural variants with different attention mechanisms and parameter sharing strategies. Our parameter sharing techniques reduce model size by 33 to 53 with minimal impact on performance or visual fidelity. HWM is designed to be trained and deployed in practical academic and small lab settings such as 1 to 2 GPUs.
zh

[AI-74] Bridging Quantum and Classical Computing in Drug Design: Architecture Principles for Improved Molecule Generation

【速读】:该论文旨在解决如何有效利用噪声中等规模量子(NISQ)设备在药物发现中构建混合量子-经典机器学习模型的问题,特别是针对生成对抗网络(GANs)的架构优化。解决方案的关键在于通过多目标贝叶斯优化系统地优化量子-经典桥梁架构,从而显著提升模型性能,实现更高的药物候选评分(DCS),同时减少参数数量。研究发现,采用多层(3-4层)浅层(4-8量子比特)量子电路的序列化结构是提升性能的核心策略。

链接: https://arxiv.org/abs/2506.01177
作者: Andrew Smith,Erhan Guven
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Biomolecules (q-bio.BM)
备注: 10 pages, 7 figures

点击查看摘要

Abstract:Hybrid quantum-classical machine learning offers a path to leverage noisy intermediate-scale quantum (NISQ) devices for drug discovery, but optimal model architectures remain unclear. We systematically optimize the quantum-classical bridge architecture for generative adversarial networks (GANs) in molecular discovery using multi-objective Bayesian optimization. Our optimized model (BO-QGAN) significantly improves performance, achieving a 2.27-fold higher Drug Candidate Score (DCS) than prior quantum-hybrid benchmarks and 2.21-fold higher than the classical baseline, using over 60% fewer parameters. Key findings favor layering multiple (3-4) shallow (4-8 qubit) quantum circuits sequentially, while classical architecture shows less sensitivity above a minimum capacity. This work provides the first empirically grounded architectural guidelines for hybrid models, enabling more effective integration of current quantum computers into pharmaceutical research pipelines.
zh

[AI-75] GraphPad: Inference-Time 3D Scene Graph Updates for Embodied Question Answering CVPR2025

【速读】:该论文旨在解决传统静态场景表示在任务规格变化时无法动态更新,从而导致关键物体、空间关系和细节丢失的问题。其解决方案的关键在于提出GraphPad,一种可修改的结构化记忆系统,通过API调用使智能体能够根据任务需求定制场景图、导航日志和任务笔记,从而构建一个动态的工作空间,确保场景表示的完整性、时效性和与任务目标的一致性。

链接: https://arxiv.org/abs/2506.01174
作者: Muhammad Qasim Ali,Saeejith Nair,Alexander Wong,Yuchen Cui,Yuhao Chen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: CVPR 2025 Workshop on 3D-LLM/VLA: Bridging Language, Vision and Action in 3D Environments

点击查看摘要

Abstract:Structured scene representations are a core component of embodied agents, helping to consolidate raw sensory streams into readable, modular, and searchable formats. Due to their high computational overhead, many approaches build such representations in advance of the task. However, when the task specifications change, such static approaches become inadequate as they may miss key objects, spatial relations, and details. We introduce GraphPad, a modifiable structured memory that an agent can tailor to the needs of the task through API calls. It comprises a mutable scene graph representing the environment, a navigation log indexing frame-by-frame content, and a scratchpad for task-specific notes. Together, GraphPad serves as a dynamic workspace that remains complete, current, and aligned with the agent’s immediate understanding of the scene and its task. On the OpenEQA benchmark, GraphPad attains 55.3%, a +3.0% increase over an image-only baseline using the same vision-language model, while operating with five times fewer input frames. These results show that allowing online, language-driven refinement of 3-D memory yields more informative representations without extra training or data collection.
zh

[AI-76] VUSA: Virtually Upscaled Systolic Array Architecture to Exploit Unstructured Sparsity in AI Acceleration

【速读】:该论文旨在解决深度神经网络(Deep Neural Network, DNN)加速器在边缘人工智能(Edge-AI)应用中效率不足的问题,通过利用高程度的非结构化稀疏性来提升计算效率。其解决方案的关键在于提出一种名为VUSA的脉动阵列架构,该架构能够根据当前的稀疏性“虚拟扩展”,在不增加物理乘加(MAC)单元数量的情况下执行更大规模的矩阵乘法,从而在相同峰值性能下分别实现面积和功耗效率提升37%和68%。

链接: https://arxiv.org/abs/2506.01166
作者: Shereef Helal,Alberto Garcia-Ortiz,Lennart Bamberg
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Preprint accepted for publication at MOCAST 2025. Submitted for possible publication in IEEE Xplore

点击查看摘要

Abstract:Leveraging high degrees of unstructured sparsity is a promising approach to enhance the efficiency of deep neural network DNN accelerators - particularly important for emerging Edge-AI applications. We introduce VUSA, a systolic-array architecture that virtually grows based on the present sparsity to perform larger matrix multiplications with the same number of physical multiply-accumulate MAC units. The proposed architecture achieves saving by 37% and 68% in area and power efficiency, respectively, at the same peak-performance, compared to a baseline systolic array architecture in a commercial 16-nm technology. Still, the proposed architecture supports acceleration for any DNN with any sparsity - even no sparsity at all. Thus, the proposed architecture is application-independent, making it viable for general-purpose AI acceleration.
zh

[AI-77] FORT: Forward-Only Regression Training of Normalizing Flows

【速读】:该论文旨在解决生成模型在连续空间中训练时,尽管具备可扩展性,但在生成高质量样本及其对应模型似然时需要昂贵的数值模拟问题,这限制了其在如分子系统平衡采样等科学应用中的使用。解决方案的关键在于重新审视经典归一化流(normalizing flows)作为具有精确似然的一步生成模型,并提出一种无需计算传统最大似然训练中昂贵变量变换公式的可扩展训练目标——前向仅回归训练(Forward-Only Regression Training, FORT)。FORT通过简单的ℓ₂回归目标将先验样本映射到特定选择的目标上,从而实现了高效且稳定的训练。

链接: https://arxiv.org/abs/2506.01158
作者: Danyal Rehman,Oscar Davis,Jiarui Lu,Jian Tang,Michael Bronstein,Yoshua Bengio,Alexander Tong,Avishek Joey Bose
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: Preprint

点击查看摘要

Abstract:Simulation-free training frameworks have been at the forefront of the generative modelling revolution in continuous spaces, leading to neural dynamical systems that encompass modern large-scale diffusion and flow matching models. Despite the scalability of training, the generation of high-quality samples and their corresponding likelihood under the model requires expensive numerical simulation – inhibiting adoption in numerous scientific applications such as equilibrium sampling of molecular systems. In this paper, we revisit classical normalizing flows as one-step generative models with exact likelihoods and propose a novel, scalable training objective that does not require computing the expensive change of variable formula used in conventional maximum likelihood training. We propose Forward-Only Regression Training (FORT), a simple \ell_2 -regression objective that maps prior samples under our flow to specifically chosen targets. We demonstrate that FORT supports a wide class of targets, such as optimal transport targets and targets from pre-trained continuous-time normalizing flows (CNF). We further demonstrate that by using CNF targets, our one-step flows allow for larger-scale training that exceeds the performance and stability of maximum likelihood training, while unlocking a broader class of architectures that were previously challenging to train. Empirically, we elucidate that our trained flows can perform equilibrium conformation sampling in Cartesian coordinates of alanine dipeptide, alanine tripeptide, and alanine tetrapeptide.
zh

[AI-78] Neuro-Symbolic Generative Diffusion Models for Physically Grounded Robust and Safe Generation

【速读】:该论文试图解决扩散模型在安全关键或科学严谨应用中难以满足严格的物理、结构和操作约束的问题。解决方案的关键在于提出神经符号扩散(Neuro-Symbolic Diffusion, NSD)框架,该框架将扩散步骤与符号优化相结合,从而在用户定义的功能性和逻辑约束下生成可验证一致的样本。这一特性同时适用于标准和离散扩散模型,首次实现了连续(如图像和轨迹)和离散(如分子结构和自然语言)输出的约束合规生成。

链接: https://arxiv.org/abs/2506.01121
作者: Jacob K. Christopher,Michael Cardei,Jinhao Liang,Ferdinando Fioretto
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Published at the 2nd International Conference on Neuro-symbolic Systems (NeuS 2025)

点击查看摘要

Abstract:Despite the remarkable generative capabilities of diffusion models, their integration into safety-critical or scientifically rigorous applications remains hindered by the need to ensure compliance with stringent physical, structural, and operational constraints. To address this challenge, this paper introduces Neuro-Symbolic Diffusion (NSD), a novel framework that interleaves diffusion steps with symbolic optimization, enabling the generation of certifiably consistent samples under user-defined functional and logic constraints. This key feature is provided for both standard and discrete diffusion models, enabling, for the first time, the generation of both continuous (e.g., images and trajectories) and discrete (e.g., molecular structures and natural language) outputs that comply with constraints. This ability is demonstrated on tasks spanning three key challenges: (1) Safety, in the context of non-toxic molecular generation and collision-free trajectory optimization; (2) Data scarcity, in domains such as drug discovery and materials engineering; and (3) Out-of-domain generalization, where enforcing symbolic constraints allows adaptation beyond the training distribution.
zh

[AI-79] ChemAU: Harness the Reasoning of LLM s in Chemical Research with Adaptive Uncertainty Estimation

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在处理化学相关问题时表现不佳的问题,尤其是由于化学问题涉及复杂的推理步骤、专业术语和符号系统,导致通用LLMs在推理过程中容易产生幻觉。论文提出的解决方案关键在于设计了一种新颖的框架ChemAU,其核心是自适应不确定性估计方法,该方法根据推理步骤在整体推理链中的位置分配不同的不确定性值,从而精准识别化学知识缺口,并通过专用领域模型补充化学专业知识,修正并更新原有的错误推理链。

链接: https://arxiv.org/abs/2506.01116
作者: Xinyi Liu,Lipeng Ma,Yixuan Li,Weidong Yang,Qingyuan Zhou,Jiayi Song,Shuhao Li,Ben Fei
机构: 未知
类目: Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are widely used across various scenarios due to their exceptional reasoning capabilities and natural language understanding. While LLMs demonstrate strong performance in tasks involving mathematics and coding, their effectiveness diminishes significantly when applied to chemistry-related problems. Chemistry problems typically involve long and complex reasoning steps, which contain specific terminology, including specialized symbol systems and complex nomenclature conventions. These characteristics often cause general LLMs to experience hallucinations during the reasoning process due to their lack of specific knowledge. However, existing methods are struggling to effectively leverage chemical expertise and formulas. Moreover, current uncertainty estimation methods, designed to mitigate potential reasoning errors, are unable to precisely identify specific steps or key knowledge. In this work, we propose a novel framework called ChemAU, which incorporates our adaptive uncertainty estimation method that applies different uncertainty values based on the position of reasoning steps within the whole reasoning chain. Leveraging this method, ChemAU identifies gaps in chemistry knowledge and precisely supplements chemical expertise with the specialized domain model, thereby correcting and updating the previously flawed reasoning chain. Our experiments with three popular LLMs across three chemistry datasets demonstrate that ChemAU significantly enhances both reasoning accuracy and uncertainty estimation.
zh

[AI-80] Reconsidering LLM Uncertainty Estimation Methods in the Wild ACL2025

【速读】:该论文旨在解决大规模语言模型(Large Language Model, LLM)不确定性估计(Uncertainty Estimation, UE)方法在实际部署中的挑战问题。现有研究多在孤立的短文本问答场景中评估UE方法,而实际应用中面临决策阈值选择敏感性、查询变换鲁棒性、长文本生成适用性以及多UE分数处理策略等关键问题。论文通过系统评估19种UE方法,揭示了其在分布偏移下的阈值敏感性、对对抗提示的脆弱性,并提出在测试阶段集成多个UE分数以提升性能的策略,这成为提升UE方法实用性的关键解决方案。

链接: https://arxiv.org/abs/2506.01114
作者: Yavuz Bakman,Duygu Nur Yaldiz,Sungmin Kang,Tuo Zhang,Baturalp Buyukates,Salman Avestimehr,Sai Praneeth Karimireddy
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to ACL 2025

点击查看摘要

Abstract:Large Language Model (LLM) Uncertainty Estimation (UE) methods have become a crucial tool for detecting hallucinations in recent years. While numerous UE methods have been proposed, most existing studies evaluate them in isolated short-form QA settings using threshold-independent metrics such as AUROC or PRR. However, real-world deployment of UE methods introduces several challenges. In this work, we systematically examine four key aspects of deploying UE methods in practical settings. Specifically, we assess (1) the sensitivity of UE methods to decision threshold selection, (2) their robustness to query transformations such as typos, adversarial prompts, and prior chat history, (3) their applicability to long-form generation, and (4) strategies for handling multiple UE scores for a single query. Our evaluations on 19 UE methods reveal that most of them are highly sensitive to threshold selection when there is a distribution shift in the calibration dataset. While these methods generally exhibit robustness against previous chat history and typos, they are significantly vulnerable to adversarial prompts. Additionally, while existing UE methods can be adapted for long-form generation through various strategies, there remains considerable room for improvement. Lastly, ensembling multiple UE scores at test time provides a notable performance boost, which highlights its potential as a practical improvement strategy. Code is available at: this https URL.
zh

[AI-81] FusionAudio-1.2M: Towards Fine-grained Audio Captioning with Multimodal Contextual Fusion

【速读】:该论文旨在解决当前自动化音频描述生成方法在细节丰富性和上下文准确性方面的不足,这些问题主要源于其对有限单模态或浅层多模态信息的依赖。解决方案的关键在于引入一种基于人类听觉感知机制的两阶段自动化流程,首先利用预训练模型提取多样化的上下文线索(如语音、音乐、通用声音及关联视频的视觉信息),随后通过大型语言模型(LLM)融合这些丰富的多模态输入,生成详细且具有上下文意识的音频描述。

链接: https://arxiv.org/abs/2506.01111
作者: Shunian Chen,Xinyuan Xie,Zheshu Chen,Liyan Zhao,Owen Lee,Zhan Su,Qilin Sun,Benyou Wang
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:High-quality, large-scale audio captioning is crucial for advancing audio understanding, yet current automated methods often generate captions that lack fine-grained detail and contextual accuracy, primarily due to their reliance on limited unimodal or superficial multimodal information. Drawing inspiration from human auditory perception, which adeptly integrates cross-modal cues and performs sophisticated auditory scene analysis, we introduce a novel two-stage automated pipeline. This pipeline first employs specialized pretrained models to extract diverse contextual cues (e.g., speech, music, general sounds, and visual information from associated video). A large language model (LLM) then synthesizes these rich, multimodal inputs to generate detailed and context-aware audio captions. Key contributions of this work include: (1) the proposed scalable method for fine-grained audio caption generation; (2) FusionAudio, a new large-scale dataset comprising 1.2 million such detailed captions, combined with 6 million QA pairs; and (3) enhanced audio models developed using FusionAudio, specifically a CLAP-based audio encoder with superior audio-text alignment and instruction following. This paper paves the way for more nuanced and accurate automated understanding of complex audio environments. Code and data can be found in this https URL.
zh

[AI-82] Speeding Up Hyper-Heuristics With Markov-Chain Operator Selection and the Only-Worsening Acceptance Operator IJCAI2025

【速读】:该论文旨在解决启发式算法在逃离局部最优解时效率低下的问题,特别是针对经典函数类如Cliff_d和Jump_m的优化问题。其解决方案的关键在于对移动接受超启发式算法进行两项改进:一是通过简单的两状态马尔可夫链替代随机选择策略,从而显著降低Jump_m函数的运行时间;二是引入仅接受恶化移动的算子,这一反直觉的策略被证明能有效帮助算法逃离局部最优解,从而在SEQOPT_k基准类上实现优异的运行时间复杂度O(n^{k+1} log n)。

链接: https://arxiv.org/abs/2506.01107
作者: Abderrahim Bendahi,Benjamin Doerr,Adrien Fradin,Johannes F. Lutzeyer
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Data Structures and Algorithms (cs.DS)
备注: Accepted at IJCAI 2025

点击查看摘要

Abstract:The move-acceptance hyper-heuristic was recently shown to be able to leave local optima with astonishing efficiency (Lissovoi et al., Artificial Intelligence (2023)). In this work, we propose two modifications to this algorithm that demonstrate impressive performances on a large class of benchmarks including the classic Cliff _d and Jump _m function classes. (i) Instead of randomly choosing between the only-improving and any-move acceptance operator, we take this choice via a simple two-state Markov chain. This modification alone reduces the runtime on Jump _m functions with gap parameter m from \Omega(n^2m-1) to O(n^m+1) . (ii) We then replace the all-moves acceptance operator with the operator that only accepts worsenings. Such a, counter-intuitive, operator has not been used before in the literature. However, our proofs show that our only-worsening operator can greatly help in leaving local optima, reducing, e.g., the runtime on Jump functions to O(n^3 \log n) independent of the gap size. In general, we prove a remarkably good runtime of O(n^k+1 \log n) for our Markov move-acceptance hyper-heuristic on all members of a new benchmark class SEQOPT _k , which contains a large number of functions having k successive local optima, and which contains the commonly studied Jump _m and Cliff _d functions for k=2 .
zh

[AI-83] SuperRL: Reinforcement Learning with Supervision to Boost Language Model Reasoning

【速读】:该论文旨在解决在稀疏奖励环境下,传统基于策略的强化学习方法因难以采样到成功轨迹而导致的学习效率低下问题。其关键解决方案是提出SuperRL框架,该框架通过引入自适应切换机制(Adaptive Switch)检测稀疏奖励条件,并在必要时激活混合策略器(Hybrid Actor)。混合策略器在损失层面整合了策略梯度与监督学习目标,使模型能够利用高质量的离线推理信号,同时保持强化学习的探索能力。

链接: https://arxiv.org/abs/2506.01096
作者: Yihao Liu,Shuocheng Li,Lang Cao,Yuhang Xie,Mengyu Zhou,Haoyu Dong,Xiaojun Ma,Shi Han,Dongmei Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models are increasingly used for complex reasoning tasks where high-quality offline data such as expert-annotated solutions and distilled reasoning traces are often available. However, in environments with sparse rewards, reinforcement learning struggles to sample successful trajectories, leading to inefficient learning. At the same time, these offline trajectories that represent correct reasoning paths are not utilized by standard on-policy reinforcement learning methods. To address this limitation, we propose SuperRL, a unified training framework that adaptively incorporates offline supervision into reinforcement learning. SuperRL introduces an Adaptive Switch to detect sparse reward conditions and activates a Hybrid Actor when necessary. The Hybrid Actor integrates policy gradient and supervised learning objectives at the loss level, enabling the model to benefit from accurate offline reasoning signals while maintaining the exploratory capacity of reinforcement learning. Experiments on a range of reasoning benchmarks show that SuperRL consistently outperforms standard reinforcement learning by improving sample efficiency, generalization, and robustness under sparse rewards.
zh

[AI-84] Modular Speaker Architecture: A Framework for Sustaining Responsibility and Contextual Integrity in Multi-Agent AI Communication

【速读】:该论文试图解决多智能体系统中维持一致且角色感知的通信这一基础性挑战,当前框架通常缺乏明确的说话者责任机制,导致上下文漂移、对齐不稳定和可解释性下降。解决方案的关键在于提出模块化说话者架构(Modular Speaker Architecture, MSA),该架构将说话者行为分解为模块化组件,以实现角色追踪、责任连续性和情境一致性。MSA包含三个核心模块:说话者角色模块、责任链追踪器和情境完整性验证器,通过结构化度量(如语用一致性、责任流和情境稳定性)进行评估,证明其能够在不依赖情感信号或表层启发式方法的情况下可靠地维持交互结构。

链接: https://arxiv.org/abs/2506.01095
作者: Khe-Han Toh,Hong-Kuan Teo
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Sustaining coherent, role-aware communication across multi-agent systems remains a foundational challenge in AI. Current frameworks often lack explicit mechanisms for speaker responsibility, leading to context drift, alignment instability, and degraded interpretability over time. We propose the Modular Speaker Architecture (MSA), a framework that decomposes speaker behavior into modular components for role tracking, responsibility continuity, and contextual coherence. Grounded in high-context human-AI dialogues, MSA includes three core modules: a Speaker Role Module, a Responsibility Chain Tracker, and a Contextual Integrity Validator. We evaluate MSA through annotated case studies and introduce structural metrics-pragmatic consistency, responsibility flow, and context stability-quantified via manual and automatic scoring and bootstrapped statistical analysis. Our results show that MSA reliably maintains interaction structure without reliance on affective signals or surface-level heuristics. We further implement a prototype configuration language (G-Code) and modular API to support MSA deployment in dynamic multi-agent scenarios.
zh

[AI-85] Regulatory Graphs and GenAI for Real-Time Transaction Monitoring and Compliance Explanation in Banking

【速读】:该论文试图解决高风险金融环境中自动化金融合规的问题,特别是如何实现可解释且可审计的交易监控。解决方案的关键在于结合图智能(graph intelligence)与生成式模型,通过构建动态交易图、提取结构和上下文特征,并利用图神经网络分类可疑行为,同时采用检索增强生成模块生成符合监管条款的自然语言解释,从而提升系统的可解释性和合规性。

链接: https://arxiv.org/abs/2506.01093
作者: Kunal Khanvilkar,Kranthi Kommuru
机构: 未知
类目: Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This paper presents a real-time transaction monitoring framework that integrates graph-based modeling, narrative field embedding, and generative explanation to support automated financial compliance. The system constructs dynamic transaction graphs, extracts structural and contextual features, and classifies suspicious behavior using a graph neural network. A retrieval-augmented generation module generates natural language explanations aligned with regulatory clauses for each flagged transaction. Experiments conducted on a simulated stream of financial data show that the proposed method achieves superior results, with 98.2% F1-score, 97.8% precision, and 97.0% recall. Expert evaluation further confirms the quality and interpretability of generated justifications. The findings demonstrate the potential of combining graph intelligence and generative models to support explainable, audit-ready compliance in high-risk financial environments.
zh

[AI-86] Choices and their Provenance: Explaining Stable Solutions of Abstract Argumentation Frameworks SIGMOD

【速读】:该论文旨在解决如何将抽象论证框架(AF)中基于稳定解(stable solution)的论证溯源扩展到与基于良好基础语义(well-founded semantics, WFS)的论证溯源相兼容的问题。其解决方案的关键在于识别稳定模型中的关键攻击集,从而揭示稳定解中所作的选择和假设,并将这些选择步骤与良好基础推导步骤相结合,提供对论证状态更深入的解释。该方法可视为一种诊断机制,通过找到最小的“修复”以调整AF图,使得修复后的图的良好基础解与原始AF图的稳定解一致。

链接: https://arxiv.org/abs/2506.01087
作者: Bertram Ludäscher,Yilin Xia,Shawn Bowers
机构: 未知
类目: Artificial Intelligence (cs.AI); Symbolic Computation (cs.SC)
备注: International Workshop on the Theory and Practice of Provenance (TaPP) and ProvenanceWeek’25 @SIGMOD, June 27, 2025. Berlin, Germany

点击查看摘要

Abstract:The rule \mathrmDefeated(x) \leftarrow \mathrmAttacks(y,x),, \neg , \mathrmDefeated(y) , evaluated under the well-founded semantics (WFS), yields a unique 3-valued (skeptical) solution of an abstract argumentation framework (AF). An argument x is defeated ( \mathrmOUT ) if there exists an undefeated argument y that attacks it. For 2-valued (stable) solutions, this is the case iff y is accepted ( \mathrmIN ), i.e., if all of y 's attackers are defeated. Under WFS, arguments that are neither accepted nor defeated are undecided ( \mathrmUNDEC ). As shown in prior work, well-founded solutions (a.k.a. grounded labelings) “explain themselves”: The provenance of arguments is given by subgraphs (definable via regular path queries) rooted at the node of interest. This provenance is closely related to winning strategies of a two-player argumentation game. We present a novel approach for extending this provenance to stable AF solutions. Unlike grounded solutions, which can be constructed via a bottom-up alternating fixpoint procedure, stable models often involve non-deterministic choice as part of the search for models. Thus, the provenance of stable solutions is of a different nature, and reflects a more expressive generate test paradigm. Our approach identifies minimal sets of critical attacks, pinpointing choices and assumptions made by a stable model. These critical attack edges provide additional insights into the provenance of an argument’s status, combining well-founded derivation steps with choice steps. Our approach can be understood as a form of diagnosis that finds minimal “repairs” to an AF graph such that the well-founded solution of the repaired graph coincides with the desired stable model of the original AF graph. Comments: International Workshop on the Theory and Practice of Provenance (TaPP) and ProvenanceWeek’25 @SIGMOD, June 27, 2025. Berlin, Germany Subjects: Artificial Intelligence (cs.AI); Symbolic Computation (cs.SC) Cite as: arXiv:2506.01087 [cs.AI] (or arXiv:2506.01087v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2506.01087 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-87] he Coming Crisis of Multi-Agent Misalignment: AI Alignment Must Be a Dynamic and Social Process NEURIPS2025

【速读】:该论文试图解决多智能体系统(Multi-Agent Systems, MAS)中人工智能对齐(AI Alignment)的问题,特别是在动态和社会环境依赖性背景下,如何确保智能体的行为与人类价值观和用户偏好保持一致。论文指出,随着MAS在现实应用中的普及,智能体之间的互动变得更加复杂,这种复杂性可能无意中导致部分或全部智能体偏离人类价值。解决方案的关键在于将人类、偏好和目标对齐视为相互依赖的概念,而非孤立问题,并强调需要构建模拟环境、基准测试和评估框架,以在动态多智能体情境下评估对齐效果,从而在系统复杂性失控之前进行有效控制。

链接: https://arxiv.org/abs/2506.01080
作者: Florian Carichon,Aditi Khandelwal,Marylou Fauchard,Golnoosh Farnadi
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: Preprint of NeurIPS 2025 Position Paper

点击查看摘要

Abstract:This position paper states that AI Alignment in Multi-Agent Systems (MAS) should be considered a dynamic and interaction-dependent process that heavily depends on the social environment where agents are deployed, either collaborative, cooperative, or competitive. While AI alignment with human values and preferences remains a core challenge, the growing prevalence of MAS in real-world applications introduces a new dynamic that reshapes how agents pursue goals and interact to accomplish various tasks. As agents engage with one another, they must coordinate to accomplish both individual and collective goals. However, this complex social organization may unintentionally misalign some or all of these agents with human values or user preferences. Drawing on social sciences, we analyze how social structure can deter or shatter group and individual values. Based on these analyses, we call on the AI community to treat human, preferential, and objective alignment as an interdependent concept, rather than isolated problems. Finally, we emphasize the urgent need for simulation environments, benchmarks, and evaluation frameworks that allow researchers to assess alignment in these interactive multi-agent contexts before such dynamics grow too complex to control.
zh

[AI-88] Unfolding Boxes with Local Constraints

【速读】:该论文试图解决寻找并枚举可以折叠成多个非同构盒子的多米诺骨牌(polyomino)的问题。现有方法如SAT求解、随机算法和决策图在大规模计算上存在局限,主要原因是全局约束(如图连通性或无环性)难以有效编码且难以被求解器处理。该研究提出了一种新的基于SAT的方法,其关键在于用具有更好传播特性的局部约束替代这些全局约束,从而显著提升了计算和枚举常见盒子展开结构的可扩展性。

链接: https://arxiv.org/abs/2506.01079
作者: Long Qian,Eric Wang,Bernardo Subercaseaux,Marijn J. H. Heule
机构: 未知
类目: Computational Geometry (cs.CG); Artificial Intelligence (cs.AI)
备注: Accepted at CADE30 (2025). 17 figures. Code at this https URL

点击查看摘要

Abstract:We consider the problem of finding and enumerating polyominos that can be folded into multiple non-isomorphic boxes. While several computational approaches have been proposed, including SAT, randomized algorithms, and decision diagrams, none has been able to perform at scale. We argue that existing SAT encodings are hindered by the presence of global constraints (e.g., graph connectivity or acyclicity), which are generally hard to encode effectively and hard for solvers to reason about. In this work, we propose a new SAT-based approach that replaces these global constraints with simple local constraints that have substantially better propagation properties. Our approach dramatically improves the scalability of both computing and enumerating common box unfoldings: (i) while previous approaches could only find common unfoldings of two boxes up to area 88, ours easily scales beyond 150, and (ii) while previous approaches were only able to enumerate common unfoldings up to area 30, ours scales up to 60. This allows us to rule out 46, 54, and 58 as the smallest areas allowing a common unfolding of three boxes, thereby refuting a conjecture of Xu et al. (2017).
zh

[AI-89] rilevel Memetic Algorithm for the Electric Vehicle Routing Problem

【速读】:该论文试图解决电动车辆路径问题(Electric Vehicle Routing Problem, EVRP),该问题在容量约束车辆路径问题的基础上引入了电池约束和充电站因素,带来了显著的优化挑战。论文提出的解决方案是三层次遗传算法(Trilevel Memetic Algorithm, TMA),其关键在于通过分层优化客户顺序、路线分配和充电站插入,并结合遗传算法与动态规划,以实现高效且高质量的求解。

链接: https://arxiv.org/abs/2506.01065
作者: Ivan Milinović,Leon Stjepan Uroić,Marko Đurasević
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The Electric Vehicle Routing Problem (EVRP) extends the capacitated vehicle routing problem by incorporating battery constraints and charging stations, posing significant optimization challenges. This paper introduces a Trilevel Memetic Algorithm (TMA) that hierarchically optimizes customer sequences, route assignments, and charging station insertions. The method combines genetic algorithms with dynamic programming, ensuring efficient and high-quality solutions. Benchmark tests on WCCI2020 instances show competitive performance, matching best-known results for small-scale cases. While computational demands limit scalability, TMA demonstrates strong potential for sustainable logistics planning.
zh

[AI-90] XAI-Units: Benchmarking Explainability Methods with Unit Tests

【速读】:该论文试图解决不同特征归因(Feature Attribution, FA)方法在相同模型上给出不一致的重要性评分的问题,这使得在缺乏真实标签或对模型内部机制不了解的情况下,难以判断哪种FA方法在不同情境下生成更合适的解释。解决方案的关键在于引入开源的XAI-Units基准,该基准专门设计用于评估FA方法在多种模型行为(如特征交互、抵消和不连续输出)上的表现,通过提供具有已知内部机制的配对数据集和模型,建立对理想归因分数的明确预期,并结合内置评估指标,系统化地揭示FA方法在不同原子类型模型推理中的性能,从而为FA方法的客观可靠比较奠定基础。

链接: https://arxiv.org/abs/2506.01059
作者: Jun Rui Lee,Sadegh Emami,Michael David Hollins,Timothy C. H. Wong,Carlos Ignacio Villalobos Sánchez,Francesca Toni,Dekai Zhang,Adam Dejl
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at FAccT 2025

点击查看摘要

Abstract:Feature attribution (FA) methods are widely used in explainable AI (XAI) to help users understand how the inputs of a machine learning model contribute to its outputs. However, different FA models often provide disagreeing importance scores for the same model. In the absence of ground truth or in-depth knowledge about the inner workings of the model, it is often difficult to meaningfully determine which of the different FA methods produce more suitable explanations in different contexts. As a step towards addressing this issue, we introduce the open-source XAI-Units benchmark, specifically designed to evaluate FA methods against diverse types of model behaviours, such as feature interactions, cancellations, and discontinuous outputs. Our benchmark provides a set of paired datasets and models with known internal mechanisms, establishing clear expectations for desirable attribution scores. Accompanied by a suite of built-in evaluation metrics, XAI-Units streamlines systematic experimentation and reveals how FA methods perform against distinct, atomic kinds of model reasoning, similar to unit tests in software engineering. Crucially, by using procedurally generated models tied to synthetic datasets, we pave the way towards an objective and reliable comparison of FA methods.
zh

[AI-91] MCP-Zero: Proactive Toolchain Construction for LLM Agents from Scratch

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在调用外部工具时面临的上下文开销大和工具模式注入成本高的问题。其解决方案的关键在于提出MCP-Zero框架,该框架通过三个核心组件实现:主动工具请求(Proactive Tool Request),用于模型自主决定所需服务器和任务;分层向量路由(Hierarchical Vector Routing),通过语义相似性进行粗粒度到细粒度的工具检索;以及迭代主动调用(Iterative Proactive Invocation),支持多轮跨领域工具链构建并减少上下文负担。

链接: https://arxiv.org/abs/2506.01056
作者: Xiang Fei,Xiawu Zheng,Hao Feng
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Function-calling has enabled large language models (LLMs) to act as tool-using agents, but injecting thousands of tool schemas into the prompt is costly and error-prone. We introduce MCP-Zero, a proactive agent framework that lets the LLM itself decide when and which external tools to retrieve, thereby assembling a task-specific toolchain from scratch. The framework is built upon three components: (1) Proactive Tool Request, where the model emits a structured \left\operatornametool_assistant\right block that explicitly specifies the desired server and task; (2) Hierarchical Vector Routing, a coarse-to-fine retrieval algorithm that first selects candidate servers and then ranks tools within each server based on the semantic similarity; (3) Iterative Proactive Invocation, enabling multi-round, cross-domain toolchain construction with minimal context overhead, and allowing the model to iteratively revise its request when the returned tools are insufficient. To evaluate our approach we also compile MCP-tools, a retrieval dataset comprising 308 MCP servers and 2,797 tools extracted from the official Model-Context-Protocol repository and normalized into a unified JSON schema. Experiments show that MCP-Zero (i) effectively addresses the context overhead problem of existing methods and accurately selects the correct tool from a pool of nearly 3,000 candidates (248.1k tokens); (ii) reduces token consumption by 98% on the APIbank while maintaining high accuracy; and (iii) supports multi-turn tool invocation with consistent accuracy across rounds. The code and dataset will be released soon.
zh

[AI-92] aming LLM s by Scaling Learning Rates with Gradient Grouping ACL’2025

【速读】:该论文试图解决大规模语言模型(Large Language Models, LLMs)训练中由于模型规模庞大和架构异质性所带来的挑战,特别是自适应优化器在参数级学习率估计上的效率和有效性不足问题,导致训练不稳定、收敛缓慢以及与参数高效微调(Parameter-Efficient Fine-Tuning, PEFT)技术的兼容性差。解决方案的关键在于提出一种名为梯度分组缩放(Scaling with Gradient Grouping, SGG)的优化器封装方法,通过动态分组和组内特定缩放来改进自适应学习率估计,具体而言是将每一层的梯度统计信息聚类,并对每个聚类应用特定的缩放以校准参数的学习率,从而在保持逐参数适应性的同时施加群体级别的约束。

链接: https://arxiv.org/abs/2506.01049
作者: Siyuan Li,Juanxi Tian,Zedong Wang,Xin Jin,Zicheng Liu,Wentao Zhang,Dan Xu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Preprint version of “Taming LLMs with Gradient Grouping” (ACL’2025). The code will be available at this https URL

点击查看摘要

Abstract:Training large language models (LLMs) poses challenges due to their massive scale and heterogeneous architectures. While adaptive optimizers like AdamW help address gradient variations, they still struggle with efficient and effective parameter-wise learning rate estimation, resulting in training instability, slow convergence, and poor compatibility with parameter-efficient fine-tuning (PEFT) techniques. This work introduces Scaling with Gradient Grouping (SGG), an optimizer wrapper that improves adaptive learning rate estimation by dynamic grouping and group-specific scaling. SGG first groups gradient statistics in each layer into clusters and then applies cluster-specific scaling to calibrate learning rates for each parameter, thus imposing collective group-wise constraints while maintaining precise per-parameter adaptation. Experiments on diverse (M)LLM benchmarks show that SGG integrates seamlessly with existing optimizers, and offers consistent gains and faster convergence over baselines, with various model sizes. Its stability across varying batch sizes and learning rates establishes SGG as a robust choice for LLM optimization.
zh

[AI-93] IRT-Router: Effective and Interpretable Multi-LLM Routing via Item Response Theory ACL2025

【速读】:该论文试图解决在众多大型语言模型(Large Language Models, LLMs)中选择最优模型以响应用户查询时面临的性能与成本之间的权衡问题。解决方案的关键在于提出一种名为IRT-Router的多LLM路由框架,该框架受项目反应理论(Item Response Theory, IRT)启发,显式建模LLM能力与用户查询属性之间的关系,从而实现对响应性能的准确预测,并提供可解释的洞察,如LLM能力与查询难度。此外,通过基于语义相似性的在线查询预热技术,进一步提升了IRT-Router的在线泛化能力。

链接: https://arxiv.org/abs/2506.01048
作者: Wei Song,Zhenya Huang,Cheng Cheng,Weibo Gao,Bihan Xu,GuanHao Zhao,Fei Wang,Runze Wu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: ACL 2025 Main

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated exceptional performance across a wide range of natural language tasks. However, selecting the optimal LLM to respond to a user query often necessitates a delicate balance between performance and cost. While powerful models deliver better results, they come at a high cost, whereas smaller models are more cost-effective but less capable. To address this trade-off, we propose IRT-Router, a multi-LLM routing framework that efficiently routes user queries to the most suitable LLM. Inspired by Item Response Theory (IRT), a psychological measurement methodology, IRT-Router explicitly models the relationship between LLM capabilities and user query attributes. This not only enables accurate prediction of response performance but also provides interpretable insights, such as LLM abilities and query difficulty. Additionally, we design an online query warm-up technique based on semantic similarity, further enhancing the online generalization capability of IRT-Router. Extensive experiments on 20 LLMs and 12 datasets demonstrate that IRT-Router outperforms most baseline methods in terms of effectiveness and interpretability. Its superior performance in cold-start scenarios further confirms the reliability and practicality of IRT-Router in real-world applications. Code is available at this https URL.
zh

[AI-94] A Two-Stage Hierarchical Deep Filtering Framework for Real-Time Speech Enhancement INTERSPEECH2025

【速读】:该论文旨在解决单通道语音增强中如何有效利用目标时间-频率(TF)bin及其周围TF bin的信息以提升语音质量的问题。其解决方案的关键在于提出一种层次化深度滤波网络(HDF-Net),该网络通过集成子带处理和深度滤波机制,分别在输入端捕获周围频率bin的信息,并在输出端对目标及周围TF bin进行滤波。此外,通过将深度滤波解耦为时域和频域成分,并引入两阶段框架,降低了滤波系数预测的复杂度,同时提出了TAConv模块以增强卷积特征提取能力,从而在减少资源消耗的同时提升了模型性能。

链接: https://arxiv.org/abs/2506.01023
作者: Shenghui Lu,Hukai Huang,Jinanglong Yao,Kaidi Wang,Qingyang Hong,Lin Li
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: 5 pages, 2 figure, accepted by Interspeech 2025

点击查看摘要

Abstract:This paper proposes a model that integrates sub-band processing and deep filtering to fully exploit information from the target time-frequency (TF) bin and its surrounding TF bins for single-channel speech enhancement. The sub-band module captures surrounding frequency bin information at the input, while the deep filtering module applies filtering at the output to both the target TF bin and its surrounding TF bins. To further improve the model performance, we decouple deep filtering into temporal and frequency components and introduce a two-stage framework, reducing the complexity of filter coefficient prediction at each stage. Additionally, we propose the TAConv module to strengthen convolutional feature extraction. Experimental results demonstrate that the proposed hierarchical deep filtering network (HDF-Net) effectively utilizes surrounding TF bin information and outperforms other advanced systems while using fewer resources.
zh

[AI-95] Higher-Order Responsibility

【速读】:该论文试图解决在群体决策情境中由于传统个体责任定义(基于Frankfurt的可选性原则)导致的责任缺失问题,即“责任鸿沟”(responsibility gap)。论文提出的一种解决方案是“高阶责任”(higher-order responsibility),其关键在于确定高阶责任至程度d是否足以填补责任鸿沟,研究的核心技术结果表明该问题属于Π₂d+1-完全问题。

链接: https://arxiv.org/abs/2506.01003
作者: Junli Jiang,Pavel Naumov
机构: 未知
类目: Artificial Intelligence (cs.AI); Computational Complexity (cs.CC); Computer Science and Game Theory (cs.GT)
备注:

点击查看摘要

Abstract:In ethics, individual responsibility is often defined through Frankfurt’s principle of alternative possibilities. This definition is not adequate in a group decision-making setting because it often results in the lack of a responsible party or "responsibility gap’‘. One of the existing approaches to address this problem is to consider group responsibility. Another, recently proposed, approach is "higher-order’’ responsibility. The paper considers the problem of deciding if higher-order responsibility up to degree d is enough to close the responsibility gap. The main technical result is that this problem is \Pi_2d+1 -complete.
zh

[AI-96] Boosting Bot Detection via Heterophily-Aware Representation Learning and Prototype-Guided Cluster Discovery KDD2025

【速读】:该论文旨在解决社交网络中基于图的虚假账号(bot)检测方法在实际应用中面临的标签依赖性高和跨不同社区泛化能力差的问题。其解决方案的关键在于提出BotHP框架,该框架通过异质性感知的表征学习和原型引导的聚类发现机制,实现对图结构中同质性和异质性的同时建模,并捕捉虚假账号群体的潜在全局一致性,从而提升检测性能、降低对标签的依赖并增强模型的泛化能力。

链接: https://arxiv.org/abs/2506.00989
作者: Buyun He,Xiaorui Jiang,Qi Wu,Hao Liu,Yingguang Yang,Yong Liao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: KDD 2025

点击查看摘要

Abstract:Detecting social media bots is essential for maintaining the security and trustworthiness of social networks. While contemporary graph-based detection methods demonstrate promising results, their practical application is limited by label reliance and poor generalization capability across diverse communities. Generative Graph Self-Supervised Learning (GSL) presents a promising paradigm to overcome these limitations, yet existing approaches predominantly follow the homophily assumption and fail to capture the global patterns in the graph, which potentially diminishes their effectiveness when facing the challenges of interaction camouflage and distributed deployment in bot detection scenarios. To this end, we propose BotHP, a generative GSL framework tailored to boost graph-based bot detectors through heterophily-aware representation learning and prototype-guided cluster discovery. Specifically, BotHP leverages a dual-encoder architecture, consisting of a graph-aware encoder to capture node commonality and a graph-agnostic encoder to preserve node uniqueness. This enables the simultaneous modeling of both homophily and heterophily, effectively countering the interaction camouflage issue. Additionally, BotHP incorporates a prototype-guided cluster discovery pretext task to model the latent global consistency of bot clusters and identify spatially dispersed yet semantically aligned bot collectives. Extensive experiments on two real-world bot detection benchmarks demonstrate that BotHP consistently boosts graph-based bot detectors, improving detection performance, alleviating label reliance, and enhancing generalization capability.
zh

[AI-97] Data Heterogeneity Modeling for Trustworthy Machine Learning KDD’25

【速读】:该论文试图解决机器学习系统中数据异质性(data heterogeneity)对模型性能的影响问题,传统算法因仅优化平均性能而忽视数据集内的内在多样性,导致决策不可靠、跨领域泛化能力不足、结果不公平以及科学推断错误等问题。解决方案的关键在于采用一种面向异质性的机器学习范式,该范式在机器学习全流程中系统性地整合数据异质性考量,包括数据收集、模型训练、评估与部署,从而提升模型的鲁棒性、公平性和可靠性。

链接: https://arxiv.org/abs/2506.00969
作者: Jiashuo Liu,Peng Cui
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Survey paper for tutorial “Data Heterogeneity Modeling for Trustworthy Machine Learning” in KDD’25

点击查看摘要

Abstract:Data heterogeneity plays a pivotal role in determining the performance of machine learning (ML) systems. Traditional algorithms, which are typically designed to optimize average performance, often overlook the intrinsic diversity within datasets. This oversight can lead to a myriad of issues, including unreliable decision-making, inadequate generalization across different domains, unfair outcomes, and false scientific inferences. Hence, a nuanced approach to modeling data heterogeneity is essential for the development of dependable, data-driven systems. In this survey paper, we present a thorough exploration of heterogeneity-aware machine learning, a paradigm that systematically integrates considerations of data heterogeneity throughout the entire ML pipeline – from data collection and model training to model evaluation and deployment. By applying this approach to a variety of critical fields, including healthcare, agriculture, finance, and recommendation systems, we demonstrate the substantial benefits and potential of heterogeneity-aware ML. These applications underscore how a deeper understanding of data diversity can enhance model robustness, fairness, and reliability and help model diagnosis and improvements. Moreover, we delve into future directions and provide research opportunities for the whole data mining community, aiming to promote the development of heterogeneity-aware ML.
zh

[AI-98] PolyBERT: Fine-Tuned Poly Encoder BERT-Based Model for Word Sense Disambiguation

【速读】:该论文旨在解决传统词义消歧(Word Sense Disambiguation, WSD)方法在语义表示不平衡和训练冗余两个方面的问题。具体而言,现有方法在特征提取过程中未能平衡token-level(局部)与sequence-level(全局)语义的表示,导致语义表达不足;同时,在训练阶段引入了所有可能的词义,增加了计算成本。其解决方案的关键在于提出一种基于多编码器(poly-encoder)的BERT模型,结合批次对比学习(Batch Contrastive Learning, BCL),通过多头注意力机制融合局部与全局语义以增强语义表示,并利用同一批次中其他目标词的正确词义作为负样本,从而减少冗余训练输入和计算开销。

链接: https://arxiv.org/abs/2506.00968
作者: Linhan Xia,Mingzhan Yang,Guohui Yuan,Shengnan Tao,Yujing Qiu,Guo Yu,Kai Lei
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Mainstream Word Sense Disambiguation (WSD) approaches have employed BERT to extract semantics from both context and definitions of senses to determine the most suitable sense of a target word, achieving notable performance. However, there are two limitations in these approaches. First, previous studies failed to balance the representation of token-level (local) and sequence-level (global) semantics during feature extraction, leading to insufficient semantic representation and a performance bottleneck. Second, these approaches incorporated all possible senses of each target word during the training phase, leading to unnecessary computational costs. To overcome these limitations, this paper introduces a poly-encoder BERT-based model with batch contrastive learning for WSD, named PolyBERT. Compared with previous WSD methods, PolyBERT has two improvements: (1) A poly-encoder with a multi-head attention mechanism is utilized to fuse token-level (local) and sequence-level (global) semantics, rather than focusing on just one. This approach enriches semantic representation by balancing local and global semantics. (2) To avoid redundant training inputs, Batch Contrastive Learning (BCL) is introduced. BCL utilizes the correct senses of other target words in the same batch as negative samples for the current target word, which reduces training inputs and computational cost. The experimental results demonstrate that PolyBERT outperforms baseline WSD methods such as Huang’s GlossBERT and Blevins’s BEM by 2% in F1-score. In addition, PolyBERT with BCL reduces GPU hours by 37.6% compared with PolyBERT without BCL.
zh

[AI-99] Unlocking Personalized Knowledge in Federated Large Language Model: The Power of Mixture of Experts

【速读】:该论文旨在解决联邦学习(Federated Learning, FL)在处理基于Mixture of Experts (MoE)架构的大语言模型(LLMs)时所面临的挑战,即现有FL方法主要针对密集模型设计,无法有效利用MoE模型的稀疏性,导致通信开销和计算成本过高,从而限制了个性化知识共享的潜力。其解决方案的关键在于提出FLEx框架,通过剪枝全局MoE模型以保留每个客户端的一个专家,并采用自适应门控机制将这些个性化专家重新整合到预训练的MoE层中,同时保持原始架构不变,从而实现高效个性化。

链接: https://arxiv.org/abs/2506.00965
作者: Fan Liu,Bikang Pan,Zhongyi Wang,Xi Yao,Xiaoying Tang,Jingya Wang,Ye Shi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The Mixture of Experts (MoE) architecture has emerged as a prominent strategy for scaling large language models (LLMs), effectively leveraging sparse activation and facilitating task-specific personalization. However, current federated learning (FL) approaches are primarily designed for dense models, making them unable to directly exploit the sparsity inherent in MoE architectures. Treating MoE models as dense networks in federated scenarios results in excessive communication overhead and computational costs, undermining the potential for personalized knowledge sharing. To address these challenges, we propose FLEx (Federated LLMs with Personalized Experts), a novel federated learning framework explicitly tailored for MoE-based LLMs. FLEx efficiently personalizes by pruning the global MoE model to keep only one expert per client, and employs an adaptive gating mechanism to reintegrate these personalized experts into the pre-trained MoE layers, ensuring the original backbone architecture remains unchanged. These personalized experts are trained with local data and stored locally on each client, while the shared modules are aggregated globally. Extensive evaluations on diverse instruction-based datasets under non-IID conditions consistently demonstrate that FLEx outperforms existing federated baselines. Our code is available at this https URL.
zh

[AI-100] Legal Compliance Evaluation of Smart Contracts Generated By Large Language Models

【速读】:该论文试图解决智能合约在法律合规性方面的挑战,即如何确保由自然语言法律合同生成的智能合约符合法律要求。解决方案的关键在于利用大型语言模型(Large Language Models, LLMs)直接从自然语言法律合同生成合法合规的智能合约,并提出了一套新的度量标准来量化法律合规性,该标准基于将法律合同和智能合约建模为过程并比较其行为。

链接: https://arxiv.org/abs/2506.00943
作者: Chanuka Wijayakoon,Hai Dong,H.M.N. Dilum Bandara,Zahir Tari,Anurag Soin
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Accepted for publication at IEEE International Conference on Blockchain and Cryptocurrency (ICBC) 2025

点击查看摘要

Abstract:Smart contracts can implement and automate parts of legal contracts, but ensuring their legal compliance remains challenging. Existing approaches such as formal specification, verification, and model-based development require expertise in both legal and software development domains, as well as extensive manual effort. Given the recent advances of Large Language Models (LLMs) in code generation, we investigate their ability to generate legally compliant smart contracts directly from natural language legal contracts, addressing these challenges. We propose a novel suite of metrics to quantify legal compliance based on modeling both legal and smart contracts as processes and comparing their behaviors. We select four LLMs, generate 20 smart contracts based on five legal contracts, and analyze their legal compliance. We find that while all LLMs generate syntactically correct code, there is significant variance in their legal compliance with larger models generally showing higher levels of compliance. We also evaluate the proposed metrics against properties of software metrics, showing they provide fine-grained distinctions, enable nuanced comparisons, and are applicable across domains for code from any source, LLM or developer. Our results suggest that LLMs can assist in generating starter code for legally compliant smart contracts with strict reviews, and the proposed metrics provide a foundation for automated and self-improving development workflows.
zh

[AI-101] Uncertainty-Aware Metabolic Stability Prediction with Dual-View Contrastive Learning ECML-PKDD2025

【速读】:该论文旨在解决分子代谢稳定性(molecular metabolic stability, MS)预测中的两个关键问题:一是由于基于原子中心的消息传递机制导致的分子建模不完整,忽略了键级拓扑特征;二是预测框架缺乏可靠的不确定性量化。其解决方案的关键在于提出一种名为TrustworthyMS的新型对比学习框架,通过分子图拓扑重映射机制同步原子-键相互作用,利用边诱导的特征传播捕捉局部电子效应和全局构象约束;通过对比拓扑-键对齐增强表示鲁棒性;并通过Beta-Binomial不确定性量化实现在认知不确定性下的同时预测与置信度校准。

链接: https://arxiv.org/abs/2506.00936
作者: Peijin Guo,Minghui Li,Hewen Pan,Bowen Chen,Yang Wu,Zikang Guo,Leo Yu Zhang,Shengshan Hu,Shengqing Hu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
备注: This manuscript has been accepted for publication at ECML-PKDD 2025. The final version will be published in the conference proceedings

点击查看摘要

Abstract:Accurate prediction of molecular metabolic stability (MS) is critical for drug research and development but remains challenging due to the complex interplay of molecular interactions. Despite recent advances in graph neural networks (GNNs) for MS prediction, current approaches face two critical limitations: (1) incomplete molecular modeling due to atom-centric message-passing mechanisms that disregard bond-level topological features, and (2) prediction frameworks that lack reliable uncertainty quantification. To address these challenges, we propose TrustworthyMS, a novel contrastive learning framework designed for uncertainty-aware metabolic stability prediction. First, a molecular graph topology remapping mechanism synchronizes atom-bond interactions through edge-induced feature propagation, capturing both localized electronic effects and global conformational constraints. Second, contrastive topology-bond alignment enforces consistency between molecular topology views and bond patterns via feature alignment, enhancing representation robustness. Third, uncertainty modeling through Beta-Binomial uncertainty quantification enables simultaneous prediction and confidence calibration under epistemic uncertainty. Through extensive experiments, our results demonstrate that TrustworthyMS outperforms current state-of-the-art methods in terms of predictive performance.
zh

[AI-102] General-purpose audio representation learning for real-world sound scenes

【速读】:该论文试图解决当前音频基础模型在真实世界场景中表现受限的问题,特别是由于这些模型通常在干燥、非空间化、单源音频片段上进行训练和测试,导致其生成的音频嵌入缺乏空间感知能力。解决方案的关键在于提出一种新颖的自监督训练方法,即通用型现实音频模型(GRAMs)的训练方法,该方法能够实现自然噪声环境下的鲁棒空间音频表征学习,并适用于任何基于掩码的深度学习模型。通过在两种先进的模型(分别采用Transformer和Mamba架构)上验证,结果表明该方法有效缩小了干音频与自然声音场景在关键任务(如听觉场景分析)上的性能差距,并在声源定位任务中表现出色。

链接: https://arxiv.org/abs/2506.00934
作者: Goksenin Yuksel,Marcel van Gerven,Kiki van der Heijden
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:While audio foundation models perform well on myriad of tasks from sound classification to speech analysis, these models are trained and tested on dry, non-spatial, single-source audio clips. This limits their success in real-world situations and results in spatially unaware audio embeddings. To address these limitations, we propose a novel self-supervised training approach for General-Purpose, Real-world Audio Models (GRAMs). The GRAM training approach enables robust spatial audio representation learning for naturalistic, noisy sound scenes and can be applied to any masking-based deep learning model. We demonstrate the success of our approach by training two state-of-the-art models, one with a transformer and one with a mamba backbone. We assess the quality of the extracted audio representations from GRAMs using the original version of the HEAR benchmark, a newly synthesized, naturalistic version of the HEAR benchmark, and novel sound localization tasks based on HEAR benchmark datasets. The results show that our approach minimizes the performance gap between dry, non-spatial, single-source sound scenes and naturalistic sound scenes for crucial tasks such as auditory scene analysis, outperforming existing state-of-the-art audio foundation models at a fraction of the training steps. Moreover, GRAMs show state-of-the-art performance on sound localization tasks, exceeding even supervised sound localization models. In sum, the proposed approach represents a significant advancement towards robust audio foundation models for real-world applications with state-of-the-art performance on naturalistic sound scenes as well as spatial audio representation learning.
zh

[AI-103] In-the-wild Audio Spatialization with Flexible Text-guided Localization ACL2025

【速读】:该论文旨在解决在复杂多对象用户交互环境中,现有音频空间化方法缺乏灵活和交互控制的问题。其解决方案的关键在于提出一种基于文本引导的音频空间化(Text-guided Audio Spatialization, TAS)框架,该框架通过灵活的文本提示来指导双耳音频的生成,并结合3D空间位置和相对位置提示以及翻转通道音频进行增强,从而实现更准确和具有空间语义一致性的音频生成。

链接: https://arxiv.org/abs/2506.00927
作者: Tianrui Pan,Jie Liu,Zewen Huang,Jie Tang,Gangshan Wu
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: Accepted by ACL 2025 main

点击查看摘要

Abstract:To enhance immersive experiences, binaural audio offers spatial awareness of sounding objects in AR, VR, and embodied AI applications. While existing audio spatialization methods can generally map any available monaural audio to binaural audio signals, they often lack the flexible and interactive control needed in complex multi-object user-interactive environments. To address this, we propose a Text-guided Audio Spatialization (TAS) framework that utilizes flexible text prompts and evaluates our model from unified generation and comprehension perspectives. Due to the limited availability of premium and large-scale stereo data, we construct the SpatialTAS dataset, which encompasses 376,000 simulated binaural audio samples to facilitate the training of our model. Our model learns binaural differences guided by 3D spatial location and relative position prompts, augmented by flipped-channel audio. It outperforms existing methods on both simulated and real-recorded datasets, demonstrating superior generalization and accuracy. Besides, we develop an assessment model based on Llama-3.1-8B, which evaluates the spatial semantic coherence between our generated binaural audio and text prompts through a spatial reasoning task. Results demonstrate that text prompts provide flexible and interactive control to generate binaural audio with excellent quality and semantic consistency in spatial locations. Dataset is available at \hrefthis https URL
zh

[AI-104] Bridging Subjective and Objective QoE: Operator-Level Aggregation Using LLM -Based Comment Analysis and Network MOS Comparison

【速读】:该论文旨在解决网络运营商侧用户体验质量(QoE)评估的问题,通过整合客观网络建模与从直播平台提取的主观用户感知来实现更全面的评估。解决方案的关键在于构建一个双层框架,其中客观层利用机器学习模型基于网络参数(如丢包率、延迟、抖动和吞吐量)预测用户感知的视频质量;主观层则通过语义过滤和评分流程处理用户评论,使用大语言模型为筛选后的评论分配标量MOS分数,从而实现可扩展且可解释的用户感知分析。

链接: https://arxiv.org/abs/2506.00924
作者: Parsa Hassani Shariat Panahi,Amir Hossein Jalilvand,M. Hasan Najafi
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 19 ppages, 13 figures

点击查看摘要

Abstract:This paper introduces a dual-layer framework for network operator-side quality of experience (QoE) assessment that integrates both objective network modeling and subjective user perception extracted from live-streaming platforms. On the objective side, we develop a machine learning model trained on mean opinion scores (MOS) computed via the ITU-T P.1203 reference implementation, allowing accurate prediction of user-perceived video quality using only network parameters such as packet loss, delay, jitter, and throughput without reliance on video content or client-side instrumentation. On the subjective side, we present a semantic filtering and scoring pipeline that processes user comments from live streams to extract performance-related feedback. A large language model is used to assign scalar MOS scores to filtered comments in a deterministic and reproducible manner. To support scalable and interpretable analysis, we con- struct a labeled dataset of 47,894 live-stream comments, of which about 34,000 are identified as QoE-relevant through multi-layer semantic filtering. Each comment is enriched with simulated Internet Service Provider attribution and temporally aligned using synthetic timestamps in 5-min intervals. The resulting dataset enables operator-level aggregation and time-series analysis of user-perceived quality. A delta MOS metric is proposed to measure each Internet service provider’s deviation from platform-wide sentiment, allowing detection of localized degradations even in the absence of direct network telemetry. A controlled outage simulation confirms the framework’s effectiveness in identifying service disruptions through comment-based trends alone. The system provides each operator with its own subjective MOS and the global platform average per interval, enabling real-time interpretation of performance deviations and comparison with objective network-based QoE estimates.
zh

[AI-105] Principled Input-Output-Conditioned Post-Hoc Uncertainty Estimation for Regression Networks

【速读】:该论文试图解决在安全敏感应用中不确定性量化(uncertainty quantification)的重要性与现有神经网络模型通常缺乏该能力之间的矛盾,尤其是在不影响预测性能的情况下。其解决方案的关键在于提出一个理论基础坚实的后处理不确定性估计框架,通过拟合辅助模型来同时利用原始输入和冻结模型输出,从而实现对回归任务中预测不确定性的精确估计。该方法基于最大似然估计和序列参数拟合原则,形式化了一个精确的后处理优化目标,无需在推理阶段进行采样或近似即可恢复高斯参数的标准最大似然估计。

链接: https://arxiv.org/abs/2506.00918
作者: Lennart Bramlage,Cristóbal Curio
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Uncertainty quantification is critical in safety-sensitive applications but is often omitted from off-the-shelf neural networks due to adverse effects on predictive performance. Retrofitting uncertainty estimates post-hoc typically requires access to model parameters or gradients, limiting feasibility in practice. We propose a theoretically grounded framework for post-hoc uncertainty estimation in regression tasks by fitting an auxiliary model to both original inputs and frozen model outputs. Drawing from principles of maximum likelihood estimation and sequential parameter fitting, we formalize an exact post-hoc optimization objective that recovers the canonical MLE of Gaussian parameters, without requiring sampling or approximation at inference. While prior work has used model outputs to estimate uncertainty, we explicitly characterize the conditions under which this is valid and demonstrate the extent to which structured outputs can support quasi-epistemic inference. We find that using diverse auxiliary data, such as augmented subsets of the original training data, significantly enhances OOD detection and metric performance. Our hypothesis that frozen model outputs contain generalizable latent information about model error and predictive uncertainty is tested and confirmed. Finally, we ensure that our method maintains proper estimation of input-dependent uncertainty without relying exclusively on base model forecasts. These findings are demonstrated in toy problems and adapted to both UCI and depth regression benchmarks. Code: this https URL.
zh

[AI-106] Conformal Arbitrag e: Risk-Controlled Balancing of Competing Objectives in Language Models

【速读】:该论文试图解决现代语言模型部署中面临的多目标权衡问题,例如有用性与无害性、成本与准确性、奖励与安全性之间的冲突。其解决方案的关键是引入Conformal Arbitrage框架,该框架通过学习一个数据驱动的阈值,在优化主要目标的主模型(Primary model)和更保守的守护者(Guardian)之间进行调解,守护者可以是另一个模型或人类领域专家,其目标与安全约束对齐。该阈值通过共形风险控制进行校准,从而在有限样本和分布无关的情况下保证不利事件(如事实错误或安全违规)的长期频率不超过用户指定的配额。

链接: https://arxiv.org/abs/2506.00911
作者: William Overman,Mohsen Bayati
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Modern language model deployments must often balance competing objectives, for example, helpfulness versus harmlessness, cost versus accuracy, and reward versus safety. We introduce Conformal Arbitrage, a post hoc framework that learns a data driven threshold to mediate between a Primary model optimized for a primary objective and a more conservative Guardian which could be another model or a human domain expert aligned with a guardrail objective. The threshold is calibrated with conformal risk control, yielding finite sample, distribution free guarantees that the long run frequency of undesirable events, such as factual errors or safety violations, does not exceed a user specified quota. Because Conformal Arbitrage operates wholly at the API level, without requiring access to model logits or updating model weights, it complements weight based alignment techniques and integrates seamlessly with existing cost aware cascades. Empirically, Conformal Arbitrage traces an efficient frontier, allowing users to define an acceptable performance level for one objective while maximizing utility in another. We observe that our method outperforms, in terms of accuracy, cost matched random routing between models. These properties make Conformal Arbitrage a practical, theoretically grounded tool for trustworthy and economical deployment of large language models across a broad range of potentially competing objectives.
zh

[AI-107] PCoreSet: Effective Active Learning through Knowledge Distillation from Vision-Language Models

【速读】:该论文试图解决知识蒸馏(Knowledge Distillation, KD)在主动学习(Active Learning, AL)场景下的应用难题,即如何在数据稀缺的情况下有效利用教师模型的知识来训练紧凑的学生模型。传统KD依赖于充足的标注数据,而AL则在标注成本受限的环境下运作,且常缺乏任务特定的教师模型。论文提出的解决方案关键在于引入ActiveKD框架,该框架结合了AL与KD,并利用大视觉语言模型(Vision-Language Models, VLMs)的零样本和少样本能力。其中核心创新是利用VLMs在概率空间中的结构化预测偏差,将其视为教师模型的归纳偏置,从而指导学生模型的学习。为此,论文提出了Probabilistic CoreSet(PCoreSet)选择策略,通过最大化概率空间的覆盖度来选取类别多样化的未标记样本,以提升知识迁移效率。

链接: https://arxiv.org/abs/2506.00910
作者: Seongjae Kang,Dong Bok Lee,Hyungjoon Jang,Dongseop Kim,Sung Ju Hwang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 35 pages, 30 figures

点击查看摘要

Abstract:Knowledge distillation (KD) is a widely used framework for training compact, task-specific models by leveraging the knowledge of teacher models. However, its application to active learning (AL), which aims to minimize annotation costs through iterative sample selection, remains underexplored. This gap stems from the fact that KD typically assumes access to sufficient labeled data, whereas AL operates in data-scarce scenarios where task-specific teacher models are often unavailable. In this paper, we introduce ActiveKD, a framework that integrates AL with KD by leveraging the zero- and few-shot capabilities of large vision-language models (VLMs). A key aspect of ActiveKD is the structured prediction bias of VLMs–i.e., their predictions form clusters in the probability space. We regard this structure as an inductive bias of the teacher model, capturing generalizable output patterns beneficial to student learning. To exploit this bias, we propose Probabilistic CoreSet (PCoreSet), a selection strategy that maximizes coverage in the probability space rather than the feature space. PCoreSet strategically selects categorically diverse unlabeled samples, facilitating more efficient transfer of teacher knowledge under limited annotation budgets. Evaluations on 11 datasets show that PCoreSet consistently outperforms existing selection methods within the ActiveKD framework, advancing research at the intersection of AL and KD.
zh

[AI-108] State-Covering Trajectory Stitching for Diffusion Planners

【速读】:该论文试图解决基于扩散的生成模型在强化学习中长期规划任务中的性能受限问题,尤其是由于训练数据质量和多样性不足导致的泛化能力差和对超出训练分布的任务或更长规划时域的适应性不足。解决方案的关键在于提出一种无需奖励的轨迹增强方法——状态覆盖轨迹拼接(State-Covering Trajectory Stitching, SCoTS),其通过逐步拼接短轨迹片段,系统地生成多样且扩展的轨迹,从而有效覆盖并扩展潜在空间,提升扩散规划器的性能与泛化能力。

链接: https://arxiv.org/abs/2506.00895
作者: Kyowoon Lee,Jaesik Choi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Diffusion-based generative models are emerging as powerful tools for long-horizon planning in reinforcement learning (RL), particularly with offline datasets. However, their performance is fundamentally limited by the quality and diversity of training data. This often restricts their generalization to tasks outside their training distribution or longer planning horizons. To overcome this challenge, we propose State-Covering Trajectory Stitching (SCoTS), a novel reward-free trajectory augmentation method that incrementally stitches together short trajectory segments, systematically generating diverse and extended trajectories. SCoTS first learns a temporal distance-preserving latent representation that captures the underlying temporal structure of the environment, then iteratively stitches trajectory segments guided by directional exploration and novelty to effectively cover and expand this latent space. We demonstrate that SCoTS significantly improves the performance and generalization capabilities of diffusion planners on offline goal-conditioned benchmarks requiring stitching and long-horizon reasoning. Furthermore, augmented trajectories generated by SCoTS significantly improve the performance of widely used offline goal-conditioned RL algorithms across diverse environments.
zh

[AI-109] oward a Theory of Agents as Tool-Use Decision-Makers

【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)作为自主代理(agent)时,其认识论基础(epistemic foundations)的未解问题,包括代理的定义、决策方式以及行为目标的设定。论文的核心解决方案是提出一种统一理论,将内部推理与外部行动视为等效的认识论工具,从而实现代理的内省与交互的系统性协调。关键在于将代理的工具使用决策边界与其知识边界对齐,以减少不必要的工具使用并最大化认识效率,从而将代理设计从单纯的行动执行者转变为以知识驱动的智能系统。

链接: https://arxiv.org/abs/2506.00886
作者: Hongru Wang,Cheng Qian,Manling Li,Jiahao Qiu,Boyang Xue,Mengdi Wang,Heng Ji,Kam-Fai Wong
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As Large Language Models (LLMs) evolve into increasingly autonomous agents, fundamental questions about their epistemic foundations remain unresolved: What defines an agent? How should it make decisions? And what objectives should guide its behavior? In this position paper, we argue that true autonomy requires agents to be grounded in a coherent epistemic framework that governs what they know, what they need to know, and how to acquire that knowledge efficiently. We propose a unified theory that treats internal reasoning and external actions as equivalent epistemic tools, enabling agents to systematically coordinate introspection and interaction. Building on this framework, we advocate for aligning an agent’s tool use decision-making boundary with its knowledge boundary, thereby minimizing unnecessary tool use and maximizing epistemic efficiency. This perspective shifts the design of agents from mere action executors to knowledge-driven intelligence systems, offering a principled path toward building foundation agents capable of adaptive, efficient, and goal-directed behavior.
zh

[AI-110] CoVoMix2: Advancing Zero-Shot Dialogue Generation with Fully Non-Autoregressive Flow Matching

【速读】:该论文旨在解决多说话人对话生成中存在的话语一致性不足、重叠语音建模困难以及高效连贯对话合成的问题。其解决方案的关键在于提出CoVoMix2,这是一个完全非自回归的框架,通过基于流匹配的生成模型直接从多流转录文本预测梅尔频谱图,从而摆脱对中间标记表示的依赖,并引入转录级说话人解耦、句级对齐和提示级随机掩码策略,以更好地捕捉真实对话动态,实现高质量、高一致性和高效推理的多说话人对话生成。

链接: https://arxiv.org/abs/2506.00885
作者: Leying Zhang,Yao Qian,Xiaofei Wang,Manthan Thakker,Dongmei Wang,Jianwei Yu,Haibin Wu,Yuxuan Hu,Jinyu Li,Yanmin Qian,Sheng Zhao
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Generating natural-sounding, multi-speaker dialogue is crucial for applications such as podcast creation, virtual agents, and multimedia content generation. However, existing systems struggle to maintain speaker consistency, model overlapping speech, and synthesize coherent conversations efficiently. In this paper, we introduce CoVoMix2, a fully non-autoregressive framework for zero-shot multi-talker dialogue generation. CoVoMix2 directly predicts mel-spectrograms from multi-stream transcriptions using a flow-matching-based generative model, eliminating the reliance on intermediate token representations. To better capture realistic conversational dynamics, we propose transcription-level speaker disentanglement, sentence-level alignment, and prompt-level random masking strategies. Our approach achieves state-of-the-art performance, outperforming strong baselines like MoonCast and Sesame in speech quality, speaker consistency, and inference speed. Notably, CoVoMix2 operates without requiring transcriptions for the prompt and supports controllable dialogue generation, including overlapping speech and precise timing control, demonstrating strong generalizability to real-world speech generation scenarios.
zh

[AI-111] ModuLM: Enabling Modular and Multimodal Molecular Relational Learning with Large Language Models

【速读】:该论文试图解决分子关系学习(Molecular Relational Learning, MRL)与大型语言模型(Large Language Models, LLMs)集成过程中面临的模型空间扩展带来的基准测试难题,特别是缺乏支持灵活分子输入格式和动态架构切换的统一框架。解决方案的关键在于提出ModuLM框架,该框架通过提供丰富的模块化组件(包括多种二维分子图编码器、三维分子构象编码器、交互层以及主流LLM主干),实现高度灵活的模型组装机制,从而支持超过50,000种不同的模型配置,有效减少冗余编码并确保公平的模型比较。

链接: https://arxiv.org/abs/2506.00880
作者: Zhuo Chen,Yizhen Zheng,Huan Yee Koh,Hongxin Xiang,Linjiang Chen,Wenjie Du,Yang Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
备注:

点击查看摘要

Abstract:Molecular Relational Learning (MRL) aims to understand interactions between molecular pairs, playing a critical role in advancing biochemical research. With the recent development of large language models (LLMs), a growing number of studies have explored the integration of MRL with LLMs and achieved promising results. However, the increasing availability of diverse LLMs and molecular structure encoders has significantly expanded the model space, presenting major challenges for benchmarking. Currently, there is no LLM framework that supports both flexible molecular input formats and dynamic architectural switching. To address these challenges, reduce redundant coding, and ensure fair model comparison, we propose ModuLM, a framework designed to support flexible LLM-based model construction and diverse molecular representations. ModuLM provides a rich suite of modular components, including 8 types of 2D molecular graph encoders, 11 types of 3D molecular conformation encoders, 7 types of interaction layers, and 7 mainstream LLM backbones. Owing to its highly flexible model assembly mechanism, ModuLM enables the dynamic construction of over 50,000 distinct model configurations. In addition, we provide comprehensive results to demonstrate the effectiveness of ModuLM in supporting LLM-based MRL tasks.
zh

[AI-112] Local Manifold Approximation and Projection for Manifold-Aware Diffusion Planning ICML2025

【速读】:该论文旨在解决基于扩散的生成模型在处理长时域、稀疏奖励任务时,由于采样过程中的不准确引导导致的轨迹不可行性问题,从而提高模型在安全关键应用中的可靠性。解决方案的关键在于提出一种无需训练的局部流形近似与投影方法(Local Manifold Approximation and Projection, LoMAP),该方法通过将引导样本投影到从离线数据集中近似得到的低秩子空间中,防止不可行轨迹的生成。

链接: https://arxiv.org/abs/2506.00867
作者: Kyowoon Lee,Jaesik Choi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: ICML 2025

点击查看摘要

Abstract:Recent advances in diffusion-based generative modeling have demonstrated significant promise in tackling long-horizon, sparse-reward tasks by leveraging offline datasets. While these approaches have achieved promising results, their reliability remains inconsistent due to the inherent stochastic risk of producing infeasible trajectories, limiting their applicability in safety-critical applications. We identify that the primary cause of these failures is inaccurate guidance during the sampling procedure, and demonstrate the existence of manifold deviation by deriving a lower bound on the guidance gap. To address this challenge, we propose Local Manifold Approximation and Projection (LoMAP), a training-free method that projects the guided sample onto a low-rank subspace approximated from offline datasets, preventing infeasible trajectory generation. We validate our approach on standard offline reinforcement learning benchmarks that involve challenging long-horizon planning. Furthermore, we show that, as a standalone module, LoMAP can be incorporated into the hierarchical diffusion planner, providing further performance enhancements.
zh

[AI-113] GIA-MIC: Multimodal Emotion Recognition with Gated Interactive Attention and Modality-Invariant Learning Constraints INTERSPEECH2025

【速读】:该论文旨在解决多模态情感识别(Multimodal Emotion Recognition, MER)中的两个关键问题:有效提取模态特异性特征以及在模态异质性导致的分布差异下捕捉跨模态相似性。其解决方案的关键在于提出一种门控交互注意力机制,以自适应地提取模态特异性特征并增强情感信息,同时引入一种模态不变生成器,通过对齐跨模态相似性来学习模态不变表示并约束领域偏移。

链接: https://arxiv.org/abs/2506.00865
作者: Jiajun He,Jinyi Mi,Tomoki Toda
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted by INTERSPEECH 2025

点击查看摘要

Abstract:Multimodal emotion recognition (MER) extracts emotions from multimodal data, including visual, speech, and text inputs, playing a key role in human-computer interaction. Attention-based fusion methods dominate MER research, achieving strong classification performance. However, two key challenges remain: effectively extracting modality-specific features and capturing cross-modal similarities despite distribution differences caused by modality heterogeneity. To address these, we propose a gated interactive attention mechanism to adaptively extract modality-specific features while enhancing emotional information through pairwise interactions. Additionally, we introduce a modality-invariant generator to learn modality-invariant representations and constrain domain shifts by aligning cross-modal similarities. Experiments on IEMOCAP demonstrate that our method outperforms state-of-the-art MER approaches, achieving WA 80.7% and UA 81.3%.
zh

[AI-114] MedBookVQA: A Systematic and Comprehensive Medical Benchmark Derived from Open-Access Book

【速读】:该论文旨在解决通用医疗人工智能(General Medical Artificial Intelligence, GMAI)在临床应用中面临的性能评估与技术指导不足的问题,尤其是针对医疗资源短缺和成本上升等长期存在的挑战。其解决方案的关键在于构建一个系统且全面的多模态基准测试框架——MedBookVQA,该框架基于开放获取的医学教材,通过标准化的自动化流程提取医学图像并将其与对应的医学叙述进行语境对齐,进而生成涵盖多种临床任务的5000个相关问题,并采用多层次标注体系对查询进行分类,从而实现跨医学子领域的细致分析。

链接: https://arxiv.org/abs/2506.00855
作者: Sau Lai Yip,Sunan He,Yuxiang Nie,Shu Pui Chan,Yilin Ye,Sum Ying Lam,Hao Chen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: For data and code, see: this https URL and this https URL

点击查看摘要

Abstract:The accelerating development of general medical artificial intelligence (GMAI), powered by multimodal large language models (MLLMs), offers transformative potential for addressing persistent healthcare challenges, including workforce deficits and escalating costs. The parallel development of systematic evaluation benchmarks emerges as a critical imperative to enable performance assessment and provide technological guidance. Meanwhile, as an invaluable knowledge source, the potential of medical textbooks for benchmark development remains underexploited. Here, we present MedBookVQA, a systematic and comprehensive multimodal benchmark derived from open-access medical textbooks. To curate this benchmark, we propose a standardized pipeline for automated extraction of medical figures while contextually aligning them with corresponding medical narratives. Based on this curated data, we generate 5,000 clinically relevant questions spanning modality recognition, disease classification, anatomical identification, symptom diagnosis, and surgical procedures. A multi-tier annotation system categorizes queries through hierarchical taxonomies encompassing medical imaging modalities (42 categories), body anatomies (125 structures), and clinical specialties (31 departments), enabling nuanced analysis across medical subdomains. We evaluate a wide array of MLLMs, including proprietary, open-sourced, medical, and reasoning models, revealing significant performance disparities across task types and model categories. Our findings highlight critical capability gaps in current GMAI systems while establishing textbook-derived multimodal benchmarks as essential evaluation tools. MedBookVQA establishes textbook-derived benchmarking as a critical paradigm for advancing clinical AI, exposing limitations in GMAI systems while providing anatomically structured performance metrics across specialties.
zh

[AI-115] Generalization in VAE and Diffusion Models: A Unified Information-Theoretic Analysis ICLR2025

【速读】:该论文试图解决扩散模型(Diffusion Models, DMs)和变分自编码器(Variational Autoencoders, VAEs)在泛化性能方面的理论分析不足问题,尤其是缺乏对共享编码器-生成器结构的全面考虑。其解决方案的关键在于利用最新的信息论工具,提出一个统一的理论框架,将编码器和生成器视为随机映射,从而为两者的泛化性能提供保证。该框架不仅能够对VAEs进行更精细的分析,还揭示了DMs在泛化性上的显式权衡,并基于训练数据提供了可计算的泛化界,以优化模型参数选择和提升性能。

链接: https://arxiv.org/abs/2506.00849
作者: Qi Chen,Jierui Zhu,Florian Shkurti
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: ICLR 2025 Accepted

点击查看摘要

Abstract:Despite the empirical success of Diffusion Models (DMs) and Variational Autoencoders (VAEs), their generalization performance remains theoretically underexplored, especially lacking a full consideration of the shared encoder-generator structure. Leveraging recent information-theoretic tools, we propose a unified theoretical framework that provides guarantees for the generalization of both the encoder and generator by treating them as randomized mappings. This framework further enables (1) a refined analysis for VAEs, accounting for the generator’s generalization, which was previously overlooked; (2) illustrating an explicit trade-off in generalization terms for DMs that depends on the diffusion time T ; and (3) providing computable bounds for DMs based solely on the training data, allowing the selection of the optimal T and the integration of such bounds into the optimization process to improve model performance. Empirical results on both synthetic and real datasets illustrate the validity of the proposed theory.
zh

[AI-116] Speech Unlearning INTERSPEECH2025

【速读】:该论文试图解决在语音任务中实现机器遗忘(machine unlearning)的问题,即在不进行完整重新训练的情况下,高效且有效地消除特定数据对已训练语音模型的影响。其解决方案的关键在于定义了两种基础的语音遗忘任务:样本遗忘(sample unlearning),用于移除单个数据点(如一段语音记录);类别遗忘(class unlearning),用于移除整个类别(如某位说话人的所有数据),同时保持剩余数据的性能。研究指出,语音数据的高维性、序列性和说话人依赖性使得语音领域的遗忘问题比图像或文本数据更具挑战性。

链接: https://arxiv.org/abs/2506.00848
作者: Jiali Cheng,Hadi Amiri
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Interspeech 2025

点击查看摘要

Abstract:We introduce machine unlearning for speech tasks, a novel and underexplored research problem that aims to efficiently and effectively remove the influence of specific data from trained speech models without full retraining. This has important applications in privacy preservation, removal of outdated or noisy data, and bias mitigation. While machine unlearning has been studied in computer vision and natural language processing, its application to speech is largely unexplored due to the high-dimensional, sequential, and speaker-dependent nature of speech data. We define two fundamental speech unlearning tasks: sample unlearning, which removes individual data points (e.g., a voice recording), and class unlearning, which removes an entire category (e.g., all data from a speaker), while preserving performance on the remaining data. Experiments on keyword spotting and speaker identification demonstrate that unlearning speech data is significantly more challenging than unlearning image or text data. We conclude with key future directions in this area, including structured training, robust evaluation, feature-level unlearning, broader applications, scalable methods, and adversarial robustness.
zh

[AI-117] Counterfactual Activation Editing for Post-hoc Prosody and Mispronunciation Correction in TTS Models INTERSPEECH2025

【速读】:该论文试图解决文本到语音(Text-to-Speech, TTS)中韵律控制和发音错误纠正的问题,现有方法通常依赖于专用模块或额外训练,限制了其在推理阶段的后处理能力;同时,传统发音错误纠正依赖于音素字素映射字典,在低资源环境下实用性较差。论文提出的解决方案是反事实激活编辑(Counterfactual Activation Editing),其关键在于通过操纵预训练TTS模型的内部表示实现对韵律和发音的后处理控制,从而在不重新训练模型的情况下提升合成质量。

链接: https://arxiv.org/abs/2506.00832
作者: Kyowoon Lee,Artyom Stitsyuk,Gunu Jho,Inchul Hwang,Jaesik Choi
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: Accepted at Interspeech 2025

点击查看摘要

Abstract:Recent advances in Text-to-Speech (TTS) have significantly improved speech naturalness, increasing the demand for precise prosody control and mispronunciation correction. Existing approaches for prosody manipulation often depend on specialized modules or additional training, limiting their capacity for post-hoc adjustments. Similarly, traditional mispronunciation correction relies on grapheme-to-phoneme dictionaries, making it less practical in low-resource settings. We introduce Counterfactual Activation Editing, a model-agnostic method that manipulates internal representations in a pre-trained TTS model to achieve post-hoc control of prosody and pronunciation. Experimental results show that our method effectively adjusts prosodic features and corrects mispronunciations while preserving synthesis quality. This opens the door to inference-time refinement of TTS outputs without retraining, bridging the gap between pre-trained TTS models and editable speech synthesis.
zh

[AI-118] A Large Language Model-Supported Threat Modeling Framework for Transportation Cyber-Physical Systems

【速读】:该论文试图解决现代交通系统中由于自动化和连接性增加所带来的网络安全威胁建模问题,现有框架在范围、资源消耗和对安全专家依赖方面存在局限。解决方案的关键在于提出一种基于大语言模型(LLM)的威胁建模框架TraCR-TMF,通过三种不同的LLM方法(检索增强生成、上下文学习和监督微调)减少对安全专家的依赖,同时利用MITRE ATTCK矩阵识别威胁、攻击技术及应对措施,并通过定制化LLM分析漏洞以映射攻击路径,从而提升威胁建模的效率与适应性。

链接: https://arxiv.org/abs/2506.00831
作者: M Sabbir Salek,Mashrur Chowdhury,Muhaimin Bin Munir,Yuchen Cai,Mohammad Imtiaz Hasan,Jean-Michel Tine,Latifur Khan,Mizanur Rahman
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Modern transportation systems rely on cyber-physical systems (CPS), where cyber systems interact seamlessly with physical systems like transportation-related sensors and actuators to enhance safety, mobility, and energy efficiency. However, growing automation and connectivity increase exposure to cyber vulnerabilities. Existing threat modeling frameworks for transportation CPS are often limited in scope, resource-intensive, and dependent on significant cybersecurity expertise. To address these gaps, we present TraCR-TMF (Transportation Cybersecurity and Resiliency Threat Modeling Framework), a large language model (LLM)-based framework that minimizes expert intervention. TraCR-TMF identifies threats, potential attack techniques, and corresponding countermeasures by leveraging the MITRE ATTCK matrix through three LLM-based approaches: (i) a retrieval-augmented generation (RAG) method requiring no expert input, (ii) an in-context learning approach requiring low expert input, and (iii) a supervised fine-tuning method requiring moderate expert input. TraCR-TMF also maps attack paths to critical assets by analyzing vulnerabilities using a customized LLM. The framework was evaluated in two scenarios. First, it identified relevant attack techniques across transportation CPS applications, with 90% precision as validated by experts. Second, using a fine-tuned LLM, it successfully predicted multiple exploitations including lateral movement, data exfiltration, and ransomware-related encryption that occurred during a major real-world cyberattack incident. These results demonstrate TraCR-TMF’s effectiveness in CPS threat modeling, its reduced reliance on cybersecurity expertise, and its adaptability across CPS domains.
zh

[AI-119] SafeGenes: Evaluating the Adversarial Robustness of Genomic Foundation Models

【速读】:该论文试图解决基因组基础模型(Genomic Foundation Models, GFMs)在对抗性攻击下的鲁棒性问题,特别是其在高风险基因组应用(如变异效应预测)中的安全性不足。解决方案的关键在于提出SafeGenes框架,通过结合快速梯度符号法(Fast Gradient Sign Method, FGSM)和软提示攻击,评估GFMs对精心设计的近似相同对抗基因及嵌入空间扰动的敏感性,从而全面揭示模型的对抗脆弱性。

链接: https://arxiv.org/abs/2506.00821
作者: Huixin Zhan,Jason H. Moore
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Genomic Foundation Models (GFMs), such as Evolutionary Scale Modeling (ESM), have demonstrated significant success in variant effect prediction. However, their adversarial robustness remains largely unexplored. To address this gap, we propose SafeGenes: a framework for Secure analysis of genomic foundation models, leveraging adversarial attacks to evaluate robustness against both engineered near-identical adversarial Genes and embedding-space manipulations. In this study, we assess the adversarial vulnerabilities of GFMs using two approaches: the Fast Gradient Sign Method (FGSM) and a soft prompt attack. FGSM introduces minimal perturbations to input sequences, while the soft prompt attack optimizes continuous embeddings to manipulate model predictions without modifying the input tokens. By combining these techniques, SafeGenes provides a comprehensive assessment of GFM susceptibility to adversarial manipulation. Targeted soft prompt attacks led to substantial performance degradation, even in large models such as ESM1b and ESM1v. These findings expose critical vulnerabilities in current foundation models, opening new research directions toward improving their security and robustness in high-stakes genomic applications such as variant effect prediction.
zh

[AI-120] DriveMind: A Dual-VLM based Reinforcement Learning Framework for Autonomous Driving

【速读】:该论文旨在解决端到端自动驾驶系统在可解释性、安全性及动态场景适应性方面的不足。其核心问题在于现有系统缺乏形式化安全保证、对动态驾驶环境的适应能力有限以及依赖静态提示和固定目标。解决方案的关键在于提出DriveMind,一个统一的语义奖励框架,通过对比视觉-语言模型(VLM)编码器实现逐步语义锚定,结合新颖性触发的VLM编码器-解码器进行动态提示生成,并集成层次化安全模块与紧凑的预测世界模型,从而提升系统的适应性、安全性和泛化能力。

链接: https://arxiv.org/abs/2506.00819
作者: Dawood Wasif,Terrence J Moore,Chandan K Reddy,Jin-Hee Cho
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:End-to-end autonomous driving systems map sensor data directly to control commands, but remain opaque, lack interpretability, and offer no formal safety guarantees. While recent vision-language-guided reinforcement learning (RL) methods introduce semantic feedback, they often rely on static prompts and fixed objectives, limiting adaptability to dynamic driving scenes. We present DriveMind, a unified semantic reward framework that integrates: (i) a contrastive Vision-Language Model (VLM) encoder for stepwise semantic anchoring; (ii) a novelty-triggered VLM encoder-decoder, fine-tuned via chain-of-thought (CoT) distillation, for dynamic prompt generation upon semantic drift; (iii) a hierarchical safety module enforcing kinematic constraints (e.g., speed, lane centering, stability); and (iv) a compact predictive world model to reward alignment with anticipated ideal states. DriveMind achieves 19.4 +/- 2.3 km/h average speed, 0.98 +/- 0.03 route completion, and near-zero collisions in CARLA Town 2, outperforming baselines by over 4% in success rate. Its semantic reward generalizes zero-shot to real dash-cam data with minimal distributional shift, demonstrating robust cross-domain alignment and potential for real-world deployment.
zh

[AI-121] Unlearning Inversion Attacks for Graph Neural Networks

【速读】:该论文试图解决图神经网络(Graph Neural Network, GNN)在进行图去学习(Graph Unlearning)后,仍可能存在隐私泄露的问题,即攻击者能否通过黑盒访问和部分图知识重构被删除的边。解决方案的关键在于提出TrendAttack方法,其核心包括:一是利用置信度陷阱(confidence pitfall)现象,即与被去学习边相邻的节点在模型置信度上会出现显著下降;二是设计一种自适应预测机制,对被去学习边和其他成员边应用不同的相似性阈值,从而有效识别被删除边的位置。

链接: https://arxiv.org/abs/2506.00808
作者: Jiahao Zhang,Yilong Wang,Zhiwei Zhang,Xiaorui Liu,Suhang Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Graph unlearning methods aim to efficiently remove the impact of sensitive data from trained GNNs without full retraining, assuming that deleted information cannot be recovered. In this work, we challenge this assumption by introducing the graph unlearning inversion attack: given only black-box access to an unlearned GNN and partial graph knowledge, can an adversary reconstruct the removed edges? We identify two key challenges: varying probability-similarity thresholds for unlearned versus retained edges, and the difficulty of locating unlearned edge endpoints, and address them with TrendAttack. First, we derive and exploit the confidence pitfall, a theoretical and empirical pattern showing that nodes adjacent to unlearned edges exhibit a large drop in model confidence. Second, we design an adaptive prediction mechanism that applies different similarity thresholds to unlearned and other membership edges. Our framework flexibly integrates existing membership inference techniques and extends them with trend features. Experiments on four real-world datasets demonstrate that TrendAttack significantly outperforms state-of-the-art GNN membership inference baselines, exposing a critical privacy vulnerability in current graph unlearning methods.
zh

[AI-122] Enhancing LLM Reasoning for Time Series Classification by Tailored Thinking and Fused Decision

【速读】:该论文旨在解决将大型语言模型(Large Language Models, LLMs)的推理能力有效应用于时间序列分类(Time Series Classification, TSC)任务中的挑战,尤其是在现有方法在时间序列领域表现有限的情况下。其解决方案的关键在于提出一种名为ReasonTSC的框架,该框架通过多轮推理和融合决策策略,引导LLM深入理解时间序列数据的本质特征,并结合插件分类器的预测结果与置信度分数,进行结构化的推理过程,从而提升分类性能并纠正插件模型的错误预测。

链接: https://arxiv.org/abs/2506.00807
作者: Jiahui Zhou,Dan Li,Lin Li,Zhuomin Chen,Shunyu Wu,Haozheng Ye,Jian Lou,Costas J. Spanos
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The reasoning capabilities of large language models (LLMs) have significantly advanced their performance by enabling in-depth understanding of diverse tasks. With growing interest in applying LLMs to the time series domain, this has proven nontrivial, as evidenced by the limited efficacy of straightforwardly adapting text-domain reasoning techniques. Although recent work has shown promise in several time series tasks, further leveraging advancements in LLM reasoning remains under-explored for time series classification (TSC) tasks, despite their prevalence and significance in many real-world applications. In this paper, we propose ReasonTSC, a novel framework designed to effectively leverage LLM reasoning for time series classification through both a multi-turn reasoning and a fused decision-making strategy tailored to TSC. Rather than straightforwardly applying existing reasoning techniques or relying solely on LLMs’ built-in reasoning capabilities, ReasonTSC first steers the model to think over the essential characteristics of time series data. Next, it integrates predictions and confidence scores from plug-in classifiers, e.g., domain-specific time series models, as in-context examples. Finally, ReasonTSC guides the LLM through a structured reasoning process: it evaluates the initial assessment, backtracks to consider alternative hypotheses, and compares their merits before arriving at a final classification. Extensive experiments and systematic ablation studies demonstrate that ReasonTSC consistently outperforms both existing time series reasoning baselines and plug-in models, and is even capable of identifying and correcting plug-in models’ false predictions.
zh

[AI-123] Action Dependency Graphs for Globally Optimal Coordinated Reinforcement Learning

【速读】:该论文旨在解决多智能体强化学习(MARL)中由于采用自回归形式的动作依赖策略而导致的计算复杂度高、可扩展性差的问题。其解决方案的关键在于引入“动作依赖图(ADG)”来建模智能体间的动作依赖关系,并证明在协调图结构下,满足特定条件的稀疏ADG可以实现全局最优性。基于此理论基础,作者提出了一种保证全局最优性的表格策略迭代算法,并将其集成到多个最先进算法中进行实验验证,结果表明该方法在复杂环境中的鲁棒性和适用性。

链接: https://arxiv.org/abs/2506.00797
作者: Jianglin Ding,Jingcheng Tang,Gangshan Jing
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY); Optimization and Control (math.OC)
备注:

点击查看摘要

Abstract:Action-dependent individual policies, which incorporate both environmental states and the actions of other agents in decision-making, have emerged as a promising paradigm for achieving global optimality in multi-agent reinforcement learning (MARL). However, the existing literature often adopts auto-regressive action-dependent policies, where each agent’s policy depends on the actions of all preceding agents. This formulation incurs substantial computational complexity as the number of agents increases, thereby limiting scalability. In this work, we consider a more generalized class of action-dependent policies, which do not necessarily follow the auto-regressive form. We propose to use the `action dependency graph (ADG)’ to model the inter-agent action dependencies. Within the context of MARL problems structured by coordination graphs, we prove that an action-dependent policy with a sparse ADG can achieve global optimality, provided the ADG satisfies specific conditions specified by the coordination graph. Building on this theoretical foundation, we develop a tabular policy iteration algorithm with guaranteed global optimality. Furthermore, we integrate our framework into several SOTA algorithms and conduct experiments in complex environments. The empirical results affirm the robustness and applicability of our approach in more general scenarios, underscoring its potential for broader MARL challenges.
zh

[AI-124] Predicting Empirical AI Research Outcomes with Language Models

【速读】:该论文试图解决在人工智能研究中,许多有前景的创新想法难以通过实验验证,而这一过程需要大量的人力和计算资源的问题。其核心挑战在于如何高效预测一个研究想法的成功概率,以加速实证性AI研究,而这一技能通常需要研究人员具备丰富的经验才能掌握。解决方案的关键在于构建了一个首个针对该任务的基准测试,并开发了一个结合微调后的GPT-4.1与论文检索代理的系统,该系统在自然语言处理领域表现优于人类专家,且在未发布的新颖想法上也展现出较高的准确性,表明其在提升想法生成模型中的潜在应用价值。

链接: https://arxiv.org/abs/2506.00794
作者: Jiaxin Wen,Chenglei Si,Yueh-han Chen,He He,Shi Feng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Many promising-looking ideas in AI research fail to deliver, but their validation takes substantial human labor and compute. Predicting an idea’s chance of success is thus crucial for accelerating empirical AI research, a skill that even expert researchers can only acquire through substantial experience. We build the first benchmark for this task and compare LMs with human experts. Concretely, given two research ideas (e.g., two jailbreaking methods), we aim to predict which will perform better on a set of benchmarks. We scrape ideas and experimental results from conference papers, yielding 1,585 human-verified idea pairs published after our base model’s cut-off date for testing, and 6,000 pairs for training. We then develop a system that combines a fine-tuned GPT-4.1 with a paper retrieval agent, and we recruit 25 human experts to compare with. In the NLP domain, our system beats human experts by a large margin (64.4% v.s. 48.9%). On the full test set, our system achieves 77% accuracy, while off-the-shelf frontier LMs like o3 perform no better than random guessing, even with the same retrieval augmentation. We verify that our system does not exploit superficial features like idea complexity through extensive human-written and LM-designed robustness tests. Finally, we evaluate our system on unpublished novel ideas, including ideas generated by an AI ideation agent. Our system achieves 63.6% accuracy, demonstrating its potential as a reward model for improving idea generation models. Altogether, our results outline a promising new direction for LMs to accelerate empirical AI research.
zh

[AI-125] Behavioral Augmentation of UML Class Diagrams: An Empirical Study of Large Language Models for Method Generation

【速读】:该论文试图解决从自然语言用例中自动增强UML类图的行为方法的问题,其核心挑战在于如何将非结构化的文本信息转化为符合UML规范的结构化行为模型。解决方案的关键在于利用九个大型语言模型(LLMs)对21个类、17个关系的无方法UML图进行扩展,并通过21个结构化的废物管理用例生成行为方法。研究评估了多个指标,包括方法数量、签名丰富性、注释完整性、结构保真度、语法正确性以及命名一致性,验证了LLMs在生成结构化方法和保持命名一致性方面的有效性,同时指出了在注释和签名一致性方面仍需改进。

链接: https://arxiv.org/abs/2506.00788
作者: Djaber Rouabhia,Ismail Hadjadj
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Automating the enrichment of UML class diagrams with behavioral methods from natural language use cases is a significant challenge. This study evaluates nine large language models (LLMs) in augmenting a methodless UML diagram (21 classes, 17 relationships) using 21 structured waste-management use cases. A total of 90 diagrams (3,373 methods) were assessed across six metrics: method quantity, signature richness (visibility, names, parameters, return types), annotation completeness (linking to use cases/actions), structural fidelity, syntactic correctness (PlantUML compilation), and naming convergence (across models). All LLMs produced valid PlantUML diagrams adhering to UML conventions. Some models excelled in method coverage and annotation accuracy, while others showed richer parameterization but weaker traceability. These results demonstrate that LLMs can generate well-structured methods with consistent naming, advancing automated behavioral modeling. However, inconsistencies in annotations and signatures highlight the need for improved prompt engineering and model selection. The rapid generation of these methods supports Agile practices by enabling faster design iterations. Despite their capabilities, human oversight is essential to ensure accuracy, appropriateness, and semantic alignment. This positions LLMs as collaborative partners in software design. All experimental artifacts (\texttt.puml, \texttt.png, \texttt.csv) are publicly available for reproducibility.
zh

[AI-126] Jailbreak-R1: Exploring the Jailbreak Capabilities of LLM s via Reinforcement Learning

【速读】:该论文试图解决自动化红队(automated red teaming)中攻击提示(attack prompts)的有效性与多样性难以平衡的问题。现有方法在生成攻击提示时往往无法兼顾效果与多样性,从而限制了其在检测大型语言模型(LLMs)安全漏洞中的应用。解决方案的关键在于提出一种基于强化学习的自动化红队训练框架,通过三个阶段的训练——冷启动、预热探索和增强越狱——利用多样性与一致性作为奖励信号,并引入渐进式越狱奖励以逐步提升红队模型的攻击能力,从而实现攻击提示的有效性与多样性的良好平衡。

链接: https://arxiv.org/abs/2506.00782
作者: Weiyang Guo,Zesheng Shi,Zhuo Li,Yequan Wang,Xuebo Liu,Wenya Wang,Fangming Liu,Min Zhang,Jing Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 21 pages, 8 figures

点击查看摘要

Abstract:As large language models (LLMs) grow in power and influence, ensuring their safety and preventing harmful output becomes critical. Automated red teaming serves as a tool to detect security vulnerabilities in LLMs without manual labor. However, most existing methods struggle to balance the effectiveness and diversity of red-team generated attack prompts. To address this challenge, we propose \ourapproach, a novel automated red teaming training framework that utilizes reinforcement learning to explore and generate more effective attack prompts while balancing their diversity. Specifically, it consists of three training stages: (1) Cold Start: The red team model is supervised and fine-tuned on a jailbreak dataset obtained through imitation learning. (2) Warm-up Exploration: The model is trained in jailbreak instruction following and exploration, using diversity and consistency as reward signals. (3) Enhanced Jailbreak: Progressive jailbreak rewards are introduced to gradually enhance the jailbreak performance of the red-team model. Extensive experiments on a variety of LLMs show that \ourapproach effectively balances the diversity and effectiveness of jailbreak prompts compared to existing methods. Our work significantly improves the efficiency of red team exploration and provides a new perspective on automated red teaming.
zh

[AI-127] CoP: Agent ic Red-teaming for Large Language Models using Composition of Principles

【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)面临的安全对齐问题,特别是通过“越狱攻击”(jailbreak attacks)绕过模型的安全机制,导致生成有害或高风险响应的问题。解决方案的关键在于提出一种基于原则组合(Composition-of-Principles, CoP)框架的智能代理工作流,该框架通过自动化编排人类提供的“红队原则”来生成有效的越狱提示,从而实现红队测试过程的自动化与规模化。CoP框架提供了一个统一且可扩展的结构,以整合和协调人工定义的红队策略,进而自动发现新的红队方法。

链接: https://arxiv.org/abs/2506.00781
作者: Chen Xiong,Pin-Yu Chen,Tsung-Yi Ho
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in Large Language Models (LLMs) have spurred transformative applications in various domains, ranging from open-source to proprietary LLMs. However, jailbreak attacks, which aim to break safety alignment and user compliance by tricking the target LLMs into answering harmful and risky responses, are becoming an urgent concern. The practice of red-teaming for LLMs is to proactively explore potential risks and error-prone instances before the release of frontier AI technology. This paper proposes an agentic workflow to automate and scale the red-teaming process of LLMs through the Composition-of-Principles (CoP) framework, where human users provide a set of red-teaming principles as instructions to an AI agent to automatically orchestrate effective red-teaming strategies and generate jailbreak prompts. Distinct from existing red-teaming methods, our CoP framework provides a unified and extensible framework to encompass and orchestrate human-provided red-teaming principles to enable the automated discovery of new red-teaming strategies. When tested against leading LLMs, CoP reveals unprecedented safety risks by finding novel jailbreak prompts and improving the best-known single-turn attack success rate by up to 19.0 times.
zh

[AI-128] Do not Abstain! Identify and Solve the Uncertainty

【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)在面对不确定场景时表现出的过度自信问题,特别是现有解决方案多依赖回避性回答(如“我不知道”),而未能有效识别和处理不确定性以生成更满意的响应。其解决方案的关键在于引入ConfuseBench基准,用于系统评估LLMs在文档稀缺性、能力限制和查询模糊性三种不确定性类型下的表现,并通过生成上下文感知的疑问来突出原始查询中的混淆点,结合基于答案唯一性的不确定性来源判断以及基于策略的训练方法InteractDPO,以提升模型对不确定性的识别与处理能力。

链接: https://arxiv.org/abs/2506.00780
作者: Jingyu Liu,Jingquan Peng,xiaopeng Wu,Xubin Li,Tiezheng Ge,Bo Zheng,Yong Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Despite the widespread application of Large Language Models (LLMs) across various domains, they frequently exhibit overconfidence when encountering uncertain scenarios, yet existing solutions primarily rely on evasive responses (e.g., “I don’t know”) overlooks the opportunity of identifying and addressing the uncertainty to generate more satisfactory responses. To systematically investigate and improve LLMs’ ability of recognizing and addressing the source of uncertainty, we introduce \textbfConfuseBench, a benchmark mainly focus on three types of uncertainty: document scarcity, limited capability, and query ambiguity. Experiments with ConfuseBench reveal that current LLMs struggle to accurately identify the root cause of uncertainty and solve it. They prefer to attribute uncertainty to query ambiguity while overlooking capability limitations, especially for those weaker models. To tackle this challenge, we first generate context-aware inquiries that highlight the confusing aspect of the original query. Then we judge the source of uncertainty based on the uniqueness of the inquiry’s answer. Further we use an on-policy training method, InteractDPO to generate better inquiries. Experimental results demonstrate the efficacy of our approach.
zh

[AI-129] Manipulating 3D Molecules in a Fixed-Dimensional SE(3)-Equivariant Latent Space

【速读】:该论文旨在解决药物分子设计中如何在不依赖标签数据的情况下,通过调整分子的三维结构来优化其关键特性(如形状、药效团或化学性质)的问题。其解决方案的关键在于提出了一种基于共享三维分子潜在空间的零样本分子操作方法,核心是构建了一个名为MolFLAE的变分自编码器(Variational AutoEncoder, VAE),该模型能够学习到与原子数量无关的固定维度、SE(3)-等变潜在空间,并通过贝叶斯流网络(Bayesian Flow Network, BFN)实现分子结构的重建,从而支持无需额外训练即可进行原子数编辑、结构重构及协同潜在空间插值等操作。

链接: https://arxiv.org/abs/2506.00771
作者: Zitao Chen,Yinjun Jia,Zitong Tian,Wei-Ying Ma,Yanyan Lan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 11 pages, 6 figures

点击查看摘要

Abstract:Medicinal chemists often optimize drugs considering their 3D structures and designing structurally distinct molecules that retain key features, such as shapes, pharmacophores, or chemical properties. Previous deep learning approaches address this through supervised tasks like molecule inpainting or property-guided optimization. In this work, we propose a flexible zero-shot molecule manipulation method by navigating in a shared latent space of 3D molecules. We introduce a Variational AutoEncoder (VAE) for 3D molecules, named MolFLAE, which learns a fixed-dimensional, SE(3)-equivariant latent space independent of atom counts. MolFLAE encodes 3D molecules using an SE(3)-equivariant neural network into fixed number of latent nodes, distinguished by learned embeddings. The latent space is regularized, and molecular structures are reconstructed via a Bayesian Flow Network (BFN) conditioned on the encoder’s latent output. MolFLAE achieves competitive performance on standard unconditional 3D molecule generation benchmarks. Moreover, the latent space of MolFLAE enables zero-shot molecule manipulation, including atom number editing, structure reconstruction, and coordinated latent interpolation for both structure and properties. We further demonstrate our approach on a drug optimization task for the human glucocorticoid receptor, generating molecules with improved hydrophilicity while preserving key interactions, under computational evaluations. These results highlight the flexibility, robustness, and real-world utility of our method, opening new avenues for molecule editing and optimization.
zh

[AI-130] Beyond Attention: Learning Spatio-Temporal Dynamics with Emergent Interpretable Topologies

【速读】:该论文旨在解决动态图结构中时空预测任务的建模问题,特别是在传统图注意力网络(Graph Attention Networks, GATs)因依赖预定义邻接结构和动态注意力分数而引入归纳偏置与计算开销,从而影响可解释性的问题。其解决方案的关键在于提出InterGAT框架,该框架用一个完全可学习的对称节点交互矩阵替代了掩码注意力机制,从而在不依赖固定图拓扑的情况下捕捉潜在的空间关系,并结合GRU-based时间解码器形成InterGAT-GRU模型,实现了更高的预测精度与训练效率,同时揭示了可解释的结构特征。

链接: https://arxiv.org/abs/2506.00770
作者: Sai Vamsi Alisetti,Vikas Kalagi,Sanjukta Krishnagopal
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
备注: 13 pages, 10 figures, workshop

点击查看摘要

Abstract:Spatio-temporal forecasting is critical in applications such as traffic prediction, energy demand modeling, and weather monitoring. While Graph Attention Networks (GATs) are popular for modeling spatial dependencies, they rely on predefined adjacency structures and dynamic attention scores, introducing inductive biases and computational overhead that can obscure interpretability. We propose InterGAT, a simplified alternative to GAT that replaces masked attention with a fully learnable, symmetric node interaction matrix, capturing latent spatial relationships without relying on fixed graph topologies. Our framework, InterGAT-GRU, which incorporates a GRU-based temporal decoder, outperforms the baseline GAT-GRU in forecasting accuracy, achieving at least a 21% improvement on the SZ-Taxi dataset and a 6% improvement on the Los-Loop dataset across all forecasting horizons (15 to 60 minutes). Additionally, we observed reduction in training time by 60-70% compared to GAT-GRU baseline. Crucially, the learned interaction matrix reveals interpretable structure: it recovers sparse, topology-aware attention patterns that align with community structure. Spectral and clustering analyses show that the model captures both localized and global dynamics, offering insights into the functional topology driving predictions. This highlights how structure learning can simultaneously support prediction, computational efficiency, and topological interpretabil-ity in dynamic graph-based domains. Comments: 13 pages, 10 figures, workshop Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI) Cite as: arXiv:2506.00770 [cs.LG] (or arXiv:2506.00770v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2506.00770 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-131] HouseTS: A Large-Scale Multimodal Spatiotemporal U.S. Housing Dataset

【速读】:该论文旨在解决长期预测中缺乏可重复基准的问题,特别是在房地产价格预测领域,现有数据在时空深度和上下文丰富性方面存在不足。其解决方案的关键在于引入HouseTS,一个大规模、多模态的数据集,覆盖了2012年3月至2023年12月期间美国30个主要都市区6,000个邮政编码的月度房价数据,并整合了兴趣点(POI)、社会经济指标和详细的房地产指标,以支持更准确和可复现的预测研究。

链接: https://arxiv.org/abs/2506.00765
作者: Shengkun Wang,Yanshen Sun,Fanglan Chen,Linhan Wang,Naren Ramakrishnan,Chang-Tien Lu,Yinlin Chen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Accurate house-price forecasting is essential for investors, planners, and researchers. However, reproducible benchmarks with sufficient spatiotemporal depth and contextual richness for long horizon prediction remain scarce. To address this, we introduce HouseTS a large scale, multimodal dataset covering monthly house prices from March 2012 to December 2023 across 6,000 ZIP codes in 30 major U.S. metropolitan areas. The dataset includes over 890K records, enriched with points of Interest (POI), socioeconomic indicators, and detailed real estate metrics. To establish standardized performance baselines, we evaluate 14 models, spanning classical statistical approaches, deep neural networks (DNNs), and pretrained time-series foundation models. We further demonstrate the value of HouseTS in a multimodal case study, where a vision language model extracts structured textual descriptions of geographic change from time stamped satellite imagery. This enables interpretable, grounded insights into urban evolution. HouseTS is hosted on Kaggle, while all preprocessing pipelines, benchmark code, and documentation are openly maintained on GitHub to ensure full reproducibility and easy adoption.
zh

[AI-132] “Who experiences large model decay and why?” A Hierarchical Framework for Diagnosing Heterogeneous Performance Drift

【速读】:该论文试图解决机器学习(Machine Learning, ML)模型在新场景中部署时性能退化的问题,特别是关注性能退化在不同子群体中的非均匀性。现有方法要么只能解释平均性能变化的来源,要么只能识别受影响的子群体而无法提供其退化原因的深入洞察。论文提出的解决方案是引入一种名为Subgroup-scanning Hierarchical Inference Framework for performance drift (SHIFT)的框架,其关键在于通过分层推理机制,首先确定是否存在因协变量或结果分布变化而导致不可接受性能下降的子群体,随后进一步分析这些子群体性能退化的原因,从而为针对性地缓解性能退化提供依据。

链接: https://arxiv.org/abs/2506.00756
作者: Harvineet Singh,Fan Xia,Alexej Gossmann,Andrew Chuang,Julian C. Hong,Jean Feng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: 13 pages, 9 figures, 8 tables, 18 pages appendix. To be published in Proceedings of the 42nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025

点击查看摘要

Abstract:Machine learning (ML) models frequently experience performance degradation when deployed in new contexts. Such degradation is rarely uniform: some subgroups may suffer large performance decay while others may not. Understanding where and how large differences in performance arise is critical for designing targeted corrective actions that mitigate decay for the most affected subgroups while minimizing any unintended effects. Current approaches do not provide such detailed insight, as they either (i) explain how average performance shifts arise or (ii) identify adversely affected subgroups without insight into how this occurred. To this end, we introduce a Subgroup-scanning Hierarchical Inference Framework for performance drifT (SHIFT). SHIFT first asks “Is there any subgroup with unacceptably large performance decay due to covariate/outcome shifts?” (Where?) and, if so, dives deeper to ask “Can we explain this using more detailed variable(subset)-specific shifts?” (How?). In real-world experiments, we find that SHIFT identifies interpretable subgroups affected by performance decay, and suggests targeted actions that effectively mitigate the decay.
zh

[AI-133] Alignment Revisited: Are Large Language Models Consistent in Stated and Revealed Preferences?

【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)在行为上与其声明性偏好(stated preferences)与揭示性偏好(revealed preferences)之间可能存在的偏差问题,这一问题影响了模型的可解释性、可信度、推理透明度和伦理部署。解决方案的关键在于提出一种量化这种偏好偏差的方法,通过构建精心设计的提示数据集,生成一系列强制二元选择,并利用KL散度等指标比较模型在一般原则提示下的声明性偏好与在情境化提示下的揭示性偏好,从而揭示模型在不同语境下可能激活不同指导原则的现象。

链接: https://arxiv.org/abs/2506.00751
作者: Zhuojun Gu,Quan Wang,Shuchu Han
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent advances in Large Language Models (LLMs) highlight the need to align their behaviors with human values. A critical, yet understudied, issue is the potential divergence between an LLM’s stated preferences (its reported alignment with general principles) and its revealed preferences (inferred from decisions in contextualized scenarios). Such deviations raise fundamental concerns for the interpretability, trustworthiness, reasoning transparency, and ethical deployment of LLMs, particularly in high-stakes applications. This work formally defines and proposes a method to measure this preference deviation. We investigate how LLMs may activate different guiding principles in specific contexts, leading to choices that diverge from previously stated general principles. Our approach involves crafting a rich dataset of well-designed prompts as a series of forced binary choices and presenting them to LLMs. We compare LLM responses to general principle prompts stated preference with LLM responses to contextualized prompts revealed preference, using metrics like KL divergence to quantify the deviation. We repeat the analysis across different categories of preferences and on four mainstream LLMs and find that a minor change in prompt format can often pivot the preferred choice regardless of the preference categories and LLMs in the test. This prevalent phenomenon highlights the lack of understanding and control of the LLM decision-making competence. Our study will be crucial for integrating LLMs into services, especially those that interact directly with humans, where morality, fairness, and social responsibilities are crucial dimensions. Furthermore, identifying or being aware of such deviation will be critically important as LLMs are increasingly envisioned for autonomous agentic tasks where continuous human evaluation of all LLMs’ intermediary decision-making steps is impossible.
zh

[AI-134] CodeSense: a Real-World Benchmark and Dataset for Code Semantic Reasoning

【速读】:该论文旨在解决当前代码推理基准在真实软件工程(SE)场景下的适用性不足问题,特别是现有基准多依赖合成数据或教育性编码问题,且主要关注粗粒度推理任务,无法有效评估大语言模型(LLM)在实际SE任务中的表现。其解决方案的关键在于提出CodeSense,这是首个针对真实世界代码软件工程的细粒度代码推理基准,通过从真实仓库中收集Python、C和Java项目,执行测试并获取执行轨迹,构建用于细粒度语义推理任务的基准数据集,同时开发了执行追踪框架和工具集,为未来基准构建和模型微调提供基础。

链接: https://arxiv.org/abs/2506.00750
作者: Monoshi Kumar Roy,Simin Chen,Benjamin Steenhoek,Jinjun Peng,Gail Kaiser,Baishakhi Ray,Wei Le
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Understanding and reasoning about code semantics is essential for enhancing code LLMs’ abilities to solve real-world software engineering (SE) tasks. Although several code reasoning benchmarks exist, most rely on synthetic datasets or educational coding problems and focus on coarse-grained reasoning tasks such as input/output prediction, limiting their effectiveness in evaluating LLMs in practical SE contexts. To bridge this gap, we propose CodeSense, the first benchmark that makes available a spectrum of fine-grained code reasoning tasks concerned with the software engineering of real-world code. We collected Python, C and Java software projects from real-world repositories. We executed tests from these repositories, collected their execution traces, and constructed a ground truth dataset for fine-grained semantic reasoning tasks. We then performed comprehensive evaluations on state-of-the-art LLMs. Our results show a clear performance gap for the models to handle fine-grained reasoning tasks. Although prompting techniques such as chain-of-thought and in-context learning helped, the lack of code semantics in LLMs fundamentally limit models’ capabilities of code reasoning. Besides dataset, benchmark and evaluation, our work produced an execution tracing framework and tool set that make it easy to collect ground truth for fine-grained SE reasoning tasks, offering a strong basis for future benchmark construction and model post training. Our code and data are located at this https URL.
zh

[AI-135] MoPINNEnKF: Iterative Model Inference using generic-PINN-based ensemble Kalman filter

【速读】:该论文旨在解决物理信息神经网络(PINNs)在现实场景中面对噪声观测数据和缺失物理机制时,特别是在反问题中表现不佳的问题。其解决方案的关键在于提出一种迭代的多目标PINN集成卡尔曼滤波器(MoPINNEnKF)框架,该框架结合了集成卡尔曼滤波器(EnKF)与非支配排序遗传算法III(NSGA-III),通过生成位于最优帕累托前沿的PINN集合成员,并在解空间中考虑模型不确定性,从而提升PINNs在正问题和反问题中的鲁棒性与准确性。

链接: https://arxiv.org/abs/2506.00731
作者: Binghang Lu,Changhong Mou,Guang Lin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Physics-informed neural networks (PINNs) have emerged as a powerful tool for solving forward and inverse problems involving partial differential equations (PDEs) by incorporating physical laws into the training process. However, the performance of PINNs is often hindered in real-world scenarios involving noisy observational data and missing physics, particularly in inverse problems. In this work, we propose an iterative multi-objective PINN ensemble Kalman filter (MoPINNEnKF) framework that improves the robustness and accuracy of PINNs in both forward and inverse problems by using the \textitensemble Kalman filter and the \textitnon-dominated sorting genetic algorithm III (NSGA-III). Specifically, NSGA-III is used as a multi-objective optimizer that can generate various ensemble members of PINNs along the optimal Pareto front, while accounting the model uncertainty in the solution space. These ensemble members are then utilized within the EnKF to assimilate noisy observational data. The EnKF’s analysis is subsequently used to refine the data loss component for retraining the PINNs, thereby iteratively updating their parameters. The iterative procedure generates improved solutions to the PDEs. The proposed method is tested on two benchmark problems: the one-dimensional viscous Burgers equation and the time-fractional mixed diffusion-wave equation (TFMDWE). The numerical results show it outperforms standard PINNs in handling noisy data and missing physics.
zh

[AI-136] Pitfalls in Evaluating Language Model Forecasters

【速读】:该论文试图解决当前对生成式 AI (Generative AI) 在预测任务中表现评估的可靠性问题,特别是大型语言模型 (Large Language Models, LLMs) 的预测能力评估中存在的挑战。论文指出,现有评估方法面临两大核心问题:一是由于时间泄露(temporal leakage)等多种因素导致评估结果难以信任,二是评估性能难以推广到实际预测场景。解决方案的关键在于建立更严格的评估方法,以确保能够准确、可靠地衡量LLMs的预测能力。

链接: https://arxiv.org/abs/2506.00723
作者: Daniel Paleka,Shashwat Goel,Jonas Geiping,Florian Tramèr
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: 20 pages, 8 figures

点击查看摘要

Abstract:Large language models (LLMs) have recently been applied to forecasting tasks, with some works claiming these systems match or exceed human performance. In this paper, we argue that, as a community, we should be careful about such conclusions as evaluating LLM forecasters presents unique challenges. We identify two broad categories of issues: (1) difficulty in trusting evaluation results due to many forms of temporal leakage, and (2) difficulty in extrapolating from evaluation performance to real-world forecasting. Through systematic analysis and concrete examples from prior work, we demonstrate how evaluation flaws can raise concerns about current and future performance claims. We argue that more rigorous evaluation methodologies are needed to confidently assess the forecasting abilities of LLMs.
zh

[AI-137] An LLM Agent for Functional Bug Detection in Network Protocols

【速读】:该论文旨在解决网络协议实现中功能错误(functional bugs)的检测问题,这类错误表现为实现与RFC文档中规定的行为不一致,可能导致路由故障、认证绕过和服务中断等严重后果。传统静态分析工具难以完成跨规范文档和源代码的深度语义分析,因此无法有效识别此类错误。论文提出的解决方案是RFCScan,其关键在于利用大语言模型(Large Language Models, LLMs)构建一个自主代理,通过两个核心组件——索引代理和检测代理——实现对协议实现与RFC规范之间的一致性检查。索引代理通过分层总结协议代码语义生成语义索引,而检测代理则通过按需检索迭代收集相关数据结构和函数,从而高效识别潜在的不一致性。

链接: https://arxiv.org/abs/2506.00714
作者: Mingwei Zheng,Chengpeng Wang,Xuwei Liu,Jinyao Guo,Shiwei Feng,Xiangyu Zhang
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Functional correctness is critical for ensuring the reliability and security of network protocol implementations. Functional bugs, instances where implementations diverge from behaviors specified in RFC documents, can lead to severe consequences, including faulty routing, authentication bypasses, and service disruptions. Detecting these bugs requires deep semantic analysis across specification documents and source code, a task beyond the capabilities of traditional static analysis tools. This paper introduces RFCScan, an autonomous agent that leverages large language models (LLMs) to detect functional bugs by checking conformance between network protocol implementations and their RFC specifications. Inspired by the human auditing procedure, RFCScan comprises two key components: an indexing agent and a detection agent. The former hierarchically summarizes protocol code semantics, generating semantic indexes that enable the detection agent to narrow down the scanning scope. The latter employs demand-driven retrieval to iteratively collect additional relevant data structures and functions, eventually identifying potential inconsistencies with the RFC specifications effectively. We evaluate RFCScan across six real-world network protocol implementations. RFCScan identifies 47 functional bugs with 81.9% precision, of which 20 bugs have been confirmed or fixed by developers.
zh

[AI-138] Bayesian Inference of Training Dataset Membership

【速读】:该论文试图解决机器学习模型训练数据隐私泄露的问题,具体表现为确定某个数据集是否属于模型的训练数据池,这一问题通常通过成员推理攻击(Membership Inference Attacks, MIAs)来揭示。传统MIAs方法通常需要访问模型内部结构或依赖计算密集型的影子模型,而本文提出了一种高效、可解释且基于贝叶斯推断的解决方案。其关键在于通过分析训练后模型的后验指标,如预测误差、置信度(熵)、扰动幅度和数据集统计信息,计算成员身份的后验概率,从而无需大量模型训练即可实现有效的成员推理。

链接: https://arxiv.org/abs/2506.00701
作者: Yongchao Huang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 10 pages

点击查看摘要

Abstract:Determining whether a dataset was part of a machine learning model’s training data pool can reveal privacy vulnerabilities, a challenge often addressed through membership inference attacks (MIAs). Traditional MIAs typically require access to model internals or rely on computationally intensive shadow models. This paper proposes an efficient, interpretable and principled Bayesian inference method for membership inference. By analyzing post-hoc metrics such as prediction error, confidence (entropy), perturbation magnitude, and dataset statistics from a trained ML model, our approach computes posterior probabilities of membership without requiring extensive model training. Experimental results on synthetic datasets demonstrate the method’s effectiveness in distinguishing member from non-member datasets. Beyond membership inference, this method can also detect distribution shifts, offering a practical and interpretable alternative to existing approaches.
zh

[AI-139] Optimizing Sensory Neurons: Nonlinear Attention Mechanisms for Accelerated Convergence in Permutation-Invariant Neural Networks for Reinforcement Learning

【速读】:该论文试图解决强化学习(Reinforcement Learning, RL)代理训练过程中计算资源消耗大和训练时间长的问题。其解决方案的关键在于提出一种改进的注意力机制,通过使用映射函数对键向量(Key Vectors, K)进行非线性变换,生成新的键向量(K’),从而增强注意力机制的表征能力,提升模型对复杂特征交互的编码能力,并在不牺牲性能的前提下加速收敛。

链接: https://arxiv.org/abs/2506.00691
作者: Junaid Muzaffar,Ahsan Adeel,Khubaib Ahmed,Ingo Frommholz,Zeeshan Pervez,Ahsan ul Haq
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Training reinforcement learning (RL) agents often requires significant computational resources and extended training times. To address this, we build upon the foundation laid by Google Brain’s Sensory Neuron, which introduced a novel neural architecture for reinforcement learning tasks that maintained permutation in-variance in the sensory neuron system. While the baseline model demonstrated significant performance improvements over traditional approaches, we identified opportunities to enhance the efficiency of the learning process further. We propose a modified attention mechanism incorporating a non-linear transformation of the key vectors (K) using a mapping function, resulting in a new set of key vectors (K’). This non-linear mapping enhances the representational capacity of the attention mechanism, allowing the model to encode more complex feature interactions and accelerating convergence without compromising performance. Our enhanced model demonstrates significant improvements in learning efficiency, showcasing the potential for non-linear attention mechanisms in advancing reinforcement learning algorithms.
zh

[AI-140] SafeTuneBed: A Toolkit for Benchmarking LLM Safety Alignment in Fine-Tuning

【速读】:该论文试图解决大规模语言模型(Large Language Models, LLMs)参数高效微调方法与安全防护措施在评估过程中因数据集、指标和威胁设定不一致而导致的难以公平比较安全性能、实用性和鲁棒性的问题。其解决方案的关键在于提出SafeTuneBed,一个统一微调与防御评估的基准和工具包,它通过整合多样化的微调数据集、支持先进的防御机制,并提供标准化的安全与实用性评估指标,实现端到端的可复现性,从而推动安全LLM微调领域的严谨且可比的研究。

链接: https://arxiv.org/abs/2506.00676
作者: Saad Hossain,Samanvay Vajpayee,Sirisha Rambhatla
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As large language models (LLMs) become ubiquitous, parameter-efficient fine-tuning methods and safety-first defenses have proliferated rapidly. However, the number of approaches and their recent increase have resulted in diverse evaluations-varied datasets, metrics, and inconsistent threat settings-making it difficult to fairly compare safety, utility, and robustness across methods. To address this, we introduce SafeTuneBed, a benchmark and toolkit unifying fine-tuning and defense evaluation. SafeTuneBed (i) curates a diverse repository of multiple fine-tuning datasets spanning sentiment analysis, question-answering, multi-step reasoning, and open-ended instruction tasks, and allows for the generation of harmful-variant splits; (ii) enables integration of state-of-the-art defenses, including alignment-stage immunization, in-training safeguards, and post-tuning repair; and (iii) provides evaluators for safety (attack success rate, refusal consistency) and utility. Built on Python-first, dataclass-driven configs and plugins, SafeTuneBed requires minimal additional code to specify any fine-tuning regime, defense method, and metric suite, while ensuring end-to-end reproducibility. We showcase its value by benchmarking representative defenses across varied poisoning scenarios and tasks. By standardizing data, code, and metrics, SafeTuneBed is the first focused toolkit of its kind to accelerate rigorous and comparable research in safe LLM fine-tuning. Code is available at: this https URL
zh

[AI-141] hinking Out of the Box: Hybrid SAT Solving by Unconstrained Continuous Optimization

【速读】:该论文试图解决在混合SAT(hybrid SAT)求解中,传统基于合取范式(CNF)的求解器效率受限的问题,因为许多实际应用需要处理非CNF约束,如XOR、基数和非全等约束。解决方案的关键在于提出一种无约束的连续优化公式,通过惩罚项(penalty terms)来表示混合约束,从而利用强大的无约束优化器(如Adam)提升求解效率。

链接: https://arxiv.org/abs/2506.00674
作者: Zhiwei Zhang,Samy Wu Fung,Anastasios Kyrillidis,Stanley Osher,Moshe Y. Vardi
机构: 未知
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Optimization and Control (math.OC)
备注:

点击查看摘要

Abstract:The Boolean satisfiability (SAT) problem lies at the core of many applications in combinatorial optimization, software verification, cryptography, and machine learning. While state-of-the-art solvers have demonstrated high efficiency in handling conjunctive normal form (CNF) formulas, numerous applications require non-CNF (hybrid) constraints, such as XOR, cardinality, and Not-All-Equal constraints. Recent work leverages polynomial representations to represent such hybrid constraints, but it relies on box constraints that can limit the use of powerful unconstrained optimizers. In this paper, we propose unconstrained continuous optimization formulations for hybrid SAT solving by penalty terms. We provide theoretical insights into when these penalty terms are necessary and demonstrate empirically that unconstrained optimizers (e.g., Adam) can enhance SAT solving on hybrid benchmarks. Our results highlight the potential of combining continuous optimization and machine-learning-based methods for effective hybrid SAT solving.
zh

[AI-142] OntoRAG : Enhancing Question-Answering through Automated Ontology Derivation from Unstructured Knowledge Bases

【速读】:该论文试图解决传统本体构建过程中依赖领域专家手动操作所带来的耗时、易错及难以适应大规模动态知识领域的问题。解决方案的关键在于提出OntoRAG,这是一个自动化流程,通过整合网络爬取、PDF解析、混合分块、信息抽取、知识图谱构建和本体生成等先进技术,将非结构化数据转化为可查询的本体,从而提升全局理解能力,并在全面性和多样性方面优于传统的检索增强生成(RAG)和GraphRAG方法。

链接: https://arxiv.org/abs/2506.00664
作者: Yash Tiwari,Owais Ahmad Lone,Mayukha Pal
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Ontologies are pivotal for structuring knowledge bases to enhance question answering (QA) systems powered by Large Language Models (LLMs). However, traditional ontology creation relies on manual efforts by domain experts, a process that is time intensive, error prone, and impractical for large, dynamic knowledge domains. This paper introduces OntoRAG, an automated pipeline designed to derive ontologies from unstructured knowledge bases, with a focus on electrical relay documents. OntoRAG integrates advanced techniques, including web scraping, PDF parsing, hybrid chunking, information extraction, knowledge graph construction, and ontology creation, to transform unstructured data into a queryable ontology. By leveraging LLMs and graph based methods, OntoRAG enhances global sensemaking capabilities, outperforming conventional Retrieval Augmented Generation (RAG) and GraphRAG approaches in comprehensiveness and diversity. Experimental results demonstrate OntoRAGs effectiveness, achieving a comprehensiveness win rate of 85% against vector RAG and 75% against GraphRAGs best configuration. This work addresses the critical challenge of automating ontology creation, advancing the vision of the semantic web.
zh

[AI-143] Differential Privacy for Deep Learning in Medicine

【速读】:该论文旨在解决在医疗深度学习中如何平衡隐私保护与模型性能及公平性的问题。其关键解决方案是通过差分隐私(Differential Privacy, DP)技术,特别是DP-SGD方法,以及在集中式和联邦设置下的其他机制,来实现对敏感患者数据的保护。研究强调了隐私保障、模型准确性和子群体公平性之间的权衡,并指出在严格隐私约束下,尤其是在数据稀缺或复杂模态中,模型性能可能显著下降,且隐私导致的性能差距对特定人口子群体影响更大。

链接: https://arxiv.org/abs/2506.00660
作者: Marziyeh Mohammadi,Mohsen Vejdanihemmat,Mahshad Lotfinia,Mirabela Rusu,Daniel Truhn,Andreas Maier,Soroosh Tayebi Arasteh
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Differential privacy (DP) is a key technique for protecting sensitive patient data in medical deep learning (DL). As clinical models grow more data-dependent, balancing privacy with utility and fairness has become a critical challenge. This scoping review synthesizes recent developments in applying DP to medical DL, with a particular focus on DP-SGD and alternative mechanisms across centralized and federated settings. Using a structured search strategy, we identified 74 studies published up to March 2025. Our analysis spans diverse data modalities, training setups, and downstream tasks, and highlights the tradeoffs between privacy guarantees, model accuracy, and subgroup fairness. We find that while DP-especially at strong privacy budgets-can preserve performance in well-structured imaging tasks, severe degradation often occurs under strict privacy, particularly in underrepresented or complex modalities. Furthermore, privacy-induced performance gaps disproportionately affect demographic subgroups, with fairness impacts varying by data type and task. A small subset of studies explicitly addresses these tradeoffs through subgroup analysis or fairness metrics, but most omit them entirely. Beyond DP-SGD, emerging approaches leverage alternative mechanisms, generative models, and hybrid federated designs, though reporting remains inconsistent. We conclude by outlining key gaps in fairness auditing, standardization, and evaluation protocols, offering guidance for future work toward equitable and clinically robust privacy-preserving DL systems in medicine.
zh

[AI-144] Permutation-Invariant Transformer Neural Architectures for Set-Based Indoor Localization Using Learned RSSI Embeddings

【速读】:该论文旨在解决室内定位问题,特别是在使用Wi-Fi接入点的接收信号强度指示(RSSI)扫描数据时,如何有效处理无序、变长且稀疏的输入。解决方案的关键在于提出一种排列不变的神经架构,利用Set Transformer对(BSSID, RSSI)对进行建模,通过学习嵌入并结合注意力机制来捕捉接入点之间的关系,从而实现对空间结构的精确恢复。

链接: https://arxiv.org/abs/2506.00656
作者: Aris J. Aristorenas
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 16 pages, 1 figure

点击查看摘要

Abstract:We propose a permutation-invariant neural architecture for indoor localization using RSSI scans from Wi-Fi access points. Each scan is modeled as an unordered set of (BSSID, RSSI) pairs, where BSSIDs are mapped to learned embeddings and concatenated with signal strength. These are processed by a Set Transformer, enabling the model to handle variable-length, sparse inputs while learning attention- based representations over access point relationships. We evaluate the model on a dataset collected across a campus environment consisting of six buildings. Results show that the model accurately recovers fine-grained spatial structure and maintains performance across physically distinct domains. In our experiments, a simple LSTM consistently outperformed all other models, achieving the lowest mean localization error across three tasks (E1 - E3), with average errors as low as 2.23 m. The Set Transformer performed competitively, ranking second in every experiment and outperforming the MLP, RNN, and basic attention models, particularly in scenarios involving multiple buildings (E2) and multiple floors (E3). Performance degraded most in E2, where signal conditions varied substantially across buildings, highlighting the importance of architectural robustness to domain diversity. This work demonstrates that set-based neural models are a natural fit for signal-based localization, offering a principled approach to handling sparse, unordered inputs in real-world positioning tasks.
zh

[AI-145] Agent Auditor: Human-Level Safety and Security Evaluation for LLM Agents

【速读】:该论文试图解决基于大语言模型(Large Language Model, LLM)的智能体在安全性和安全性评估方面的可靠性问题,因为现有的基于规则或LLM的评估工具往往无法准确识别智能体逐步行动中的潜在危险、忽视细微语义、未能察觉小问题的累积效应以及因安全或安全规则不明确而产生混淆。解决方案的关键在于提出\sys,这是一个无需训练、具备记忆增强推理框架的通用评估系统,它通过让LLM自适应地提取结构化语义特征并生成相应的思维链推理轨迹来构建经验记忆,并利用多阶段、上下文感知的检索增强生成过程动态检索相关推理经验,从而指导LLM评估者对新案例的评估。

链接: https://arxiv.org/abs/2506.00641
作者: Hanjun Luo,Shenyu Dai,Chiming Ni,Xinfeng Li,Guibin Zhang,Kun Wang,Tongliang Liu,Hanan Salam
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Despite the rapid advancement of LLM-based agents, the reliable evaluation of their safety and security remains a significant challenge. Existing rule-based or LLM-based evaluators often miss dangers in agents’ step-by-step actions, overlook subtle meanings, fail to see how small issues compound, and get confused by unclear safety or security rules. To overcome this evaluation crisis, we introduce \sys, a universal, training-free, memory-augmented reasoning framework that empowers LLM evaluators to emulate human expert evaluators. \sys constructs an experiential memory by having an LLM adaptively extract structured semantic features (e.g., scenario, risk, behavior) and generate associated chain-of-thought reasoning traces for past interactions. A multi-stage, context-aware retrieval-augmented generation process then dynamically retrieves the most relevant reasoning experiences to guide the LLM evaluator’s assessment of new cases. Moreover, we developed \data, the first benchmark designed to check how well LLM-based evaluators can spot both safety risks and security threats. \data comprises \textbf2293 meticulously annotated interaction records, covering \textbf15 risk types across \textbf29 application scenarios. A key feature of \data is its nuanced approach to ambiguous risk situations, employing Strict'' and Lenient’’ judgment standards. Experiments demonstrate that \sys not only consistently improves the evaluation performance of LLMs across all benchmarks but also sets a new state-of-the-art in LLM-as-a-judge for agent safety and security, achieving human-level accuracy. Our work is openly openly accessible.
zh

[AI-146] Learning with Calibration: Exploring Test-Time Computing of Spatio-Temporal Forecasting

【速读】:该论文旨在解决时空预测中的挑战,如信号异常、噪声和分布偏移等问题。现有方法主要通过修改网络架构或训练流程来提高鲁棒性,但这些方法计算成本高且资源消耗大。本文提出了一种新的测试阶段计算范式——基于校准的学习(ST-TTC),其关键在于通过学习校准捕捉非平稳性引起的周期性结构偏差,并在测试阶段实时校正预测结果以提高准确性。具体而言,引入了具有相位-幅度调制的频域校准器以缓解周期性偏移,并提出了带有流式内存队列的快速更新机制以实现高效的测试阶段计算。

链接: https://arxiv.org/abs/2506.00635
作者: Wei Chen,Yuxuan Liang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (stat.ML)
备注: 28 pages, 9 figures, 8 tables

点击查看摘要

Abstract:Spatio-temporal forecasting is crucial in many domains, such as transportation, meteorology, and energy. However, real-world scenarios frequently present challenges such as signal anomalies, noise, and distributional shifts. Existing solutions primarily enhance robustness by modifying network architectures or training procedures. Nevertheless, these approaches are computationally intensive and resource-demanding, especially for large-scale applications. In this paper, we explore a novel test-time computing paradigm, namely learning with calibration, ST-TTC, for spatio-temporal forecasting. Through learning with calibration, we aim to capture periodic structural biases arising from non-stationarity during the testing phase and perform real-time bias correction on predictions to improve accuracy. Specifically, we first introduce a spectral-domain calibrator with phase-amplitude modulation to mitigate periodic shift and then propose a flash updating mechanism with a streaming memory queue for efficient test-time computation. ST-TTC effectively bypasses complex training-stage techniques, offering an efficient and generalizable paradigm. Extensive experiments on real-world datasets demonstrate the effectiveness, universality, flexibility and efficiency of our proposed method.
zh

[AI-147] he Disparate Effects of Partial Information in Bayesian Strategic Learning

【速读】:该论文试图解决在战略学习环境中,部分评分规则信息如何影响公平性的问题。在战略学习中,学习者部署一个评分规则,而代理通过修改其特征(以一定成本为代价)来战略性地提升其结果,但代理无法直接观察评分规则,而是接收到该规则的噪声信号。论文的核心在于分析不同群体在特征修改成本差异下的结果不平等现象,并探讨这种不平等如何随着学习者规则的透明度变化。解决方案的关键在于区分两种代理模型:(i)天真代理,他们直接接受噪声信号;(ii)贝叶斯代理,他们根据信号更新先验信念。研究揭示了在不同透明度水平下,结果不平等的动态变化,特别是指出在贝叶斯代理情况下,不平等保持有界,而在有限透明度下,低成本群体可能被不成比例地损害。

链接: https://arxiv.org/abs/2506.00627
作者: Srikanth Avasarala,Serena Wang,Juba Ziani
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We study how partial information about scoring rules affects fairness in strategic learning settings. In strategic learning, a learner deploys a scoring rule, and agents respond strategically by modifying their features – at some cost – to improve their outcomes. However, in our work, agents do not observe the scoring rule directly; instead, they receive a noisy signal of said rule. We consider two different agent models: (i) naive agents, who take the noisy signal at face value, and (ii) Bayesian agents, who update a prior belief based on the signal. Our goal is to understand how disparities in outcomes arise between groups that differ in their costs of feature modification, and how these disparities vary with the level of transparency of the learner’s rule. For naive agents, we show that utility disparities can grow unboundedly with noise, and that the group with lower costs can, perhaps counter-intuitively, be disproportionately harmed under limited transparency. In contrast, for Bayesian agents, disparities remain bounded. We provide a full characterization of disparities across groups as a function of the level of transparency and show that they can vary non-monotonically with noise; in particular, disparities are often minimized at intermediate levels of transparency. Finally, we extend our analysis to settings where groups differ not only in cost, but also in prior beliefs, and study how this asymmetry influences fairness. Subjects: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI) Cite as: arXiv:2506.00627 [cs.GT] (or arXiv:2506.00627v1 [cs.GT] for this version) https://doi.org/10.48550/arXiv.2506.00627 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-148] RiOSWorld: Benchmarking the Risk of Multimodal Compter-Use Agents

【速读】:该论文试图解决多模态大语言模型(Multimodal Large Language Models, MLLMs)作为自主计算机使用代理在现实世界场景中的安全风险评估问题,具体而言,是探究为对话场景设计的安全风险原则是否能有效迁移至实际计算机操作场景。解决方案的关键在于提出一个名为\textbf{RiOSWorld}的基准测试平台,该平台包含492个涉及多种计算机应用的高风险任务,并从风险来源(用户引发的风险和环境风险)以及风险目标意图与完成度两个角度对安全风险进行评估,从而全面揭示当前计算机使用代理在真实环境中面临的安全挑战。

链接: https://arxiv.org/abs/2506.00618
作者: Jingyi Yang,Shuai Shao,Dongrui Liu,Jing Shao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 40 pages, 6 figures, Project Page: this https URL

点击查看摘要

Abstract:With the rapid development of multimodal large language models (MLLMs), they are increasingly deployed as autonomous computer-use agents capable of accomplishing complex computer tasks. However, a pressing issue arises: Can the safety risk principles designed and aligned for general MLLMs in dialogue scenarios be effectively transferred to real-world computer-use scenarios? Existing research on evaluating the safety risks of MLLM-based computer-use agents suffers from several limitations: it either lacks realistic interactive environments, or narrowly focuses on one or a few specific risk types. These limitations ignore the complexity, variability, and diversity of real-world environments, thereby restricting comprehensive risk evaluation for computer-use agents. To this end, we introduce \textbfRiOSWorld, a benchmark designed to evaluate the potential risks of MLLM-based agents during real-world computer manipulations. Our benchmark includes 492 risky tasks spanning various computer applications, involving web, social media, multimedia, os, email, and office software. We categorize these risks into two major classes based on their risk source: (i) User-originated risks and (ii) Environmental risks. For the evaluation, we evaluate safety risks from two perspectives: (i) Risk goal intention and (ii) Risk goal completion. Extensive experiments with multimodal agents on \textbfRiOSWorld demonstrate that current computer-use agents confront significant safety risks in real-world scenarios. Our findings highlight the necessity and urgency of safety alignment for computer-use agents in real-world computer manipulation, providing valuable insights for developing trustworthy computer-use agents. Our benchmark is publicly available at this https URL.
zh

[AI-149] A Topological Semantics of Dialogue: Nerve Structures and Logical Extraction

【速读】:该论文试图解决有限对话的语义建模问题,通过拓扑学方法为对话中的每个话语分配一个开集,并构建对应的神经复形以提取基本的组合不变量。解决方案的关键在于利用拓扑神经结构(nerve structures)来捕捉话语之间的联合可满足性关系,从而实现对话中逻辑后果的有效枚举和不一致检测,其核心思想基于经典对偶理论与拓扑语义(如Stone对偶、Priestley对偶、Tarski语义等)以及现代拓扑数据分析和对话语义的进展。

链接: https://arxiv.org/abs/2506.00615
作者: Andreu Ballus Santacana
机构: 未知
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI); Algebraic Topology (math.AT); Logic (math.LO)
备注: 17 pages

点击查看摘要

Abstract:We introduce a concise, topologically-motivated semantics for finite dialogues by mapping each utterance to an open set in a fixed semantic space, building the corresponding nerve complex of joint satisfiability, and extracting fundamental combinatorial invariants: 1. The negative nerve, which enumerates all finite collections of utterances whose opens have empty intersection, providing a straightforward criterion for merging separate transcripts without contradiction. 2. The global interpretation subspace, the unique minimal open in which all asserted utterances hold simultaneously, enabling effective enumeration of all logical consequences of the entire dialogue. 3. A practical demonstration in the Wolfram Language, with algorithms for constructing nerves, detecting inconsistencies, and computing the global interpretation, thereby illustrating computational feasibility. Our framework is grounded in classical duality and topological semantics (Stone duality, Priestley duality, Tarski’s semantics, coherence-space methods, Scott domains, topos semantics, and homotopy type theory) while drawing on recent advances in topological data analysis and dialogue-based semantics. Comments: 17 pages Subjects: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI); Algebraic Topology (math.AT); Logic (math.LO) MSC classes: 03B05, 55U10, 68T27 ACMclasses: F.4.1; I.2.3; I.2.4 Cite as: arXiv:2506.00615 [cs.LO] (or arXiv:2506.00615v1 [cs.LO] for this version) https://doi.org/10.48550/arXiv.2506.00615 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Andreu Ballús [view email] [v1] Sat, 31 May 2025 15:58:05 UTC (26 KB) Full-text links: Access Paper: View a PDF of the paper titled A Topological Semantics of Dialogue: Nerve Structures and Logical Extraction, by Andreu Ballus SantacanaView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.LO prev | next new | recent | 2025-06 Change to browse by: cs cs.AI math math.AT math.LO References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack
zh

[AI-150] Predictability-Aware Compression and Decompression Framework for Multichannel Time Series Data

【速读】:该论文旨在解决多通道时间序列预测中在边缘和云环境下的效率问题,特别是通过通道压缩来降低运行时间和通信成本,同时保持预测精度。其解决方案的关键在于提出了一种可预测性感知的压缩-解压缩框架,核心思想是利用具有正交性的循环周期性关键矩阵,在压缩过程中捕捉时间序列的可预测性,并在解压缩过程中通过放松过于简化的数据假设来减轻重建误差。

链接: https://arxiv.org/abs/2506.00614
作者: Ziqi Liu,Pei Zeng,Yi Ding
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 17 pages,3 figures

点击查看摘要

Abstract:Real-world multichannel time series prediction faces growing demands for efficiency across edge and cloud environments, making channel compression a timely and essential problem. Motivated by success of Multiple-Input Multiple-Output (MIMO) methods, we propose a predictability-aware compression-decompression framework to reduce runtime, lower communication cost, and maintain prediction accuracy across diverse predictors. The core idea involves using a circular periodicity key matrix with orthogonality to capture underlying time series predictability during compression and to mitigate reconstruction errors during decompression by relaxing oversimplified data assumptions. Theoretical and empirical analyses show that the proposed framework is both time-efficient and scalable under a large number of channels. Extensive experiments on six datasets across various predictors demonstrate that the proposed method achieves superior overall performance by jointly considering prediction accuracy and runtime, while maintaining strong compatibility with diverse predictors.
zh

[AI-151] Evaluating Robot Policies in a World Model

【速读】:该论文试图解决机器人控制策略在现实世界中评估困难的问题,因为实际测试成本高昂,而手工设计的仿真环境往往无法准确反映真实场景,导致仿真评估与实际结果之间相关性差。其解决方案的关键是提出基于世界模型的策略评估方法(World-model-based Policy Evaluation, WPE),通过训练一个动作条件视频生成模型作为真实环境的代理,并引入一种称为“分块自回归扩散变换器”的推理方案以实现高效且误差累积可控的交互步骤模拟。此外,利用视觉-语言模型(Vision-Language Model, VLM)作为奖励函数进行策略评估,从而在虚拟环境中对机器人策略进行有效评估。

链接: https://arxiv.org/abs/2506.00613
作者: Julian Quevedo,Percy Liang,Sherry Yang
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: this https URL

点击查看摘要

Abstract:Robotics has broad applications from automating house chores to taking care of patients. However, evaluating robot control policies is challenging, as real-world testing is expensive, while handcrafted simulations often fail to accurately reflect real-world conditions, resulting in poor correlation between simulated evaluation and real-world outcomes. In this work, we investigate World-model-based Policy Evaluation (WPE). We first train an action-conditioned video generation model as a proxy to real-world environments. To enable efficient rollouts of hundreds of interactive steps while mitigating error accumulation in the world model, we propose an inference scheme which we call Blockwise-Autoregressive Diffusion Transformer with adjustable context and decoding horizon lengths. To ensure that the world model indeed follows action input, we propose metrics based on the agreement between the ground truth video and generated video conditioned on the same sequence of actions to evaluate the world model. We then use the world model for policy evaluation by performing Monte Carlo rollouts in the world model while employing a vision-language model (VLM) as a reward function. Interestingly, we found that WPE tends to underestimate the policy values for in-distribution actions and overestimate policy values for out-of-distribution actions. Nevertheless, WPE preserves the relative rankings of different policies. In emulating real robot executions, WPE achieves high fidelity in mimicing robot arm movements as in real videos, while emulating highly realistic object interaction remains challenging. Despite this limitation, we show that a world model can serve as a starting point for evaluating robot policies before real-world deployment.
zh

[AI-152] Graph Evidential Learning for Anomaly Detection KDD25

【速读】:该论文旨在解决图异常检测中因缺乏可靠异常标注数据集所带来的挑战,其解决方案的关键在于提出一种基于证据学习的图证据学习(Graph Evidential Learning, GEL)框架。GEL通过引入证据分布对节点特征和图拓扑进行建模,量化了图不确定性与重构不确定性,并将这两种不确定性整合到异常评分机制中,从而提升了模型在噪声和结构扰动下的鲁棒性与检测性能。

链接: https://arxiv.org/abs/2506.00594
作者: Chunyu Wei,Wenji Hu,Xingjia Hao,Yunhai Wang,Yueguo Chen,Bing Bai,Fei Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted by KDD25

点击查看摘要

Abstract:Graph anomaly detection faces significant challenges due to the scarcity of reliable anomaly-labeled datasets, driving the development of unsupervised methods. Graph autoencoders (GAEs) have emerged as a dominant approach by reconstructing graph structures and node features while deriving anomaly scores from reconstruction errors. However, relying solely on reconstruction error for anomaly detection has limitations, as it increases the sensitivity to noise and overfitting. To address these issues, we propose Graph Evidential Learning (GEL), a probabilistic framework that redefines the reconstruction process through evidential learning. By modeling node features and graph topology using evidential distributions, GEL quantifies two types of uncertainty: graph uncertainty and reconstruction uncertainty, incorporating them into the anomaly scoring mechanism. Extensive experiments demonstrate that GEL achieves state-of-the-art performance while maintaining high robustness against noise and structural perturbations.
zh

[AI-153] Mitigating Plasticity Loss in Continual Reinforcement Learning by Reducing Churn ICML2025

【速读】:该论文旨在解决深度持续强化学习(deep continual RL)中因网络输出对批次外数据的变异性(churn)增加而导致的可塑性丧失问题。其解决方案的关键在于通过减少churn来防止神经切线核(NTK)矩阵秩的崩溃,并自适应地调整常规RL梯度的步长,从而提升模型在连续学习环境中的性能。论文提出的Continual Churn Approximated Reduction (C-CHAIN)方法在多个基准测试中表现出优于基线的效果。

链接: https://arxiv.org/abs/2506.00592
作者: Hongyao Tang,Johan Obando-Ceron,Pablo Samuel Castro,Aaron Courville,Glen Berseth
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to ICML 2025

点击查看摘要

Abstract:Plasticity, or the ability of an agent to adapt to new tasks, environments, or distributions, is crucial for continual learning. In this paper, we study the loss of plasticity in deep continual RL from the lens of churn: network output variability for out-of-batch data induced by mini-batch training. We demonstrate that (1) the loss of plasticity is accompanied by the exacerbation of churn due to the gradual rank decrease of the Neural Tangent Kernel (NTK) matrix; (2) reducing churn helps prevent rank collapse and adjusts the step size of regular RL gradients adaptively. Moreover, we introduce Continual Churn Approximated Reduction (C-CHAIN) and demonstrate it improves learning performance and outperforms baselines in a diverse range of continual learning environments on OpenAI Gym Control, ProcGen, DeepMind Control Suite, and MinAtar benchmarks.
zh

[AI-154] mporal Chunking Enhances Recognition of Implicit Sequential Patterns

【速读】:该论文试图解决传统神经网络(如循环神经网络,Recurrent Neural Networks, RNNs)在处理多时间尺度的时序模式时存在的局限性,特别是在资源受限环境下学习效率低的问题。其解决方案的关键在于提出一种受神经启发的方法,将时间序列压缩为带有上下文标签(context-tagged chunks)的块,其中每个标签代表序列中重复出现的结构单元或“社区”,这些标签在离线睡眠阶段生成,作为过去经验的紧凑参考,使学习者能够整合超出即时输入范围的信息。

链接: https://arxiv.org/abs/2506.00588
作者: Jayanta Dey,Nicholas Soures,Miranda Gonzales,Itamar Lerner,Christopher Kanan,Dhireesha Kudithipudi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In this pilot study, we propose a neuro-inspired approach that compresses temporal sequences into context-tagged chunks, where each tag represents a recurring structural unit or``community’’ in the sequence. These tags are generated during an offline sleep phase and serve as compact references to past experience, allowing the learner to incorporate information beyond its immediate input range. We evaluate this idea in a controlled synthetic environment designed to reveal the limitations of traditional neural network based sequence learners, such as recurrent neural networks (RNNs), when facing temporal patterns on multiple timescales. We evaluate this idea in a controlled synthetic environment designed to reveal the limitations of traditional neural network based sequence learners, such as recurrent neural networks (RNNs), when facing temporal patterns on multiple timescales. Our results, while preliminary, suggest that temporal chunking can significantly enhance learning efficiency under resource constrained settings. A small-scale human pilot study using a Serial Reaction Time Task further motivates the idea of structural abstraction. Although limited to synthetic tasks, this work serves as an early proof-of-concept, with initial evidence that learned context tags can transfer across related task, offering potential for future applications in transfer learning.
zh

[AI-155] Do Language Models Mirror Human Confidence? Exploring Psychological Insights to Address Overconfidence in LLM s ACL2025

【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)在置信度估计上的校准问题,即模型在不同难度任务中表现出的过度自信问题。研究发现,尽管模型的答题准确性保持不变,但当被提示以不同角色(如专家或普通用户)回答问题时,模型会表现出刻板印象式的置信度偏差。解决方案的关键在于提出一种无需答案的置信度估计(Answer-Free Confidence Estimation, AFCE),该方法通过两个阶段的提示策略,首先仅获取对问题的置信度评分,随后单独请求答案,从而显著降低过度自信并提升对任务难度的人类相似敏感性。

链接: https://arxiv.org/abs/2506.00582
作者: Chenjun Xu,Bingbing Wen,Bin Han,Robert Wolfe,Lucy Lu Wang,Bill Howe
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted by ACL 2025 Findings, 20 pages

点击查看摘要

Abstract:Psychology research has shown that humans are poor at estimating their performance on tasks, tending towards underconfidence on easy tasks and overconfidence on difficult tasks. We examine three LLMs, Llama-3-70B-instruct, Claude-3-Sonnet, and GPT-4o, on a range of QA tasks of varying difficulty, and show that models exhibit subtle differences from human patterns of overconfidence: less sensitive to task difficulty, and when prompted to answer based on different personas – e.g., expert vs layman, or different race, gender, and ages – the models will respond with stereotypically biased confidence estimations even though their underlying answer accuracy remains the same. Based on these observations, we propose Answer-Free Confidence Estimation (AFCE) to improve confidence calibration and LLM interpretability in these settings. AFCE is a self-assessment method that employs two stages of prompting, first eliciting only confidence scores on questions, then asking separately for the answer. Experiments on the MMLU and GPQA datasets spanning subjects and difficulty show that this separation of tasks significantly reduces overconfidence and delivers more human-like sensitivity to task difficulty.
zh

[AI-156] ORAN-GUIDE: RAG -Driven Prompt Learning for LLM -Augmented Reinforcement Learning in O-RAN Network Slicing

【速读】:该论文旨在解决开放无线接入网络(O-RAN)中动态、异构服务需求下深度强化学习(DRL)在处理原始、非结构化输入时的局限性,从而影响策略泛化和决策效率的问题。解决方案的关键在于提出一种基于双大语言模型(LLM)的框架ORAN-GUIDE,通过引入领域特定的语言模型ORANSight生成语义丰富、上下文感知的状态表示,并结合可学习的标记与冻结的GPT编码器,输出高层语义表示供DRL代理使用,从而提升多智能体强化学习(MARL)的样本效率、策略收敛性和性能泛化能力。

链接: https://arxiv.org/abs/2506.00576
作者: Fatemeh Lotfi,Hossein Rajoli,Fatemeh Afghah
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Advanced wireless networks must support highly dynamic and heterogeneous service demands. Open Radio Access Network (O-RAN) architecture enables this flexibility by adopting modular, disaggregated components, such as the RAN Intelligent Controller (RIC), Centralized Unit (CU), and Distributed Unit (DU), that can support intelligent control via machine learning (ML). While deep reinforcement learning (DRL) is a powerful tool for managing dynamic resource allocation and slicing, it often struggles to process raw, unstructured input like RF features, QoS metrics, and traffic trends. These limitations hinder policy generalization and decision efficiency in partially observable and evolving environments. To address this, we propose \textitORAN-GUIDE, a dual-LLM framework that enhances multi-agent RL (MARL) with task-relevant, semantically enriched state representations. The architecture employs a domain-specific language model, ORANSight, pretrained on O-RAN control and configuration data, to generate structured, context-aware prompts. These prompts are fused with learnable tokens and passed to a frozen GPT-based encoder that outputs high-level semantic representations for DRL agents. This design adopts a retrieval-augmented generation (RAG) style pipeline tailored for technical decision-making in wireless systems. Experimental results show that ORAN-GUIDE improves sample efficiency, policy convergence, and performance generalization over standard MARL and single-LLM baselines.
zh

[AI-157] Prompt-Tuned LLM -Augmented DRL for Dynamic O-RAN Network Slicing

【速读】:该论文旨在解决现代无线网络中动态环境下的资源分配问题,特别是在O-RAN(Open Radio Access Network)切片场景下,传统深度强化学习(Deep Reinforcement Learning, DRL)因反馈信息分散且不断变化而难以实现最优决策。其解决方案的关键在于引入基于上下文的适配方法,将可学习的提示(learnable prompts)集成到大语言模型(Large Language Models, LLMs)增强的DRL框架中,通过任务特定的提示动态调整状态表示,从而提升强化学习代理的性能与适应性。

链接: https://arxiv.org/abs/2506.00574
作者: Fatemeh Lotfi,Hossein Rajoli,Fatemeh Afghah
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Modern wireless networks must adapt to dynamic conditions while efficiently managing diverse service demands. Traditional deep reinforcement learning (DRL) struggles in these environments, as scattered and evolving feedback makes optimal decision-making challenging. Large Language Models (LLMs) offer a solution by structuring unorganized network feedback into meaningful latent representations, helping RL agents recognize patterns more effectively. For example, in O-RAN slicing, concepts like SNR, power levels and throughput are semantically related, and LLMs can naturally cluster them, providing a more interpretable state representation. To leverage this capability, we introduce a contextualization-based adaptation method that integrates learnable prompts into an LLM-augmented DRL framework. Instead of relying on full model fine-tuning, we refine state representations through task-specific prompts that dynamically adjust to network conditions. Utilizing ORANSight, an LLM trained on O-RAN knowledge, we develop Prompt-Augmented Multi agent RL (PA-MRL) framework. Learnable prompts optimize both semantic clustering and RL objectives, allowing RL agents to achieve higher rewards in fewer iterations and adapt more efficiently. By incorporating prompt-augmented learning, our approach enables faster, more scalable, and adaptive resource allocation in O-RAN slicing. Experimental results show that it accelerates convergence and outperforms other baselines.
zh

[AI-158] A “Wenlu” Brain System for Multimodal Cognition and Embodied Decision-Making: A Secure New Architecture for Deep Integration of Foundation Models and Domain Knowledge

【速读】:该论文试图解决在复杂现实应用场景中,如何有效整合基础模型的语言理解能力与领域特定知识库的问题(domain-specific knowledge bases)。解决方案的关键在于提出了一种多模态认知与具身决策脑系统“Wenlu”,其核心是通过类脑记忆标记与回放机制,实现私有知识与公共模型的安全融合、多模态数据的统一处理以及从认知到硬件级代码自动生成的闭环决策。

链接: https://arxiv.org/abs/2506.00570
作者: Liang Geng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:With the rapid penetration of artificial intelligence across industries and scenarios, a key challenge in building the next-generation intelligent core lies in effectively integrating the language understanding capabilities of foundation models with domain-specific knowledge bases in complex real-world applications. This paper proposes a multimodal cognition and embodied decision-making brain system, Wenlu", designed to enable secure fusion of private knowledge and public models, unified processing of multimodal data such as images and speech, and closed-loop decision-making from cognition to automatic generation of hardware-level code. The system introduces a brain-inspired memory tagging and replay mechanism, seamlessly integrating user-private data, industry-specific knowledge, and general-purpose language models. It provides precise and efficient multimodal services for enterprise decision support, medical analysis, autonomous driving, robotic control, and more. Compared with existing solutions, Wenlu" demonstrates significant advantages in multimodal processing, privacy security, end-to-end hardware control code generation, self-learning, and sustainable updates, thus laying a solid foundation for constructing the next-generation intelligent core.
zh

[AI-159] Understanding Behavioral Metric Learning: A Large-Scale Study on Distracting Reinforcement Learning Environments

【速读】:该论文试图解决在深度强化学习中如何有效进行度量学习(metric learning)以提升状态抽象的问题,特别是如何准确估计行为度量(behavioral metrics),如双模拟度量(bisimulation metrics),从而增强对任务无关噪声的鲁棒性。解决方案的关键在于通过将观察空间中的度量近似嵌入到表示空间中,实现对状态的高效抽象,同时通过统一的概念框架——即具有不同设计选择的等距嵌入(isometric embeddings),系统评估了多种方法的效果,并引入了去噪因子等新评估指标以更全面地分析度量学习的作用。

链接: https://arxiv.org/abs/2506.00563
作者: Ziyan Luo,Tianwei Ni,Pierre-Luc Bacon,Doina Precup,Xujie Si
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:A key approach to state abstraction is approximating behavioral metrics (notably, bisimulation metrics) in the observation space and embedding these learned distances in the representation space. While promising for robustness to task-irrelevant noise, as shown in prior work, accurately estimating these metrics remains challenging, requiring various design choices that create gaps between theory and practice. Prior evaluations focus mainly on final returns, leaving the quality of learned metrics and the source of performance gains unclear. To systematically assess how metric learning works in deep reinforcement learning (RL), we evaluate five recent approaches, unified conceptually as isometric embeddings with varying design choices. We benchmark them with baselines across 20 state-based and 14 pixel-based tasks, spanning 370 task configurations with diverse noise settings. Beyond final returns, we introduce the evaluation of a denoising factor to quantify the encoder’s ability to filter distractions. To further isolate the effect of metric learning, we propose and evaluate an isolated metric estimation setting, in which the encoder is influenced solely by the metric loss. Finally, we release an open-source, modular codebase to improve reproducibility and support future research on metric learning in deep RL.
zh

[AI-160] Imputation of Missing Data in Smooth Pursuit Eye Movements Using a Self-Attention-based Deep Learning Approach

【速读】:该论文旨在解决时间序列中缺失数据的问题,特别是在生物医学序列(如平滑追踪眼动)中,由于眨眼和跟踪丢失导致的数据缺口,这给分析和提取有意义的生物标志物带来了挑战。解决方案的关键在于提出一种基于自注意力机制的生成式 AI (Generative AI) 填补框架,利用深度学习和自注意力机制进行数据填补,并通过定制的自编码器进一步优化填补结果,以更好地表征平滑追踪眼动序列。

链接: https://arxiv.org/abs/2506.00545
作者: Mehdi Bejani,Guillermo Perez-de-Arenaza-Pozo,Julián D. Arias-Londoño,Juan I. Godino-LLorente
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 17 pages, 10 figures, 3 tables

点击查看摘要

Abstract:Missing data is a relevant issue in time series, especially in biomedical sequences such as those corresponding to smooth pursuit eye movements, which often contain gaps due to eye blinks and track losses, complicating the analysis and extraction of meaningful biomarkers. In this paper, a novel imputation framework is proposed using Self-Attention-based Imputation networks for time series, which leverages the power of deep learning and self-attention mechanisms to impute missing data. We further refine the imputed data using a custom made autoencoder, tailored to represent smooth pursuit eye movement sequences. The proposed approach was implemented using 5,504 sequences from 172 Parkinsonian patients and healthy controls. Results show a significant improvement in the accuracy of reconstructed eye movement sequences with respect to other state of the art techniques, substantially reducing the values for common time domain error metrics such as the mean absolute error, mean relative error, and root mean square error, while also preserving the signal’s frequency domain characteristics. Moreover, it demonstrates robustness when large intervals of data are missing. This method offers an alternative solution for robustly handling missing data in time series, enhancing the reliability of smooth pursuit analysis for the screening and monitoring of neurodegenerative disorders.
zh

[AI-161] he Security Threat of Compressed Projectors in Large Vision-Language Models

【速读】:该论文试图解决大型视觉语言模型(Large Visual Language Models, LVLMs)中视觉语言投影器(Visual Language Projector, VLP)的安全性问题。研究发现,主流的VLP可分为压缩型和非压缩型,二者在性能和计算效率上各有优势,但其安全特性尚未被充分探讨。论文的关键解决方案在于揭示压缩型VLP存在显著的安全漏洞,使得攻击者即使缺乏结构信息也能成功入侵LVLMs,而非压缩型VLP则表现出更强的安全性,未引入额外漏洞。这一发现为研究人员选择更安全可靠的VLP提供了重要依据。

链接: https://arxiv.org/abs/2506.00534
作者: Yudong Zhang,Ruobing Xie,Xingwu Sun,Jiansheng Chen,Zhanhui Kang,Di Wang,Yu Wang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The choice of a suitable visual language projector (VLP) is critical to the successful training of large visual language models (LVLMs). Mainstream VLPs can be broadly categorized into compressed and uncompressed projectors, and each offering distinct advantages in performance and computational efficiency. However, their security implications have not been thoroughly examined. Our comprehensive evaluation reveals significant differences in their security profiles: compressed projectors exhibit substantial vulnerabilities, allowing adversaries to successfully compromise LVLMs even with minimal knowledge of structural information. In stark contrast, uncompressed projectors demonstrate robust security properties and do not introduce additional vulnerabilities. These findings provide critical guidance for researchers in selecting optimal VLPs that enhance the security and reliability of visual language models. The code will be released.
zh

[AI-162] M2WLLM : Multi-Modal Multi-Task Ultra-Short-term Wind Power Prediction Algorithm Based on Large Language Model

【速读】:该论文旨在解决风能接入电网时对超短期风功率预测的准确性需求,以保障电网稳定性和优化资源配置。其解决方案的关键在于提出M2WLLM模型,该模型通过融合文本信息与时间序列数值数据,克服了传统方法和深度学习方法的局限性,利用大型语言模型(Large Language Models, LLMs)的多模态数据处理能力,显著提升了风功率预测的精度。

链接: https://arxiv.org/abs/2506.00531
作者: Hang Fana,Mingxuan Lib,Zuhan Zhanga,Long Chengc,Yujian Ye,Dunnan Liua
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The integration of wind energy into power grids necessitates accurate ultra-short-term wind power forecasting to ensure grid stability and optimize resource allocation. This study introduces M2WLLM, an innovative model that leverages the capabilities of Large Language Models (LLMs) for predicting wind power output at granular time intervals. M2WLLM overcomes the limitations of traditional and deep learning methods by seamlessly integrating textual information and temporal numerical data, significantly improving wind power forecasting accuracy through multi-modal data. Its architecture features a Prompt Embedder and a Data Embedder, enabling an effective fusion of textual prompts and numerical inputs within the LLMs framework. The Semantic Augmenter within the Data Embedder translates temporal data into a format that the LLMs can comprehend, enabling it to extract latent features and improve prediction accuracy. The empirical evaluations conducted on wind farm data from three Chinese provinces demonstrate that M2WLLM consistently outperforms existing methods, such as GPT4TS, across various datasets and prediction horizons. The results highlight LLMs’ ability to enhance accuracy and robustness in ultra-short-term forecasting and showcase their strong few-shot learning capabilities.
zh

[AI-163] Pro3D-Editor : A Progressive-Views Perspective for Consistent and Precise 3D Editing

【速读】:该论文旨在解决文本引导的3D编辑中多视角一致性不足的问题,即现有方法在编辑2D视图后将其投影回3D空间时,未能充分考虑跨视角的相互依赖关系,导致编辑结果在不同视角下不一致。其解决方案的关键在于提出一种“渐进式视角”(progressive-views)范式,通过从编辑显著视图向其他编辑稀疏视图传播编辑语义,实现更一致的3D编辑。具体而言,该方案包含三个核心模块:主视图采样器、关键视图渲染器和全视图精修器,其中关键视图渲染器利用Mixture-of-View-Experts Low-Rank Adaption (MoVE-LoRA) 技术实现语义的准确传播。

链接: https://arxiv.org/abs/2506.00512
作者: Yang Zheng,Mengqi Huang,Nan Chen,Zhendong Mao
机构: 未知
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Text-guided 3D editing aims to precisely edit semantically relevant local 3D regions, which has significant potential for various practical applications ranging from 3D games to film production. Existing methods typically follow a view-indiscriminate paradigm: editing 2D views indiscriminately and projecting them back into 3D space. However, they overlook the different cross-view interdependencies, resulting in inconsistent multi-view editing. In this study, we argue that ideal consistent 3D editing can be achieved through a \textitprogressive-views paradigm, which propagates editing semantics from the editing-salient view to other editing-sparse views. Specifically, we propose \textitPro3D-Editor, a novel framework, which mainly includes Primary-view Sampler, Key-view Render, and Full-view Refiner. Primary-view Sampler dynamically samples and edits the most editing-salient view as the primary view. Key-view Render accurately propagates editing semantics from the primary view to other key views through its Mixture-of-View-Experts Low-Rank Adaption (MoVE-LoRA). Full-view Refiner edits and refines the 3D object based on the edited multi-views. Extensive experiments demonstrate that our method outperforms existing methods in editing accuracy and spatial consistency.
zh

[AI-164] Monitoring Robustness and Individual Fairness

【速读】:该论文试图解决部署中的黑箱人工智能模型在运行时的输入-输出鲁棒性监控问题,旨在通过实时监测检测相似输入导致不同输出的情况,从而提升AI决策系统的可信度。解决方案的关键在于将监控问题建模为固定半径最近邻(FRNN)搜索问题,并提出轻量级监控工具Clemont,其中部分监控器采用改进的在线FRNN算法,另一部分则基于二进制决策图(BDD)的新型算法,同时引入高效的并行化技术以降低计算时间。

链接: https://arxiv.org/abs/2506.00496
作者: Ashutosh Gupta,Thomas A. Henzinger,Konstantin Kueffner,Kaushik Mallik,David Pape
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Input-output robustness appears in various different forms in the literature, such as robustness of AI models to adversarial or semantic perturbations and individual fairness of AI models that make decisions about humans. We propose runtime monitoring of input-output robustness of deployed, black-box AI models, where the goal is to design monitors that would observe one long execution sequence of the model, and would raise an alarm whenever it is detected that two similar inputs from the past led to dissimilar outputs. This way, monitoring will complement existing offline ``robustification’’ approaches to increase the trustworthiness of AI decision-makers. We show that the monitoring problem can be cast as the fixed-radius nearest neighbor (FRNN) search problem, which, despite being well-studied, lacks suitable online solutions. We present our tool Clemont, which offers a number of lightweight monitors, some of which use upgraded online variants of existing FRNN algorithms, and one uses a novel algorithm based on binary decision diagrams – a data-structure commonly used in software and hardware verification. We have also developed an efficient parallelization technique that can substantially cut down the computation time of monitors for which the distance between input-output pairs is measured using the L_\infty norm. Using standard benchmarks from the literature of adversarial and semantic robustness and individual fairness, we perform a comparative study of different monitors in \tool, and demonstrate their effectiveness in correctly detecting robustness violations at runtime. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2506.00496 [cs.AI] (or arXiv:2506.00496v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2506.00496 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1145/3711896.3737054 Focus to learn more DOI(s) linking to related resources Submission history From: Konstantin Kueffner [view email] [v1] Sat, 31 May 2025 10:27:54 UTC (16,695 KB) Full-text links: Access Paper: View a PDF of the paper titled Monitoring Robustness and Individual Fairness, by Ashutosh Gupta and Thomas A. Henzinger and Konstantin Kueffner and Kaushik Mallik and David PapeView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.AI prev | next new | recent | 2025-06 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack
zh

[AI-165] Multi-Objective Neural Network Assisted Design Optimization of Soft Fin-Ray Grippers for Enhanced Grasping Performance

【速读】:该论文旨在解决软体Fin-Ray夹爪在设计过程中面临的非线性抓取力与变形行为建模难题,以及如何在高抓取力与精细操作之间实现平衡的问题。其解决方案的关键在于利用有限元方法(FEM)建立夹爪的形变和接触力数据集,并通过多层感知机(MLP)进行预测建模,同时采用非支配排序遗传算法(NSGA-II)进行多目标优化,从而找到在最大接触力和尖端位移之间具有最优权衡的设计方案。

链接: https://arxiv.org/abs/2506.00494
作者: Ali Ghanizadeh,Ali Ahmadi,Arash Bahrami
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注:

点击查看摘要

Abstract:Soft Fin-Ray grippers can perform delicate and careful manipulation, which has caused notable attention in different fields. These grippers can handle objects of various forms and sizes safely. The internal structure of the Fin-Ray finger plays a significant role in its adaptability and grasping performance. However, modeling the non-linear grasp force and deformation behaviors for design purposes is challenging. Moreover, when the Fin-Ray finger becomes more rigid and capable of exerting higher forces, it becomes less delicate in handling objects. The contrast between these two objectives gives rise to a multi-objective optimization problem. In this study, we employ finite element method (FEM) to estimate the deflections and contact forces of the Fin-Ray, grasping cylindrical objects. This dataset is then used to construct a multilayer perception (MLP) for prediction of the contact force and the tip displacement. The FEM dataset consists of three input and four target features. The three input features of the MLP and optimization design variables are the thickness of the front and supporting beams, the thickness of the cross beams, and the equal spacing between the cross beams. In addition, the target features are the maximum contact forces and maximum tip displacements in x- and y-directions. The magnitude of maximum contact force and magnitude of maximum tip displacement are the two objectives, showing the trade-off between force and delicate manipulation in soft Fin-Ray grippers. Furthermore, the optimized set of solutions are found using multi-objective optimal techniques. We use non-dominated sorting genetic algorithm (NSGA-II) method for this purpose. Our findings demonstrate that our methodologies can be used to improve the design and gripping performance of soft robotic grippers, helping us to choose a design not only for delicate grasping but also for high-force applications.
zh

[AI-166] It Takes a Good Model to Train a Good Model: Generalized Gaussian Priors for Optimized LLM s

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)中参数统计分布及其对初始化、训练动态和下游效率影响的研究不足问题。其关键解决方案是基于广义高斯分布(Generalized Gaussian Distribution, GGD)构建一个统一的端到端优化框架,包括基于GGD的初始化方案、一种后训练正则化方法DeepShape以及针对GGD分布初始化的8位浮点格式RF8,从而实现模型压缩与性能优化的平衡。

链接: https://arxiv.org/abs/2506.00486
作者: Jun Wu,Yirong Xiong,Jiangtao Wen,Yuxing Han
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Despite rapid advancements in the research and deployment of large language models (LLMs), the statistical distribution of model parameters, as well as their influence on initialization, training dynamics, and downstream efficiency, has received surprisingly little attention. A recent work introduced BackSlash, a training-time compression algorithm. It first demonstrated that pre-trained LLM parameters follow generalized Gaussian distributions (GGDs) better. By optimizing GG priors during training, BackSlash can reduce parameters by up to 90% with minimal performance loss. Building on this foundational insight, we propose a unified, end-to-end framework for LLM optimization based on the GG model. Our contributions are threefold: (1) GG-based initialization scheme that aligns with the statistical structure of trained models, resulting in faster convergence and improved accuracy; (2) DeepShape, a post-training regularization method that reshapes weight distributions to match a GG profile, improving compressibility with minimized degradation in performance; and (3) RF8, a compact and hardware-efficient 8-bit floating-point format designed for GG-distributed-initialized BackSlash training, enabling low-cost inference without compromising accuracy. Experiments across diverse model architectures show that our framework consistently yields smaller and faster models that match or outperform standard training baselines. By grounding LLM development in principled statistical modeling, this work forges a new path toward efficient, scalable, and hardware-aware AI systems. The code is available on our project page: this https URL.
zh

[AI-167] Comparing Traditional and Reinforcement-Learning Methods for Energy Storag e Control

【速读】:该论文试图解决传统方法与强化学习(Reinforcement Learning, RL)在能量存储管理中的权衡问题,特别是当使用生成式RL策略代替传统方法时,针对特定实例寻找最优控制策略所导致的性能损失。解决方案的关键在于基于一个简化的微电网模型(包含负载、光伏电源和储能设备),分析三种复杂度逐渐增加的应用场景,并提供每个场景的详细公式化描述及优化挑战,进而对比传统方法与RL方法的性能,探讨各自适用的场景并提出未来研究方向。

链接: https://arxiv.org/abs/2506.00459
作者: Elinor Ginzburg,Itay Segev,Yoash Levron,Sarah Keren
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We aim to better understand the tradeoffs between traditional and reinforcement learning (RL) approaches for energy storage management. More specifically, we wish to better understand the performance loss incurred when using a generative RL policy instead of using a traditional approach to find optimal control policies for specific instances. Our comparison is based on a simplified micro-grid model, that includes a load component, a photovoltaic source, and a storage device. Based on this model, we examine three use cases of increasing complexity: ideal storage with convex cost functions, lossy storage devices, and lossy storage devices with convex transmission losses. With the aim of promoting the principled use RL based methods in this challenging and important domain, we provide a detailed formulation of each use case and a detailed description of the optimization challenges. We then compare the performance of traditional and RL methods, discuss settings in which it is beneficial to use each method, and suggest avenues for future investigation.
zh

[AI-168] Reinforcement Learning for Hanabi

【速读】:该论文旨在解决在合作性不完全信息环境中,如何提升强化学习(Reinforcement Learning, RL)代理性能的问题。研究的关键在于评估不同表格型和深度强化学习算法在面对不同类型对手时的表现,并探索其在不同情境下的适应性和优势。研究发现,时间差分(Temporal Difference, TD)算法在整体性能和玩法类型的平衡上优于表格型代理,其中表格型期望SARSA和深度Q学习代理表现最佳。

链接: https://arxiv.org/abs/2506.00458
作者: Nina Cohen,Kordel K. France
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Hanabi has become a popular game for research when it comes to reinforcement learning (RL) as it is one of the few cooperative card games where you have incomplete knowledge of the entire environment, thus presenting a challenge for a RL agent. We explored different tabular and deep reinforcement learning algorithms to see which had the best performance both against an agent of the same type and also against other types of agents. We establish that certain agents played their highest scoring games against specific agents while others exhibited higher scores on average by adapting to the opposing agent’s behavior. We attempted to quantify the conditions under which each algorithm provides the best advantage and identified the most interesting interactions between agents of different types. In the end, we found that temporal difference (TD) algorithms had better overall performance and balancing of play types compared to tabular agents. Specifically, tabular Expected SARSA and deep Q-Learning agents showed the best performance.
zh

[AI-169] Diffusion Models for Increasing Accuracy in Olfaction Sensors and Datasets

【速读】:该论文旨在解决机器人在复杂环境中进行气味源定位(odour source localization, OSL)时面临的模糊性问题,特别是在嗅觉数据集和传感器分辨率有限的情况下,机器人可能将气味错误地归因于不正确的物体。其解决方案的关键在于引入一种基于扩散的分子生成机器学习方法,该方法能够扩展化学空间,超越现有嗅觉数据集和视觉-语言模型(VLMs)训练数据的限制,从而生成潜在的未被记录的气味分子,并通过先进的嗅觉传感器进行更准确的验证,提升气味与正确来源之间的关联能力。

链接: https://arxiv.org/abs/2506.00455
作者: Kordel K. France,Ovidiu Daescu
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Robotic odour source localization (OSL) is a critical capability for autonomous systems operating in complex environments. However, current OSL methods often suffer from ambiguities, particularly when robots misattribute odours to incorrect objects due to limitations in olfactory datasets and sensor resolutions. To address this challenge, we introduce a novel machine learning method using diffusion-based molecular generation to enhance odour localization accuracy that can be used by itself or with automated olfactory dataset construction pipelines with vision-language models (VLMs) This generative process of our diffusion model expands the chemical space beyond the limitations of both current olfactory datasets and the training data of VLMs, enabling the identification of potential odourant molecules not previously documented. The generated molecules can then be more accurately validated using advanced olfactory sensors which emulate human olfactory recognition through electronic sensor arrays. By integrating visual analysis, language processing, and molecular generation, our framework enhances the ability of olfaction-vision models on robots to accurately associate odours with their correct sources, thereby improving navigation and decision-making in environments where olfactory cues are essential. Our methodology represents a foundational advancement in the field of robotic olfaction, offering a scalable solution to the challenges posed by limited olfactory data and sensor ambiguities.
zh

[AI-170] MetaNet: Topological Meta-Learning Framework for Dynamic Link Prediction ICML2025

【速读】:该论文试图解决动态图(dynamic graphs)在传统图学习中的挑战,特别是其结构变化和时间依赖性带来的问题。现有基于元学习(meta-learning)的动态图神经网络模型大多依赖于固定权重更新参数,忽视了动态演化图中本质的高阶拓扑信息。解决方案的关键在于设计一种基于Dowker复形和zigzag持久同调的高效稳定动态图持久同调表示方法——Dowker Zigzag Persistence (DZP),并在此基础上提出TMetaNet模型,通过利用高阶拓扑特征之间的距离实现跨快照的有效适应,从而提升模型在动态图分析中的性能与鲁棒性。

链接: https://arxiv.org/abs/2506.00453
作者: Hao Li,Hao Wan,Yuzhou Chen,Dongsheng Ye,Yulia Gel,Hao Jiang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: ICML2025

点击查看摘要

Abstract:Dynamic graphs evolve continuously, presenting challenges for traditional graph learning due to their changing structures and temporal dependencies. Recent advancements have shown potential in addressing these challenges by developing suitable meta-learning-based dynamic graph neural network models. However, most meta-learning approaches for dynamic graphs rely on fixed weight update parameters, neglecting the essential intrinsic complex high-order topological information of dynamically evolving graphs. We have designed Dowker Zigzag Persistence (DZP), an efficient and stable dynamic graph persistent homology representation method based on Dowker complex and zigzag persistence, to capture the high-order features of dynamic graphs. Armed with the DZP ideas, we propose TMetaNet, a new meta-learning parameter update model based on dynamic topological features. By utilizing the distances between high-order topological features, TMetaNet enables more effective adaptation across snapshots. Experiments on real-world datasets demonstrate TMetaNet’s state-of-the-art performance and resilience to graph noise, illustrating its high potential for meta-learning and dynamic graph analysis. Our code is available at this https URL.
zh

[AI-171] RLAE: Reinforcement Learning-Assisted Ensemble for LLM s

【速读】:该论文试图解决传统集成方法在结合大语言模型(Large Language Models, LLMs)时依赖固定权重策略,无法适应模型能力的动态和上下文相关特性的问题。其解决方案的关键在于提出一种基于强化学习的集成框架(Reinforcement Learning-Assisted Ensemble for LLMs, RLAE),该框架将LLM集成重新建模为一个马尔可夫决策过程(Markov Decision Process, MDP),并通过强化学习代理动态调整集成权重,考虑输入上下文和中间生成状态,并利用与最终输出质量直接相关的奖励进行训练。

链接: https://arxiv.org/abs/2506.00439
作者: Yuqian Fu,Yuanheng Zhu,Jiajun Chai,Guojun Yin,Wei Lin,Qichao Zhang,Dongbin Zhao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Ensembling large language models (LLMs) can effectively combine diverse strengths of different models, offering a promising approach to enhance performance across various tasks. However, existing methods typically rely on fixed weighting strategies that fail to adapt to the dynamic, context-dependent characteristics of LLM capabilities. In this work, we propose Reinforcement Learning-Assisted Ensemble for LLMs (RLAE), a novel framework that reformulates LLM ensemble through the lens of a Markov Decision Process (MDP). Our approach introduces a RL agent that dynamically adjusts ensemble weights by considering both input context and intermediate generation states, with the agent being trained using rewards that directly correspond to the quality of final outputs. We implement RLAE using both single-agent and multi-agent reinforcement learning algorithms ( \textRLAE_\textPPO and \textRLAE_\textMAPPO ), demonstrating substantial improvements over conventional ensemble methods. Extensive evaluations on a diverse set of tasks show that RLAE outperforms existing approaches by up to 3.3% accuracy points, offering a more effective framework for LLM ensembling. Furthermore, our method exhibits superior generalization capabilities across different tasks without the need for retraining, while simultaneously achieving lower time latency.
zh

[AI-172] Is Your Explanation Reliable: Confidence-Aware Explanation on Graph Neural Networks KDD25 KDD

【速读】:该论文旨在解决图神经网络(Graph Neural Networks, GNNs)预测解释的可靠性问题,特别是在分布外或未知测试数据集上的解释可信度不确定。其解决方案的关键在于提出一个基于理论原则的解释框架ConfExplainer,该框架引入了置信度评分模块,其核心是广义图信息瓶颈与置信度约束(Generalized Graph Information Bottleneck with Confidence Constraint, GIB-CC),用于量化生成解释的可靠性。

链接: https://arxiv.org/abs/2506.00437
作者: Jiaxing Zhang,Xiaoou Liu,Dongsheng Luo,Hua Wei
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: In Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD25)

点击查看摘要

Abstract:Explaining Graph Neural Networks (GNNs) has garnered significant attention due to the need for interpretability, enabling users to understand the behavior of these black-box models better and extract valuable insights from their predictions. While numerous post-hoc instance-level explanation methods have been proposed to interpret GNN predictions, the reliability of these explanations remains uncertain, particularly in the out-of-distribution or unknown test datasets. In this paper, we address this challenge by introducing an explainer framework with the confidence scoring module ( ConfExplainer), grounded in theoretical principle, which is generalized graph information bottleneck with confidence constraint (GIB-CC), that quantifies the reliability of generated explanations. Experimental results demonstrate the superiority of our approach, highlighting the effectiveness of the confidence score in enhancing the trustworthiness and robustness of GNN explanations.
zh

[AI-173] Learning from Double Positive and Unlabeled Data for Potential-Customer Identification

【速读】:该论文试图解决在目标营销中识别潜在客户的问题,特别是在仅有购买者数据的情况下,如何有效区分对产品感兴趣但对公司忠诚度较低的客户。解决方案的关键在于提出一种称为“双PU学习(double PU learning)”的方法,该方法通过单阶段优化构建一个分类器,能够同时捕捉两个目标:(i) 识别对产品感兴趣的人群,以及 (ii) 排除对公司具有强忠诚度的用户。该方法的损失函数隐含了来自标准PU学习设置的两种损失,从而实现了更高效的营销策略。

链接: https://arxiv.org/abs/2506.00436
作者: Masahiro Kato,Yuki Ikeda abd Kentaro Baba,Takashi Imai,Ryo Inokuchi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Econometrics (econ.EM); Methodology (stat.ME); Machine Learning (stat.ML)
备注: Accepted for publication in the Proceedings of IIAI AAI 2025

点击查看摘要

Abstract:In this study, we propose a method for identifying potential customers in targeted marketing by applying learning from positive and unlabeled data (PU learning). We consider a scenario in which a company sells a product and can observe only the customers who purchased it. Decision-makers seek to market products effectively based on whether people have loyalty to the company. Individuals with loyalty are those who are likely to remain interested in the company even without additional advertising. Consequently, those loyal customers would likely purchase from the company if they are interested in the product. In contrast, people with lower loyalty may overlook the product or buy similar products from other companies unless they receive marketing attention. Therefore, by focusing marketing efforts on individuals who are interested in the product but do not have strong loyalty, we can achieve more efficient marketing. To achieve this goal, we consider how to learn, from limited data, a classifier that identifies potential customers who (i) have interest in the product and (ii) do not have loyalty to the company. Although our algorithm comprises a single-stage optimization, its objective function implicitly contains two losses derived from standard PU learning settings. For this reason, we refer to our approach as double PU learning. We verify the validity of the proposed algorithm through numerical experiments, confirming that it functions appropriately for the problem at hand.
zh

[AI-174] Channel Normalization for Time Series Channel Identification ICML2025

【速读】:该论文旨在解决时间序列(Time Series, TS)建模中通道可辨识性(Channel Identifiability, CID)不足的问题,即模型无法区分不同通道的特性,导致相同输入产生相同输出。解决方案的关键在于提出一种名为通道归一化(Channel Normalization, CN)的简单而有效的归一化策略,通过为每个通道分配独立的仿射变换参数来增强通道可辨识性。

链接: https://arxiv.org/abs/2506.00432
作者: Seunghan Lee,Taeyoung Park,Kibok Lee
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: ICML 2025

点击查看摘要

Abstract:Channel identifiability (CID) refers to the ability to distinguish between individual channels in time series (TS) modeling. The absence of CID often results in producing identical outputs for identical inputs, disregarding channel-specific characteristics. In this paper, we highlight the importance of CID and propose Channel Normalization (CN), a simple yet effective normalization strategy that enhances CID by assigning distinct affine transformation parameters to each channel. We further extend CN in two ways: 1) Adaptive CN (ACN) dynamically adjusts parameters based on the input TS, improving adaptability in TS models, and 2) Prototypical CN (PCN) introduces a set of learnable prototypes instead of per-channel parameters, enabling applicability to datasets with unknown or varying number of channels and facilitating use in TS foundation models. We demonstrate the effectiveness of CN and its variants by applying them to various TS models, achieving significant performance gains for both non-CID and CID models. In addition, we analyze the success of our approach from an information theory perspective. Code is available at this https URL.
zh

[AI-175] MIRROR: Cognitive Inner Monologue Between Conversational Turns for Persistent Reflection and Reasoning in Conversational LLM s

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在多轮对话中面临的三个关键问题:谄媚行为(sycophancy)、对关键信息的注意力缺陷(attentional deficits)以及在冲突约束下的优先级不一致(inconsistent prioritization)。其解决方案的关键在于提出MIRROR(Modular Internal Reasoning, Reflection, Orchestration, and Response)认知架构,该架构通过分层设计实现并行推理能力,包括负责协调认知维度的Thinker模块和基于整合叙述生成上下文感知响应的Talker模块,从而提升模型在复杂、安全敏感场景下的表现。

链接: https://arxiv.org/abs/2506.00430
作者: Nicole Hsing
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Human intelligence relies on inner monologue to process complex information through simultaneous reflection, memory retrieval, and response formulation. We introduce MIRROR (Modular Internal Reasoning, Reflection, Orchestration, and Response), a cognitive architecture that systematically implements these parallel reasoning capabilities in large language models. MIRROR operates as a unified system with two distinct functional layers: the Thinker and the Talker. The Thinker encompasses: (1) the Inner Monologue Manager, coordinating reasoning threads across cognitive dimensions (Goals, Reasoning, and Memory); and (2) the Cognitive Controller, synthesizing these threads into a coherent internal narrative maintained across conversation turns. The Talker component then leverages this integrated narrative for context-aware responses. Evaluated on the CuRaTe benchmark–testing personalized dialogue with safety-critical constraints, conflicting preferences, and multi-turn consistency–LLMs utilizing the MIRROR architecture achieve up to 156% relative improvement in critical safety scenarios involving three persons with conflicting preferences, maintaining an average accuracy of ~80% on all scenarios. Across scenario-specific comparisons, GPT-4o, Gemini 1.5 Pro, Claude 3.7 Sonnet, Llama 4 variants, and Mistral 3 variants with the MIRROR architecture outperformed baseline models by 21% on average (15.5 percentage points absolute). MIRROR directly addresses three critical LLM failure modes: sycophancy, attentional deficits to critical information, and inconsistent prioritization of conflicting constraints. This work bridges cognitive science and AI by implementing modular internal reasoning inspired by human cognition, creating a persistent internal model that significantly enhances multi-turn conversation capabilities.
zh

[AI-176] COGNATE: Acceleration of Sparse Tensor Programs on Emerging Hardware using Transfer Learning

【速读】:该论文旨在解决稀疏张量程序在专用硬件加速器上的优化难题,特别是由于稀疏输入的敏感性以及早期加速器依赖昂贵模拟器所带来的挑战。现有基于机器学习的成本模型在通用硬件上表现良好,但在早期加速器上效果不佳,因为其需要大量数据进行训练。解决方案的关键在于提出COGNATE框架,该框架利用通用硬件(如CPU)的低成本数据样本训练成本模型,并在新兴硬件上进行少量微调,从而有效减少数据需求并提升模型适应性。

链接: https://arxiv.org/abs/2506.00424
作者: Chamika Sudusinghe,Gerasimos Gerogiannis Damitha Lenadora,Charles Block,Josep Torrellas,Charith Mendis
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Emerging Technologies (cs.ET)
备注: Accepted at the 42nd International Conference on Machine Learning

点击查看摘要

Abstract:Sparse tensor programs are essential in deep learning and graph analytics, driving the need for optimized processing. To meet this demand, specialized hardware accelerators are being developed. Optimizing these programs for accelerators is challenging for two reasons: program performance is highly sensitive to variations in sparse inputs, and early-stage accelerators rely on expensive simulators. Therefore, ML-based cost models used for optimizing such programs on general-purpose hardware are often ineffective for early-stage accelerators, as they require large datasets for proper training. To this end, we introduce COGNATE, a novel framework that leverages inexpensive data samples from general-purpose hardware (e.g., CPUs) to train cost models, followed by few-shot fine-tuning on emerging hardware. COGNATE exploits the homogeneity of input features across hardware platforms while effectively mitigating heterogeneity, enabling cost model training with just 5% of the data samples needed by accelerator-specific models to achieve comparable performance. We conduct extensive experiments to demonstrate that COGNATE outperforms existing techniques, achieving average speedups of 1.47x (up to 5.46x) for SpMM and 1.39x (up to 4.22x) for SDDMM.
zh

[AI-177] A New Spatiotemporal Correlation Anomaly Detection Method that Integrates Contrastive Learning and Few-Shot Learning in Wireless Sensor Networks

【速读】:该论文旨在解决无线传感器网络(Wireless Sensor Networks, WSNs)中异常检测面临的挑战,包括时空相关特征提取有限、样本标签缺失、异常样本稀少以及样本分布不平衡等问题。其解决方案的关键在于提出一种考虑模型架构和双阶段训练策略的时空相关性检测模型(MTAD-RD),该模型通过引入增强的RetNet结构、多粒度特征融合模块和图注意力网络模块,有效提取节点间的时空相关特征,并结合双阶段训练策略,利用对比学习和基于缓存的样本采样方法,解决了标签缺失和样本不平衡的问题,从而显著提升了异常检测性能。

链接: https://arxiv.org/abs/2506.00420
作者: Miao Ye,Suxiao Wang,Jiaguang Han,Yong Wang,Xiaoli Wang,Jingxuan Wei,Peng Wen,Jing Cui
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Detecting anomalies in the data collected by WSNs can provide crucial evidence for assessing the reliability and stability of WSNs. Existing methods for WSN anomaly detection often face challenges such as the limited extraction of spatiotemporal correlation features, the absence of sample labels, few anomaly samples, and an imbalanced sample distribution. To address these issues, a spatiotemporal correlation detection model (MTAD-RD) considering both model architecture and a two-stage training strategy perspective is proposed. In terms of model structure design, the proposed MTAD-RD backbone network includes a retentive network (RetNet) enhanced by a cross-retention (CR) module, a multigranular feature fusion module, and a graph attention network module to extract internode correlation information. This proposed model can integrate the intermodal correlation features and spatial features of WSN neighbor nodes while extracting global information from time series data. Moreover, its serialized inference characteristic can remarkably reduce inference overhead. For model training, a two-stage training approach was designed. First, a contrastive learning proxy task was designed for time series data with graph structure information in WSNs, enabling the backbone network to learn transferable features from unlabeled data using unsupervised contrastive learning methods, thereby addressing the issue of missing sample labels in the dataset. Then, a caching-based sample sampler was designed to divide samples into few-shot and contrastive learning data. A specific joint loss function was developed to jointly train the dual-graph discriminator network to address the problem of sample imbalance effectively. In experiments carried out on real public datasets, the designed MTAD-RD anomaly detection method achieved an F1 score of 90.97%, outperforming existing supervised WSN anomaly detection methods.
zh

[AI-178] World Models for Cognitive Agents : Transforming Edge Intelligence in Future Networks

【速读】:该论文旨在解决在数据受限或安全关键场景下,智能体如何高效进行预测、规划与决策的问题。其解决方案的关键在于构建世界模型(world models),通过学习潜在动态来生成环境的内部表示,从而实现样本高效的学习框架。该框架特别适用于无线边缘智能优化,论文进一步提出了Wireless Dreamer,一种基于世界模型的强化学习框架,以提升低空无线网络中的学习效率和决策质量。

链接: https://arxiv.org/abs/2506.00417
作者: Changyuan Zhao,Ruichen Zhang,Jiacheng Wang,Gaosheng Zhao,Dusit Niyato,Geng Sun,Shiwen Mao,Dong In Kim
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 7 pages, 4 figures

点击查看摘要

Abstract:World models are emerging as a transformative paradigm in artificial intelligence, enabling agents to construct internal representations of their environments for predictive reasoning, planning, and decision-making. By learning latent dynamics, world models provide a sample-efficient framework that is especially valuable in data-constrained or safety-critical scenarios. In this paper, we present a comprehensive overview of world models, highlighting their architecture, training paradigms, and applications across prediction, generation, planning, and causal reasoning. We compare and distinguish world models from related concepts such as digital twins, the metaverse, and foundation models, clarifying their unique role as embedded cognitive engines for autonomous agents. We further propose Wireless Dreamer, a novel world model-based reinforcement learning framework tailored for wireless edge intelligence optimization, particularly in low-altitude wireless networks (LAWNs). Through a weather-aware UAV trajectory planning case study, we demonstrate the effectiveness of our framework in improving learning efficiency and decision quality.
zh

[AI-179] Wide Reflective Equilibrium in LLM Alignment: Bridging Moral Epistemology and AI Safety

【速读】:该论文试图解决如何使大型语言模型(Large Language Models, LLMs)与人类价值观保持对齐的问题,特别是在确保其安全性、有益性和伦理合理性方面。论文提出的解决方案关键在于引入广义反思平衡方法(Method of Wide Reflective Equilibrium, MWRE),该方法通过协调经过深思熟虑的道德判断、指导性道德原则和相关背景理论之间的连贯性,为LLM对齐提供了一个更稳健的框架。MWRE强调原则的动态双向修订过程及其程序合法性,从而增强了对齐过程的可修订性、合法性和伦理基础,相较于现有的基础主义模型或简单的输入-输出评估,能够更准确地反映LLM对齐的复杂现实。

链接: https://arxiv.org/abs/2506.00415
作者: Matthew Brophy
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 24 pages excluding references, 3 tables

点击查看摘要

Abstract:As large language models (LLMs) become more powerful and pervasive across society, ensuring these systems are beneficial, safe, and aligned with human values is crucial. Current alignment techniques, like Constitutional AI (CAI), involve complex iterative processes. This paper argues that the Method of Wide Reflective Equilibrium (MWRE) – a well-established coherentist moral methodology – offers a uniquely apt framework for understanding current LLM alignment efforts. Moreover, this methodology can substantively augment these processes by providing concrete pathways for improving their dynamic revisability, procedural legitimacy, and overall ethical grounding. Together, these enhancements can help produce more robust and ethically defensible outcomes. MWRE, emphasizing the achievement of coherence between our considered moral judgments, guiding moral principles, and relevant background theories, arguably better represents the intricate reality of LLM alignment and offers a more robust path to justification than prevailing foundationalist models or simplistic input-output evaluations. While current methods like CAI bear a structural resemblance to MWRE, they often lack its crucial emphasis on dynamic, bi-directional revision of principles and the procedural legitimacy derived from such a process. While acknowledging various disanalogies (e.g., consciousness, genuine understanding in LLMs), the paper demonstrates that MWRE serves as a valuable heuristic for critically analyzing current alignment efforts and for guiding the future development of more ethically sound and justifiably aligned AI systems.
zh

[AI-180] LoHoVLA: A Unified Vision-Language-Action Model for Long-Horizon Embodied Tasks

【速读】:该论文旨在解决长时域具身任务中高阶任务规划与低阶运动控制协同不足的问题,此类任务需要多步骤解决方案以实现复杂目标。其解决方案的关键在于提出一种统一的视觉语言动作框架(LoHoVLA),该框架利用预训练的视觉语言模型(VLM)作为核心,联合生成语言和动作标记以分别完成子任务生成和机器人动作预测,从而促进跨任务的更好泛化能力。此外,LoHoVLA采用分层闭环控制机制,以减少来自高阶规划和低阶控制的误差,提升整体性能。

链接: https://arxiv.org/abs/2506.00411
作者: Yi Yang,Jiaxuan Sun,Siqi Kou,Yihan Wang,Zhijie Deng
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Real-world embodied agents face long-horizon tasks, characterized by high-level goals demanding multi-step solutions beyond single actions. Successfully navigating these requires both high-level task planning (i.e., decomposing goals into sub-tasks) and low-level motion control (i.e., generating precise robot actions). While existing vision language action (VLA) models and hierarchical architectures offer potential in embodied tasks, the former often falter in planning, and the latter can suffer from coordination issues, both hampering performance. We introduce a new unified VLA framework for long-horizon tasks, dubbed LoHoVLA, to overcome these limitations. LoHoVLA leverages a large pretrained vision language model (VLM) as the backbone to jointly generate language and action tokens for sub-task generation and robot action prediction, respectively. This shared representation promotes better generalization across tasks. Additionally, LoHoVLA embraces a hierarchical closed-loop control mechanism to mitigate errors originating from both high-level planning and low-level control. To train LoHoVLA, we introduce LoHoSet, a dataset built on the Ravens simulator, containing 20 long-horizon tasks, each with 1,000 expert demonstrations composed of visual observations, linguistic goals, sub-tasks, and robot actions. Experimental results show that LoHoVLA significantly surpasses both hierarchical and standard VLA approaches on long-horizon embodied tasks in the Ravens simulator. These findings underscore the promise of unified architectures for advancing generalizable embodied intelligence.
zh

[AI-181] Bias as a Virtue: Rethinking Generalization under Distribution Shifts

【速读】:该论文试图解决机器学习模型在部署到与训练数据分布不同的数据时性能下降的问题,即提升模型的跨分布泛化能力。解决方案的关键在于提出自适应分布桥(Adaptive Distribution Bridge, ADB)框架,通过在训练过程中引入受控的统计多样性,使模型发展出有效的偏差特征,从而实现更好的分布外泛化能力。研究发现,更高的分布内(in-distribution, ID)偏差反而能降低分布外(out-of-distribution, OOD)误差,这一结论挑战了传统以最小化验证误差为目标的实践。

链接: https://arxiv.org/abs/2506.00407
作者: Ruixuan Chen,Wentao Li,Jiahui Xiao,Yuchen Li,Yimin Tang,Xiaonan Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: 14 pages

点击查看摘要

Abstract:Machine learning models often degrade when deployed on data distributions different from their training data. Challenging conventional validation paradigms, we demonstrate that higher in-distribution (ID) bias can lead to better out-of-distribution (OOD) generalization. Our Adaptive Distribution Bridge (ADB) framework implements this insight by introducing controlled statistical diversity during training, enabling models to develop bias profiles that effectively generalize across distributions. Empirically, we observe a robust negative correlation where higher ID bias corresponds to lower OOD error–a finding that contradicts standard practices focused on minimizing validation error. Evaluation on multiple datasets shows our approach significantly improves OOD generalization. ADB achieves robust mean error reductions of up to 26.8% compared to traditional cross-validation, and consistently identifies high-performing training strategies, evidenced by percentile ranks often exceeding 74.4%. Our work provides both a practical method for improving generalization and a theoretical framework for reconsidering the role of bias in robust machine learning.
zh

[AI-182] Position: Olfaction Standardization is Essential for the Advancement of Embodied Artificial Intelligence

【速读】:该论文试图解决当前人工智能(Artificial Intelligence, AI)系统在感知能力上的不完整性问题,特别是对嗅觉(olfaction)感知的忽视。尽管视觉、听觉和语言等模态已得到充分研究,但嗅觉作为高带宽、进化上至关重要的感知方式,却因科学理论未明确、传感器技术异质性、缺乏标准化数据集及评估基准等问题而被长期忽略。论文指出,这一缺失阻碍了构建真正具身化(embodied)和符合伦理的超人类智能的发展。解决方案的关键在于AI领域需加大对嗅觉研究的投入,并通过跨学科合作(涵盖神经科学、机器人学、机器学习和伦理学)来建立嗅觉基准、开发多模态数据集,并定义机器理解、导航和作用于人类环境所需的感官能力。

链接: https://arxiv.org/abs/2506.00398
作者: Kordel K. France,Rohith Peddi,Nik Dennler,Ovidiu Daescu
机构: 未知
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Despite extraordinary progress in artificial intelligence (AI), modern systems remain incomplete representations of human cognition. Vision, audition, and language have received disproportionate attention due to well-defined benchmarks, standardized datasets, and consensus-driven scientific foundations. In contrast, olfaction - a high-bandwidth, evolutionarily critical sense - has been largely overlooked. This omission presents a foundational gap in the construction of truly embodied and ethically aligned super-human intelligence. We argue that the exclusion of olfactory perception from AI architectures is not due to irrelevance but to structural challenges: unresolved scientific theories of smell, heterogeneous sensor technologies, lack of standardized olfactory datasets, absence of AI-oriented benchmarks, and difficulty in evaluating sub-perceptual signal processing. These obstacles have hindered the development of machine olfaction despite its tight coupling with memory, emotion, and contextual reasoning in biological systems. In this position paper, we assert that meaningful progress toward general and embodied intelligence requires serious investment in olfactory research by the AI community. We call for cross-disciplinary collaboration - spanning neuroscience, robotics, machine learning, and ethics - to formalize olfactory benchmarks, develop multimodal datasets, and define the sensory capabilities necessary for machines to understand, navigate, and act within human environments. Recognizing olfaction as a core modality is essential not only for scientific completeness, but for building AI systems that are ethically grounded in the full scope of the human experience.
zh

[AI-183] MagiCodec: Simple Masked Gaussian-Injected Codec for High-Fidelity Reconstruction and Generation

【速读】:该论文旨在解决现有神经音频编解码器在优化重建质量时,往往牺牲了编码后标记的下游模型可操作性的问题。其解决方案的关键在于提出一种基于单层流式Transformer的音频编解码器——MagiCodec,该编解码器通过多阶段训练流程引入高斯噪声注入和潜在正则化,以增强生成代码的语义表达能力,同时保持高重建保真度。

链接: https://arxiv.org/abs/2506.00385
作者: Yakun Song,Jiawei Chen,Xiaobin Zhuang,Chenpeng Du,Ziyang Ma,Jian Wu,Jian Cong,Dongya Jia,Zhuo Chen,Yuping Wang,Yuxuan Wang,Xie Chen
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
备注: 18 pages, 3 figures. The code and pre-trained models are available at this https URL

点击查看摘要

Abstract:Neural audio codecs have made significant strides in efficiently mapping raw audio waveforms into discrete token representations, which are foundational for contemporary audio generative models. However, most existing codecs are optimized primarily for reconstruction quality, often at the expense of the downstream modelability of the encoded tokens. Motivated by the need to overcome this bottleneck, we introduce \textbfMagiCodec , a novel single-layer, streaming Transformer-based audio codec. MagiCodec is designed with a multistage training pipeline that incorporates Gaussian noise injection and latent regularization, explicitly targeting the enhancement of semantic expressiveness in the generated codes while preserving high reconstruction fidelity. We analytically derive the effect of noise injection in the frequency domain, demonstrating its efficacy in attenuating high-frequency components and fostering robust tokenization. Extensive experimental evaluations show that MagiCodec surpasses state-of-the-art codecs in both reconstruction quality and downstream tasks. Notably, the tokens produced by MagiCodec exhibit Zipf-like distributions, as observed in natural languages, thereby improving compatibility with language-model-based generative architectures. The code and pre-trained models are available at this https URL.
zh

[AI-184] textttAVROBUSTBENCH: Benchmarking the Robustness of Audio-Visual Recognition Models at Test-Time ATC

【速读】:该论文旨在解决音频-视觉模型在测试时对分布偏移(distributional shifts)的鲁棒性不足的问题,现有基准主要针对单一模态,无法全面评估多模态模型的鲁棒性。其解决方案的关键在于提出一个全面的基准测试集 \textttAVROBUSTBENCH,包含四个音频-视觉数据集,每个数据集引入了75种共现且相关的双模态噪声,以模拟真实场景中的分布偏移,并通过实验验证了当前最先进的监督与自监督模型在噪声严重时鲁棒性下降的现象,同时提出了一种简单的测试时自适应(TTA)方法 \textttAV2C,通过惩罚高熵样本实现跨模态融合,从而提升模型在特定数据集上的性能。

链接: https://arxiv.org/abs/2506.00358
作者: Sarthak Kumar Maharana,Saksham Singh Kushwaha,Baoming Zhang,Adrian Rodriguez,Songtao Wei,Yapeng Tian,Yunhui Guo
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
备注: Under review. For uniformity, all TTA experiments are done with a batch size of 16

点击查看摘要

Abstract:While recent audio-visual models have demonstrated impressive performance, their robustness to distributional shifts at test-time remains not fully understood. Existing robustness benchmarks mainly focus on single modalities, making them insufficient for thoroughly assessing the robustness of audio-visual models. Motivated by real-world scenarios where shifts can occur \textitsimultaneously in both audio and visual modalities, we introduce \textttAVROBUSTBENCH , a comprehensive benchmark designed to evaluate the test-time robustness of audio-visual recognition models. \textttAVROBUSTBENCH comprises four audio-visual benchmark datasets, \textttAUDIOSET-2C , \textttVGGSOUND-2C , \textttKINETICS-2C , and \textttEPICKITCHENS-2C , each incorporating 75 bimodal audio-visual corruptions that are \textitco-occurring and \textitcorrelated . Through extensive evaluations, we observe that state-of-the-art supervised and self-supervised audio-visual models exhibit declining robustness as corruption severity increases. Furthermore, online test-time adaptation (TTA) methods, on \textttVGGSOUND-2C and \textttKINETICS-2C , offer minimal improvements in performance under bimodal corruptions. We further propose \textttAV2C , a simple TTA approach enabling on-the-fly cross-modal fusion by penalizing high-entropy samples, which achieves improvements on \textttVGGSOUND-2C . We hope that \textttAVROBUSTBENCH will steer the development of more effective and robust audio-visual TTA approaches. Our code is available \hrefthis https URLhere .
zh

[AI-185] Exploring the Performance of Perforated Backpropagation through Further Experiments

【速读】:该论文试图解决神经网络模型的效率与性能之间的平衡问题,即如何在不牺牲模型精度的前提下提升模型的压缩能力或增强模型的准确性。其解决方案的关键在于Perforated Backpropagation技术,该技术基于对生物神经元中树突计算重要性的现代理解,通过优化反向传播过程中的计算路径,实现模型的高效训练与部署。

链接: https://arxiv.org/abs/2506.00356
作者: Rorry Brenner,Evan Davis,Rushi Chaudhari,Rowan Morse,Jingyao Chen,Xirui Liu,Zhaoyi You,Laurent Itti
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 10 pages, 7 figures, 1 table

点击查看摘要

Abstract:Perforated Backpropagation is a neural network optimization technique based on modern understanding of the computational importance of dendrites within biological neurons. This paper explores further experiments from the original publication, generated from a hackathon held at the Carnegie Mellon Swartz Center in February 2025. Students and local Pittsburgh ML practitioners were brought together to experiment with the Perforated Backpropagation algorithm on the datasets and models which they were using for their projects. Results showed that the system could enhance their projects, with up to 90% model compression without negative impact on accuracy, or up to 16% increased accuracy of their original models.
zh

[AI-186] Enabling Secure and Ephemeral AI Workloads in Data Mesh Environments

【速读】:该论文试图解决大型企业在高度受控和复杂的ICT环境中,无法高效支持数据与AI团队快速构建和销毁自助式数据与计算基础设施的问题,从而难以实验新的数据分析工具并部署数据产品。解决方案的关键在于提出一种按需自助式数据平台基础设施,以赋能去中心化的数据团队基于中心化的模板、策略和治理构建数据产品。核心创新是采用不可变容器操作系统和基础设施即代码方法,高效地在本地和任何云环境中从零开始创建供应商中立且短暂的Kubernetes集群。

链接: https://arxiv.org/abs/2506.00352
作者: Chinkit Patel,Kee Siong Ng
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Databases (cs.DB)
备注: 52 pages

点击查看摘要

Abstract:Many large enterprises that operate highly governed and complex ICT environments have no efficient and effective way to support their Data and AI teams in rapidly spinning up and tearing down self-service data and compute infrastructure, to experiment with new data analytic tools, and deploy data products into operational use. This paper proposes a key piece of the solution to the overall problem, in the form of an on-demand self-service data-platform infrastructure to empower de-centralised data teams to build data products on top of centralised templates, policies and governance. The core innovation is an efficient method to leverage immutable container operating systems and infrastructure-as-code methodologies for creating, from scratch, vendor-neutral and short-lived Kubernetes clusters on-premises and in any cloud environment. Our proposed approach can serve as a repeatable, portable and cost-efficient alternative or complement to commercial Platform-as-a-Service (PaaS) offerings, and this is particularly important in supporting interoperability in complex data mesh environments with a mix of modern and legacy compute infrastructure.
zh

[AI-187] BASIL: Best-Action Symbolic Interpretable Learning for Evolving Compact RL Policies

【速读】:该论文试图解决深度强化学习(Deep Reinforcement Learning)中策略不透明的问题,这限制了其在安全关键型应用中的部署。为了解决这一问题,作者提出了BASIL(Best-Action Symbolic Interpretable Learning),其关键在于通过在线进化搜索结合质量-多样性(Quality-Diversity, QD)优化,生成符号化、基于规则的策略。BASIL将策略表示为状态变量上的符号谓词有序列表,确保策略的完全可解释性,并通过QD存档促进高性能解之间的行为和结构多样性,同时利用复杂度感知的适应度函数合成紧凑表示。

链接: https://arxiv.org/abs/2506.00328
作者: Kourosh Shahnazari,Seyed Moein Ayyoubzadeh,Mohammadali Keshtparvar
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The quest for interpretable reinforcement learning is a grand challenge for the deployment of autonomous decision-making systems in safety-critical applications. Modern deep reinforcement learning approaches, while powerful, tend to produce opaque policies that compromise verification, reduce transparency, and impede human oversight. To address this, we introduce BASIL (Best-Action Symbolic Interpretable Learning), a systematic approach for generating symbolic, rule-based policies via online evolutionary search with quality-diversity (QD) optimization. BASIL represents policies as ordered lists of symbolic predicates over state variables, ensuring full interpretability and tractable policy complexity. By using a QD archive, the methodology in the proposed study encourages behavioral and structural diversity between top-performing solutions, while a complexity-aware fitness encourages the synthesis of compact representations. The evolutionary system supports the use of exact constraints for rule count and system adaptability for balancing transparency with expressiveness. Empirical comparisons with three benchmark tasks CartPole-v1, MountainCar-v0, and Acrobot-v1 show that BASIL consistently synthesizes interpretable controllers with compact representations comparable to deep reinforcement learning baselines. Herein, this article introduces a new interpretable policy synthesis method that combines symbolic expressiveness, evolutionary diversity, and online learning through a unifying framework.
zh

[AI-188] dpmm: Differentially Private Marginal Models a Library for Synthetic Tabular Data Generation

【速读】:该论文旨在解决在生成合成数据时如何提供严格的差分隐私(Differential Privacy, DP)保障的问题。其解决方案的关键在于提出一个开源库dpmm,该库集成了三种流行的边缘模型——PrivBayes、MST和AIM,这些模型在实用性和功能丰富性方面优于其他实现,并通过采用最佳实践来确保端到端的DP保证,同时应对已知的DP相关漏洞。

链接: https://arxiv.org/abs/2506.00322
作者: Sofiane Mahiou,Amir Dizche,Reza Nazari,Xinmin Wu,Ralph Abbey,Jorge Silva,Georgi Ganev
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to the Theory and Practice of Differential Privacy Workshop (TPDP 2025)

点击查看摘要

Abstract:We propose dpmm, an open-source library for synthetic data generation with Differentially Private (DP) guarantees. It includes three popular marginal models – PrivBayes, MST, and AIM – that achieve superior utility and offer richer functionality compared to alternative implementations. Additionally, we adopt best practices to provide end-to-end DP guarantees and address well-known DP-related vulnerabilities. Our goal is to accommodate a wide audience with easy-to-install, highly customizable, and robust model implementations. Our codebase is available from this https URL. Comments: Accepted to the Theory and Practice of Differential Privacy Workshop (TPDP 2025) Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2506.00322 [cs.CR] (or arXiv:2506.00322v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2506.00322 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-189] Evaluation of LLM s for mathematical problem solving

【速读】:该论文旨在探讨大型语言模型(Large Language Models, LLMs)在解决数学问题方面的潜力,并评估其在不同复杂度数学数据集上的表现。研究的关键在于采用基于结构化思维链(Structured Chain-of-Thought, SCoT)框架的五维评估方法,从最终答案正确性、步骤完整性、步骤有效性、中间计算准确性和问题理解五个方面全面分析模型性能。通过这一系统性评估,研究揭示了不同模型在数学推理任务中的优势与局限性。

链接: https://arxiv.org/abs/2506.00309
作者: Ruonan Wang,Runxi Wang,Yunwen Shen,Chengfeng Wu,Qinglin Zhou,Rohitash Chandra
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have shown impressive performance on a range of educational tasks, but are still understudied for their potential to solve mathematical problems. In this study, we compare three prominent LLMs, including GPT-4o, DeepSeek-V3, and Gemini-2.0, on three mathematics datasets of varying complexities (GSM8K, MATH500, and UNSW datasets). We take a five-dimensional approach based on the Structured Chain-of-Thought (SCoT) framework to assess final answer correctness, step completeness, step validity, intermediate calculation accuracy, and problem comprehension. The results show that GPT-4o is the most stable and consistent in performance across all the datasets, but particularly it performs outstandingly in high-level questions of the UNSW dataset. DeepSeek-V3 is competitively strong in well-structured domains such as optimisation, but suffers from fluctuations in accuracy in statistical inference tasks. Gemini-2.0 shows strong linguistic understanding and clarity in well-structured problems but performs poorly in multi-step reasoning and symbolic logic. Our error analysis reveals particular deficits in each model: GPT-4o is at times lacking in sufficient explanation or precision; DeepSeek-V3 leaves out intermediate steps; and Gemini-2.0 is less flexible in mathematical reasoning in higher dimensions.
zh

[AI-190] Improving Protein Sequence Design through Designability Preference Optimization

【速读】:该论文试图解决蛋白质序列设计中设计可实现性(designability)不足的问题,即生成的序列未必能折叠成目标结构。解决方案的关键在于重新定义训练目标,通过引入直接偏好优化(Direct Preference Optimization, DPO),利用AlphaFold pLDDT分数作为偏好信号,引导序列生成向高设计可实现性方向优化。进一步地,提出残基级设计可实现性偏好优化(Residue-level Designability Preference Optimization, ResiDPO),在残基层面应用结构奖励并解耦残基间的优化过程,从而在保持高性能区域的同时提升整体设计可实现性。

链接: https://arxiv.org/abs/2506.00297
作者: Fanglei Xue,Andrew Kubaney,Zhichun Guo,Joseph K. Min,Ge Liu,Yi Yang,David Baker
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Biomolecules (q-bio.BM)
备注:

点击查看摘要

Abstract:Protein sequence design methods have demonstrated strong performance in sequence generation for de novo protein design. However, as the training objective was sequence recovery, it does not guarantee designability–the likelihood that a designed sequence folds into the desired structure. To bridge this gap, we redefine the training objective by steering sequence generation toward high designability. To do this, we integrate Direct Preference Optimization (DPO), using AlphaFold pLDDT scores as the preference signal, which significantly improves the in silico design success rate. To further refine sequence generation at a finer, residue-level granularity, we introduce Residue-level Designability Preference Optimization (ResiDPO), which applies residue-level structural rewards and decouples optimization across residues. This enables direct improvement in designability while preserving regions that already perform well. Using a curated dataset with residue-level annotations, we fine-tune LigandMPNN with ResiDPO to obtain EnhancedMPNN, which achieves a nearly 3-fold increase in in silico design success rate (from 6.56% to 17.57%) on a challenging enzyme design benchmark.
zh

[AI-191] Entropic Risk Optimization in Discounted MDPs: Sample Complexity Bounds with a Generative Model

【速读】:该论文旨在解决在具有递归熵风险偏好(risk-parameter β0\beta \neq 0)的折扣马尔可夫决策过程(MDP)中,学习最优状态-动作值函数 QQ^* 和最优策略 π\pi^* 的样本复杂性问题。其解决方案的关键是提出一种基于模型的风险敏感 Q 值迭代方法(model-based risk-sensitive Q-value-iteration, MB-RS-QVI),该方法在存在生成模型的情况下,能够提供关于 $ |Q^-Q^k| $ 和 $ |V^-V^\pi_k| $ 的 (ϵ,δ)(\epsilon,\delta)-PAC 界,并揭示了这些界在有效时间跨度 11γ\frac{1}{1-\gamma} 和风险敏感度 β|\beta| 上的指数依赖性。

链接: https://arxiv.org/abs/2506.00286
作者: Oliver Mortensen,Mohammad Sadegh Talebi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:In this paper we analyze the sample complexities of learning the optimal state-action value function Q^* and an optimal policy \pi^* in a discounted Markov decision process (MDP) where the agent has recursive entropic risk-preferences with risk-parameter \beta\neq 0 and where a generative model of the MDP is available. We provide and analyze a simple model based approach which we call model-based risk-sensitive Q -value-iteration (MB-RS-QVI) which leads to (\epsilon,\delta) -PAC-bounds on |Q^-Q^k| , and |V^-V^\pi_k| where Q_k is the output of MB-RS-QVI after k iterations and \pi_k is the greedy policy with respect to Q_k . Both PAC-bounds have exponential dependence on the effective horizon \frac11-\gamma and the strength of this dependence grows with the learners risk-sensitivity |\beta| . We also provide two lower bounds which shows that exponential dependence on |\beta|\frac11-\gamma is unavoidable in both cases. The lower bounds reveal that the PAC-bounds are both tight in \varepsilon and \delta and that the PAC-bound on Q -learning is tight in the number of actions A , and that the PAC-bound on policy-learning is nearly tight in A .
zh

[AI-192] Adversarial Threat Vectors and Risk Mitigation for Retrieval-Augmented Generation Systems

【速读】:该论文旨在解决Retrieval-Augmented Generation (RAG)系统在集成大型语言模型(LLMs)与外部知识源时所面临的多种对抗性攻击向量问题,包括提示注入、数据污染和对抗性查询操作。论文的关键解决方案是通过风险管理体系分析这些威胁,并提出一个优先级控制清单,其中包括输入验证、对抗训练和实时监控等风险缓解措施。

链接: https://arxiv.org/abs/2506.00281
作者: Chris M. Ward,Josh Harguess
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: SPIE DCS: Proceedings Volume Assurance and Security for AI-enabled Systems 2025

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) systems, which integrate Large Language Models (LLMs) with external knowledge sources, are vulnerable to a range of adversarial attack vectors. This paper examines the importance of RAG systems through recent industry adoption trends and identifies the prominent attack vectors for RAG: prompt injection, data poisoning, and adversarial query manipulation. We analyze these threats under risk management lens, and propose robust prioritized control list that includes risk-mitigating actions like input validation, adversarial training, and real-time monitoring.
zh

[AI-193] Sleep Brain and Cardiac Activity Predict Cognitive Flexibility and Conceptual Reasoning Using Deep Learning

【速读】:该论文试图解决睡眠微结构与人类在特定认知领域表现之间关系的探索不足问题,特别是如何通过生理过程预测执行功能,如认知适应性和概念推理。其解决方案的关键在于提出CogPSGFormer模型,该模型是一种多尺度卷积-Transformer架构,能够处理多模态的多导睡眠图(polysomnographic)数据,整合单通道心电图(ECG)和脑电图(EEG)信号及提取的特征,以捕捉跨模态的互补信息,并优化对长时间睡眠信号的处理。

链接: https://arxiv.org/abs/2506.00279
作者: Boshra Khajehpiri,Eric Granger,Massimiliano de Zambotti,Fiona C. Baker,Mohamad Forouzanfar
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: This work was accepted for publication in IEEE EMBC 2025

点击查看摘要

Abstract:Despite extensive research on the relationship between sleep and cognition, the connection between sleep microstructure and human performance across specific cognitive domains remains underexplored. This study investigates whether deep learning models can predict executive functions, particularly cognitive adaptability and conceptual reasoning from physiological processes during a night’s sleep. To address this, we introduce CogPSGFormer, a multi-scale convolutional-transformer model designed to process multi-modal polysomnographic data. This model integrates one-channel ECG and EEG signals along with extracted features, including EEG power bands and heart rate variability parameters, to capture complementary information across modalities. A thorough evaluation of the CogPSGFormer architecture was conducted to optimize the processing of extended sleep signals and identify the most effective configuration. The proposed framework was evaluated on 817 individuals from the STAGES dataset using cross-validation. The model achieved 80.3% accuracy in classifying individuals into low vs. high cognitive performance groups on unseen data based on Penn Conditional Exclusion Test (PCET) scores. These findings highlight the effectiveness of our multi-scale feature extraction and multi-modal learning approach in leveraging sleep-derived signals for cognitive performance prediction. To facilitate reproducibility, our code is publicly accessible (this https URL).
zh

[AI-194] Chances and Challenges of the Model Context Protocol in Digital Forensics and Incident Response

【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)在法医调查中广泛应用所面临的透明度、可解释性和可复现性不足的问题。其解决方案的关键在于引入新兴的模型上下文协议(Model Context Protocol, MCP),通过该协议在不同法医场景中实现LLMs的集成,从而增强现有法医工作流程,并拓展LLMs在法医领域的应用范围。MCP通过设计选择对模型行为进行有意约束,提升了审计性和可追溯性,为构建更透明、可复现且具有法律效力的LLM辅助法医工作流程提供了基础。

链接: https://arxiv.org/abs/2506.00274
作者: Jan-Niclas Hilgert,Carlo Jakobs,Michael Külper,Martin Lambertz,Axel Mahr,Elmar Padilla
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models hold considerable promise for supporting forensic investigations, but their widespread adoption is hindered by a lack of transparency, explainability, and reproducibility. This paper explores how the emerging Model Context Protocol can address these challenges and support the meaningful use of LLMs in digital forensics. Through a theoretical analysis, we examine how MCP can be integrated across various forensic scenarios - ranging from artifact analysis to the generation of interpretable reports. We also outline both technical and conceptual considerations for deploying an MCP server in forensic environments. Our analysis reveals a wide range of use cases in which MCP not only strengthens existing forensic workflows but also facilitates the application of LLMs to areas of forensics where their use was previously limited. Furthermore, we introduce the concept of the inference constraint level - a way of characterizing how specific MCP design choices can deliberately constrain model behavior, thereby enhancing both auditability and traceability. Our insights demonstrate that MCP has significant potential as a foundational component for developing LLM-assisted forensic workflows that are not only more transparent, reproducible, and legally defensible, but also represent a step toward increased automation in digital forensic analysis. However, we also highlight potential challenges that the adoption of MCP may pose for digital forensics in the future.
zh

[AI-195] Hidden in Plain Sight: Probing Implicit Reasoning in Multimodal Language Models

【速读】:该论文试图解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在开放性、现实环境中处理隐式推理场景时存在的问题,即模型在面对缺失对象、矛盾事实、模糊指代或不可行任务时,难以识别隐含错误,导致性能下降。解决方案的关键在于通过简单的推理阶段干预,如谨慎的角色提示和尤其要求澄清性问题,以恢复模型的隐式推理能力,从而提升其在非约束环境中的可信度。

链接: https://arxiv.org/abs/2506.00258
作者: Qianqi Yan,Hongquan Li,Shan Jiang,Yang Zhao,Xinze Guan,Ching-Chen Kuo,Xin Eric Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multimodal large language models (MLLMs) are increasingly deployed in open-ended, real-world environments where inputs are messy, underspecified, and not always trustworthy. Unlike curated benchmarks, these settings frequently involve instructions that refer to missing objects or contradictory facts, rely on ambiguous references, or request infeasible actions. In such cases, success hinges not on task execution alone, but on a model’s ability to detect when something is silently wrong. This paper presents a systematic analysis of how current MLLMs handle such implicit reasoning scenarios: cases where the flaw is not explicitly stated but must be inferred from context. Using a curated diagnostic suite spanning four categories of real-world failure modes, we evaluate six MLLMs, including o3 and GPT-4o, and find that models frequently fail to surface hidden issues, even when they possess the necessary perceptual and reasoning skills. Explicit prompting reveals that the underlying capabilities exist but are often suppressed in favor of user compliance. We further show that simple inference-time interventions, such as cautious persona prompting and, in particular, requiring a clarifying question, can dramatically recover performance. Our findings highlight a persistent gap between reasoning competence and behavioral compliance in current MLLMs and suggest practical strategies for making these models more trustworthy in underconstrained environments.
zh

[AI-196] Designing AI Tools for Clinical Care Teams to Support Serious Illness Conversations with Older Adults in the Emergency Department

【速读】:该论文试图解决在急诊科(Emergency Department, ED)环境中,临床护理团队与患有严重、威胁生命疾病的老年人进行严重疾病对话(Serious Illness Conversations, SICs)时所面临的障碍。这些问题包括电子健康记录(Electronic Health Records, EHR)数据碎片化、时间限制、情感准备需求以及文档负担。解决方案的关键在于开发符合现有临床实践的AI工具,以支持SIC的工作流程,包括信息综合、对话支持和自动化文档,并强调在技术介入的同时保持人文关怀和临床自主性。

链接: https://arxiv.org/abs/2506.00241
作者: Menglin Zhao,Zhuorui Yong,Ruijia Guan,Kai-Wei Chang,Adrian Haimovich,Kei Ouchi,Timothy Bickmore,Bingsheng Yao,Dakuo Wang,Smit Desai
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Serious illness conversations (SICs), discussions between clinical care teams and patients with serious, life-limiting illnesses about their values, goals, and care preferences, are critical for patient-centered care. Without these conversations, patients often receive aggressive interventions that may not align with their goals. Clinical care teams face significant barriers when conducting serious illness conversations with older adult patients in Emergency Department (ED) settings, where most older adult patients lack documented treatment goals. To understand current practices and identify AI support opportunities, we conducted interviews with two domain experts and nine ED clinical care team members. Through thematic analysis, we characterized a four-phase serious illness conversation workflow (identification, preparation, conduction, documentation) and identified key needs and challenges at each stage. Clinical care teams struggle with fragmented EHR data access, time constraints, emotional preparation demands, and documentation burdens. While participants expressed interest in AI tools for information synthesis, conversational support, and automated documentation, they emphasized preserving human connection and clinical autonomy. We present design guidelines for AI tools supporting SIC workflows that fit within existing clinical practices. This work contributes empirical understanding of ED-based serious illness conversations and provides design considerations for AI in high-stakes clinical environments.
zh

[AI-197] SMELLNET: A Large-scale Dataset for Real-world Smell Recognition

【速读】:该论文试图解决在现实世界中训练和评估人工智能系统嗅觉能力缺乏大规模基准的问题(benchmark),从而推动对气味的感知与识别技术发展。其解决方案的关键在于构建SmellNet,这是首个大规模的、数字化自然世界中多种气味的数据集,包含约180,000个时间步长的50种物质数据,涵盖坚果、香料、草药、水果和蔬菜等类别。通过SmellNet,研究者训练了基于气味实时分类的AI模型,并采用序列模型、对比学习以及一种新的时间差分方法来提升模型性能,以应对复杂环境下的气味识别挑战。

链接: https://arxiv.org/abs/2506.00239
作者: Dewei Feng,Carol Li,Wei Dai,Paul Pu Liang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 22 pages, 13 figures

点击查看摘要

Abstract:The ability of AI to sense and identify various substances based on their smell alone can have profound impacts on allergen detection (e.g., smelling gluten or peanuts in a cake), monitoring the manufacturing process, and sensing hormones that indicate emotional states, stress levels, and diseases. Despite these broad impacts, there are virtually no large scale benchmarks, and therefore little progress, for training and evaluating AI systems’ ability to smell in the real world. In this paper, we use portable gas and chemical sensors to create SmellNet, the first large-scale database that digitizes a diverse range of smells in the natural world. SmellNet contains about 180,000 time steps of 50 substances (spanning nuts, spices, herbs, fruits, and vegetables) with 50 hours of data. Using SmellNet, we train AI models for real-time classification of substances based on their smell alone. Our best methods leverage sequence models, contrastive learning to integrate high-resolution Gas Chromatography-Mass Spectrometry molecular data, and a new temporal difference method that identifies sharp changes in sensor readings. Our best models achieve up to 65.35% accuracy on pre-recorded data, and generalize to real-world conditions with 10.71% accuracy on nuts and 25.38% on spices in the challenging 50-way online classification task. Despite these promising results, SmellNet highlights many technical challenges in building AI for smell, including richer feature learning, on-edge smell models, and robustness to environmental changes.
zh

[AI-198] Ethical AI: Towards Defining a Collective Evaluation Framework

【速读】:该论文试图解决人工智能(Artificial Intelligence, AI)在快速集成过程中引发的伦理问题,包括数据所有权、隐私和系统性偏见等,特别是在高风险领域中透明度不足、决策不可解释以及不公平对待等问题。其解决方案的关键在于提出一种基于本体论块(ontological blocks)的模块化伦理评估框架,这些块是意义离散且可解释的单元,编码了公平性、责任性和所有权等伦理原则,并与FAIR(Findable, Accessible, Interoperable, Reusable)原则相结合,以支持可扩展、透明且符合法律要求的伦理评估。

链接: https://arxiv.org/abs/2506.00233
作者: Aasish Kumar Sharma,Dimitar Kyosev,Julian Kunkel
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 6 pages, 3 figures, accepted at 8th IEEE International Workshop on Advances in Artificial Intelligence and Machine Learning (AIML 2025): Futuristic AI and ML models Intelligent Systems

点击查看摘要

Abstract:Artificial Intelligence (AI) is transforming sectors such as healthcare, finance, and autonomous systems, offering powerful tools for innovation. Yet its rapid integration raises urgent ethical concerns related to data ownership, privacy, and systemic bias. Issues like opaque decision-making, misleading outputs, and unfair treatment in high-stakes domains underscore the need for transparent and accountable AI systems. This article addresses these challenges by proposing a modular ethical assessment framework built on ontological blocks of meaning-discrete, interpretable units that encode ethical principles such as fairness, accountability, and ownership. By integrating these blocks with FAIR (Findable, Accessible, Interoperable, Reusable) principles, the framework supports scalable, transparent, and legally aligned ethical evaluations, including compliance with the EU AI Act. Using a real-world use case in AI-powered investor profiling, the paper demonstrates how the framework enables dynamic, behavior-informed risk classification. The findings suggest that ontological blocks offer a promising path toward explainable and auditable AI ethics, though challenges remain in automation and probabilistic reasoning.
zh

[AI-199] he World As Large Language Models See It: Exploring the reliability of LLM s in representing geographical features

【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)在地理信息表示中的可信度问题,特别是其在地名编码(geocoding)、高程估计(elevation estimation)和反向地名编码(reverse geocoding)等关键地理空间任务中的表现。研究的解决方案之关键在于通过具体任务评估GPT-4o和Gemini 2.0 Flash模型的地理信息处理能力,并揭示其在精度、系统偏差及区域一致性方面的差异,从而为提升模型在GIScience和Geoinformatics领域的应用提供依据。

链接: https://arxiv.org/abs/2506.00203
作者: Omid Reza Abbasi,Franz Welscher,Georg Weinberger,Johannes Scholz
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: 9 pages, 4 figures, 2 tables

点击查看摘要

Abstract:As large language models (LLMs) continue to evolve, questions about their trustworthiness in delivering factual information have become increasingly important. This concern also applies to their ability to accurately represent the geographic world. With recent advancements in this field, it is relevant to consider whether and to what extent LLMs’ representations of the geographical world can be trusted. This study evaluates the performance of GPT-4o and Gemini 2.0 Flash in three key geospatial tasks: geocoding, elevation estimation, and reverse geocoding. In the geocoding task, both models exhibited systematic and random errors in estimating the coordinates of St. Anne’s Column in Innsbruck, Austria, with GPT-4o showing greater deviations and Gemini 2.0 Flash demonstrating more precision but a significant systematic offset. For elevation estimation, both models tended to underestimate elevations across Austria, though they captured overall topographical trends, and Gemini 2.0 Flash performed better in eastern regions. The reverse geocoding task, which involved identifying Austrian federal states from coordinates, revealed that Gemini 2.0 Flash outperformed GPT-4o in overall accuracy and F1-scores, demonstrating better consistency across regions. Despite these findings, neither model achieved an accurate reconstruction of Austria’s federal states, highlighting persistent misclassifications. The study concludes that while LLMs can approximate geographic information, their accuracy and reliability are inconsistent, underscoring the need for fine-tuning with geographical information to enhance their utility in GIScience and Geoinformatics.
zh

[AI-200] What do professional software developers need to know to succeed in an age of Artificial Intelligence?

【速读】:该论文试图解决生成式 AI(Generative AI)在软件开发中的应用所带来的生产力提升与劳动力 disruption 及技能退化之间的矛盾。其解决方案的关键在于识别并组织成功使用 AI 的开发者所需的知识与技能,将其划分为四个领域:有效使用生成式 AI、核心软件工程、相邻工程以及相邻非工程领域,并在六步任务工作流的关键节点进行部署,以通过在职学习和计算机科学教育项目,全面培养开发者的“软技能”与技术能力,从而实现再培训、技能提升及防止技能退化。

链接: https://arxiv.org/abs/2506.00202
作者: Matthew Kam,Cody Miller,Miaoxin Wang,Abey Tidwell,Irene A. Lee,Joyce Malyn-Smith,Beatriz Perez,Vikram Tiwari,Joshua Kenitzer,Andrew Macvean,Erin Barrar
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 12 pages, 4 figures, software engineering education track of the 2025 ACM international conference on the foundations of software engineering, includes supplementary material i.e. full 50-page occupational profile of the AI-enhanced software developer

点击查看摘要

Abstract:Generative AI is showing early evidence of productivity gains for software developers, but concerns persist regarding workforce disruption and deskilling. We describe our research with 21 developers at the cutting edge of using AI, summarizing 12 of their work goals we uncovered, together with 75 associated tasks and the skills knowledge for each, illustrating how developers use AI at work. From all of these, we distilled our findings in the form of 5 insights. We found that the skills knowledge to be a successful AI-enhanced developer are organized into four domains (using Generative AI effectively, core software engineering, adjacent engineering, and adjacent non-engineering) deployed at critical junctures throughout a 6-step task workflow. In order to “future proof” developers for this age of AI, on-the-job learning initiatives and computer science degree programs will need to target both “soft” skills and the technical skills knowledge in all four domains to reskill, upskill and safeguard against deskilling.
zh

[AI-201] MOFGPT : Generative Design of Metal-Organic Frameworks using Language Models

【速读】:该论文试图解决金属有机框架(Metal-Organic Frameworks, MOFs)在应用特定性质设计中的核心挑战,即由于其结构设计空间庞大且复杂,传统计算筛选技术如分子模拟和密度泛函理论(DFT)在大规模应用中计算成本过高。解决方案的关键在于提出一种基于强化学习增强的Transformer框架,其中MOFid作为化学信息化的字符串表示,能够编码连接性和拓扑结构,从而实现可扩展的生成建模。该方法通过结合生成式GPT模型、基于Transformer的性质预测器以及强化学习模块,实现了对目标功能属性的优化,推动了可合成且拓扑有效的MOFs的逆向设计。

链接: https://arxiv.org/abs/2506.00198
作者: Srivathsan Badrinarayanan,Rishikesh Magar,Akshay Antony,Radheesh Sharma Meda,Amir Barati Farimani
机构: 未知
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI)
备注: 21 pages, 3 figures (in main text, without references)

点击查看摘要

Abstract:The discovery of Metal-Organic Frameworks (MOFs) with application-specific properties remains a central challenge in materials chemistry, owing to the immense size and complexity of their structural design space. Conventional computational screening techniques such as molecular simulations and density functional theory (DFT), while accurate, are computationally prohibitive at scale. Machine learning offers an exciting alternative by leveraging data-driven approaches to accelerate materials discovery. The complexity of MOFs, with their extended periodic structures and diverse topologies, creates both opportunities and challenges for generative modeling approaches. To address these challenges, we present a reinforcement learning-enhanced, transformer-based framework for the de novo design of MOFs. Central to our approach is MOFid, a chemically-informed string representation encoding both connectivity and topology, enabling scalable generative modeling. Our pipeline comprises three components: (1) a generative GPT model trained on MOFid sequences, (2) MOFormer, a transformer-based property predictor, and (3) a reinforcement learning (RL) module that optimizes generated candidates via property-guided reward functions. By integrating property feedback into sequence generation, our method drives the model toward synthesizable, topologically valid MOFs with desired functional attributes. This work demonstrates the potential of large language models, when coupled with reinforcement learning, to accelerate inverse design in reticular chemistry and unlock new frontiers in computational MOF discovery.
zh

[AI-202] Heterogeneous Graph Backdoor Attack

【速读】:该论文试图解决异构图神经网络(Heterogeneous Graph Neural Networks, HGNNs)在面对现有图后门攻击时所表现出的高攻击预算需求、低效且不可靠的后门激活以及攻击效果评估不准确等问题。其解决方案的关键在于提出一种专为HGNN设计的后门攻击方法——异构图后门攻击(Heterogeneous Graph Backdoor Attack, HGBA),该方法引入了一种基于关系的触发机制,通过后门元路径在选定的触发节点与污染节点之间建立特定连接,从而实现高效且隐蔽的后门注入,并支持通过两种灵活策略进行后门激活,同时改进了攻击效果评估协议以提高准确性。

链接: https://arxiv.org/abs/2506.00191
作者: Jiawei Chen,Lusi Li,Daniel Takabi,Masha Sosonkina,Rui Ning
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Heterogeneous Graph Neural Networks (HGNNs) excel in modeling complex, multi-typed relationships across diverse domains, yet their vulnerability to backdoor attacks remains unexplored. To address this gap, we conduct the first investigation into the susceptibility of HGNNs to existing graph backdoor attacks, revealing three critical issues: (1) high attack budget required for effective backdoor injection, (2) inefficient and unreliable backdoor activation, and (3) inaccurate attack effectiveness evaluation. To tackle these issues, we propose the Heterogeneous Graph Backdoor Attack (HGBA), the first backdoor attack specifically designed for HGNNs, introducing a novel relation-based trigger mechanism that establishes specific connections between a strategically selected trigger node and poisoned nodes via the backdoor metapath. HGBA achieves efficient and stealthy backdoor injection with minimal structural modifications and supports easy backdoor activation through two flexible strategies: Self-Node Attack and Indiscriminate Attack. Additionally, we improve the ASR measurement protocol, enabling a more accurate assessment of attack effectiveness. Extensive experiments demonstrate that HGBA far surpasses multiple state-of-the-art graph backdoor attacks in black-box settings, efficiently attacking HGNNs with low attack budgets. Ablation studies show that the strength of HBGA benefits from our trigger node selection method and backdoor metapath selection strategy. In addition, HGBA shows superior robustness against node feature perturbations and multiple types of existing graph backdoor defense mechanisms. Finally, extension experiments demonstrate that the relation-based trigger mechanism can effectively extend to tasks in homogeneous graph scenarios, thereby posing severe threats to broader security-critical domains.
zh

[AI-203] ournament of Prompts: Evolving LLM Instructions Through Structured Debates and Elo Ratings

【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)在处理复杂任务时,由于提示工程(Prompt Engineering)所面临的瓶颈问题,尤其是针对涉及主观质量评估的任务,传统自动化提示优化方法因缺乏明确的优化目标或依赖通用模板而效果不佳。论文提出的解决方案关键在于DEEVO(DEbate-driven EVOlutionary prompt optimization)框架,其通过基于博弈评价的进化机制,利用Elo评分作为适应度代理,结合辩论反馈进行智能交叉和策略性变异操作,从而在保持语义连贯性的前提下探索离散的提示空间,并有效提升提示种群的性能与多样性。

链接: https://arxiv.org/abs/2506.00178
作者: Anirudh Nair,Adi Banerjee,Laurent Mombaerts,Matthew Hagen,Tarik Borogovac
机构: 未知
类目: Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注:

点击查看摘要

Abstract:Prompt engineering represents a critical bottleneck to harness the full potential of Large Language Models (LLMs) for solving complex tasks, as it requires specialized expertise, significant trial-and-error, and manual intervention. This challenge is particularly pronounced for tasks involving subjective quality assessment, where defining explicit optimization objectives becomes fundamentally problematic. Existing automated prompt optimization methods falter in these scenarios, as they typically require well-defined task-specific numerical fitness functions or rely on generic templates that cannot capture the nuanced requirements of complex use cases. We introduce DEEVO (DEbate-driven EVOlutionary prompt optimization), a novel framework that guides prompt evolution through a debate-driven evaluation with an Elo-based selection. Contrary to prior work, DEEVOs approach enables exploration of the discrete prompt space while preserving semantic coherence through intelligent crossover and strategic mutation operations that incorporate debate-based feedback, combining elements from both successful and unsuccessful prompts based on identified strengths rather than arbitrary splicing. Using Elo ratings as a fitness proxy, DEEVO simultaneously drives improvement and preserves valuable diversity in the prompt population. Experimental results demonstrate that DEEVO significantly outperforms both manual prompt engineering and alternative state-of-the-art optimization approaches on open-ended tasks and close-ended tasks despite using no ground truth feedback. By connecting LLMs reasoning capabilities with adaptive optimization, DEEVO represents a significant advancement in prompt optimization research by eliminating the need of predetermined metrics to continuously improve AI systems.
zh

[AI-204] Accountability Attribution: Tracing Model Behavior to Training Processes

【速读】:该论文试图解决现代AI开发流程中模型行为归因问题,即在模型部署后出现成功或失败时,如何确定具体哪个训练阶段及其影响程度。解决方案的关键在于提出一个通用框架,用于回答关于训练阶段影响的反事实问题:如果某训练阶段的更新未被执行,模型行为会如何变化。该框架引入基于一阶近似的估计器,能够在不重新训练模型的情况下高效量化阶段影响,同时考虑训练数据和优化动态的关键方面,如学习率调度、动量和权重衰减。

链接: https://arxiv.org/abs/2506.00175
作者: Shichang Zhang,Hongzhe Du,Karim Saraipour,Jiaqi W. Ma,Himabindu Lakkaraju
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Modern AI development pipelines often involve multiple stages-pretraining, fine-tuning rounds, and subsequent adaptation or alignment-with numerous model update steps within each stage. This raises a critical question of accountability: when a deployed model succeeds or fails, which stage is responsible, and to what extent? We pose the problem of accountability attribution, which aims to trace model behavior back to specific stages of the training process. To address this, we propose a general framework that answers counterfactual questions about stage effects: how would the model behavior have changed if the updates from a training stage had not been executed?. Within this framework, we introduce estimators based on first-order approximations that efficiently quantify the stage effects without retraining. Our estimators account for both the training data and key aspects of optimization dynamics, including learning rate schedules, momentum, and weight decay. Empirically, we demonstrate that our approach identifies training stages accountable for specific behaviors, offering a practical tool for model analysis and a step toward more accountable AI development.
zh

[AI-205] Utilizing AI for Aviation Post-Accident Analysis Classification

【速读】:该论文旨在解决航空安全报告中海量文本数据带来的及时且准确分析难题,其核心问题是通过自动化手段提取有价值的信息以提升航空安全。解决方案的关键在于利用人工智能(AI)技术,特别是自然语言处理(NLP)和深度学习方法,对航空安全报告进行分类与主题建模(Topic Modeling, TM),从而识别事故损伤等级、飞行阶段及潜在的安全改进领域。研究还对比了不同深度学习模型和TM技术在多个数据集上的表现,强调了数据集规模与来源对分析准确性的重要影响。

链接: https://arxiv.org/abs/2506.00169
作者: Aziida Nanyonga,Graham Wild
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The volume of textual data available in aviation safety reports presents a challenge for timely and accurate analysis. This paper examines how Artificial Intelligence (AI) and, specifically, Natural Language Processing (NLP) can automate the process of extracting valuable insights from this data, ultimately enhancing aviation safety. The paper reviews ongoing efforts focused on the application of NLP and deep learning to aviation safety reports, with the goal of classifying the level of damage to an aircraft and identifying the phase of flight during which safety occurrences happen. Additionally, the paper explores the use of Topic Modeling ™ to uncover latent thematic structures within aviation incident reports, aiming to identify recurring patterns and potential areas for safety improvement. The paper compares and contrasts the performance of various deep learning models and TM techniques applied to datasets from the National Transportation Safety Board (NTSB) and the Australian Transport Safety Bureau (ATSB), as well as the Aviation Safety Network (ASN), discussing the impact of dataset size and source on the accuracy of the analysis. The findings demonstrate that both NLP and deep learning, as well as TM, can significantly improve the efficiency and accuracy of aviation safety analysis, paving the way for more proactive safety management and risk mitigation strategies.
zh

[AI-206] Supporting architecture evaluation for ATAM scenarios with LLM s

【速读】:该论文试图解决软件架构评估中因竞争性质量属性导致的场景选择与优先级确定困难问题,传统方法依赖人工进行长时间的头脑风暴以确定最合适的质量场景。解决方案的关键在于利用生成式 AI (Generative AI) 部分自动化评估活动,通过分析学生在软件架构课程中提出的质量场景,并与 LLM 提供的评估结果进行比较,验证其在风险、敏感点和权衡分析方面的有效性,从而提升架构评估的效率与准确性。

链接: https://arxiv.org/abs/2506.00150
作者: Rafael Capilla,J. Andrés Díaz-Pace,Yamid Ramírez,Jennifer Pérez,Vanessa Rodríguez-Horcajo
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Architecture evaluation methods have long been used to evaluate software designs. Several evaluation methods have been proposed and used to analyze tradeoffs between different quality attributes. Having competing qualities leads to conflicts for selecting which quality-attribute scenarios are the most suitable ones that an architecture should tackle and for prioritizing the scenarios required by the stakeholders. In this context, architecture evaluation is carried out manually, often involving long brainstorming sessions to decide which are the most adequate quality scenarios. To reduce this effort and make the assessment and selection of scenarios more efficient, we suggest the usage of LLMs to partially automate evaluation activities. As a first step to validate this hypothesis, this work studies MS Copilot as an LLM tool to analyze quality scenarios suggested by students in a software architecture course and compares the students’ results with the assessment provided by the LLM. Our initial study reveals that the LLM produces in most cases better and more accurate results regarding the risks, sensitivity points and tradeoff analysis of the quality scenarios. Overall, the use of generative AI has the potential to partially automate and support the architecture evaluation tasks, improving the human decision-making process.
zh

[AI-207] Balancing Profit and Fairness in Risk-Based Pricing Markets

【速读】:该论文试图解决动态风险定价机制可能将弱势消费者群体系统性地排除在健康保险和消费信贷等关键资源之外的问题,即如何在竞争性市场中实现公平与社会福利的平衡。解决方案的关键在于引入一种可学习且可解释的税收方案,通过强化学习的社会规划者(SP)来制定分档公平税,同时利用L1\mathcal{L}_1正则化保持政策的简洁性与透明度,从而在不需明确协调的情况下提升市场公平性并优化社会福利。

链接: https://arxiv.org/abs/2506.00140
作者: Jesse Thibodeau,Hadi Nekoei,Afaf Taïk,Janarthanan Rajendran,Golnoosh Farnadi
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); General Economics (econ.GN)
备注:

点击查看摘要

Abstract:Dynamic, risk-based pricing can systematically exclude vulnerable consumer groups from essential resources such as health insurance and consumer credit. We show that a regulator can realign private incentives with social objectives through a learned, interpretable tax schedule. First, we provide a formal proposition that bounding each firm’s \emphlocal demographic gap implicitly bounds the \emphglobal opt-out disparity, motivating firm-level penalties. Building on this insight we introduce \textttMarketSim – an open-source, scalable simulator of heterogeneous consumers and profit-maximizing firms – and train a reinforcement learning (RL) social planner (SP) that selects a bracketed fairness-tax while remaining close to a simple linear prior via an \mathcalL_1 regularizer. The learned policy is thus both transparent and easily interpretable. In two empirically calibrated markets, i.e., U.S. health-insurance and consumer-credit, our planner simultaneously raises demand-fairness by up to 16% relative to unregulated Free Market while outperforming a fixed linear schedule in terms of social welfare without explicit coordination. These results illustrate how AI-assisted regulation can convert a competitive social dilemma into a win-win equilibrium, providing a principled and practical framework for fairness-aware market oversight.
zh

[AI-208] A Reinforcement Learning-Based Telematic Routing Protocol for the Internet of Underwater Things

【速读】:该论文旨在解决水下物联网(Internet of Underwater Things, IoUT)中存在的低带宽、高延迟、节点移动性和有限能量资源等挑战,传统路由协议如RPL在水下环境中表现不佳。其解决方案的关键在于引入RL-RPL-UA,一种基于强化学习(Reinforcement Learning)的路由协议,每个节点包含一个轻量级强化学习代理,根据局部信息(如数据包投递率、缓冲区状态、链路质量和剩余能量)选择最佳父节点,并通过动态目标函数支持实时决策,从而提升网络性能。

链接: https://arxiv.org/abs/2506.00133
作者: Mohammadhossein Homaei,Mehran Tarif,Agustin Di Bartolo,Oscar Mogollon Gutierrez,Mar Avila
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 8 Pages, 10 Figures, 2 Tables

点击查看摘要

Abstract:The Internet of Underwater Things (IoUT) faces major challenges such as low bandwidth, high latency, mobility, and limited energy resources. Traditional routing protocols like RPL, which were designed for land-based networks, do not perform well in these underwater conditions. This paper introduces RL-RPL-UA, a new routing protocol that uses reinforcement learning to improve performance in underwater environments. Each node includes a lightweight RL agent that selects the best parent node based on local information such as packet delivery ratio, buffer level, link quality, and remaining energy. RL-RPL-UA keeps full compatibility with standard RPL messages and adds a dynamic objective function to support real-time decision-making. Simulations using Aqua-Sim show that RL-RPL-UA increases packet delivery by up to 9.2%, reduces energy use per packet by 14.8%, and extends network lifetime by 80 seconds compared to traditional methods. These results suggest that RL-RPL-UA is a promising and energy-efficient routing solution for underwater networks.
zh

[AI-209] Adapting Offline Reinforcement Learning with Online Delays

【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)代理在从离线到在线部署过程中面临的两个关键问题:模拟到现实(sim-to-real)的差距以及交互(interaction)差距。其中,模拟到现实的差距源于真实系统中存在延迟和其他仿真中不存在的不完美因素,而交互差距则是因为纯离线训练的策略在在线执行时会遇到分布外状态,由于收集新交互数据的成本或风险较高,导致这一问题难以解决。该研究提出的解决方案是DT-CORL(Delay-Transformer belief policy Constrained Offline RL),其关键在于利用基于Transformer的信念预测器生成对延迟具有鲁棒性的动作,且在训练过程中并未接触延迟观测,同时相比传统的历史增强基线方法表现出更高的样本效率。

链接: https://arxiv.org/abs/2506.00131
作者: Simon Sinong Zhan,Qingyuan Wu,Frank Yang,Xiangyu Shi,Chao Huang,Qi Zhu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Offline-to-online deployment of reinforcement-learning (RL) agents must bridge two gaps: (1) the sim-to-real gap, where real systems add latency and other imperfections not present in simulation, and (2) the interaction gap, where policies trained purely offline face out-of-distribution states during online execution because gathering new interaction data is costly or risky. Agents therefore have to generalize from static, delay-free datasets to dynamic, delay-prone environments. Standard offline RL learns from delay-free logs yet must act under delays that break the Markov assumption and hurt performance. We introduce DT-CORL (Delay-Transformer belief policy Constrained Offline RL), an offline-RL framework built to cope with delayed dynamics at deployment. DT-CORL (i) produces delay-robust actions with a transformer-based belief predictor even though it never sees delayed observations during training, and (ii) is markedly more sample-efficient than naïve history-augmentation baselines. Experiments on D4RL benchmarks with several delay settings show that DT-CORL consistently outperforms both history-augmentation and vanilla belief-based methods, narrowing the sim-to-real latency gap while preserving data efficiency.
zh

[AI-210] Gated Multimodal Graph Learning for Personalized Recommendation

【速读】:该论文旨在解决协同过滤中的冷启动和稀疏性问题,通过引入丰富的多模态内容信息(如产品图像和文本描述)来提升推荐效果。其解决方案的关键在于提出一种轻量且模块化的推荐框架RLMultimodalRec,该框架结合了基于图的用户建模与自适应多模态物品编码,采用门控融合模块动态平衡视觉与文本模态的贡献,从而实现细粒度且内容感知的物品表示,同时利用两层LightGCN编码器捕捉高阶协同信号,无需依赖非线性变换,提升了模型的可扩展性和可解释性。

链接: https://arxiv.org/abs/2506.00107
作者: Sibei Liu,Yuanzhe Zhang,Xiang Li,Yunbo Liu,Chengwei Feng,Hao Yang
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multimodal recommendation has emerged as a promising solution to alleviate the cold-start and sparsity problems in collaborative filtering by incorporating rich content information, such as product images and textual descriptions. However, effectively integrating heterogeneous modalities into a unified recommendation framework remains a challenge. Existing approaches often rely on fixed fusion strategies or complex architectures , which may fail to adapt to modality quality variance or introduce unnecessary computational overhead. In this work, we propose RLMultimodalRec, a lightweight and modular recommendation framework that combines graph-based user modeling with adaptive multimodal item encoding. The model employs a gated fusion module to dynamically balance the contribution of visual and textual modalities, enabling fine-grained and content-aware item representations. Meanwhile, a two-layer LightGCN encoder captures high-order collaborative signals by propagating embeddings over the user-item interaction graph without relying on nonlinear transformations. We evaluate our model on a real-world dataset from the Amazon product domain. Experimental results demonstrate that RLMultimodalRec consistently outperforms several competitive baselines, including collaborative filtering, visual-aware, and multimodal GNN-based methods. The proposed approach achieves significant improvements in top-K recommendation metrics while maintaining scalability and interpretability, making it suitable for practical deployment. Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI) Cite as: arXiv:2506.00107 [cs.IR] (or arXiv:2506.00107v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2506.00107 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-211] Feeling Guilty Being a c(ai)borg: Navigating the Tensions Between Guilt and Empowerment in AI Use

【速读】:该论文试图解决将人工智能(Artificial Intelligence, AI)整合到个人和职业工作流程中所带来的情感、伦理和实践问题,特别是探讨作为“c(ai)borg”——即被AI增强的人类——所产生的内疚感。其解决方案的关键在于通过提升基础学术技能、高级AI素养以及对AI结果的诚实互动,实现从最初的内疚和抵触到通过技能构建和透明性带来的赋权转变。研究倡导一种开放接纳AI作为协作伙伴的未来愿景,以促进创新与公平,并解决获取与自主性问题。

链接: https://arxiv.org/abs/2506.00094
作者: Konstantin Aal,Tanja Aal,Vasil Navumau,David Unbehaun,Claudia Müller,Volker Wulf,Sarah Rüller
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 16 pages,

点击查看摘要

Abstract:This paper explores the emotional, ethical and practical dimensions of integrating Artificial Intelligence (AI) into personal and professional workflows, focusing on the concept of feeling guilty as a ‘c(ai)borg’ - a human augmented by AI. Inspired by Donna Haraway’s Cyborg Manifesto, the study explores how AI challenges traditional notions of creativity, originality and intellectual labour. Using an autoethnographic approach, the authors reflect on their year-long experiences with AI tools, revealing a transition from initial guilt and reluctance to empowerment through skill-building and transparency. Key findings highlight the importance of basic academic skills, advanced AI literacy and honest engagement with AI results. The c(ai)borg vision advocates for a future where AI is openly embraced as a collaborative partner, fostering innovation and equity while addressing issues of access and agency. By reframing guilt as growth, the paper calls for a thoughtful and inclusive approach to AI integration.
zh

[AI-212] RAPDOC: Deceiving LLM Users by Injecting Imperceptible Phantom Tokens into Documents

【速读】:该论文试图解决用户对大语言模型(Large Language Models, LLMs)的过度依赖问题,这种依赖表现为用户在处理作业、任务或敏感文档时缺乏实质性参与。解决方案的关键在于引入不可察觉的幻影标记(phantom tokens)到文档中,使得LLMs生成看似合理但实际错误的输出,从而促使用户重新审视和参与内容的生成与编辑过程。该方法被集成到TRAPDOC框架中,旨在减少用户对LLMs的盲目信任并促进更负责任的使用方式。

链接: https://arxiv.org/abs/2506.00089
作者: Hyundong Jin,Sicheol Sung,Shinwoo Park,SeungYeop Baik,Yo-Sub Han
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The reasoning, writing, text-editing, and retrieval capabilities of proprietary large language models (LLMs) have advanced rapidly, providing users with an ever-expanding set of functionalities. However, this growing utility has also led to a serious societal concern: the over-reliance on LLMs. In particular, users increasingly delegate tasks such as homework, assignments, or the processing of sensitive documents to LLMs without meaningful engagement. This form of over-reliance and misuse is emerging as a significant social issue. In order to mitigate these issues, we propose a method injecting imperceptible phantom tokens into documents, which causes LLMs to generate outputs that appear plausible to users but are in fact incorrect. Based on this technique, we introduce TRAPDOC, a framework designed to deceive over-reliant LLM users. Through empirical evaluation, we demonstrate the effectiveness of our framework on proprietary LLMs, comparing its impact against several baselines. TRAPDOC serves as a strong foundation for promoting more responsible and thoughtful engagement with language models. Our code is available at this https URL.
zh

[AI-213] Hi-Dyna Graph: Hierarchical Dynamic Scene Graph for Robotic Autonomy in Human-Centric Environments

【速读】:该论文旨在解决服务机器人在以人为中心的场景中实现自主操作的问题,这一问题主要源于对动态环境的理解和上下文感知决策的需求。现有方法如拓扑地图虽能提供高效的时空先验,但无法建模瞬时物体关系,而密集神经表示(如NeRF)则计算成本过高。论文提出的解决方案是Hi-Dyna Graph,其关键在于构建一个分层动态场景图架构,将持久的全局布局与局部动态语义相结合,通过语义和空间约束将动态子图锚定到全局拓扑中,从而实现环境演变时的无缝更新,并利用大型语言模型(LLMs)解析统一图结构,推断潜在任务触发器并生成基于机器人能力的可执行指令。

链接: https://arxiv.org/abs/2506.00083
作者: Jiawei Hou,Xiangyang Xue,Taiping Zeng
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Autonomous operation of service robotics in human-centric scenes remains challenging due to the need for understanding of changing environments and context-aware decision-making. While existing approaches like topological maps offer efficient spatial priors, they fail to model transient object relationships, whereas dense neural representations (e.g., NeRF) incur prohibitive computational costs. Inspired by the hierarchical scene representation and video scene graph generation works, we propose Hi-Dyna Graph, a hierarchical dynamic scene graph architecture that integrates persistent global layouts with localized dynamic semantics for embodied robotic autonomy. Our framework constructs a global topological graph from posed RGB-D inputs, encoding room-scale connectivity and large static objects (e.g., furniture), while environmental and egocentric cameras populate dynamic subgraphs with object position relations and human-object interaction patterns. A hybrid architecture is conducted by anchoring these subgraphs to the global topology using semantic and spatial constraints, enabling seamless updates as the environment evolves. An agent powered by large language models (LLMs) is employed to interpret the unified graph, infer latent task triggers, and generate executable instructions grounded in robotic affordances. We conduct complex experiments to demonstrate Hi-Dyna Grap’s superior scene representation effectiveness. Real-world deployments validate the system’s practicality with a mobile manipulator: robotics autonomously complete complex tasks with no further training or complex rewarding in a dynamic scene as cafeteria assistant. See this https URL for video demonstration and more details.
zh

[AI-214] Who Gets the Kidney? Human-AI Alignment Indecision and Moral Values

【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)在涉及道德/伦理的高风险决策场景中(如肾脏分配)与人类道德价值观不一致的问题。研究发现,LLMs在优先级排序上表现出与人类价值观显著偏离的行为,并且在面对不确定性时倾向于做出确定性决策,而非像人类那样表现出犹豫。解决方案的关键在于采用少量样本的低秩监督微调方法,以提高决策一致性并校准犹豫建模能力。

链接: https://arxiv.org/abs/2506.00079
作者: John P. Dickerson,Hadi Hosseini,Samarth Khanna,Leona Pierce
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The rapid integration of Large Language Models (LLMs) in high-stakes decision-making – such as allocating scarce resources like donor organs – raises critical questions about their alignment with human moral values. We systematically evaluate the behavior of several prominent LLMs against human preferences in kidney allocation scenarios and show that LLMs: i) exhibit stark deviations from human values in prioritizing various attributes, and ii) in contrast to humans, LLMs rarely express indecision, opting for deterministic decisions even when alternative indecision mechanisms (e.g., coin flipping) are provided. Nonetheless, we show that low-rank supervised fine-tuning with few samples is often effective in improving both decision consistency and calibrating indecision modeling. These findings illustrate the necessity of explicit alignment strategies for LLMs in moral/ethical domains.
zh

[AI-215] Reducing Latency in LLM -Based Natural Language Commands Processing for Robot Navigation

【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)在工业机器人中应用时存在的请求与响应延迟问题,该问题源于模型的计算复杂性和规模。解决方案的关键在于将ChatGPT自然语言模型与Robot Operating System 2(ROS 2)进行集成,从而减少交互延迟并提升机器人系统的控制性能。该研究提出了一种无需中间件传输平台的架构,通过实验验证了该集成方法能够平均降低7.01%的通信延迟,进而提高人机交互的执行速度、可用性和可访问性。

链接: https://arxiv.org/abs/2506.00075
作者: Diego Pollini,Bruna V. Guterres,Rodrigo S. Guerra,Ricardo B. Grando
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Accepted to the 23rd IEEE International Conference on Industrial Informatics (INDIN)

点击查看摘要

Abstract:The integration of Large Language Models (LLMs), such as GPT, in industrial robotics enhances operational efficiency and human-robot collaboration. However, the computational complexity and size of these models often provide latency problems in request and response times. This study explores the integration of the ChatGPT natural language model with the Robot Operating System 2 (ROS 2) to mitigate interaction latency and improve robotic system control within a simulated Gazebo environment. We present an architecture that integrates these technologies without requiring a middleware transport platform, detailing how a simulated mobile robot responds to text and voice commands. Experimental results demonstrate that this integration improves execution speed, usability, and accessibility of the human-robot interaction by decreasing the communication latency by 7.01% on average. Such improvements facilitate smoother, real-time robot operations, which are crucial for industrial automation and precision tasks.
zh

[AI-216] Whose Name Comes Up? Auditing LLM -Based Scholar Recommendations

【速读】:该论文试图解决生成式 AI (Generative AI) 在物理领域专家推荐任务中的性能与偏差问题,具体包括推荐一致性、事实准确性以及性别、种族、学术知名度和学者相似性等方面的偏见。其解决方案的关键在于利用来自美国物理学会和OpenAlex的真实学术数据建立基准,通过对比模型输出与实际学术记录来评估模型表现,并揭示模型在不同任务中的不一致性和偏见特征。

链接: https://arxiv.org/abs/2506.00074
作者: Daniele Barolo,Chiara Valentin,Fariba Karimi,Luis Galárraga,Gonzalo G. Méndez,Lisette Espín-Noboa
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Digital Libraries (cs.DL); Information Retrieval (cs.IR); Social and Information Networks (cs.SI); Physics and Society (physics.soc-ph)
备注: 39 pages: 10 main (incl. 9 figures), 3 references, and 26 appendix. Paper under-review

点击查看摘要

Abstract:This paper evaluates the performance of six open-weight LLMs (llama3-8b, llama3.1-8b, gemma2-9b, mixtral-8x7b, llama3-70b, llama3.1-70b) in recommending experts in physics across five tasks: top-k experts by field, influential scientists by discipline, epoch, seniority, and scholar counterparts. The evaluation examines consistency, factuality, and biases related to gender, ethnicity, academic popularity, and scholar similarity. Using ground-truth data from the American Physical Society and OpenAlex, we establish scholarly benchmarks by comparing model outputs to real-world academic records. Our analysis reveals inconsistencies and biases across all models. mixtral-8x7b produces the most stable outputs, while llama3.1-70b shows the highest variability. Many models exhibit duplication, and some, particularly gemma2-9b and llama3.1-8b, struggle with formatting errors. LLMs generally recommend real scientists, but accuracy drops in field-, epoch-, and seniority-specific queries, consistently favoring senior scholars. Representation biases persist, replicating gender imbalances (reflecting male predominance), under-representing Asian scientists, and over-representing White scholars. Despite some diversity in institutional and collaboration networks, models favor highly cited and productive scholars, reinforcing the rich-getricher effect while offering limited geographical representation. These findings highlight the need to improve LLMs for more reliable and equitable scholarly recommendations.
zh

[AI-217] Robot-R1: Reinforcement Learning for Enhanced Embodied Reasoning in Robotics

【速读】:该论文试图解决传统监督微调(Supervised Fine-Tuning, SFT)在机器人控制任务中存在的一些局限性,如数据集构建的启发式性质、灾难性遗忘以及泛化性能下降等问题。解决方案的关键在于提出一种名为Robot-R1的新框架,该框架通过强化学习增强具身推理能力,具体表现为基于当前场景图像和环境元数据预测完成任务所需的下一个关键点状态,并通过采样基于推理的响应并强化那些能带来更准确预测的响应来优化模型性能。

链接: https://arxiv.org/abs/2506.00070
作者: Dongyoung Kim,Sumin Park,Huiwon Jang,Jinwoo Shin,Jaehyung Kim,Younggyo Seo
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 26 pages, 14 figures

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) have recently shown great promise in advancing robotics by combining embodied reasoning with robot control. A common approach involves training on embodied reasoning tasks related to robot control using Supervised Fine-Tuning (SFT). However, SFT datasets are often heuristically constructed and not explicitly optimized for improving robot control. Furthermore, SFT often leads to issues such as catastrophic forgetting and reduced generalization performance. To address these limitations, we introduce Robot-R1, a novel framework that leverages reinforcement learning to enhance embodied reasoning specifically for robot control. Robot-R1 learns to predict the next keypoint state required for task completion, conditioned on the current scene image and environment metadata derived from expert demonstrations. Inspired by the DeepSeek-R1 learning approach, Robot-R1 samples reasoning-based responses and reinforces those that lead to more accurate predictions. Our experiments show that models trained with Robot-R1 outperform SFT methods on embodied reasoning tasks. Despite having only 7B parameters, Robot-R1 even surpasses GPT-4o on reasoning tasks related to low-level action control, such as spatial and primitive movement reasoning.
zh

[AI-218] Literature Review Of Multi-Agent Debate For Problem-Solving

【速读】:该论文试图解决多智能体大语言模型(MA-LLMs)领域中缺乏直接比较的问题,旨在揭示可扩展性、通信结构和决策过程等因素如何影响MA-LLM的性能。其解决方案的关键在于综合传统多智能体系统与最新MA-LLM研究,通过分析常见实践和当前挑战,为开发高效且稳健的多智能体AI方案提供指导。

链接: https://arxiv.org/abs/2506.00066
作者: Arne Tillmann
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注: 11 pages, 2 figures

点击查看摘要

Abstract:Multi-agent large language models (MA-LLMs) are a rapidly growing research area that leverages multiple interacting language agents to tackle complex tasks, outperforming single-agent large language models. This literature review synthesizes the latest research on agent profiles, communication structures, and decision-making processes, drawing insights from both traditional multi-agent systems and state-of-the-art MA-LLM studies. In doing so, it aims to address the lack of direct comparisons in the field, illustrating how factors like scalability, communication structure, and decision-making processes influence MA-LLM performance. By examining frequent practices and outlining current challenges, the review reveals that multi-agent approaches can yield superior results but also face elevated computational costs and under-explored challenges unique to MA-LLM. Overall, these findings provide researchers and practitioners with a roadmap for developing robust and efficient multi-agent AI solutions.
zh

[AI-219] Prompt Engineer: Analyzing Skill Requirements in the AI Job Market

【速读】:该论文试图解决的问题是:随着大型语言模型(Large Language Models, LLMs)的兴起,新兴职业“提示工程师(Prompt Engineer)”所需的技能及其岗位普及程度尚不明确。论文通过分析LinkedIn上的20,662份职位招聘信息(其中包括72个提示工程师职位),揭示了该职业的独特技能需求。其解决方案的关键在于识别出提示工程师所需的核心技能组合,包括人工智能知识(22.8%)、提示设计能力(18.7%)、良好的沟通能力(21.9%)以及创造性问题解决能力(15.8%),并对比传统岗位如数据科学家和机器学习工程师的技能要求,证明提示工程正在发展为一个独立的职业领域。

链接: https://arxiv.org/abs/2506.00058
作者: An Vu,Jonas Oppenlaender
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 42 pages, 8 figures

点击查看摘要

Abstract:The rise of large language models (LLMs) has created a new job role: the Prompt Engineer. Despite growing interest in this position, we still do not fully understand what skills this new job role requires or how common these jobs are. We analyzed 20,662 job postings on LinkedIn, including 72 prompt engineer positions, to learn more about this emerging role. We found that prompt engineering is still rare (less than 0.5% of sampled job postings) but has a unique skill profile. Prompt engineers need AI knowledge (22.8%), prompt design skills (18.7%), good communication (21.9%), and creative problem-solving (15.8%) skills. These requirements significantly differ from those of established roles, such as data scientists and machine learning engineers, showing that prompt engineering is becoming its own profession. Our findings help job seekers, employers, and educational institutions in better understanding the emerging field of prompt engineering.
zh

[AI-220] oward Knowledge-Guided AI for Inverse Design in Manufacturing: A Perspective on Domain Physics and Human-AI Synergy

【速读】:该论文试图解决在制造领域中,传统数据驱动方法在面对稀疏数据、高维设计空间和复杂物理约束时所遇到的性能瓶颈问题。其解决方案的关键在于构建新一代设计系统,该系统通过整合领域知识、物理信息学习以及直观的人机交互界面,超越传统的黑箱建模方法。具体而言,解决方案强调专家引导的采样策略以提高数据效率和模型泛化能力,利用物理信息机器学习实现数据稀缺条件下的物理一致性建模,并借助大语言模型作为交互式设计代理,连接用户意图与仿真工具、优化流程及协作工作流。

链接: https://arxiv.org/abs/2506.00056
作者: Hugon Lee,Hyeonbin Moon,Junhyeong Lee,Seunghwa RYu
机构: 未知
类目: Artificial Intelligence (cs.AI); Computational Physics (physics.comp-ph)
备注: 26 pages, 4 figures

点击查看摘要

Abstract:Artificial intelligence (AI) is reshaping inverse design across manufacturing domain, enabling high-performance discovery in materials, products, and processes. However, purely data-driven approaches often struggle in realistic settings characterized by sparse data, high-dimensional design spaces, and nontrivial physical constraints. This perspective argues for a new generation of design systems that transcend black-box modeling by integrating domain knowledge, physics-informed learning, and intuitive human-AI interfaces. We first demonstrate how expert-guided sampling strategies enhance data efficiency and model generalization. Next, we discuss how physics-informed machine learning enables physically consistent modeling in data-scarce regimes. Finally, we explore how large language models emerge as interactive design agents connecting user intent with simulation tools, optimization pipelines, and collaborative workflows. Through illustrative examples and conceptual frameworks, we advocate that inverse design in manufacturing should evolve into a unified ecosystem, where domain knowledge, physical priors, and adaptive reasoning collectively enable scalable, interpretable, and accessible AI-driven design systems.
zh

[AI-221] Rethinking Hybrid Retrieval: When Small Embeddings and LLM Re-ranking Beat Bigger Models

【速读】:该论文试图解决 Retrieval-Augmented Generation (RAG) 系统中多模态嵌入模型融合的优化问题,旨在提升检索质量与生成效果。其解决方案的关键在于将密集语义、稀疏词汇和基于图的嵌入进行融合,并选择适合与大语言模型(LLM)重排序相结合的嵌入模型。研究发现,尽管BGE-Large模型规模更大,但MiniLM-v6在与LLM重排序结合时表现出更优性能,这表明嵌入模型的选择应优先考虑与多信号融合及LLM对齐的兼容性,而非单纯依赖模型大小。这一方法在保持计算效率的同时提升了检索准确性和系统整体性能。

链接: https://arxiv.org/abs/2506.00049
作者: Arjun Rao,Hanieh Alipour,Nick Pendar
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper presents a comparison of embedding models in tri-modal hybrid retrieval for Retrieval-Augmented Generation (RAG) systems. We investigate the fusion of dense semantic, sparse lexical, and graph-based embeddings, focusing on the performance of the MiniLM-v6 and BGE-Large architectures. Contrary to conventional assumptions, our results show that the compact MiniLM-v6 outperforms the larger BGE-Large when integrated with LLM-based re-ranking within our tri-modal hybrid framework. Experiments conducted on the SciFact, FIQA, and NFCorpus datasets demonstrate significant improvements in retrieval quality with the MiniLM-v6 configuration. The performance difference is particularly pronounced in agentic re-ranking scenarios, indicating better alignment between MiniLM-v6’s embedding space and LLM reasoning. Our findings suggest that embedding model selection for RAG systems should prioritize compatibility with multi-signal fusion and LLM alignment, rather than relying solely on larger models. This approach may reduce computational requirements while improving retrieval accuracy and efficiency.
zh

[AI-222] Risks of AI-driven product development and strategies for their mitigation

【速读】:该论文试图解决人工智能(Artificial Intelligence, AI)在产品开发中日益增长的使用所带来的风险问题,包括技术风险和社会技术风险。其解决方案的关键在于提出一套安全的AI驱动产品开发原则,强调人类监督、责任归属和可解释性设计,以在促进技术创新的同时有效管理潜在风险。

链接: https://arxiv.org/abs/2506.00047
作者: Jan Göpfert,Jann M. Weinand,Patrick Kuckertz,Noah Pflugradt,Jochen Linßen
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注:

点击查看摘要

Abstract:Humanity is progressing towards automated product development, a trend that promises faster creation of better products and thus the acceleration of technological progress. However, increasing reliance on non-human agents for this process introduces many risks. This perspective aims to initiate a discussion on these risks and appropriate mitigation strategies. To this end, we outline a set of principles for safer AI-driven product development which emphasize human oversight, accountability, and explainable design, among others. The risk assessment covers both technical risks which affect product quality and safety, and sociotechnical risks which affect society. While AI-driven product development is still in its early stages, this discussion will help balance its opportunities and risks without delaying essential progress in understanding, norm-setting, and regulation.
zh

[AI-223] he Folly of AI for Age Verification AAAI

【速读】:该论文试图解决将生成式 AI (Generative AI) 用于年龄验证的可行性问题,指出此类系统在实际部署中存在易被规避以及对少数群体和低社会经济地位用户产生不成比例的误分类问题。解决方案的关键在于揭示当前 AI 模型及其运行的物理硬件存在技术局限性,这些局限性难以在不增加成本的情况下克服,因此基于政府身份证件的年龄验证仍是更可靠的选择。

链接: https://arxiv.org/abs/2506.00038
作者: Reid McIlroy-Young
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: From AI for Public Missions Workshop at AAAI

点击查看摘要

Abstract:In the near future a governmental body will be asked to allow companies to use AI for age verification. If they allow it the resulting system will both be easily circumvented and disproportionately misclassify minorities and low socioeconomic status users. This is predictable by showing that other very similar systems (facial recognition and remote proctoring software) have similar issues despite years of efforts to mitigate their biases. These biases are due to technical limitations both of the AI models themselves and the physical hardware they are running on that will be difficult to overcome below the cost of government ID-based age verification. Thus in, the near future, deploying an AI system for age verification is folly.
zh

[AI-224] Rapid yet accurate Tile-circuit and device modeling for Analog In-Memory Computing

【速读】:该论文试图解决模拟存内计算(Analog In-Memory Compute, AIMC)中由于模拟器件和电路非理想性导致的神经网络任务精度下降问题。其解决方案的关键在于建立一个能够准确捕捉瞬时电流IR降和ADC量化效应的数学模型,并将其集成到基于PyTorch的框架中,以评估BERT和ALBERT等Transformer网络的精度影响。该模型能够快速预测矩阵-向量乘法(MVM)模块的输出,从而为硬件感知的微调提供依据,其中简单高斯噪声在应对ADC量化和PCM读取噪声方面有效,但在应对IR降方面效果有限,这表明需要更复杂的训练方法来提升大规模神经网络在AIMC硬件上的鲁棒性。

链接: https://arxiv.org/abs/2506.00004
作者: J. Luquin,C. Mackin,S. Ambrogio,A. Chen,F. Baldi,G. Miralles,M.J. Rasch,J. Büchel,M. Lalwani,W. Ponghiran,P. Solomon,H. Tsai,G.W. Burr,P. Narayanan
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
备注:

点击查看摘要

Abstract:Analog In-Memory Compute (AIMC) can improve the energy efficiency of Deep Learning by orders of magnitude. Yet analog-domain device and circuit non-idealities – within the analog ``Tiles’’ performing Matrix-Vector Multiply (MVM) operations – can degrade neural-network task accuracy. We quantify the impact of low-level distortions and noise, and develop a mathematical model for Multiply-ACcumulate (MAC) operations mapped to analog tiles. Instantaneous-current IR-drop (the most significant circuit non-ideality), and ADC quantization effects are fully captured by this model, which can predict MVM tile-outputs both rapidly and accurately, as compared to much slower rigorous circuit simulations. A statistical model of PCM read noise at nanosecond timescales is derived from – and matched against – experimental measurements. We integrate these (statistical) device and (deterministic) circuit effects into a PyTorch-based framework to assess the accuracy impact on the BERT and ALBERT Transformer networks. We show that hardware-aware fine-tuning using simple Gaussian noise provides resilience against ADC quantization and PCM read noise effects, but is less effective against IR-drop. This is because IR-drop – although deterministic – is non-linear, is changing significantly during the time-integration window, and is ultimately dependent on all the excitations being introduced in parallel into the analog tile. The apparent inability of simple Gaussian noise applied during training to properly prepare a DNN network for IR-drop during inference implies that more complex training approaches – incorporating advances such as the Tile-circuit model introduced here – will be critical for resilient deployment of large neural networks onto AIMC hardware.
zh

[AI-225] Advancing AI-assisted Hardware Design with Hierarchical Decentralized Training and Personalized Inference-Time Optimization

【速读】:该论文旨在解决生成式 AI 在硬件设计生成中的三个关键挑战:数据可用性有限、数据质量参差不齐以及推理阶段效率不足。其解决方案的关键在于提出一个两阶段框架,通过探索去中心化训练和个性化推理来提升性能。第一阶段利用分层去中心化训练机制挖掘私有领域设计资源,并通过用户定义的指标优化模型聚合以缓解低质量数据的影响;第二阶段则聚焦于客户端个性化,引入新的度量标准 Trueput 来分析效率,并通过个性化推理加速和定制化采样策略优化 Trueput,从而提升生成速度与质量。

链接: https://arxiv.org/abs/2506.00002
作者: Hao Mark Chen,Zehuan Zhang,Wanru Zhao,Nicholas Lane,Hongxiang Fan
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent years have witnessed a significant increase in the adoption of AI techniques to enhance electronic design automation. In particular, the emergence of Large Language Models (LLMs) has sparked significant interest in LLM-assisted hardware design generation, spanning applications from classical digital circuits to quantum computing. Despite substantial progress in this direction, the quality of LLM-generated hardware design still cannot meet the requirements for practical deployment. In this work, we identify three critical challenges hindering the development of LLM-assisted hardware design generation: 1) limited data availability, 2) varied data quality, 3) inadequate inference-time efficiency. To address these fundamental challenges, this paper introduces a two-stage framework for AI-assisted hardware design by exploring decentralized training and personalized inference. In the first stage, we propose to harness private domain design sources through a hierarchical decentralized training mechanism that addresses data-sharing constraints. To mitigate the impact of low-quality data, we identify optimization opportunities in hardware generation tasks, using user-defined metrics for model aggregation. The second stage focuses on client personalization to enhance both speed and quality. We introduce a new metric, Trueput, to analyze LLM-assisted hardware generation efficiency. To optimize Trueput, we implement personalized inference-time acceleration and customized sampling strategies. Evaluating both classical and quantum benchmarks, our experimental results demonstrate that the proposed two-stage framework can significantly improve the model capability for hardware design generation. As orthogonal enhancements to existing methods, our framework can achieve 33% \sim 50% semantic accuracy improvement and 2.3 times speedup, depending on the difficulty of the generation tasks.
zh

[AI-226] A Quantum Information Theoretic Approach to Tractable Probabilistic Models

【速读】:该论文试图解决传统概率电路在建模复杂分布时的局限性,其解决方案的关键在于引入正则单位电路(Positive Unital Circuits, PUnCs),通过将电路计算从正实值概率推广到正半定矩阵,从而扩展了概率电路的表达能力,并严格涵盖了包括PSD电路在内的最新电路类别。

链接: https://arxiv.org/abs/2506.01824
作者: Pedro Zuidberg Dos Martires
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:By recursively nesting sums and products, probabilistic circuits have emerged in recent years as an attractive class of generative models as they enjoy, for instance, polytime marginalization of random variables. In this work we study these machine learning models using the framework of quantum information theory, leading to the introduction of positive unital circuits (PUnCs), which generalize circuit evaluations over positive real-valued probabilities to circuit evaluations over positive semi-definite matrices. As a consequence, PUnCs strictly generalize probabilistic circuits as well as recently introduced circuit classes such as PSD circuits.
zh

[AI-227] Overcoming Data Scarcity in Scanning Tunnelling Microscopy Image Segmentation

【速读】:该论文旨在解决扫描隧道显微镜(Scanning Tunnelling Microscopy, STM)图像分析中手动标记特征耗时费力的问题。传统方法依赖大量人工标注的数据集,限制了其灵活性和适应性。本文提出的解决方案关键在于结合少量样本学习(few-shot learning)和无监督学习,从而无需依赖大规模人工标注数据,提高了模型在未见过表面的泛化能力,并且在仅需少量额外标注数据的情况下仍能保持高精度。

链接: https://arxiv.org/abs/2506.01678
作者: Nikola L. Kolev,Max Trouton,Filippo Federici Canova,Geoff Thornton,David Z. Gao,Neil J. Curson,Taylor J. Z. Stock
机构: 未知
类目: Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Scanning tunnelling microscopy (STM) is a powerful technique for imaging surfaces with atomic resolution, providing insight into physical and chemical processes at the level of single atoms and molecules. A regular task of STM image analysis is the identification and labelling of features of interest against a uniform background. Performing this manually is a labour-intensive task, requiring significant human effort. To reduce this burden, we propose an automated approach to the segmentation of STM images that uses both few-shot learning and unsupervised learning. Our technique offers greater flexibility compared to previous supervised methods; it removes the requirement for large manually annotated datasets and is thus easier to adapt to an unseen surface while still maintaining a high accuracy. We demonstrate the effectiveness of our approach by using it to recognise atomic features on three distinct surfaces: Si(001), Ge(001), and TiO _2 (110), including adsorbed AsH _3 molecules on the silicon and germanium surfaces. Our model exhibits strong generalisation capabilities, and following initial training, can be adapted to unseen surfaces with as few as one additional labelled data point. This work is a significant step towards efficient and material-agnostic, automatic segmentation of STM images.
zh

[AI-228] Synthesis of discrete-continuous quantum circuits with multimodal diffusion models

【速读】:该论文试图解决量子操作高效编译的问题,这是量子计算扩展中的主要瓶颈。当前的先进方法通过结合搜索算法与基于梯度的参数优化来实现低编译误差,但其运行时间长且需要多次调用量子硬件或昂贵的经典模拟,导致难以扩展。该论文提出的关键解决方案是一种多模态去噪扩散模型,该模型同时生成电路结构和连续参数以编译目标酉矩阵,其核心在于利用两个独立的扩散过程分别处理离散门选择和参数预测。

链接: https://arxiv.org/abs/2506.01666
作者: Florian Fürrutter,Zohim Chandani,Ikko Hamamura,Hans J. Briegel,Gorka Muñoz-Gil
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Main Text: 10 pages and 5 figures; Appendix: 17 pages, 7 figures and 1 table. Code available at: this https URL

点击查看摘要

Abstract:Efficiently compiling quantum operations remains a major bottleneck in scaling quantum computing. Today’s state-of-the-art methods achieve low compilation error by combining search algorithms with gradient-based parameter optimization, but they incur long runtimes and require multiple calls to quantum hardware or expensive classical simulations, making their scaling prohibitive. Recently, machine-learning models have emerged as an alternative, though they are currently restricted to discrete gate sets. Here, we introduce a multimodal denoising diffusion model that simultaneously generates a circuit’s structure and its continuous parameters for compiling a target unitary. It leverages two independent diffusion processes, one for discrete gate selection and one for parameter prediction. We benchmark the model over different experiments, analyzing the method’s accuracy across varying qubit counts, circuit depths, and proportions of parameterized gates. Finally, by exploiting its rapid circuit generation, we create large datasets of circuits for particular operations and use these to extract valuable heuristics that can help us discover new insights into quantum circuit synthesis.
zh

[AI-229] Unsupervised Rhythm and Voice Conversion to Improve ASR on Dysarthric Speech INTERSPEECH2025

【速读】:该论文试图解决自动语音识别(ASR)系统在处理构音障碍(dysarthric)语音时表现不佳的问题,主要原因是构音障碍语音具有较高的说话人间差异性和较慢的语速。解决方案的关键在于提出一种基于音节的节奏建模方法,以改进构音障碍语音到健康语音的转换,从而提升ASR性能。该方法扩展了现有的Rhythm and Voice (RnV)转换框架,并通过训练LF-MMI模型和微调Whisper在转换后的语音上进行评估,结果显示该方法在严重构音障碍情况下显著降低了词错误率。

链接: https://arxiv.org/abs/2506.01618
作者: Karl El Hajal,Enno Hermann,Sevada Hovsepyan,Mathew Magimai.-Doss
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD)
备注: Accepted at Interspeech 2025

点击查看摘要

Abstract:Automatic speech recognition (ASR) systems struggle with dysarthric speech due to high inter-speaker variability and slow speaking rates. To address this, we explore dysarthric-to-healthy speech conversion for improved ASR performance. Our approach extends the Rhythm and Voice (RnV) conversion framework by introducing a syllable-based rhythm modeling method suited for dysarthric speech. We assess its impact on ASR by training LF-MMI models and fine-tuning Whisper on converted speech. Experiments on the Torgo corpus reveal that LF-MMI achieves significant word error rate reductions, especially for more severe cases of dysarthria, while fine-tuning Whisper on converted data has minimal effect on its performance. These results highlight the potential of unsupervised rhythm and voice conversion for dysarthric ASR. Code available at: this https URL
zh

[AI-230] Advanced Nanostructured Topical Therapeutics for Psoriasis: Strategic Synthesis Multimodal Characterization and Preliminary Pharmacodynamic Profiling

【速读】:该论文试图解决银屑病(psoriasis)这一长期炎症性皮肤疾病难以治疗的问题,其解决方案的关键在于将金属氧化物纳米颗粒(如二氧化铈、氧化锌和银)与天然植物提取物结合,制备一种基于鱼胶原蛋白和琼脂的凝胶型外用制剂。通过表征纳米颗粒的物理化学性质,并利用植物来源的抗氧化剂增强治疗效果,实验结果显示该制剂在动物模型中表现出更快的伤口愈合和显著的抗炎作用,表明该组合策略可能为银屑病提供一种有前景的新治疗方法。

链接: https://arxiv.org/abs/2506.01572
作者: Iqra Yousaf,Aqsa Yousaf
机构: 未知
类目: Medical Physics (physics.med-ph); Artificial Intelligence (cs.AI); Biological Physics (physics.bio-ph)
备注: 24 pages

点击查看摘要

Abstract:Psoriasis is a long-term inflammatory skin disease that remains difficult to treat. In this study, we developed a new topical treatment by combining metal oxide nanoparticles: cerium oxide (CeO2), zinc oxide (ZnO), and silver (Ag), with natural plant extracts in a gel made from fish collagen and agar. The nanoparticles were characterized using UV-Vis spectroscopy, dynamic light scattering (DLS), Fourier-transform infrared spectroscopy (FTIR), and scanning electron microscopy (SEM), showing good stability and a uniform particle size distribution (ZnO averaged 66 nm). To enhance therapeutic potential, the gel was enriched with plant-derived antioxidants from bitter melon, ginger, and neem. This formulation was tested on an animal model of psoriasis. The treated group exhibited faster wound healing and reduced inflammation compared to both placebo and untreated groups, with statistically significant results (p 0.01 to p 0.001) observed from Day 3, becoming more pronounced by Day 14. These results indicate that the combination of nanoparticles with plant-based components in a topical gel may provide a promising new approach to psoriasis treatment. Further studies are recommended to evaluate long-term safety and therapeutic effectiveness. Comments: 24 pages Subjects: Medical Physics (physics.med-ph); Artificial Intelligence (cs.AI); Biological Physics (physics.bio-ph) MSC classes: 92C50 ACMclasses: J.3; I.4.5; J.2 Cite as: arXiv:2506.01572 [physics.med-ph] (or arXiv:2506.01572v1 [physics.med-ph] for this version) https://doi.org/10.48550/arXiv.2506.01572 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-231] GenDMR: A dynamic multimodal role-swapping network for identifying risk gene phenotypes

【速读】:该论文旨在解决当前深度学习方法在整合多模态数据(如影像学特征与遗传特征)进行阿尔茨海默病(Alzheimer’s disease, AD)病因分析和预测诊断时存在的关键问题,包括遗传信息的选择与编码不足,以及影像学特征因分类价值较高而掩盖了遗传特征的独特价值。其解决方案的关键在于提出动态多模态角色互换网络(GenDMR),通过新颖的单核苷酸多态性(SNP)空间组织编码方法增强基因组背景表示,并引入多实例注意力模块以自适应量化SNP和脑区的疾病风险,同时结合主导模态选择模块与对比自蒸馏模块,实现基于主导与辅助模态的教师-学生角色动态交换机制,从而促进不同模态数据的双向协同更新。

链接: https://arxiv.org/abs/2506.01456
作者: Lina Qin,Cheng Zhu,Chuqi Zhou,Yukun Huang,Jiayi Zhu,Ping Liang,Jinju Wang,Yixing Huang,Cheng Luo,Dezhong Yao,Ying Tan
机构: 未知
类目: Genomics (q-bio.GN); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
备注: 31 pages, 9 figures

点击查看摘要

Abstract:Recent studies have shown that integrating multimodal data fusion techniques for imaging and genetic features is beneficial for the etiological analysis and predictive diagnosis of Alzheimer’s disease (AD). However, there are several critical flaws in current deep learning methods. Firstly, there has been insufficient discussion and exploration regarding the selection and encoding of genetic information. Secondly, due to the significantly superior classification value of AD imaging features compared to genetic features, many studies in multimodal fusion emphasize the strengths of imaging features, actively mitigating the influence of weaker features, thereby diminishing the learning of the unique value of genetic features. To address this issue, this study proposes the dynamic multimodal role-swapping network (GenDMR). In GenDMR, we develop a novel approach to encode the spatial organization of single nucleotide polymorphisms (SNPs), enhancing the representation of their genomic context. Additionally, to adaptively quantify the disease risk of SNPs and brain region, we propose a multi-instance attention module to enhance model interpretability. Furthermore, we introduce a dominant modality selection module and a contrastive self-distillation module, combining them to achieve a dynamic teacher-student role exchange mechanism based on dominant and auxiliary modalities for bidirectional co-updating of different modal data. Finally, GenDMR achieves state-of-the-art performance on the ADNI public dataset and visualizes attention to different SNPs, focusing on confirming 12 potential high-risk genes related to AD, including the most classic APOE and recently highlighted significant risk genes. This demonstrates GenDMR’s interpretable analytical capability in exploring AD genetic features, providing new insights and perspectives for the development of multimodal data fusion techniques.
zh

[AI-232] From Initial Data to Boundary Layers: Neural Networks for Nonlinear Hyperbolic Conservation Laws

【速读】:该论文试图解决非线性严格双曲守恒律初边值问题的熵解近似问题(entropy solutions to initial-boundary value problems for nonlinear strictly hyperbolic conservation laws)。其解决方案的关键在于引入一种通用且系统化的框架,用于设计高效且可靠的学习算法,该框架结合了训练过程中的快速收敛与预测的高精度。

链接: https://arxiv.org/abs/2506.01453
作者: Igor Ciril,Khalil Haddaoui,Yohann Tendero
机构: 未知
类目: Analysis of PDEs (math.AP); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We address the approximation of entropy solutions to initial-boundary value problems for nonlinear strictly hyperbolic conservation laws using neural networks. A general and systematic framework is introduced for the design of efficient and reliable learning algorithms, combining fast convergence during training with accurate predictions. The methodology is assessed through a series of one-dimensional scalar test cases, highlighting its potential applicability to more complex industrial scenarios.
zh

[AI-233] Can AI Master Econometrics? Evidence from Econometrics AI Agent on Expert-Level Tasks

【速读】:该论文试图解决如何使人工智能(AI)有效执行传统上需要人类专业知识的复杂计量经济学分析的问题。其解决方案的关键在于开发了一个基于开源MetaGPT框架的“计量经济学AI代理”(Econometrics AI Agent),该代理在战略规划计量经济任务、生成和执行代码、基于错误的反思以提高稳健性以及通过多轮对话进行迭代优化方面表现出色。通过构建来自学术课程材料和已发表研究论文的两个数据集,验证了该代理在现实世界挑战中的卓越性能,并证明其在计量经济学领域的专业能力显著优于基准大型语言模型(LLM)和通用AI代理。

链接: https://arxiv.org/abs/2506.00856
作者: Qiang Chen,Tianyang Han,Jin Li,Ye Luo,Yuxiao Wu,Xiaowei Zhang,Tuo Zhou
机构: 未知
类目: Econometrics (econ.EM); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Can AI effectively perform complex econometric analysis traditionally requiring human expertise? This paper evaluates an agentic AI’s capability to master econometrics, focusing on empirical analysis performance. We develop an ``Econometrics AI Agent’’ built on the open-source MetaGPT framework. This agent exhibits outstanding performance in: (1) planning econometric tasks strategically, (2) generating and executing code, (3) employing error-based reflection for improved robustness, and (4) allowing iterative refinement through multi-round conversations. We construct two datasets from academic coursework materials and published research papers to evaluate performance against real-world challenges. Comparative testing shows our domain-specialized agent significantly outperforms both benchmark large language models (LLMs) and general-purpose AI agents. This work establishes a testbed for exploring AI’s impact on social science research and enables cost-effective integration of domain expertise, making advanced econometric methods accessible to users with minimal coding expertise. Furthermore, our agent enhances research reproducibility and offers promising pedagogical applications for econometrics teaching.
zh

[AI-234] Attention-Aided MMSE for OFDM Channel Estimation: Learning Linear Filters with Attention

【速读】:该论文旨在解决正交频分复用(Orthogonal Frequency Division Multiplexing, OFDM)系统中准确信道估计的问题,传统基于信号处理的方法如最小均方误差(Minimum Mean-Squared Error, MMSE)估计需要难以获取的二阶统计量,而现有的深度神经网络方法则存在计算复杂度高的问题。论文提出的解决方案是Attention-aided MMSE(A-MMSE),其关键在于通过Attention Transformer学习最优MMSE滤波器,并在推理阶段仅通过一次线性操作进行信道估计,从而消除非线性激活函数,降低计算复杂度。

链接: https://arxiv.org/abs/2506.00452
作者: TaeJun Ha,Chaehyun Jung,Hyeonuk Kim,Jeongwoo Park,Jeonghun Park
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: 13 pages, 12 figures

点击查看摘要

Abstract:In orthogonal frequency division multiplexing (OFDM), accurate channel estimation is crucial. Classical signal processing based approaches, such as minimum mean-squared error (MMSE) estimation, often require second-order statistics that are difficult to obtain in practice. Recent deep neural networks based methods have been introduced to address this; yet they often suffer from high complexity. This paper proposes an Attention-aided MMSE (A-MMSE), a novel model-based DNN framework that learns the optimal MMSE filter via the Attention Transformer. Once trained, the A-MMSE estimates the channel through a single linear operation for channel estimation, eliminating nonlinear activations during inference and thus reducing computational complexity. To enhance the learning efficiency of the A-MMSE, we develop a two-stage Attention encoder, designed to effectively capture the channel correlation structure. Additionally, a rank-adaptive extension of the proposed A-MMSE allows flexible trade-offs between complexity and channel estimation accuracy. Extensive simulations with 3GPP TDL channel models demonstrate that the proposed A-MMSE consistently outperforms other baseline methods in terms of normalized MSE across a wide range of SNR conditions. In particular, the A-MMSE and its rank-adaptive extension establish a new frontier in the performance complexity trade-off, redefining the standard for practical channel estimation methods.
zh

[AI-235] Neural Network-based Information-Theoretic Transceivers for High-Order Modulation Schemes

【速读】:该论文旨在解决传统端到端(E2E)通信系统中计算效率与性能平衡的问题,特别是在高阶调制方案下如何提升系统性能。其解决方案的关键在于提出一种基于神经网络(NN)的比特级接收机,该接收机在保持与基准解映射器相当性能的同时提高了计算效率,并进一步引入了一种基于符号级自编码器(AE)的E2E系统,通过物理层联合优化发射机和接收机,从而实现更优的系统性能。

链接: https://arxiv.org/abs/2506.00368
作者: Ngoc Long Pham,Tri Nhu Do
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Neural network (NN)-based end-to-end (E2E) communication systems, in which each system component may consist of a portion of a neural network, have been investigated as potential tools for developing artificial intelligence (Al)-native E2E systems. In this paper, we propose an NN-based bitwise receiver that improves computational efficiency while maintaining performance comparable to baseline demappers. Building on this foundation, we introduce a novel symbol-wise autoencoder (AE)-based E2E system that jointly optimizes the transmitter and receiver at the physical layer. We evaluate the proposed NN-based receiver using bit-error rate (BER) analysis to confirm that the numerical BER achieved by NN-based receivers or transceivers is accurate. Results demonstrate that the AE-based system outperforms baseline architectures, particularly for higher-order modulation schemes. We further show that the training signal-to-noise ratio (SNR) significantly affects the performance of the systems when inference is conducted at different SNR levels.
zh

[AI-236] Beyond Winning: Margin of Victory Relative to Expectation Unlocks Accurate Skill Ratings

【速读】:该论文试图解决传统评分系统(如ELO)在竞技环境中仅依赖二元结果而忽略比赛表现幅度的问题,从而导致信息丢失。其解决方案的关键在于引入一种名为“Margin of Victory Differential Analysis (MOVDA)”的框架,该框架通过比较实际比赛得分差(MOV)与模型预测的MOV之间的偏差,来提供更细致且加权的评分更新信号。MOVDA利用一个领域特定的非线性函数(缩放双曲正切函数)来预测MOV,并通过真实MOV与预期MOV的差异,提升评分系统的准确性与收敛速度。

链接: https://arxiv.org/abs/2506.00348
作者: Shivam Shorewala,Zihao Yang
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Knowledge of accurate relative skills in any competitive system is essential, but foundational approaches such as ELO discard extremely relevant performance data by concentrating exclusively on binary outcomes. While margin of victory (MOV) extensions exist, they often lack a definitive method for incorporating this information. We introduce Margin of Victory Differential Analysis (MOVDA), a framework that enhances traditional rating systems by using the deviation between the true MOV and a \textitmodeled expectation . MOVDA learns a domain-specific, non-linear function (a scaled hyperbolic tangent that captures saturation effects and home advantage) to predict expected MOV based on rating differentials. Crucially, the \textitdifference between the true and expected MOV provides a subtle and weighted signal for rating updates, highlighting informative deviations in all levels of contests. Extensive experiments on professional NBA basketball data (from 2013 to 2023, with 13,619 games) show that MOVDA significantly outperforms standard ELO and Bayesian baselines. MOVDA reduces Brier score prediction error by 1.54% compared to TrueSkill, increases outcome accuracy by 0.58% , and most importantly accelerates rating convergence by 13.5% , while maintaining the computational efficiency of the original ELO updates. MOVDA offers a theoretically motivated, empirically superior, and computationally lean approach to integrating performance magnitude into skill rating for competitive environments like the NBA.
zh

[AI-237] Recover Experimental Data with Selection Bias using Counterfactual Logic

【速读】:该论文旨在解决选择偏差(selection bias)对因果推断有效性的影响问题,特别是在实验数据中如何恢复无偏的因果分布。其解决方案的关键在于通过结构因果模型(Structural Causal Models, SCMs)显式构建反事实世界,并分析观测世界中的选择机制如何传播到反事实领域,从而确定实验分布是否受选择偏差影响。研究提出了图形化和理论化的完整标准,以及利用部分无偏观测数据从有偏实验数据中恢复目标因果分布的方法,为实际因果推断中的选择偏差缓解提供了实用指导。

链接: https://arxiv.org/abs/2506.00335
作者: Jingyang He,Shuai Wang,Ang Li
机构: 未知
类目: Methodology (stat.ME); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Selection bias, arising from the systematic inclusion or exclusion of certain samples, poses a significant challenge to the validity of causal inference. While Bareinboim et al. introduced methods for recovering unbiased observational and interventional distributions from biased data using partial external information, the complexity of the backdoor adjustment and the method’s strong reliance on observational data limit its applicability in many practical settings. In this paper, we formally discover the recoverability of P(Y^_x^) under selection bias with experimental data. By explicitly constructing counterfactual worlds via Structural Causal Models (SCMs), we analyze how selection mechanisms in the observational world propagate to the counterfactual domain. We derive a complete set of graphical and theoretical criteria to determine that the experimental distribution remain unaffected by selection bias. Furthermore, we propose principled methods for leveraging partially unbiased observational data to recover P(Y^_x^) from biased experimental datasets. Simulation studies replicating realistic research scenarios demonstrate the practical utility of our approach, offering concrete guidance for mitigating selection bias in applied causal inference.
zh

[AI-238] Diff-SPORT: Diffusion-based Sensor Placement Optimization and Reconstruction of Turbulent flows in urban environments

【速读】:该论文旨在解决城市环境中湍流风场监测的准确性与效率问题,特别是在传统稀疏重构和传感器布置策略在实际约束下面临精度下降的情况下。其解决方案的关键在于提出Diff-SPORT框架,该框架结合了生成式扩散模型(Generative Diffusion Model)、最大后验(Maximum A Posteriori, MAP)推断方案以及Shapley值归因框架,实现了高保真流场重构与最优传感器布置的可扩展且可解释的解决方案。

链接: https://arxiv.org/abs/2506.00214
作者: Abhijeet Vishwasrao,Sai Bharath Chandra Gutha,Andres Cremades,Klas Wijk,Aakash Patil,Catherine Gorle,Beverley J McKeon,Hossein Azizpour,Ricardo Vinuesa
机构: 未知
类目: Fluid Dynamics (physics.flu-dyn); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Rapid urbanization demands accurate and efficient monitoring of turbulent wind patterns to support air quality, climate resilience and infrastructure design. Traditional sparse reconstruction and sensor placement strategies face major accuracy degradations under practical constraints. Here, we introduce Diff-SPORT, a diffusion-based framework for high-fidelity flow reconstruction and optimal sensor placement in urban environments. Diff-SPORT combines a generative diffusion model with a maximum a posteriori (MAP) inference scheme and a Shapley-value attribution framework to propose a scalable and interpretable solution. Compared to traditional numerical methods, Diff-SPORT achieves significant speedups while maintaining both statistical and instantaneous flow fidelity. Our approach offers a modular, zero-shot alternative to retraining-intensive strategies, supporting fast and reliable urban flow monitoring under extreme sparsity. Diff-SPORT paves the way for integrating generative modeling and explainability in sustainable urban intelligence.
zh

[AI-239] Autonomous Behavior and Whole-Brain Dynamics Emerge in Embodied Zebrafish Agents with Model-based Intrinsic Motivation

【速读】:该论文试图解决在稀疏奖励和无奖励环境中,现有强化学习方法在探索模式上表现不一致,无法生成类似动物的稳健自主行为的问题,同时指出系统神经科学对自主性的神经基础研究不足。解决方案的关键在于提出一种基于模型的内在驱动机制(3M-Progress),通过跟踪智能体当前世界模型与生态学先验之间的差异来激发自然行为,从而实现对自主探索的有效建模。

链接: https://arxiv.org/abs/2506.00138
作者: Reece Keller,Alyn Tornell,Felix Pei,Xaq Pitkow,Leo Kozachkov,Aran Nayebi
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注: 17 pages, 7 figures

点击查看摘要

Abstract:Autonomy is a hallmark of animal intelligence, enabling adaptive and intelligent behavior in complex environments without relying on external reward or task structure. Existing reinforcement learning approaches to exploration in sparse reward and reward-free environments, including class of methods known as intrinsic motivation, exhibit inconsistent exploration patterns and thus fail to produce robust autonomous behaviors observed in animals. Moreover, systems neuroscience has largely overlooked the neural basis of autonomy, focusing instead on experimental paradigms where animals are motivated by external reward rather than engaging in unconstrained, naturalistic and task-independent behavior. To bridge these gaps, we introduce a novel model-based intrinsic drive explicitly designed to capture robust autonomous exploration observed in animals. Our method (3M-Progress) motivates naturalistic behavior by tracking divergence between the agent’s current world model and an ethological prior. We demonstrate that artificial embodied agents trained with 3M-Progress capture the explainable variance in behavioral patterns and whole-brain neural-glial dynamics recorded from autonomously-behaving larval zebrafish, introducing the first goal-driven, population-level model of neural-glial computation. Our findings establish a computational framework connecting model-based intrinsic motivation to naturalistic behavior, providing a foundation for building artificial agents with animal-like autonomy.
zh

[AI-240] PathGene: Benchmarking Driver Gene Mutations and Exon Prediction Using Multicenter Lung Cancer Histopathology Image Dataset NIPS2025

【速读】:该论文试图解决在肺癌中准确预测基因突变、突变亚型及其外显子的问题,以支持个性化治疗计划和预后评估。面对医疗资源区域差异和基因组检测的高成本,利用人工智能从常规组织病理学图像中推断这些突变和外显子变异成为一种可行方案。解决方案的关键在于构建了PathGene数据集,该数据集包含来自中南大学湘雅二医院的1,576例患者和TCGA-LUAD数据库中的448例患者的组织病理学图像与下一代测序报告,将全切片图像与驱动基因突变状态、突变亚型、外显子及肿瘤突变负荷(TMB)状态相链接,从而为早期遗传筛查和精准肿瘤学提供病理图像预测突变、亚型、外显子位置和TMB的可能。

链接: https://arxiv.org/abs/2506.00096
作者: Liangrui Pan,Qingchun Liang,Shen Zhao,Songqing Fan,Shaoliang Peng
机构: 未知
类目: Genomics (q-bio.GN); Artificial Intelligence (cs.AI)
备注: Submit to NIPS2025

点击查看摘要

Abstract:Accurately predicting gene mutations, mutation subtypes and their exons in lung cancer is critical for personalized treatment planning and prognostic assessment. Faced with regional disparities in medical resources and the high cost of genomic assays, using artificial intelligence to infer these mutations and exon variants from routine histopathology images could greatly facilitate precision therapy. Although some prior studies have shown that deep learning can accelerate the prediction of key gene mutations from lung cancer pathology slides, their performance remains suboptimal and has so far been limited mainly to early screening tasks. To address these limitations, we have assembled PathGene, which comprises histopathology images paired with next-generation sequencing reports from 1,576 patients at the Second Xiangya Hospital, Central South University, and 448 TCGA-LUAD patients. This multi-center dataset links whole-slide images to driver gene mutation status, mutation subtypes, exon, and tumor mutational burden (TMB) status, with the goal of leveraging pathology images to predict mutations, subtypes, exon locations, and TMB for early genetic screening and to advance precision oncology. Unlike existing datasets, we provide molecular-level information related to histopathology images in PathGene to facilitate the development of biomarker prediction models. We benchmarked 11 multiple-instance learning methods on PathGene for mutation, subtype, exon, and TMB prediction tasks. These experimental methods provide valuable alternatives for early genetic screening of lung cancer patients and assisting clinicians to quickly develop personalized precision targeted treatment plans for patients. Code and data are available at this https URL.
zh

[AI-241] Artificial Empathy: AI based Mental Health

【速读】:该论文试图解决心理健康问题患者难以获得专业帮助或心理卫生护理的问题,提出利用AI聊天机器人作为替代性支持工具。解决方案的关键在于通过用户调研和基于场景的测试,评估大型语言模型(LLM)聊天机器人的实际应用效果,以优化其在情感支持和危机干预中的表现。

链接: https://arxiv.org/abs/2506.00081
作者: Aditya Naik,Jovi Thomas,Teja Sree,Himavant Reddy
机构: 未知
类目: Other Quantitative Biology (q-bio.OT); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Many people suffer from mental health problems but not everyone seeks professional help or has access to mental health care. AI chatbots have increasingly become a go-to for individuals who either have mental disorders or simply want someone to talk to. This paper presents a study on participants who have previously used chatbots and a scenario-based testing of large language model (LLM) chatbots. Our findings indicate that AI chatbots were primarily utilized as a “Five minute therapist” or as a non-judgmental companion. Participants appreciated the anonymity and lack of judgment from chatbots. However, there were concerns about privacy and the security of sensitive information. The scenario-based testing of LLM chatbots highlighted additional issues. Some chatbots were consistently reassuring, used emojis and names to add a personal touch, and were quick to suggest seeking professional help. However, there were limitations such as inconsistent tone, occasional inappropriate responses (e.g., casual or romantic), and a lack of crisis sensitivity, particularly in recognizing red flag language and escalating responses appropriately. These findings can inform both the technology and mental health care industries on how to better utilize AI chatbots to support individuals during challenging emotional periods.
zh

[AI-242] Human sensory-musculoskeletal modeling and control of whole-body movements

【速读】:该论文旨在解决人类运动控制中多感官输入整合、感觉运动转换及运动执行的复杂性问题,其核心挑战在于如何构建能够准确模拟人体感觉-肌骨系统动态行为的模型。解决方案的关键在于提出了一种称为SMS-Human的人体感觉-肌骨模型,该模型结合了精确的骨骼、关节和肌肉-肌腱单元解剖结构,并融合了视觉、前庭、本体感觉和触觉等多种感官输入,同时采用分阶段的分层深度强化学习框架,以应对肌肉骨骼系统在集成多模态感知信息下的高维控制难题。

链接: https://arxiv.org/abs/2506.00071
作者: Chenhui Zuo,Guohao Lin,Chen Zhang,Shanning Zhuang,Yanan Sui
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Coordinated human movement depends on the integration of multisensory inputs, sensorimotor transformation, and motor execution, as well as sensory feedback resulting from body-environment interaction. Building dynamic models of the sensory-musculoskeletal system is essential for understanding movement control and investigating human behaviours. Here, we report a human sensory-musculoskeletal model, termed SMS-Human, that integrates precise anatomical representations of bones, joints, and muscle-tendon units with multimodal sensory inputs involving visual, vestibular, proprioceptive, and tactile components. A stage-wise hierarchical deep reinforcement learning framework was developed to address the inherent challenges of high-dimensional control in musculoskeletal systems with integrated multisensory information. Using this framework, we demonstrated the simulation of three representative movement tasks, including bipedal locomotion, vision-guided object manipulation, and human-machine interaction during bicycling. Our results showed a close resemblance between natural and simulated human motor behaviours. The simulation also revealed musculoskeletal dynamics that could not be directly measured. This work sheds deeper insights into the sensorimotor dynamics of human movements, facilitates quantitative understanding of human behaviours in interactive contexts, and informs the design of systems with embodied intelligence.
zh

[AI-243] Improving statistical learning methods via features selection without replacement sampling and random projection

【速读】:该论文试图解决高维微阵列数据在分类模型中的“小样本、大特征”(small n, large p)问题,该问题导致模型过拟合。其解决方案的关键在于提出一种结合无放回特征选择(Feature Selection Without Re-placement, FSWOR)和投影方法的机器学习方法,并通过Kendall统计检验筛选显著基因,从而降低特征空间维度;同时采用集成分类器与线性判别分析(LDA)投影及朴素贝叶斯相结合的模型,提升了分类准确性,最终在测试集上达到了96%的准确率,优于现有方法。

链接: https://arxiv.org/abs/2506.00053
作者: Sulaiman khan,Muhammad Ahmad,Fida Ullah,Carlos Aguilar Ibañez,José Eduardo Valdez Rodriguez
机构: 未知
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Applications (stat.AP); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Cancer is fundamentally a genetic disease characterized by genetic and epigenetic alterations that disrupt normal gene expression, leading to uncontrolled cell growth and metastasis. High-dimensional microarray datasets pose challenges for classification models due to the “small n, large p” problem, resulting in overfitting. This study makes three different key contributions: 1) we propose a machine learning-based approach integrating the Feature Selection Without Re-placement (FSWOR) technique and a projection method to improve classification accuracy. 2) We apply the Kendall statistical test to identify the most significant genes from the brain cancer mi-croarray dataset (GSE50161), reducing the feature space from 54,675 to 20,890 genes.3) we apply machine learning models using k-fold cross validation techniques in which our model incorpo-rates ensemble classifiers with LDA projection and Naïve Bayes, achieving a test score of 96%, outperforming existing methods by 9.09%. The results demonstrate the effectiveness of our ap-proach in high-dimensional gene expression analysis, improving classification accuracy while mitigating overfitting. This study contributes to cancer biomarker discovery, offering a robust computational method for analyzing microarray data.
zh

[AI-244] Using LLM s to Advance the Cognitive Science of Collectives

【速读】:该论文试图解决集体认知(collective cognition)研究中因复杂性而受到的阻碍问题,提出生成式 AI (Generative AI) 可能成为应对这一挑战的解决方案。其关键在于利用大语言模型(LLMs)处理和模拟复杂集体行为的能力,从而推动对集体认知机制的理解,同时指出需要新的方法来应对由此带来的潜在风险。

链接: https://arxiv.org/abs/2506.00052
作者: Ilia Sucholutsky,Katherine M. Collins,Nori Jacoby,Bill D. Thompson,Robert D. Hawkins
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Multiagent Systems (cs.MA); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Abstract:LLMs are already transforming the study of individual cognition, but their application to studying collective cognition has been underexplored. We lay out how LLMs may be able to address the complexity that has hindered the study of collectives and raise possible risks that warrant new methods.
zh

[AI-245] MolTextNet: A Two-Million Molecule-Text Dataset for Multimodal Molecular Learning

【速读】:该论文旨在解决现有分子-文本数据集在规模和信息量上的局限性,从而限制了通用多模态模型的训练。其解决方案的关键在于提出了一种合成文本生成流程,该流程整合了结构特征、计算属性、生物活性数据和合成复杂性,利用GPT-4o-mini为250万种来自ChEMBL35的分子生成结构化描述,文本长度超过以往数据集的10倍,构建了MolTextNet数据集。

链接: https://arxiv.org/abs/2506.00009
作者: Yihan Zhu,Gang Liu,Eric Inae,Meng Jiang
机构: 未知
类目: Biomolecules (q-bio.BM); Artificial Intelligence (cs.AI)
备注: 21 pages, 13 figures, 10 tables

点击查看摘要

Abstract:Small molecules are essential to drug discovery, and graph-language models hold promise for learning molecular properties and functions from text. However, existing molecule-text datasets are limited in scale and informativeness, restricting the training of generalizable multimodal models. We present MolTextNet, a dataset of 2.5 million high-quality molecule-text pairs designed to overcome these limitations. To construct it, we propose a synthetic text generation pipeline that integrates structural features, computed properties, bioactivity data, and synthetic complexity. Using GPT-4o-mini, we create structured descriptions for 2.5 million molecules from ChEMBL35, with text over 10 times longer than prior datasets. MolTextNet supports diverse downstream tasks, including property prediction and structure retrieval. Pretraining CLIP-style models with Graph Neural Networks and ModernBERT on MolTextNet yields improved performance, highlighting its potential for advancing foundational multimodal modeling in molecular science. Our dataset is available at this https URL.
zh

机器学习

[LG-0] Should Decision-Makers Reveal Classifiers in Online Strategic Classification?

链接: https://arxiv.org/abs/2506.01936
作者: Han Shao,Shuo Xie,Kunhe Yang
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Strategic classification addresses a learning problem where a decision-maker implements a classifier over agents who may manipulate their features in order to receive favorable predictions. In the standard model of online strategic classification, in each round, the decision-maker implements and publicly reveals a classifier, after which agents perfectly best respond based on this knowledge. However, in practice, whether to disclose the classifier is often debated – some decision-makers believe that hiding the classifier can prevent misclassification errors caused by manipulation. In this paper, we formally examine how limiting the agents’ access to the current classifier affects the decision-maker’s performance. Specifically, we consider an extended online strategic classification setting where agents lack direct knowledge about the current classifier and instead manipulate based on a weighted average of historically implemented classifiers. Our main result shows that in this setting, the decision-maker incurs (1-\gamma)^-1 or k_\textin times more mistakes compared to the full-knowledge setting, where k_\textin is the maximum in-degree of the manipulation graph (representing how many distinct feature vectors can be manipulated to appear as a single one), and \gamma is the discount factor indicating agents’ memory of past classifiers. Our results demonstrate how withholding access to the classifier can backfire and degrade the decision-maker’s performance in online strategic classification. Subjects: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG) Cite as: arXiv:2506.01936 [cs.GT] (or arXiv:2506.01936v1 [cs.GT] for this version) https://doi.org/10.48550/arXiv.2506.01936 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-1] Generalized Gradient Norm Clipping Non-Euclidean (L_0L_1)-Smoothness

链接: https://arxiv.org/abs/2506.01913
作者: Thomas Pethick,Wanyun Xie,Mete Erdogan,Kimon Antonakopoulos,Tony Silveti-Falls,Volkan Cevher
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:This work introduces a hybrid non-Euclidean optimization method which generalizes gradient norm clipping by combining steepest descent and conditional gradient approaches. The method achieves the best of both worlds by establishing a descent property under a generalized notion of ( L_0 , L_1 )-smoothness. Weight decay is incorporated in a principled manner by identifying a connection to the Frank-Wolfe short step. In the stochastic case, we show an order optimal O(n^-1/4) convergence rate by leveraging a momentum based gradient estimator. We discuss how to instantiate the algorithms for deep learning and demonstrate their properties on image classification and language modeling.

[LG-2] SMOTE-DP: Improving Privacy-Utility Tradeoff with Synthetic Data

链接: https://arxiv.org/abs/2506.01907
作者: Yan Zhou,Bradley Malin,Murat Kantarcioglu
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Privacy-preserving data publication, including synthetic data sharing, often experiences trade-offs between privacy and utility. Synthetic data is generally more effective than data anonymization in balancing this trade-off, however, not without its own challenges. Synthetic data produced by generative models trained on source data may inadvertently reveal information about outliers. Techniques specifically designed for preserving privacy, such as introducing noise to satisfy differential privacy, often incur unpredictable and significant losses in utility. In this work we show that, with the right mechanism of synthetic data generation, we can achieve strong privacy protection without significant utility loss. Synthetic data generators producing contracting data patterns, such as Synthetic Minority Over-sampling Technique (SMOTE), can enhance a differentially private data generator, leveraging the strengths of both. We prove in theory and through empirical demonstration that this SMOTE-DP technique can produce synthetic data that not only ensures robust privacy protection but maintains utility in downstream learning tasks.

[LG-3] MLorc: Momentum Low-rank Compression for Large Language Model Adaptation

链接: https://arxiv.org/abs/2506.01897
作者: Wei Shen,Yaxiang Zhang,Minhui Huang,Mengfan Xu,Jiawei Zhang,Cong Shen
类目: Machine Learning (cs.LG); Information Theory (cs.IT); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:With increasing size of large language models (LLMs), full-parameter fine-tuning imposes substantial memory demands. To alleviate this, we propose a novel memory-efficient training paradigm called Momentum Low-rank compression (MLorc). By directly compressing and reconstructing momentum rather than gradients, MLorc avoids imposing a fixed-rank constraint on weight update matrices and better preserves the training dynamics of full-parameter fine-tuning, in contrast to existing low-rank approaches such as LoRA and GaLore. Empirically, MLorc consistently outperforms other memory-efficient training methods, matches or even exceeds the performance of full fine-tuning with a small rank (e.g., r=4 ), and generalizes well across different optimizers – all while not compromising time or memory efficiency. Furthermore, we provide a theoretical guarantee for its convergence under reasonable assumptions.

[LG-4] NepTrain and NepTrainKit: Automated Active Learning and Visualization Toolkit for Neuroevolution Potentials

链接: https://arxiv.org/abs/2506.01868
作者: Chengbing Chen,Yutong Li,Rui Zhao,Zhoulin Liu,Zheyong Fan,Gang Tang,Zhiyong Wang
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci)
*备注:

点击查看摘要

Abstract:As a machine-learned potential, the neuroevolution potential (NEP) method features exceptional computational efficiency and has been successfully applied in materials science. Constructing high-quality training datasets is crucial for developing accurate NEP models. However, the preparation and screening of NEP training datasets remain a bottleneck for broader applications due to their time-consuming, labor-intensive, and resource-intensive nature. In this work, we have developed NepTrain and NepTrainKit, which are dedicated to initializing and managing training datasets to generate high-quality training sets while automating NEP model training. NepTrain is an open-source Python package that features a bond length filtering method to effectively identify and remove non-physical structures from molecular dynamics trajectories, thereby ensuring high-quality training datasets. NepTrainKit is a graphical user interface (GUI) software designed specifically for NEP training datasets, providing functionalities for data editing, visualization, and interactive exploration. It integrates key features such as outlier identification, farthest-point sampling, non-physical structure detection, and configuration type selection. The combination of these tools enables users to process datasets more efficiently and conveniently. Using \rm CsPbI_3 as a case study, we demonstrate the complete workflow for training NEP models with NepTrain and further validate the models through materials property predictions. We believe this toolkit will greatly benefit researchers working with machine learning interatomic potentials.

[LG-5] rade-offs in Data Memorization via Strong Data Processing Inequalities COLT2025

链接: https://arxiv.org/abs/2506.01855
作者: Vitaly Feldman,Guy Kornowski,Xin Lyu
类目: Machine Learning (cs.LG); Information Theory (cs.IT); Machine Learning (stat.ML)
*备注: To appear in COLT 2025

点击查看摘要

Abstract:Recent research demonstrated that training large language models involves memorization of a significant fraction of training data. Such memorization can lead to privacy violations when training on sensitive user data and thus motivates the study of data memorization’s role in learning. In this work, we develop a general approach for proving lower bounds on excess data memorization, that relies on a new connection between strong data processing inequalities and data memorization. We then demonstrate that several simple and natural binary classification problems exhibit a trade-off between the number of samples available to a learning algorithm, and the amount of information about the training data that a learning algorithm needs to memorize to be accurate. In particular, \Omega(d) bits of information about the training data need to be memorized when O(1) d -dimensional examples are available, which then decays as the number of examples grows at a problem-specific rate. Further, our lower bounds are generally matched (up to logarithmic factors) by simple learning algorithms. We also extend our lower bounds to more general mixture-of-clusters models. Our definitions and results build on the work of Brown et al. (2021) and address several limitations of the lower bounds in their work.

[LG-6] rojan Horse Hunt in Time Series Forecasting for Space Operations

链接: https://arxiv.org/abs/2506.01849
作者: Krzysztof Kotowski,Ramez Shendy,Jakub Nalepa,Przemysław Biecek,Piotr Wilczyński,Agata Kaczmarek,Dawid Płudowski,Artur Janicki,Evridiki Ntagiou
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:This competition hosted on Kaggle (this https URL) is the first part of a series of follow-up competitions and hackathons related to the “Assurance for Space Domain AI Applications” project funded by the European Space Agency (this https URL). The competition idea is based on one of the real-life AI security threats identified within the project – the adversarial poisoning of continuously fine-tuned satellite telemetry forecasting models. The task is to develop methods for finding and reconstructing triggers (trojans) in advanced models for satellite telemetry forecasting used in safety-critical space operations. Participants are provided with 1) a large public dataset of real-life multivariate satellite telemetry (without triggers), 2) a reference model trained on the clean data, 3) a set of poisoned neural hierarchical interpolation (N-HiTS) models for time series forecasting trained on the dataset with injected triggers, and 4) Jupyter notebook with the training pipeline and baseline algorithm (the latter will be published in the last month of the competition). The main task of the competition is to reconstruct a set of 45 triggers (i.e., short multivariate time series segments) injected into the training data of the corresponding set of 45 poisoned models. The exact characteristics (i.e., shape, amplitude, and duration) of these triggers must be identified by participants. The popular Neural Cleanse method is adopted as a baseline, but it is not designed for time series analysis and new approaches are necessary for the task. The impact of the competition is not limited to the space domain, but also to many other safety-critical applications of advanced time series analysis where model poisoning may lead to serious consequences.

[LG-7] SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

链接: https://arxiv.org/abs/2506.01844
作者: Mustafa Shukor,Dana Aubakirova,Francesco Capuano,Pepijn Kooijmans,Steven Palma,Adil Zouitine,Michel Aractingi,Caroline Pascal,Martino Russi,Andres Marafioti,Simon Alibert,Matthieu Cord,Thomas Wolf,Remi Cadene
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注: 24 pages. Code and assets: this https URL

点击查看摘要

Abstract:Vision-language models (VLMs) pretrained on large-scale multimodal datasets encode rich visual and linguistic knowledge, making them a strong foundation for robotics. Rather than training robotic policies from scratch, recent approaches adapt VLMs into vision-language-action (VLA) models that enable natural language-driven perception and control. However, existing VLAs are typically massive–often with billions of parameters–leading to high training costs and limited real-world deployability. Moreover, they rely on academic and industrial datasets, overlooking the growing availability of community-collected data from affordable robotic platforms. In this work, we present SmolVLA, a small, efficient, and community-driven VLA that drastically reduces both training and inference costs, while retaining competitive performance. SmolVLA is designed to be trained on a single GPU and deployed on consumer-grade GPUs or even CPUs. To further improve responsiveness, we introduce an asynchronous inference stack decoupling perception and action prediction from action execution, allowing higher control rates with chunked action generation. Despite its compact size, SmolVLA achieves performance comparable to VLAs that are 10x larger. We evaluate SmolVLA on a range of both simulated as well as real-world robotic benchmarks and release all code, pretrained models, and training data.

[LG-8] SPACE: Your Genomic Profile Predictor is a Powerful DNA Foundation Model ICML2025

链接: https://arxiv.org/abs/2506.01833
作者: Zhao Yang,Jiwei Zhu,Bing Su
类目: Machine Learning (cs.LG); Genomics (q-bio.GN)
*备注: Accepted to ICML 2025

点击查看摘要

Abstract:Inspired by the success of unsupervised pre-training paradigms, researchers have applied these approaches to DNA pre-training. However, we argue that these approaches alone yield suboptimal results because pure DNA sequences lack sufficient information, since their functions are regulated by genomic profiles like chromatin accessibility. Here, we demonstrate that supervised training for genomic profile prediction serves as a more effective alternative to pure sequence pre-training. Furthermore, considering the multi-species and multi-profile nature of genomic profile prediction, we introduce our \textbfS pecies- \textbfP rofile \textbfA daptive \textbfC ollaborative \textbfE xperts (SPACE) that leverages Mixture of Experts (MoE) to better capture the relationships between DNA sequences across different species and genomic profiles, thereby learning more effective DNA representations. Through extensive experiments across various tasks, our model achieves state-of-the-art performance, establishing that DNA models trained with supervised genomic profiles serve as powerful DNA representation learners. The code is available at this https URL.

[LG-9] Memory Access Characterization of Large Language Models in CPU Environment and its Potential Impacts

链接: https://arxiv.org/abs/2506.01827
作者: Spencer Banasik
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR)
*备注: 34 pages, 14 figures

点击查看摘要

Abstract:As machine learning algorithms are shown to be an increasingly valuable tool, the demand for their access has grown accordingly. Oftentimes, it is infeasible to run inference with larger models without an accelerator, which may be unavailable in environments that have constraints such as energy consumption, security, or cost. To increase the availability of these models, we aim to im- prove the LLM inference speed on a CPU-only environment by modifying the cache architecture. To determine what improvements could be made, we conducted two experiments using this http URL and the QWEN model: running various cache configurations and evaluating their performance, and outputting a trace of the memory footprint. Using these experiments, we investigate the memory access patterns and performance characteristics to identify potential optimizations.

[LG-10] Efficient Learning of Balanced Signed Graphs via Sparse Linear Programming

链接: https://arxiv.org/abs/2506.01826
作者: Haruki Yokota,Hiroshi Higashi,Yuichi Tanaka,Gene Cheung
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 13 pages, submitted to IEEE Transactions on Signal Processing

点击查看摘要

Abstract:Signed graphs are equipped with both positive and negative edge weights, encoding pairwise correlations as well as anti-correlations in data. A balanced signed graph is a signed graph with no cycles containing an odd number of negative edges. Laplacian of a balanced signed graph has eigenvectors that map via a simple linear transform to ones in a corresponding positive graph Laplacian, thus enabling reuse of spectral filtering tools designed for positive graphs. We propose an efficient method to learn a balanced signed graph Laplacian directly from data. Specifically, extending a previous linear programming (LP) based sparse inverse covariance estimation method called CLIME, we formulate a new LP problem for each Laplacian column i , where the linear constraints restrict weight signs of edges stemming from node i , so that nodes of same / different polarities are connected by positive / negative edges. Towards optimal model selection, we derive a suitable CLIME parameter \rho based on a combination of the Hannan-Quinn information criterion and a minimum feasibility criterion. We solve the LP problem efficiently by tailoring a sparse LP method based on ADMM. We theoretically prove local solution convergence of our proposed iterative algorithm. Extensive experimental results on synthetic and real-world datasets show that our balanced graph learning method outperforms competing methods and enables reuse of spectral filters, wavelets, and graph convolutional nets (GCN) constructed for positive graphs.

[LG-11] Path Signatures for Feature Extraction. An Introduction to the Mathematics Underpinning an Efficient Machine Learning Technique

链接: https://arxiv.org/abs/2506.01815
作者: Stephan Sturm
类目: Machine Learning (cs.LG); Probability (math.PR)
*备注: 15 pages, 11 figures

点击查看摘要

Abstract:We provide an introduction to the topic of path signatures as means of feature extraction for machine learning from data streams. The article stresses the mathematical theory underlying the signature methodology, highlighting the conceptual character without plunging into the technical details of rigorous proofs. These notes are based on an introductory presentation given to students of the Research Experience for Undergraduates in Industrial Mathematics and Statistics at Worcester Polytechnic Institute in June 2024.

[LG-12] IF-GUIDE: Influence Function-Guided Detoxification of LLM s

链接: https://arxiv.org/abs/2506.01790
作者: Zachary Coalson,Juhan Bae,Nicholas Carlini,Sanghyun Hong
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: Pre-print

点击查看摘要

Abstract:We study how training data contributes to the emergence of toxic behaviors in large-language models. Most prior work on reducing model toxicity adopts reactive approaches, such as fine-tuning pre-trained (and potentially toxic) models to align them with human values. In contrast, we propose a proactive approach - IF-Guide - which leverages influence functions to identify harmful tokens within any training data and suppress their impact during training. To this end, we first show that standard influence functions are ineffective at discovering harmful training records. We then present a novel adaptation that measures token-level attributions from training data to model toxicity, along with techniques for selecting toxic training documents and a learning objective that can be integrated into both pre-training and fine-tuning. Moreover, IF-Guide does not rely on human-preference data, which is typically required by existing alignment methods. In evaluation, we demonstrate that IF-Guide substantially reduces both explicit and implicit toxicity - by up to 10 \times compared to uncensored models, and up to 3 \times compared to baseline alignment methods, e.g., DPO and RAD - across both pre-training and fine-tuning scenarios. IF-Guide is computationally efficient: a billion-parameter model is not necessary for computing influence scores; a million-parameter model - with 7.5 \times fewer parameters - can effectively serve as a proxy for identifying harmful data.

[LG-13] Federated Gaussian Mixture Models

链接: https://arxiv.org/abs/2506.01780
作者: Sophia Zhang Pettersson,Kuo-Yun Liang,Juan Carlos Andresen
类目: Machine Learning (cs.LG)
*备注: 19 pages, 6 figures. Submitted to ACM

点击查看摘要

Abstract:This paper introduces FedGenGMM, a novel one-shot federated learning approach for Gaussian Mixture Models (GMM) tailored for unsupervised learning scenarios. In federated learning (FL), where multiple decentralized clients collaboratively train models without sharing raw data, significant challenges include statistical heterogeneity, high communication costs, and privacy concerns. FedGenGMM addresses these issues by allowing local GMM models, trained independently on client devices, to be aggregated through a single communication round. This approach leverages the generative property of GMMs, enabling the creation of a synthetic dataset on the server side to train a global model efficiently. Evaluation across diverse datasets covering image, tabular, and time series data demonstrates that FedGenGMM consistently achieves performance comparable to non-federated and iterative federated methods, even under significant data heterogeneity. Additionally, FedGenGMM significantly reduces communication overhead, maintains robust performance in anomaly detection tasks, and offers flexibility in local model complexities, making it particularly suitable for edge computing environments.

[LG-14] DRAUN: An Algorithm-Agnostic Data Reconstruction Attack on Federated Unlearning Systems

链接: https://arxiv.org/abs/2506.01777
作者: Hithem Lamri,Manaar Alam,Haiyan Jiang,Michail Maniatakos
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Federated Unlearning (FU) enables clients to remove the influence of specific data from a collaboratively trained shared global model, addressing regulatory requirements such as GDPR and CCPA. However, this unlearning process introduces a new privacy risk: A malicious server may exploit unlearning updates to reconstruct the data requested for removal, a form of Data Reconstruction Attack (DRA). While DRAs for machine unlearning have been studied extensively in centralized Machine Learning-as-a-Service (MLaaS) settings, their applicability to FU remains unclear due to the decentralized, client-driven nature of FU. This work presents DRAUN, the first attack framework to reconstruct unlearned data in FU systems. DRAUN targets optimization-based unlearning methods, which are widely adopted for their efficiency. We theoretically demonstrate why existing DRAs targeting machine unlearning in MLaaS fail in FU and show how DRAUN overcomes these limitations. We validate our approach through extensive experiments on four datasets and four model architectures, evaluating its performance against five popular unlearning methods, effectively demonstrating that state-of-the-art FU methods remain vulnerable to DRAs.

[LG-15] Data-assimilated model-informed reinforcement learning

链接: https://arxiv.org/abs/2506.01755
作者: Defne E. Ozan,Andrea Nóvoa,Georgios Rigas,Luca Magri
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The control of spatio-temporally chaos is challenging because of high dimensionality and unpredictability. Model-free reinforcement learning (RL) discovers optimal control policies by interacting with the system, typically requiring observations of the full physical this http URL practice, sensors often provide only partial and noisy measurements (observations) of the system. The objective of this paper is to develop a framework that enables the control of chaotic systems with partial and noisy observability. The proposed method, data-assimilated model-informed reinforcement learning (DA-MIRL), integrates (i) low-order models to approximate high-dimensional dynamics; (ii) sequential data assimilation to correct the model prediction when observations become available; and (iii) an off-policy actor-critic RL algorithm to adaptively learn an optimal control strategy based on the corrected state estimates. We test DA-MIRL on the spatiotemporally chaotic solutions of the Kuramoto-Sivashinsky equation. We estimate the full state of the environment with (i) a physics-based model, here, a coarse-grained model; and (ii) a data-driven model, here, the control-aware echo state network, which is proposed in this paper. We show that DA-MIRL successfully estimates and suppresses the chaotic dynamics of the environment in real time from partial observations and approximate models. This work opens opportunities for the control of partially observable chaotic systems.

[LG-16] Automated Manifold Learning for Reduced Order Modeling

链接: https://arxiv.org/abs/2506.01741
作者: Imran Nasim,Melanie Weber
类目: Machine Learning (cs.LG)
*备注: 17 pages, 6 figures

点击查看摘要

Abstract:The problem of identifying geometric structure in data is a cornerstone of (unsupervised) learning. As a result, Geometric Representation Learning has been widely applied across scientific and engineering domains. In this work, we investigate the use of Geometric Representation Learning for the data-driven discovery of system dynamics from spatial-temporal data. We propose to encode similarity structure in such data in a spatial-temporal proximity graph, to which we apply a range of classical and deep learning-based manifold learning approaches to learn reduced order dynamics. We observe that while manifold learning is generally capable of recovering reduced order dynamics, the quality of the learned representations varies substantially across different algorithms and hyperparameter choices. This is indicative of high sensitivity to the inherent geometric assumptions of the respective approaches and suggests a need for careful hyperparameter tuning, which can be expensive in practise. To overcome these challenges, we propose a framework for Automated Manifold Learning, which selects a manifold learning approach and corresponding hyperparameter choices based on representative subsamples of the input graph. We demonstrate that the proposed framework leads to performance gains both in scalability and in the learned representations’ accuracy in capturing local and global geometric features of the underlying system dynamics.

[LG-17] When Lower-Order Terms Dominate: Adaptive Expert Algorithms for Heavy-Tailed Losses

链接: https://arxiv.org/abs/2506.01722
作者: Antoine Moulin,Emmanuel Esposito,Dirk van der Hoeven
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We consider the problem setting of prediction with expert advice with possibly heavy-tailed losses, i.e.\ the only assumption on the losses is an upper bound on their second moments, denoted by \theta . We develop adaptive algorithms that do not require any prior knowledge about the range or the second moment of the losses. Existing adaptive algorithms have what is typically considered a lower-order term in their regret guarantees. We show that this lower-order term, which is often the maximum of the losses, can actually dominate the regret bound in our setting. Specifically, we show that even with small constant \theta , this lower-order term can scale as \sqrtKT , where K is the number of experts and T is the time horizon. We propose adaptive algorithms with improved regret bounds that avoid the dependence on such a lower-order term and guarantee \mathcalO(\sqrt\theta T\log(K)) regret in the worst case, and \mathcalO(\theta \log(KT)/\Delta_\min) regret when the losses are sampled i.i.d.\ from some fixed distribution, where \Delta_\min is the difference between the mean losses of the second best expert and the best expert. Additionally, when the loss function is the squared loss, our algorithm also guarantees improved regret bounds over prior results.

[LG-18] Geometry Meets Incentives: Sample-Efficient Incentivized Exploration with Linear Contexts

链接: https://arxiv.org/abs/2506.01685
作者: Benjamin Schiffer,Mark Sellke
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In the incentivized exploration model, a principal aims to explore and learn over time by interacting with a sequence of self-interested agents. It has been recently understood that the main challenge in designing incentive-compatible algorithms for this problem is to gather a moderate amount of initial data, after which one can obtain near-optimal regret via posterior sampling. With high-dimensional contexts, however, this \emphinitial exploration phase requires exponential sample complexity in some cases, which prevents efficient learning unless initial data can be acquired exogenously. We show that these barriers to exploration disappear under mild geometric conditions on the set of available actions, in which case incentive-compatibility does not preclude regret-optimality. Namely, we consider the linear bandit model with actions in the Euclidean unit ball, and give an incentive-compatible exploration algorithm with sample complexity that scales polynomially with the dimension and other parameters.

[LG-19] Minimal Impact ControlNet: Advancing Multi-ControlNet Integration ICLR2025

链接: https://arxiv.org/abs/2506.01672
作者: Shikun Sun,Min Zhou,Zixuan Wang,Xubin Li,Tiezheng Ge,Zijie Ye,Xiaoyu Qin,Junliang Xing,Bo Zheng,Jia Jia
类目: Machine Learning (cs.LG)
*备注: ICLR 2025

点击查看摘要

Abstract:With the advancement of diffusion models, there is a growing demand for high-quality, controllable image generation, particularly through methods that utilize one or multiple control signals based on ControlNet. However, in current ControlNet training, each control is designed to influence all areas of an image, which can lead to conflicts when different control signals are expected to manage different parts of the image in practical applications. This issue is especially pronounced with edge-type control conditions, where regions lacking boundary information often represent low-frequency signals, referred to as silent control signals. When combining multiple ControlNets, these silent control signals can suppress the generation of textures in related areas, resulting in suboptimal outcomes. To address this problem, we propose Minimal Impact ControlNet. Our approach mitigates conflicts through three key strategies: constructing a balanced dataset, combining and injecting feature signals in a balanced manner, and addressing the asymmetry in the score function’s Jacobian matrix induced by ControlNet. These improvements enhance the compatibility of control signals, allowing for freer and more harmonious generation in areas with silent control signals.

[LG-20] Mixture of Experts Provably Detect and Learn the Latent Cluster Structure in Gradient-Based Learning

链接: https://arxiv.org/abs/2506.01656
作者: Ryotaro Kawata,Kohsei Matsutani,Yuri Kinoshita,Naoki Nishikawa,Taiji Suzuki
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Mixture of Experts (MoE), an ensemble of specialized models equipped with a router that dynamically distributes each input to appropriate experts, has achieved successful results in the field of machine learning. However, theoretical understanding of this architecture is falling behind due to its inherent complexity. In this paper, we theoretically study the sample and runtime complexity of MoE following the stochastic gradient descent (SGD) when learning a regression task with an underlying cluster structure of single index models. On the one hand, we prove that a vanilla neural network fails in detecting such a latent organization as it can only process the problem as a whole. This is intrinsically related to the concept of information exponent which is low for each cluster, but increases when we consider the entire task. On the other hand, we show that a MoE succeeds in dividing this problem into easier subproblems by leveraging the ability of each expert to weakly recover the simpler function corresponding to an individual cluster. To the best of our knowledge, this work is among the first to explore the benefits of the MoE framework by examining its SGD dynamics in the context of nonlinear regression.

[LG-21] Interpretable reinforcement learning for heat pump control through asymmetric differentiable decision trees

链接: https://arxiv.org/abs/2506.01641
作者: Toon Van Puyvelde,Mehran Zareh,Chris Develder
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注: 7 pages, 3 figures, conference

点击查看摘要

Abstract:In recent years, deep reinforcement learning (DRL) algorithms have gained traction in home energy management systems. However, their adoption by energy management companies remains limited due to the black-box nature of DRL, which fails to provide transparent decision-making feedback. To address this, explainable reinforcement learning (XRL) techniques have emerged, aiming to make DRL decisions more transparent. Among these, soft differential decision tree (DDT) distillation provides a promising approach due to the clear decision rules they are based on, which can be efficiently computed. However, achieving high performance often requires deep, and completely full, trees, which reduces interpretability. To overcome this, we propose a novel asymmetric soft DDT construction method. Unlike traditional soft DDTs, our approach adaptively constructs trees by expanding nodes only when necessary. This improves the efficient use of decision nodes, which require a predetermined depth to construct full symmetric trees, enhancing both interpretability and performance. We demonstrate the potential of asymmetric DDTs to provide transparent, efficient, and high-performing decision-making in home energy management systems.

[LG-22] Riemannian Time Warping: Multiple Sequence Alignment in Curved Spaces

链接: https://arxiv.org/abs/2506.01635
作者: Julian Richter,Christopher Erdös,Christian Scheurer,Jochen J. Steil,Niels Dehio
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Temporal alignment of multiple signals through time warping is crucial in many fields, such as classification within speech recognition or robot motion learning. Almost all related works are limited to data in Euclidean space. Although an attempt was made in 2011 to adapt this concept to unit quaternions, a general extension to Riemannian manifolds remains absent. Given its importance for numerous applications in robotics and beyond, we introduce Riemannian Time Warping~(RTW). This novel approach efficiently aligns multiple signals by considering the geometric structure of the Riemannian manifold in which the data is embedded. Extensive experiments on synthetic and real-world data, including tests with an LBR iiwa robot, demonstrate that RTW consistently outperforms state-of-the-art baselines in both averaging and classification tasks.

[LG-23] Connecting Neural Models Latent Geometries with Relative Geodesic Representations

链接: https://arxiv.org/abs/2506.01599
作者: Hanlin Yu,Berfin Inal,Georgios Arvanitidis,Soren Hauberg,Francesco Locatello,Marco Fumero
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Neural models learn representations of high-dimensional data on low-dimensional manifolds. Multiple factors, including stochasticities in the training process, model architectures, and additional inductive biases, may induce different representations, even when learning the same task on the same data. However, it has recently been shown that when a latent structure is shared between distinct latent spaces, relative distances between representations can be preserved, up to distortions. Building on this idea, we demonstrate that exploiting the differential-geometric structure of latent spaces of neural models, it is possible to capture precisely the transformations between representational spaces trained on similar data distributions. Specifically, we assume that distinct neural models parametrize approximately the same underlying manifold, and introduce a representation based on the pullback metric that captures the intrinsic structure of the latent space, while scaling efficiently to large models. We validate experimentally our method on model stitching and retrieval tasks, covering autoencoders and vision foundation discriminative models, across diverse architectures, datasets, and pretraining schemes.

[LG-24] PMNO: A novel physics guided multi-step neural operator predictor for partial differential equations

链接: https://arxiv.org/abs/2506.01598
作者: Jin Song,Kenji Kawaguchi,Zhenya Yan
类目: Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注: 27 pages, 12 figures

点击查看摘要

Abstract:Neural operators, which aim to approximate mappings between infinite-dimensional function spaces, have been widely applied in the simulation and prediction of physical systems. However, the limited representational capacity of network architectures, combined with their heavy reliance on large-scale data, often hinder effective training and result in poor extrapolation performance. In this paper, inspired by traditional numerical methods, we propose a novel physics guided multi-step neural operator (PMNO) architecture to address these challenges in long-horizon prediction of complex physical systems. Distinct from general operator learning methods, the PMNO framework replaces the single-step input with multi-step historical data in the forward pass and introduces an implicit time-stepping scheme based on the Backward Differentiation Formula (BDF) during backpropagation. This design not only strengthens the model’s extrapolation capacity but also facilitates more efficient and stable training with fewer data samples, especially for long-term predictions. Meanwhile, a causal training strategy is employed to circumvent the need for multi-stage training and to ensure efficient end-to-end optimization. The neural operator architecture possesses resolution-invariant properties, enabling the trained model to perform fast extrapolation on arbitrary spatial resolutions. We demonstrate the superior predictive performance of PMNO predictor across a diverse range of physical systems, including 2D linear system, modeling over irregular domain, complex-valued wave dynamics, and reaction-diffusion processes. Depending on the specific problem setting, various neural operator architectures, including FNO, DeepONet, and their variants, can be seamlessly integrated into the PMNO framework.

[LG-25] Selecting for Less Discriminatory Algorithms: A Relational Search Framework for Navigating Fairness-Accuracy Trade-offs in Practice

链接: https://arxiv.org/abs/2506.01594
作者: Hana Samad,Michael Akinwumi,Jameel Khan,Christoph Mügge-Durum,Emmanuel O. Ogundimu
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注: 40 pages, 5 figures. Introduces a horizontal LDA search framework for relationally navigating fairness-accuracy trade-offs using 2021 HMDA data

点击查看摘要

Abstract:As machine learning models are increasingly embedded into society through high-stakes decision-making, selecting the right algorithm for a given task, audience, and sector presents a critical challenge, particularly in the context of fairness. Traditional assessments of model fairness have often framed fairness as an objective mathematical property, treating model selection as an optimization problem under idealized informational conditions. This overlooks model multiplicity as a consideration–that multiple models can deliver similar performance while exhibiting different fairness characteristics. Legal scholars have engaged this challenge through the concept of Less Discriminatory Algorithms (LDAs), which frames model selection as a civil rights obligation. In real-world deployment, this normative challenge is bounded by constraints on fairness experimentation, e.g., regulatory standards, institutional priorities, and resource capacity. Against these considerations, the paper revisits Lee and Floridi (2021)'s relational fairness approach using updated 2021 Home Mortgage Disclosure Act (HMDA) data, and proposes an expansion of the scope of the LDA search process. We argue that extending the LDA search horizontally, considering fairness across model families themselves, provides a lightweight complement, or alternative, to within-model hyperparameter optimization, when operationalizing fairness in non-experimental, resource constrained settings. Fairness metrics alone offer useful, but insufficient signals to accurately evaluate candidate LDAs. Rather, by using a horizontal LDA search approach with the relational trade-off framework, we demonstrate a responsible minimum viable LDA search on real-world lending outcomes. Organizations can modify this approach to systematically compare, evaluate, and select LDAs that optimize fairness and accuracy in a sector-based contextualized manner. Comments: 40 pages, 5 figures. Introduces a horizontal LDA search framework for relationally navigating fairness-accuracy trade-offs using 2021 HMDA data Subjects: Machine Learning (cs.LG); Computers and Society (cs.CY) ACMclasses: I.2.1; I.2.6; K.4.1; K.5.2 Cite as: arXiv:2506.01594 [cs.LG] (or arXiv:2506.01594v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2506.01594 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-26] Bayes optimal learning of attention-indexed models

链接: https://arxiv.org/abs/2506.01582
作者: Fabrizio Boncoraglio,Emanuele Troiani,Vittorio Erba,Lenka Zdeborová
类目: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn); Information Theory (cs.IT); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We introduce the attention-indexed model (AIM), a theoretical framework for analyzing learning in deep attention layers. Inspired by multi-index models, AIM captures how token-level outputs emerge from layered bilinear interactions over high-dimensional embeddings. Unlike prior tractable attention models, AIM allows full-width key and query matrices, aligning more closely with practical transformers. Using tools from statistical mechanics and random matrix theory, we derive closed-form predictions for Bayes-optimal generalization error and identify sharp phase transitions as a function of sample complexity, model width, and sequence length. We propose a matching approximate message passing algorithm and show that gradient descent can reach optimal performance. AIM offers a solvable playground for understanding learning in modern attention architectures.

[LG-27] Latent Space Topology Evolution in Multilayer Perceptrons

链接: https://arxiv.org/abs/2506.01569
作者: Eduardo Paluzo-Hidalgo
类目: Machine Learning (cs.LG); Algebraic Topology (math.AT)
*备注:

点击查看摘要

Abstract:This paper introduces a topological framework for interpreting the internal representations of Multilayer Perceptrons (MLPs). We construct a simplicial tower, a sequence of simplicial complexes connected by simplicial maps, that captures how data topology evolves across network layers. Our approach enables bi-persistence analysis: layer persistence tracks topological features within each layer across scales, while MLP persistence reveals how these features transform through the network. We prove stability theorems for our topological descriptors and establish that linear separability in latent spaces is related to disconnected components in the nerve complexes. To make our framework practical, we develop a combinatorial algorithm for computing MLP persistence and introduce trajectory-based visualisations that track data flow through the network. Experiments on synthetic and real-world medical data demonstrate our method’s ability to identify redundant layers, reveal critical topological transitions, and provide interpretable insights into how MLPs progressively organise data for classification.

[LG-28] rajectory First: A Curriculum for Discovering Diverse Policies

链接: https://arxiv.org/abs/2506.01568
作者: Cornelius V. Braun,Sayantan Auddy,Marc Toussaint
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Being able to solve a task in diverse ways makes agents more robust to task variations and less prone to local optima. In this context, constrained diversity optimization has emerged as a powerful reinforcement learning (RL) framework to train a diverse set of agents in parallel. However, existing constrained-diversity RL methods often under-explore in complex tasks such as robotic manipulation, leading to a lack in policy diversity. To improve diversity optimization in RL, we therefore propose a curriculum that first explores at the trajectory level before learning step-based policies. In our empirical evaluation, we provide novel insights into the shortcoming of skill-based diversity optimization, and demonstrate empirically that our curriculum improves the diversity of the learned skills.

[LG-29] Unpacking Softmax: How Temperature Drives Representation Collapse Compression and Generalization

链接: https://arxiv.org/abs/2506.01562
作者: Wojciech Masarczyk,Mateusz Ostaszewski,Tin Sum Cheng,Tomasz Trzciński,Aurelien Lucchi,Razvan Pascanu
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The softmax function is a fundamental building block of deep neural networks, commonly used to define output distributions in classification tasks or attention weights in transformer architectures. Despite its widespread use and proven effectiveness, its influence on learning dynamics and learned representations remains poorly understood, limiting our ability to optimize model behavior. In this paper, we study the pivotal role of the softmax function in shaping the model’s representation. We introduce the concept of rank deficit bias - a phenomenon in which softmax-based deep networks find solutions of rank much lower than the number of classes. This bias depends on the softmax function’s logits norm, which is implicitly influenced by hyperparameters or directly modified by softmax temperature. Furthermore, we demonstrate how to exploit the softmax dynamics to learn compressed representations or to enhance their performance on out-of-distribution data. We validate our findings across diverse architectures and real-world datasets, highlighting the broad applicability of temperature tuning in improving model performance. Our work provides new insights into the mechanisms of softmax, enabling better control over representation learning in deep neural networks.

[LG-30] o Each Metric Its Decoding: Post-Hoc Optimal Decision Rules of Probabilistic Hierarchical Classifiers ICML2025

链接: https://arxiv.org/abs/2506.01552
作者: Roman Plaud,Alexandre Perez-Lebel,Matthieu Labeau,Antoine Saillenfest,Thomas Bonald
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Accepted at ICML 2025

点击查看摘要

Abstract:Hierarchical classification offers an approach to incorporate the concept of mistake severity by leveraging a structured, labeled hierarchy. However, decoding in such settings frequently relies on heuristic decision rules, which may not align with task-specific evaluation metrics. In this work, we propose a framework for the optimal decoding of an output probability distribution with respect to a target metric. We derive optimal decision rules for increasingly complex prediction settings, providing universal algorithms when candidates are limited to the set of nodes. In the most general case of predicting a subset of nodes, we focus on rules dedicated to the hierarchical hF_\beta scores, tailored to hierarchical settings. To demonstrate the practical utility of our approach, we conduct extensive empirical evaluations, showcasing the superiority of our proposed optimal strategies, particularly in underdetermined scenarios. These results highlight the potential of our methods to enhance the performance and reliability of hierarchical classifiers in real-world applications. The code is available at this https URL

[LG-31] Class Incremental Learning for Algorithm Selection GECCO2025

链接: https://arxiv.org/abs/2506.01545
作者: Mate Botond Nemeth,Emma Hart,Kevin Sim,Quentin Renau
类目: Machine Learning (cs.LG)
*备注: This paper was accepted at GECCO 2025. 4 pages, 2 figures

点击查看摘要

Abstract:Algorithm selection is commonly used to predict the best solver from a portfolio per per-instance. In many real scenarios, instances arrive in a stream: new instances become available over time, while the number of class labels can also grow as new data distributions arrive downstream. As a result, the classification model needs to be periodically updated to reflect additional solvers without catastrophic forgetting of past data. In machine-learning (ML), this is referred to as Class Incremental Learning (CIL). While commonly addressed in ML settings, its relevance to algorithm-selection in optimisation has not been previously studied. Using a bin-packing dataset, we benchmark 8 continual learning methods with respect to their ability to withstand catastrophic forgetting. We find that rehearsal-based methods significantly outperform other CIL methods. While there is evidence of forgetting, the loss is small at around 7%. Hence, these methods appear to be a viable approach to continual learning in streaming optimisation scenarios.

[LG-32] mporal Variational Implicit Neural Representations

链接: https://arxiv.org/abs/2506.01544
作者: Batuhan Koyuncu,Rachael DeVries,Ole Winther,Isabel Valera
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce Temporal Variational Implicit Neural Representations (TV-INRs), a probabilistic framework for modeling irregular multivariate time series that enables efficient individualized imputation and forecasting. By integrating implicit neural representations with latent variable models, TV-INRs learn distributions over time-continuous generator functions conditioned on signal-specific covariates. Unlike existing approaches that require extensive training, fine-tuning or meta-learning, our method achieves accurate individualized predictions through a single forward pass. Our experiments demonstrate that with a single TV-INRs instance, we can accurately solve diverse imputation and forecasting tasks, offering a computationally efficient and scalable solution for real-world applications. TV-INRs excel especially in low-data regimes, where it outperforms existing methods by an order of magnitude in mean squared error for imputation task.

[LG-33] Adaptive Destruction Processes for Diffusion Samplers

链接: https://arxiv.org/abs/2506.01541
作者: Timofei Gritsaev,Nikita Morozov,Kirill Tamogashev,Daniil Tiapkin,Sergey Samsonov,Alexey Naumov,Dmitry Vetrov,Nikolay Malkin
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:This paper explores the challenges and benefits of a trainable destruction process in diffusion samplers – diffusion-based generative models trained to sample an unnormalised density without access to data samples. Contrary to the majority of work that views diffusion samplers as approximations to an underlying continuous-time model, we view diffusion models as discrete-time policies trained to produce samples in very few generation steps. We propose to trade some of the elegance of the underlying theory for flexibility in the definition of the generative and destruction policies. In particular, we decouple the generation and destruction variances, enabling both transition kernels to be learned as unconstrained Gaussian densities. We show that, when the number of steps is limited, training both generation and destruction processes results in faster convergence and improved sampling quality on various benchmarks. Through a robust ablation study, we investigate the design choices necessary to facilitate stable training. Finally, we show the scalability of our approach through experiments on GAN latent space sampling for conditional image generation.

[LG-34] Learning Abstract World Models with a Group-Structured Latent Space

链接: https://arxiv.org/abs/2506.01529
作者: Thomas Delliaux,Nguyen-Khanh Vu,Vincent François-Lavet,Elise van der Pol,Emmanuel Rachelson
类目: Machine Learning (cs.LG)
*备注: 20 pages, 18 figures

点击查看摘要

Abstract:Learning meaningful abstract models of Markov Decision Processes (MDPs) is crucial for improving generalization from limited data. In this work, we show how geometric priors can be imposed on the low-dimensional representation manifold of a learned transition model. We incorporate known symmetric structures via appropriate choices of the latent space and the associated group actions, which encode prior knowledge about invariances in the environment. In addition, our framework allows the embedding of additional unstructured information alongside these symmetries. We show experimentally that this leads to better predictions of the latent transition model than fully unstructured approaches, as well as better learning on downstream RL tasks, in environments with rotational and translational features, including in first-person views of 3D environments. Additionally, our experiments show that this leads to simpler and more disentangled representations. The full code is available on GitHub to ensure reproducibility.

[LG-35] Alignment as Distribution Learning: Your Preference Model is Explicitly a Language Model

链接: https://arxiv.org/abs/2506.01523
作者: Jihun Yun,Juno Kim,Jongho Park,Junhyuck Kim,Jongha Jon Ryu,Jaewoong Cho,Kwang-Sung Jun
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 26 pages, 7 tables

点击查看摘要

Abstract:Alignment via reinforcement learning from human feedback (RLHF) has become the dominant paradigm for controlling the quality of outputs from large language models (LLMs). However, when viewed as `loss + regularization,’ the standard RLHF objective lacks theoretical justification and incentivizes degenerate, deterministic solutions, an issue that variants such as Direct Policy Optimization (DPO) also inherit. In this paper, we rethink alignment by framing it as \emphdistribution learning from pairwise preference feedback by explicitly modeling how information about the target language model bleeds through the preference data. This explicit modeling leads us to propose three principled learning objectives: preference maximum likelihood estimation, preference distillation, and reverse KL minimization. We theoretically show that all three approaches enjoy strong non-asymptotic O(1/n) convergence to the target language model, naturally avoiding degeneracy and reward overfitting. Finally, we empirically demonstrate that our distribution learning framework, especially preference distillation, consistently outperforms or matches the performances of RLHF and DPO across various tasks and models.

[LG-36] Beyond Diagonal Covariance: Flexible Posterior VAEs via Free-Form Injective Flows

链接: https://arxiv.org/abs/2506.01522
作者: Peter Sorrenson,Lukas Lührs,Hans Olischläger,Ullrich Köthe
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Variational Autoencoders (VAEs) are powerful generative models widely used for learning interpretable latent spaces, quantifying uncertainty, and compressing data for downstream generative tasks. VAEs typically rely on diagonal Gaussian posteriors due to computational constraints. Using arguments grounded in differential geometry, we demonstrate inherent limitations in the representational capacity of diagonal covariance VAEs, as illustrated by explicit low-dimensional examples. In response, we show that a regularized variant of the recently introduced Free-form Injective Flow (FIF) can be interpreted as a VAE featuring a highly flexible, implicitly defined posterior. Crucially, this regularization yields a posterior equivalent to a full Gaussian covariance distribution, yet maintains computational costs comparable to standard diagonal covariance VAEs. Experiments on image datasets validate our approach, demonstrating that incorporating full covariance substantially improves model likelihood.

[LG-37] Analyzing the Importance of Blank for CTC-Based Knowledge Distillation INTERSPEECH2025

链接: https://arxiv.org/abs/2506.01503
作者: Benedikt Hilmes,Nick Rossenbach,Ralf Schlüter
类目: Machine Learning (cs.LG)
*备注: Accepted for Interspeech 2025

点击查看摘要

Abstract:With the rise of large pre-trained foundation models for automatic speech recognition new challenges appear. While the performance of these models is good, runtime and cost of inference increases. One approach to make use of their strength while retaining efficiency is to distill their knowledge to smaller models during training. In this work, we explore different CTC-based distillation variants, focusing on blank token handling. We show that common approaches like blank elimination do not always work off the shelf. We explore new blank selection patterns as a potential sweet spot between standard knowledge distillation and blank elimination mechanisms. Through the introduction of a symmetric selection method, we are able to remove the CTC loss during knowledge distillation with minimal to no performance degradation. With this, we make the training independent from target labels, potentially allowing for distillation on untranscribed audio data.

[LG-38] SpiceMixer - Netlist-Level Circuit Evolution

链接: https://arxiv.org/abs/2506.01497
作者: Stefan Uhlich,Andrea Bonetti,Arun Venkitaraman,Chia-Yu Hsieh,Mustafa Emre Gürsoy,Ryoga Matsuo,Lorenzo Servadei
类目: Neural and Evolutionary Computing (cs.NE); Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper introduces SpiceMixer, a genetic algorithm developed to synthesize novel analog circuits by evolving SPICE netlists. Unlike conventional methods, SpiceMixer operates directly on netlist lines, enabling compatibility with any component or subcircuit type and supporting general-purpose genetic operations. By using a normalized netlist format, the algorithm enhances the effectiveness of its genetic operators: crossover, mutation, and pruning. We show that SpiceMixer achieves superior performance in synthesizing standard cells (inverter, two-input NAND, and latch) and in designing an analog classifier circuit for the Iris dataset, reaching an accuracy of 89% on the test set. Across all evaluated tasks, SpiceMixer consistently outperforms existing synthesis methods.

[LG-39] Confidence-Aware Self-Distillation for Multimodal Sentiment Analysis with Incomplete Modalities

链接: https://arxiv.org/abs/2506.01490
作者: Yanxi Luo,Shijin Wang,Zhongxing Xu,Yulong Li,Feilong Tang,Jionglong Su
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multimodal sentiment analysis (MSA) aims to understand human sentiment through multimodal data. In real-world scenarios, practical factors often lead to uncertain modality missingness. Existing methods for handling modality missingness are based on data reconstruction or common subspace projections. However, these methods neglect the confidence in multimodal combinations and impose constraints on intra-class representation, hindering the capture of modality-specific information and resulting in suboptimal performance. To address these challenges, we propose a Confidence-Aware Self-Distillation (CASD) strategy that effectively incorporates multimodal probabilistic embeddings via a mixture of Student’s t -distributions, enhancing its robustness by incorporating confidence and accommodating heavy-tailed properties. This strategy estimates joint distributions with uncertainty scores and reduces uncertainty in the student network by consistency distillation. Furthermore, we introduce a reparameterization representation module that facilitates CASD in robust multimodal learning by sampling embeddings from the joint distribution for the prediction module to calculate the task loss. As a result, the directional constraint from the loss minimization is alleviated by the sampled representation. Experimental results on three benchmark datasets demonstrate that our method achieves state-of-the-art performance.

[LG-40] Model-agnostic Mitigation Strategies of Data Imbalance for Regression

链接: https://arxiv.org/abs/2506.01486
作者: Jelke Wibbeke,Sebastian Rohjans,Andreas Rauh
类目: Machine Learning (cs.LG)
*备注: 34 pages, 11 figures, to be submitted to Springer Nature Machine Learning

点击查看摘要

Abstract:Data imbalance persists as a pervasive challenge in regression tasks, introducing bias in model performance and undermining predictive reliability. This is particularly detrimental in applications aimed at predicting rare events that fall outside the domain of the bulk of the training data. In this study, we review the current state-of-the-art regarding sampling-based methods and cost-sensitive learning. Additionally, we propose novel approaches to mitigate model bias. To better asses the importance of data, we introduce the density-distance and density-ratio relevance functions, which effectively integrate empirical frequency of data with domain-specific preferences, offering enhanced interpretability for end-users. Furthermore, we present advanced mitigation techniques (cSMOGN and crbSMOGN), which build upon and improve existing sampling methods. In a comprehensive quantitative evaluation, we benchmark state-of-the-art methods on 10 synthetic and 42 real-world datasets, using neural networks, XGBoosting trees and Random Forest models. Our analysis reveals that while most strategies improve performance on rare samples, they often degrade it on frequent ones. We demonstrate that constructing an ensemble of models – one trained with imbalance mitigation and another without – can significantly reduce these negative effects. The key findings underscore the superior performance of our novel crbSMOGN sampling technique with the density-ratio relevance function for neural networks, outperforming state-of-the-art methods.

[LG-41] Feature-aware Hypergraph Generation via Next-Scale Prediction

链接: https://arxiv.org/abs/2506.01467
作者: Dorian Gailhard,Enzo Tartaglione,Lirida Naviner,Jhony H. Giraldo
类目: Machine Learning (cs.LG); Discrete Mathematics (cs.DM)
*备注:

点击查看摘要

Abstract:Hypergraphs generalize traditional graphs by allowing hyperedges to connect multiple nodes, making them well-suited for modeling complex structures with higher-order relationships, such as 3D meshes, molecular systems, and electronic circuits. While topology is central to hypergraph structure, many real-world applications also require node and hyperedge features. Existing hypergraph generation methods focus solely on topology, often overlooking feature modeling. In this work, we introduce FAHNES (feature-aware hypergraph generation via next-scale prediction), a hierarchical approach that jointly generates hypergraph topology and features. FAHNES builds a multi-scale representation through node coarsening, then learns to reconstruct finer levels via localized expansion and refinement, guided by a new node budget mechanism that controls cluster splitting. We evaluate FAHNES on synthetic hypergraphs, 3D meshes, and molecular datasets. FAHNES achieves competitive results in reconstructing topology and features, establishing a foundation for future research in featured hypergraph generative modeling.

[LG-42] Self-supervised Latent Space Optimization with Nebula Variational Coding

链接: https://arxiv.org/abs/2506.01414
作者: Yida Wang,David Joseph Tan,Nassir Navab,Federico Tombari
类目: Machine Learning (cs.LG); Information Theory (cs.IT)
*备注:

点击查看摘要

Abstract:Deep learning approaches process data in a layer-by-layer way with intermediate (or latent) features. We aim at designing a general solution to optimize the latent manifolds to improve the performance on classification, segmentation, completion and/or reconstruction through probabilistic models. This paper proposes a variational inference model which leads to a clustered embedding. We introduce additional variables in the latent space, called \textbfnebula anchors, that guide the latent variables to form clusters during training. To prevent the anchors from clustering among themselves, we employ the variational constraint that enforces the latent features within an anchor to form a Gaussian distribution, resulting in a generative model we refer as Nebula Variational Coding (NVC). Since each latent feature can be labeled with the closest anchor, we also propose to apply metric learning in a self-supervised way to make the separation between clusters more explicit. As a consequence, the latent variables of our variational coder form clusters which adapt to the generated semantic of the training data, \textite.g. the categorical labels of each sample. We demonstrate experimentally that it can be used within different architectures designed to solve different problems including text sequence, images, 3D point clouds and volumetric data, validating the advantage of our proposed method.

[LG-43] SOC-DGL: Social Interaction Behavior Inspired Dual Graph Learning Framework for Drug-Target Interaction Identification

链接: https://arxiv.org/abs/2506.01405
作者: Xiang Zhao,Ruijie Li,Qiao Ning,Shikai Guo,Hui Li,Qian Ma
类目: Machine Learning (cs.LG)
*备注: 14 pages, 17 figures (including subfigures), 4 tables. Xiang Zhao and Ruijie Li contributed equally to this work and should be considered co-first authors. The source code and datasets are available at this https URL

点击查看摘要

Abstract:The identification of drug-target interactions (DTI) is crucial for drug discovery and repositioning, as it reveals potential uses of existing drugs, aiding in the acceleration of the drug development process and reducing associated costs. Despite the similarity information in DTI is important, most models are limited to mining direct similarity information within homogeneous graphs, overlooking the potential yet rich similarity information in heterogeneous graphs. Inspired by real-world social interaction behaviors, we propose SOC-DGL, which comprises two specialized modules: the Affinity-Driven Graph Learning (ADGL) module and the Equilibrium-Driven Graph Learning (EDGL) module. The ADGL module adopts a comprehensive social interaction strategy, leveraging an affinity-enhanced global drug-target graph to learn both global DTI and the individual similarity information of drugs and targets. In contrast, the EDGL module employs a higher-order social interaction strategy, amplifying the influence of even-hop neighbors through an even-polynomial graph filter grounded in balance theory, enabling the indirect mining of higher-order homogeneous information. This dual approach enables SOC-DGL to effectively and comprehensively capture similarity information across diverse interaction scales within the affinity matrices and drug-target association matrices, significantly enhancing the model’s generalization capability and predictive accuracy in DTI tasks. To address the issue of imbalance in drug-target interaction datasets, this paper proposes an adjustable imbalance loss function that mitigates the impact of sample imbalance by adjusting the weight of negative samples and a parameter. Extensive experiments on four benchmark datasets demonstrate significant accuracy improvements achieved by SOC-DGL, particularly in scenarios involving data imbalance and unseen drugs or targets.

[LG-44] Quantitative Error Feedback for Quantization Noise Reduction of Filtering over Graphs ICASSP ICASSP49660

链接: https://arxiv.org/abs/2506.01404
作者: Xue Xian Zheng,Weihang Liu,Xin Lou,Stefan Vlaski,Tareq Al-Naffouri
类目: Machine Learning (cs.LG); Multiagent Systems (cs.MA); Systems and Control (eess.SY)
*备注: Journal Paper from ICASSP https://doi.org/10.1109/ICASSP49660.2025.10888821

点击查看摘要

Abstract:This paper introduces an innovative error feedback framework designed to mitigate quantization noise in distributed graph filtering, where communications are constrained to quantized messages. It comes from error spectrum shaping techniques from state-space digital filters, and therefore establishes connections between quantized filtering processes over different domains. In contrast to existing error compensation methods, our framework quantitatively feeds back the quantization noise for exact compensation. We examine the framework under three key scenarios: (i) deterministic graph filtering, (ii) graph filtering over random graphs, and (iii) graph filtering with random node-asynchronous updates. Rigorous theoretical analysis demonstrates that the proposed framework significantly reduces the effect of quantization noise, and we provide closed-form solutions for the optimal error feedback coefficients. Moreover, this quantitative error feedback mechanism can be seamlessly integrated into communication-efficient decentralized optimization frameworks, enabling lower error floors. Numerical experiments validate the theoretical results, consistently showing that our method outperforms conventional quantization strategies in terms of both accuracy and robustness.

[LG-45] Mitigating Disparate Impact of Differentially Private Learning through Bounded Adaptive Clipping NEURIPS2025

链接: https://arxiv.org/abs/2506.01396
作者: Linzh Zhao(1),Aki Rehn(1),Mikko A. Heikkilä(1),Razane Tajeddine(2),Antti Honkela(1) ((1) Department of Computer Science, University of Helsinki, Finland, (2) Department of Electrical and Computer Engineering, American University of Beirut, Lebanon)
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Machine Learning (stat.ML)
*备注: NeurIPS 2025 under review. 22 pages, 8 figures

点击查看摘要

Abstract:Differential privacy (DP) has become an essential framework for privacy-preserving machine learning. Existing DP learning methods, however, often have disparate impacts on model predictions, e.g., for minority groups. Gradient clipping, which is often used in DP learning, can suppress larger gradients from challenging samples. We show that this problem is amplified by adaptive clipping, which will often shrink the clipping bound to tiny values to match a well-fitting majority, while significantly reducing the accuracy for others. We propose bounded adaptive clipping, which introduces a tunable lower bound to prevent excessive gradient suppression. Our method improves the accuracy of the worst-performing class on average over 10 percentage points on skewed MNIST and Fashion MNIST compared to the unbounded adaptive clipping, and over 5 percentage points over constant clipping.

[LG-46] Improved Regret Bounds for Gaussian Process Upper Confidence Bound in Bayesian Optimization

链接: https://arxiv.org/abs/2506.01393
作者: Shogo Iwazaki
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 37 pages

点击查看摘要

Abstract:This paper addresses the Bayesian optimization problem (also referred to as the Bayesian setting of the Gaussian process bandit), where the learner seeks to minimize the regret under a function drawn from a known Gaussian process (GP). Under a Matérn kernel with a certain degree of smoothness, we show that the Gaussian process upper confidence bound (GP-UCB) algorithm achieves \tildeO(\sqrtT) cumulative regret with high probability. Furthermore, our analysis yields O(\sqrtT \ln^4 T) regret under a squared exponential kernel. These results fill the gap between the existing regret upper bound for GP-UCB and the best-known bound provided by Scarlett (2018). The key idea in our proof is to capture the concentration behavior of the input sequence realized by GP-UCB, enabling a more refined analysis of the GP’s information gain.

[LG-47] Multi Part Deployment of Neural Network

链接: https://arxiv.org/abs/2506.01387
作者: Paritosh Ranjan,Surajit Majumder,Prodip Roy
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: 7 pages, 1 figures

点击查看摘要

Abstract:The increasing scale of modern neural networks, exemplified by architectures from IBM (530 billion neurons) and Google (500 billion parameters), presents significant challenges in terms of computational cost and infrastructure requirements. As deep neural networks continue to grow, traditional training paradigms relying on monolithic GPU clusters become increasingly unsustainable. This paper proposes a distributed system architecture that partitions a neural network across multiple servers, each responsible for a subset of neurons. Neurons are classified as local or remote, with inter-server connections managed via a metadata-driven lookup mechanism. A Multi-Part Neural Network Execution Engine facilitates seamless execution and training across distributed partitions by dynamically resolving and invoking remote neurons using stored metadata. All servers share a unified model through a network file system (NFS), ensuring consistency during parallel updates. A Neuron Distributor module enables flexible partitioning strategies based on neuron count, percentage, identifiers, or network layers. This architecture enables cost-effective, scalable deployment of deep learning models on cloud infrastructure, reducing dependency on high-performance centralized compute resources.

[LG-48] hinkEval: Practical Evaluation of Knowledge Preservation and Consistency in LLM Editing with Thought-based Knowledge Graphs

链接: https://arxiv.org/abs/2506.01386
作者: Manit Baser,Dinil Mon Divakaran,Mohan Gurusamy
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Model editing has become an important tool for addressing privacy, bias, and misinformation in large language models (LLMs) by enabling updates to knowledge without the need for retraining from scratch. However, existing editing techniques often target isolated facts, ignoring ripple effects on related knowledge, allowing edited facts to remain deducible and compromising broader contextual integrity. For example, changing Harry Potter’s school from Hogwarts to Ilvermorny requires reassigning his house from Gryffindor to a suitable alternative while preserving Gryffindor’s relationship with Hogwarts. In this work, we present a new model-editing setting, deep editing, to show: (1) how editing techniques fail to handle connected facts, evaluating how original knowledge sneaks through unchanged causal links, and (2) their impact on broader contextual knowledge. We introduce ThinkEval, a framework to systematically evaluate model- editing techniques by building model-specific knowledge graphs to analyze pre- and post-edit effects on fact persistence and catastrophic forgetting. We present KnowGIC, a benchmark created with ThinkEval, consisting of sequentially linked queries to measure these effects. We evaluate five editing techniques: AlphaEdit, RECT, ROME, MEMIT, and PRUNE across multiple LLMs. We find that these techniques struggle to balance indirect fact suppression with the preservation of related knowledge. Our dataset is available at: this https URL.

[LG-49] Modeling All-Atom Glycan Structures via Hierarchical Message Passing and Multi-Scale Pre-training ICML2025

链接: https://arxiv.org/abs/2506.01376
作者: Minghao Xu,Jiaze Song,Keming Wu,Xiangxin Zhou,Bin Cui,Wentao Zhang
类目: Machine Learning (cs.LG)
*备注: Published at ICML 2025. All code and data are released

点击查看摘要

Abstract:Understanding the various properties of glycans with machine learning has shown some preliminary promise. However, previous methods mainly focused on modeling the backbone structure of glycans as graphs of monosaccharides (i.e., sugar units), while they neglected the atomic structures underlying each monosaccharide, which are actually important indicators of glycan properties. We fill this blank by introducing the GlycanAA model for All-Atom-wise Glycan modeling. GlycanAA models a glycan as a heterogeneous graph with monosaccharide nodes representing its global backbone structure and atom nodes representing its local atomic-level structures. Based on such a graph, GlycanAA performs hierarchical message passing to capture from local atomic-level interactions to global monosaccharide-level interactions. To further enhance model capability, we pre-train GlycanAA on a high-quality unlabeled glycan dataset, deriving the PreGlycanAA model. We design a multi-scale mask prediction algorithm to endow the model about different levels of dependencies in a glycan. Extensive benchmark results show the superiority of GlycanAA over existing glycan encoders and verify the further improvements achieved by PreGlycanAA. We maintain all resources at this https URL

[LG-50] meGraph: Synthetic Benchmark Datasets for Robust Time-Series Causal Discovery KDD2025

链接: https://arxiv.org/abs/2506.01361
作者: Muhammad Hasan Ferdous,Emam Hossain,Md Osman Gani
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR); Machine Learning (stat.ML)
*备注: 11 pages, 4 figures, accepted at KDD 2025 (Datasets and Benchmarks Track)

点击查看摘要

Abstract:Robust causal discovery in time series datasets depends on reliable benchmark datasets with known ground-truth causal relationships. However, such datasets remain scarce, and existing synthetic alternatives often overlook critical temporal properties inherent in real-world data, including nonstationarity driven by trends and seasonality, irregular sampling intervals, and the presence of unobserved confounders. To address these challenges, we introduce TimeGraph, a comprehensive suite of synthetic time-series benchmark datasets that systematically incorporates both linear and nonlinear dependencies while modeling key temporal characteristics such as trends, seasonal effects, and heterogeneous noise patterns. Each dataset is accompanied by a fully specified causal graph featuring varying densities and diverse noise distributions and is provided in two versions: one including unobserved confounders and one without, thereby offering extensive coverage of real-world complexity while preserving methodological neutrality. We further demonstrate the utility of TimeGraph through systematic evaluations of state-of-the-art causal discovery algorithms including PCMCI+, LPCMCI, and FGES across a diverse array of configurations and metrics. Our experiments reveal significant variations in algorithmic performance under realistic temporal conditions, underscoring the need for robust synthetic benchmarks in the fair and transparent assessment of causal discovery methods. The complete TimeGraph suite, including dataset generation scripts, evaluation metrics, and recommended experimental protocols, is freely available to facilitate reproducible research and foster community-driven advancements in time-series causal discovery.

[LG-51] RDB2G-Bench: A Comprehensive Benchmark for Automatic Graph Modeling of Relational Databases

链接: https://arxiv.org/abs/2506.01360
作者: Dongwon Choi,Sunwoo Kim,Juyeon Kim,Kyungho Kim,Geon Lee,Shinhwan Kang,Myunghwan Kim,Kijung Shin
类目: Machine Learning (cs.LG)
*备注: Code and datasets are in this https URL

点击查看摘要

Abstract:Relational databases (RDBs) are composed of interconnected tables, where relationships between them are defined through foreign keys. Recent research on applying machine learning to RDBs has explored graph-based representations of RDBs, where rows of tables are modeled as nodes, and foreign key relationships are modeled as edges. RDB-to-graph modeling helps capture cross-table dependencies, ultimately leading to enhanced performance across diverse tasks. However, there are numerous ways to model RDBs as graphs, and performance varies significantly depending on the chosen graph model. In our analysis, applying a common heuristic rule for graph modeling leads to up to a 10% drop in performance compared to the best-performing graph model, which remains non-trivial to identify. To foster research on intelligent RDB-to-graph modeling, we introduce RDB2G-Bench, the first benchmark framework for evaluating such methods. We construct extensive datasets covering 5 real-world RDBs and 12 predictive tasks, resulting in around 50k graph-performance pairs for efficient and reproducible evaluations. Thanks to our precomputed datasets, we were able to benchmark 9 automatic RDB-to-graph modeling methods on the 12 tasks over 600x faster than on-the-fly evaluation, which requires repeated model training. Our analysis of the datasets and benchmark results reveals key structural patterns affecting graph model effectiveness, along with practical implications for effective graph modeling.

[LG-52] wo-Stage Learning of Stabilizing Neural Controllers via Zubov Sampling and Iterative Domain Expansion

链接: https://arxiv.org/abs/2506.01356
作者: Haoyu Li,Xiangru Zhong,Bin Hu,Huan Zhang
类目: Machine Learning (cs.LG); Robotics (cs.RO); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Learning-based neural network (NN) control policies have shown impressive empirical performance. However, obtaining stability guarantees and estimations of the region of attraction of these learned neural controllers is challenging due to the lack of stable and scalable training and verification algorithms. Although previous works in this area have achieved great success, much conservatism remains in their framework. In this work, we propose a novel two-stage training framework to jointly synthesize the controller and Lyapunov function for continuous-time systems. By leveraging a Zubov-inspired region of attraction characterization to directly estimate stability boundaries, we propose a novel training data sampling strategy and a domain updating mechanism that significantly reduces the conservatism in training. Moreover, unlike existing works on continuous-time systems that rely on an SMT solver to formally verify the Lyapunov condition, we extend state-of-the-art neural network verifier \alpha,!\beta -CROWN with the capability of performing automatic bound propagation through the Jacobian of dynamical systems and a novel verification scheme that avoids expensive bisection. To demonstrate the effectiveness of our approach, we conduct numerical experiments by synthesizing and verifying controllers on several challenging nonlinear systems across multiple dimensions. We show that our training can yield region of attractions with volume 5 - 1.5\cdot 10^5 times larger compared to the baselines, and our verification on continuous systems can be up to 40-10000 times faster compared to the traditional SMT solver dReal. Our code is available at this https URL.

[LG-53] AH-QUANT: Effective Activation Quantization in Pipeline Parallelism over Slow Network

链接: https://arxiv.org/abs/2506.01352
作者: Guangxin He,Yuan Cao,Yutong He,Tianyi Bai,Kun Yuan,Binhang Yuan
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Decentralized training of large language models offers the opportunity to pool computational resources across geographically distributed participants but faces significant network communication bottlenecks, particularly in pipeline-parallel settings. While pipeline parallelism partitions model layers across devices to handle large-scale models, it necessitates frequent communication of intermediate activations, creating challenges when network bandwidth is limited. Existing activation compression methods, such as AQ-SGD, mitigate quantization-induced errors through error compensation but impose prohibitive memory overhead by requiring storage of previous activations. To address these issues, we introduce TAH-Quant (Tile-wise Adaptive Hadamard Quantization), a novel activation quantization framework designed specifically for pipeline parallelism. Our approach integrates fine-grained tile-wise quantization for precise control, entropy-guided token-level adaptive bit allocation for optimal bit usage, and a Hadamard-based transform with pivot element swapping to effectively suppress quantization outliers. We further provide a theoretical analysis, proving that pipeline parallel training equipped with TAH-Quant maintains a convergence rate of \mathcalO(1/\sqrtT) , matching that of vanilla stochastic gradient descent. Extensive experiments on diverse LLM tasks demonstrate that TAH-Quant achieves aggressive activation quantization (3-4 bits) ratio, which provides up to 4.3 \times end-to-end speedup without compromising training convergence, matches state-of-the-art methods, incurs no extra memory overhead, and generalizes well across different training scenarios.

[LG-54] Variational Adaptive Noise and Dropout towards Stable Recurrent Neural Networks

链接: https://arxiv.org/abs/2506.01350
作者: Taisuke Kobayashi,Shingo Murata
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注: 6 pages, 6 figures (accepted in ICDL2025)

点击查看摘要

Abstract:This paper proposes a novel stable learning theory for recurrent neural networks (RNNs), so-called variational adaptive noise and dropout (VAND). As stabilizing factors for RNNs, noise and dropout on the internal state of RNNs have been separately confirmed in previous studies. We reinterpret the optimization problem of RNNs as variational inference, showing that noise and dropout can be derived simultaneously by transforming the explicit regularization term arising in the optimization problem into implicit regularization. Their scale and ratio can also be adjusted appropriately to optimize the main objective of RNNs, respectively. In an imitation learning scenario with a mobile manipulator, only VAND is able to imitate sequential and periodic behaviors as instructed. this https URL

[LG-55] Distributionally Robust Learning in Survival Analysis

链接: https://arxiv.org/abs/2506.01348
作者: Yeping Jin,Lauren Wise,Ioannis Paschalidis
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We introduce an innovative approach that incorporates a Distributionally Robust Learning (DRL) approach into Cox regression to enhance the robustness and accuracy of survival predictions. By formulating a DRL framework with a Wasserstein distance-based ambiguity set, we develop a variant Cox model that is less sensitive to assumptions about the underlying data distribution and more resilient to model misspecification and data perturbations. By leveraging Wasserstein duality, we reformulate the original min-max DRL problem into a tractable regularized empirical risk minimization problem, which can be computed by exponential conic programming. We provide guarantees on the finite sample behavior of our DRL-Cox model. Moreover, through extensive simulations and real world case studies, we demonstrate that our regression model achieves superior performance in terms of prediction accuracy and robustness compared with traditional methods.

[LG-56] Invariance Makes LLM Unlearning Resilient Even to Unanticipated Downstream Fine-Tuning ICML2025

链接: https://arxiv.org/abs/2506.01339
作者: Changsheng Wang,Yihua Zhang,Jinghan Jia,Parikshit Ram,Dennis Wei,Yuguang Yao,Soumyadeep Pal,Nathalie Baracaldo,Sijia Liu
类目: Machine Learning (cs.LG)
*备注: Accepted by ICML 2025

点击查看摘要

Abstract:Machine unlearning offers a promising solution to privacy and safety concerns in large language models (LLMs) by selectively removing targeted knowledge while preserving utility. However, current methods are highly sensitive to downstream fine-tuning, which can quickly recover forgotten information-even from unrelated tasks. To address this, we introduce invariance into unlearning for the first time, inspired by invariant risk minimization (IRM). Building on this principle, we propose invariant LLM unlearning (ILU), a regularization-based framework that enhances robustness. Notably, ILU generalizes well to diverse fine-tuning tasks, even when trained using a single dataset. A task vector analysis is also provided to further elucidate the rationale behind ILU’s effectiveness. Extensive experiments on the WMDP and MUSE benchmark, reveal that ILU significantly outperforms state-of-the-art unlearning methods, including negative preference optimization (NPO) and representation misdirection for unlearning (RMU). Notably, ILU achieves superior unlearning robustness across diverse downstream fine-tuning scenarios (e.g., math, paraphrase detection, and sentiment analysis) while preserving the fine-tuning performance.

[LG-57] Energy Considerations for Large Pretrained Neural Networks

链接: https://arxiv.org/abs/2506.01311
作者: Leo Mei,Mark Stamp
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Increasingly complex neural network architectures have achieved phenomenal performance. However, these complex models require massive computational resources that consume substantial amounts of electricity, which highlights the potential environmental impact of such models. Previous studies have demonstrated that substantial redundancies exist in large pre-trained models. However, previous work has primarily focused on compressing models while retaining comparable model performance, and the direct impact on electricity consumption appears to have received relatively little attention. By quantifying the energy usage associated with both uncompressed and compressed models, we investigate compression as a means of reducing electricity consumption. We consider nine different pre-trained models, ranging in size from 8M parameters to 138M parameters. To establish a baseline, we first train each model without compression and record the electricity usage and time required during training, along with other relevant statistics. We then apply three compression techniques: Steganographic capacity reduction, pruning, and low-rank factorization. In each of the resulting cases, we again measure the electricity usage, training time, model accuracy, and so on. We find that pruning and low-rank factorization offer no significant improvements with respect to energy usage or other related statistics, while steganographic capacity reduction provides major benefits in almost every case. We discuss the significance of these findings.

[LG-58] Latent Structured Hopfield Network for Semantic Association and Retrieval

链接: https://arxiv.org/abs/2506.01303
作者: Chong Li,Xiangyang Xue,Jianfeng Feng,Taiping Zeng
类目: Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注:

点击查看摘要

Abstract:Episodic memory enables humans to recall past experiences by associating semantic elements such as objects, locations, and time into coherent event representations. While large pretrained models have shown remarkable progress in modeling semantic memory, the mechanisms for forming associative structures that support episodic memory remain underexplored. Inspired by hippocampal CA3 dynamics and its role in associative memory, we propose the Latent Structured Hopfield Network (LSHN), a biologically inspired framework that integrates continuous Hopfield attractor dynamics into an autoencoder architecture. LSHN mimics the cortical-hippocampal pathway: a semantic encoder extracts compact latent representations, a latent Hopfield network performs associative refinement through attractor convergence, and a decoder reconstructs perceptual input. Unlike traditional Hopfield networks, our model is trained end-to-end with gradient descent, achieving scalable and robust memory retrieval. Experiments on MNIST, CIFAR-10, and a simulated episodic memory task demonstrate superior performance in recalling corrupted inputs under occlusion and noise, outperforming existing associative memory models. Our work provides a computational perspective on how semantic elements can be dynamically bound into episodic memory traces through biologically grounded attractor mechanisms.

[LG-59] Recent Developments in GNNs for Drug Discovery

链接: https://arxiv.org/abs/2506.01302
作者: Zhengyu Fang,Xiaoge Zhang,Anyin Zhao,Xiao Li,Huiyuan Chen,Jing Li
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:In this paper, we review recent developments and the role of Graph Neural Networks (GNNs) in computational drug discovery, including molecule generation, molecular property prediction, and drug-drug interaction prediction. By summarizing the most recent developments in this area, we underscore the capabilities of GNNs to comprehend intricate molecular patterns, while exploring both their current and prospective applications. We initiate our discussion by examining various molecular representations, followed by detailed discussions and categorization of existing GNN models based on their input types and downstream application tasks. We also collect a list of commonly used benchmark datasets for a variety of applications. We conclude the paper with brief discussions and summarize common trends in this important research area.

[LG-60] he Actor-Critic Update Order Matters for PPO in Federated Reinforcement Learning

链接: https://arxiv.org/abs/2506.01261
作者: Zhijie Xie,Shenghui Song
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In the context of Federated Reinforcement Learning (FRL), applying Proximal Policy Optimization (PPO) faces challenges related to the update order of its actor and critic due to the aggregation step occurring between successive iterations. In particular, when local actors are updated based on local critic estimations, the algorithm becomes vulnerable to data heterogeneity. As a result, the conventional update order in PPO (critic first, then actor) may cause heterogeneous gradient directions among clients, hindering convergence to a globally optimal policy. To address this issue, we propose FedRAC, which reverses the update order (actor first, then critic) to eliminate the divergence of critics from different clients. Theoretical analysis shows that the convergence bound of FedRAC is immune to data heterogeneity under mild conditions, i.e., bounded level of heterogeneity and accurate policy evaluation. Empirical results indicate that the proposed algorithm obtains higher cumulative rewards and converges more rapidly in five experiments, including three classical RL environments and a highly heterogeneous autonomous driving scenario using the SUMO traffic simulator.

[LG-61] Protocol Models: Scaling Decentralized Training with Communication-Efficient Model Parallelism

链接: https://arxiv.org/abs/2506.01260
作者: Sameera Ramasinghe,Thalaiyasingam Ajanthan,Gil Avraham,Yan Zuo,Alexander Long
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Scaling models has led to significant advancements in deep learning, but training these models in decentralized settings remains challenging due to communication bottlenecks. While existing compression techniques are effective in data-parallel, they do not extend to model parallelism. Unlike data-parallel training, where weight gradients are exchanged, model-parallel requires compressing activations and activation gradients as they propagate through layers, accumulating compression errors. We propose a novel compression algorithm that compresses both forward and backward passes, enabling up to 99% compression with no convergence degradation with negligible memory/compute overhead. By leveraging a recursive structure in transformer networks, we predefine a low-dimensional subspace to confine the activations and gradients, allowing full reconstruction in subsequent layers. Our method achieves up to 100x improvement in communication efficiency and enables training billion-parameter-scale models over low-end GPUs connected via consumer-grade internet speeds as low as 80Mbps, matching the convergence of centralized datacenter systems with 100Gbps connections with model parallel.

[LG-62] Neural Variance-aware Dueling Bandits with Deep Representation and Shallow Exploration

链接: https://arxiv.org/abs/2506.01250
作者: Youngmin Oh,Jinje Park,Taejin Paik,Jaemin Park
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:In this paper, we address the contextual dueling bandit problem by proposing variance-aware algorithms that leverage neural networks to approximate nonlinear utility functions. Our approach employs a \textitvariance-aware exploration strategy, which adaptively accounts for uncertainty in pairwise comparisons while relying only on the gradients with respect to the learnable parameters of the last layer. This design effectively balances the exploration–exploitation tradeoff under both the Upper Confidence Bound (UCB) and Thompson Sampling (TS) frameworks. As a result, under standard assumptions, we establish theoretical guarantees showing that our algorithms achieve sublinear cumulative average regret of order \bigol\lt(d \sqrt\sum_t=1^T \sigma_t^2 + \sqrtdT\rt), for sufficiently wide neural networks, where d is the contextual dimension, \sigma_t^2 the variance of comparisons at round t , and T the total number of rounds. We also empirically validate that our approach offers reasonable computational efficiency and achieves sublinear regret on both synthetic tasks with nonlinear utilities and real-world tasks, outperforming existing methods.

[LG-63] Stress-Testing ML Pipelines with Adversarial Data Corruption

链接: https://arxiv.org/abs/2506.01230
作者: Jiongli Zhu,Geyang Xu,Felipe Lorenzi,Boris Glavic,Babak Salimi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Structured data-quality issues, such as missing values correlated with demographics, culturally biased labels, or systemic selection biases, routinely degrade the reliability of machine-learning pipelines. Regulators now increasingly demand evidence that high-stakes systems can withstand these realistic, interdependent errors, yet current robustness evaluations typically use random or overly simplistic corruptions, leaving worst-case scenarios unexplored. We introduce SAVAGE, a causally inspired framework that (i) formally models realistic data-quality issues through dependency graphs and flexible corruption templates, and (ii) systematically discovers corruption patterns that maximally degrade a target performance metric. SAVAGE employs a bi-level optimization approach to efficiently identify vulnerable data subpopulations and fine-tune corruption severity, treating the full ML pipeline, including preprocessing and potentially non-differentiable models, as a black box. Extensive experiments across multiple datasets and ML tasks (data cleaning, fairness-aware learning, uncertainty quantification) demonstrate that even a small fraction (around 5 %) of structured corruptions identified by SAVAGE severely impacts model performance, far exceeding random or manually crafted errors, and invalidating core assumptions of existing techniques. Thus, SAVAGE provides a practical tool for rigorous pipeline stress-testing, a benchmark for evaluating robustness methods, and actionable guidance for designing more resilient data workflows.

[LG-64] React to Surprises: Stable-by-Design Neural Feedback Control and the Youla-REN

链接: https://arxiv.org/abs/2506.01226
作者: Nicholas H. Barbara,Ruigang Wang,Alexandre Megretski,Ian R. Manchester
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study parameterizations of stabilizing nonlinear policies for learning-based control. We propose a structure based on a nonlinear version of the Youla-Kučera parameterization combined with robust neural networks such as the recurrent equilibrium network (REN). The resulting parameterizations are unconstrained, and hence can be searched over with first-order optimization methods, while always ensuring closed-loop stability by construction. We study the combination of (a) nonlinear dynamics, (b) partial observation, and © incremental closed-loop stability requirements (contraction and Lipschitzness). We find that with any two of these three difficulties, a contracting and Lipschitz Youla parameter always leads to contracting and Lipschitz closed loops. However, if all three hold, then incremental stability can be lost with exogenous disturbances. Instead, a weaker condition is maintained, which we call d-tube contraction and Lipschitzness. We further obtain converse results showing that the proposed parameterization covers all contracting and Lipschitz closed loops for certain classes of nonlinear systems. Numerical experiments illustrate the utility of our parameterization when learning controllers with built-in stability certificates for: i) ``economic’’ rewards without stabilizing effects; ii) short training horizons; and iii) uncertain systems.

[LG-65] Self-Refining Training for Amortized Density Functional Theory

链接: https://arxiv.org/abs/2506.01225
作者: Majdi Hassan,Cristian Gabellini,Hatem Helal,Dominique Beaini,Kirill Neklyudov
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Density Functional Theory (DFT) allows for predicting all the chemical and physical properties of molecular systems from first principles by finding an approximate solution to the many-body Schrödinger equation. However, the cost of these predictions becomes infeasible when increasing the scale of the energy evaluations, e.g., when calculating the ground-state energy for simulating molecular dynamics. Recent works have demonstrated that, for substantially large datasets of molecular conformations, Deep Learning-based models can predict the outputs of the classical DFT solvers by amortizing the corresponding optimization problems. In this paper, we propose a novel method that reduces the dependency of amortized DFT solvers on large pre-collected datasets by introducing a self-refining training strategy. Namely, we propose an efficient method that simultaneously trains a deep-learning model to predict the DFT outputs and samples molecular conformations that are used as training data for the model. We derive our method as a minimization of the variational upper bound on the KL-divergence measuring the discrepancy between the generated samples and the target Boltzmann distribution defined by the ground state energy. To demonstrate the utility of the proposed scheme, we perform an extensive empirical study comparing it with the models trained on the pre-collected datasets. Finally, we open-source our implementation of the proposed algorithm, optimized with asynchronous training and sampling stages, which enables simultaneous sampling and training. Code is available at this https URL.

[LG-66] On the Stability of Graph Convolutional Neural Networks: A Probabilistic Perspective

链接: https://arxiv.org/abs/2506.01213
作者: Ning Zhang,Henry Kenlay,Li Zhang,Mihai Cucuringu,Xiaowen Dong
类目: Machine Learning (cs.LG); Signal Processing (eess.SP); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Graph convolutional neural networks (GCNNs) have emerged as powerful tools for analyzing graph-structured data, achieving remarkable success across diverse applications. However, the theoretical understanding of the stability of these models, i.e., their sensitivity to small changes in the graph structure, remains in rather limited settings, hampering the development and deployment of robust and trustworthy models in practice. To fill this gap, we study how perturbations in the graph topology affect GCNN outputs and propose a novel formulation for analyzing model stability. Unlike prior studies that focus only on worst-case perturbations, our distribution-aware formulation characterizes output perturbations across a broad range of input data. This way, our framework enables, for the first time, a probabilistic perspective on the interplay between the statistical properties of the node data and perturbations in the graph topology. We conduct extensive experiments to validate our theoretical findings and demonstrate their benefits over existing baselines, in terms of both representation stability and adversarial attacks on downstream tasks. Our results demonstrate the practical significance of the proposed formulation and highlight the importance of incorporating data distribution into stability analysis.

[LG-67] Dynamic Modes as Time Representation for Spatiotemporal Forecasting

链接: https://arxiv.org/abs/2506.01212
作者: Menglin Kong,Vincent Zhihao Zheng,Xudong Wang,Lijun Sun
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:This paper introduces a data-driven time embedding method for modeling long-range seasonal dependencies in spatiotemporal forecasting tasks. The proposed approach employs Dynamic Mode Decomposition (DMD) to extract temporal modes directly from observed data, eliminating the need for explicit timestamps or hand-crafted time features. These temporal modes serve as time representations that can be seamlessly integrated into deep spatiotemporal forecasting models. Unlike conventional embeddings such as time-of-day indicators or sinusoidal functions, our method captures complex multi-scale periodicity through spectral analysis of spatiotemporal data. Extensive experiments on urban mobility, highway traffic, and climate datasets demonstrate that the DMD-based embedding consistently improves long-horizon forecasting accuracy, reduces residual correlation, and enhances temporal generalization. The method is lightweight, model-agnostic, and compatible with any architecture that incorporates time covariates.

[LG-68] Multiresolution Analysis and Statistical Thresholding on Dynamic Networks

链接: https://arxiv.org/abs/2506.01208
作者: Raphaël Romero,Tijl De Bie,Nick Heard,Alexander Modell
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Detecting structural change in dynamic network data has wide-ranging applications. Existing approaches typically divide the data into time bins, extract network features within each bin, and then compare these features over time. This introduces an inherent tradeoff between temporal resolution and the statistical stability of the extracted features. Despite this tradeoff, reminiscent of time-frequency tradeoffs in signal processing, most methods rely on a fixed temporal resolution. Choosing an appropriate resolution parameter is typically difficult and can be especially problematic in domains like cybersecurity, where anomalous behavior may emerge at multiple time scales. We address this challenge by proposing ANIE (Adaptive Network Intensity Estimation), a multi-resolution framework designed to automatically identify the time scales at which network structure evolves, enabling the joint detection of both rapid and gradual changes. Modeling interactions as Poisson processes, our method proceeds in two steps: (1) estimating a low-dimensional subspace of node behavior, and (2) deriving a set of novel empirical affinity coefficients that quantify change in interaction intensity between latent factors and support statistical testing for structural change across time scales. We provide theoretical guarantees for subspace estimation and the asymptotic behavior of the affinity coefficients, enabling model-based change detection. Experiments on synthetic networks show that ANIE adapts to the appropriate time resolution and is able to capture sharp structural changes while remaining robust to noise. Furthermore, applications to real-world data showcase the practical benefits of ANIE’s multiresolution approach to detecting structural change over fixed resolution methods.

[LG-69] FedRPCA: Enhancing Federated LoRA Aggregation Using Robust PCA

链接: https://arxiv.org/abs/2506.01194
作者: Divyansh Jhunjhunwala,Arian Raje,Madan Ravi Ganesh,Chaithanya Kumar Mummadi,Chaoqun Dong,Jiawei Zhou,Wan-Yi Lin,Gauri Joshi,Zhenzhen Li
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:LoRA has emerged as one of the most promising fine-tuning techniques, especially for federated learning (FL), since it significantly reduces communication and computation costs at resource-constrained clients. However, data heterogeneity remains a significant challenge for LoRA-based FL, and the conventional aggregation strategy based on FedAvg suffers from slow convergence and suboptimal accuracy. Motivated by recent advances in model merging, particularly Task Arithmetic, we explore the idea of aggregating client LoRA parameters using scaled averaging. We first observe that a naive application of Task Arithmetic is ineffective due to the high cosine similarity between client updates, indicating significant common knowledge in the updates across clients. To address this issue, we propose decomposing client LoRA updates via Robust Principal Component Analysis (Robust-PCA) into a common low-rank component and client-specific sparse components. Our proposed algorithm FedRPCA aggregates the low-rank components through averaging, consolidating common knowledge, and applies scaled averaging to the sparse components to amplify client-specific knowledge. We evaluate our approach across a variety of vision and language tasks and demonstrate that it achieves higher final accuracy and faster convergence compared to competing baselines.

[LG-70] SIFBench: An Extensive Benchmark for Fatigue Analysis

链接: https://arxiv.org/abs/2506.01173
作者: Tushar Gautam,Robert M. Kirby,Jacob Hochhalter,Shandian Zhe
类目: Databases (cs.DB); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Fatigue-induced crack growth is a leading cause of structural failure across critical industries such as aerospace, civil engineering, automotive, and energy. Accurate prediction of stress intensity factors (SIFs) – the key parameters governing crack propagation in linear elastic fracture mechanics – is essential for assessing fatigue life and ensuring structural integrity. While machine learning (ML) has shown great promise in SIF prediction, its advancement has been severely limited by the lack of rich, transparent, well-organized, and high-quality datasets. To address this gap, we introduce SIFBench, an open-source, large-scale benchmark database designed to support ML-based SIF prediction. SIFBench contains over 5 million different crack and component geometries derived from high-fidelity finite element simulations across 37 distinct scenarios, and provides a unified Python interface for seamless data access and customization. We report baseline results using a range of popular ML models – including random forests, support vector machines, feedforward neural networks, and Fourier neural operators – alongside comprehensive evaluation metrics and template code for model training, validation, and assessment. By offering a standardized and scalable resource, SIFBench substantially lowers the entry barrier and fosters the development and application of ML methods in damage tolerance design and predictive maintenance. Subjects: Databases (cs.DB); Machine Learning (cs.LG) Cite as: arXiv:2506.01173 [cs.DB] (or arXiv:2506.01173v1 [cs.DB] for this version) https://doi.org/10.48550/arXiv.2506.01173 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-71] Accelerated Learning with Linear Temporal Logic using Differentiable Simulation

链接: https://arxiv.org/abs/2506.01167
作者: Alper Kamil Bozkurt,Calin Belta,Ming C. Lin
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:To ensure learned controllers comply with safety and reliability requirements for reinforcement learning in real-world settings remains challenging. Traditional safety assurance approaches, such as state avoidance and constrained Markov decision processes, often inadequately capture trajectory requirements or may result in overly conservative behaviors. To address these limitations, recent studies advocate the use of formal specification languages such as linear temporal logic (LTL), enabling the derivation of correct-by-construction learning objectives from the specified requirements. However, the sparse rewards associated with LTL specifications make learning extremely difficult, whereas dense heuristic-based rewards risk compromising correctness. In this work, we propose the first method, to our knowledge, that integrates LTL with differentiable simulators, facilitating efficient gradient-based learning directly from LTL specifications by coupling with differentiable paradigms. Our approach introduces soft labeling to achieve differentiable rewards and states, effectively mitigating the sparse-reward issue intrinsic to LTL without compromising objective correctness. We validate the efficacy of our method through experiments, demonstrating significant improvements in both reward attainment and training time compared to the discrete methods.

[LG-72] Nearly-Linear Time Private Hypothesis Selection with the Optimal Approximation Factor

链接: https://arxiv.org/abs/2506.01162
作者: Maryam Aliakbarpour,Zhan Shi,Ria Stevens,Vincent X. Wang
类目: Data Structures and Algorithms (cs.DS); Cryptography and Security (cs.CR); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 33 pages

点击查看摘要

Abstract:Estimating the density of a distribution from its samples is a fundamental problem in statistics. Hypothesis selection addresses the setting where, in addition to a sample set, we are given n candidate distributions – referred to as hypotheses – and the goal is to determine which one best describes the underlying data distribution. This problem is known to be solvable very efficiently, requiring roughly O(\log n) samples and running in \tildeO(n) time. The quality of the output is measured via the total variation distance to the unknown distribution, and the approximation factor of the algorithm determines how large this distance is compared to the optimal distance achieved by the best candidate hypothesis. It is known that \alpha = 3 is the optimal approximation factor for this problem. We study hypothesis selection under the constraint of differential privacy. We propose a differentially private algorithm in the central model that runs in nearly-linear time with respect to the number of hypotheses, achieves the optimal approximation factor, and incurs only a modest increase in sample complexity, which remains polylogarithmic in n . This resolves an open question posed by [Bun, Kamath, Steinke, Wu, NeurIPS 2019]. Prior to our work, existing upper bounds required quadratic time.

[LG-73] Weight-Space Linear Recurrent Neural Networks

链接: https://arxiv.org/abs/2506.01153
作者: Roussel Desmond Nzoyem,Nawid Keshtmand,Idriss Tsayem,David A.W. Barton,Tom Deakin
类目: Machine Learning (cs.LG)
*备注: 33 pages, 21 figures, 11 tables

点击查看摘要

Abstract:We introduce WARP (Weight-space Adaptive Recurrent Prediction), a simple yet powerful framework that unifies weight-space learning with linear recurrence to redefine sequence modeling. Unlike conventional recurrent neural networks (RNNs) which collapse temporal dynamics into fixed-dimensional hidden states, WARP explicitly parametrizes the hidden state as the weights of a distinct root neural network. This formulation promotes higher-resolution memory, gradient-free adaptation at test-time, and seamless integration of domain-specific physical priors. Empirical validation shows that WARP matches or surpasses state-of-the-art baselines on diverse classification tasks, spanning synthetic benchmarks to real-world datasets. Furthermore, extensive experiments across sequential image completion, dynamical system reconstruction, and multivariate time series forecasting demonstrate its expressiveness and generalization capabilities. Critically, WARP’s weight trajectories offer valuable insights into the model’s inner workings. Ablation studies confirm the architectural necessity of key components, solidifying weight-space linear RNNs as a transformative paradigm for adaptive machine intelligence.

[LG-74] Slow Feature Analysis on Markov Chains from Goal-Directed Behavior

链接: https://arxiv.org/abs/2506.01145
作者: Merlin Schüler,Eddie Seabrook,Laurenz Wiskott
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Slow Feature Analysis is a unsupervised representation learning method that extracts slowly varying features from temporal data and can be used as a basis for subsequent reinforcement learning. Often, the behavior that generates the data on which the representation is learned is assumed to be a uniform random walk. Less research has focused on using samples generated by goal-directed behavior, as commonly the case in a reinforcement learning setting, to learn a representation. In a spatial setting, goal-directed behavior typically leads to significant differences in state occupancy between states that are close to a reward location and far from a reward location. Through the perspective of optimal slow features on ergodic Markov chains, this work investigates the effects of these differences on value-function approximation in an idealized setting. Furthermore, three correction routes, which can potentially alleviate detrimental scaling effects, are evaluated and discussed. In addition, the special case of goal-averse behavior is considered. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2506.01145 [cs.LG] (or arXiv:2506.01145v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2506.01145 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-75] Learning DNF through Generalized Fourier Representations

链接: https://arxiv.org/abs/2506.01075
作者: Mohsen Heidari,Roni Khardon
类目: Data Structures and Algorithms (cs.DS); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注: 54 pages

点击查看摘要

Abstract:The Fourier representation for the uniform distribution over the Boolean cube has found numerous applications in algorithms and complexity analysis. Notably, in learning theory, learnability of Disjunctive Normal Form (DNF) under uniform as well as product distributions has been established through such representations. This paper makes five main contributions. First, it introduces a generalized Fourier expansion that can be used with any distribution D through the representation of the distribution as a Bayesian network (BN). Second, it shows that the main algorithmic tools for learning with the Fourier representation, that use membership queries to approximate functions by recovering their heavy Fourier coefficients, can be used with slight modifications with the generalized expansion. These results hold for any distribution. Third, it analyzes the L_1 spectral norm of conjunctions under the new expansion, showing that it is bounded for a class of distributions which can be represented by difference bounded tree BN, where a parent node in the BN representation can change the conditional expectation of a child node by at most \alpha0.5 . Lower bounds are presented to show that such constraints are necessary. The fourth contribution uses these results to show the learnability of DNF with membership queries under difference bounded tree BN. The final contribution is to develop an algorithm for learning difference-bounded tree BN distributions, thus extending the DNF learnability result to cases where the distribution is not known in advance.

[LG-76] No Soundness in the Real World: On the Challenges of the Verification of Deployed Neural Networks ICML2025

链接: https://arxiv.org/abs/2506.01054
作者: Attila Szász,Balázs Bánhelyi,Márk Jelasity
类目: Machine Learning (cs.LG)
*备注: accepted at ICML 2025. For the implementation, see this https URL

点击查看摘要

Abstract:The ultimate goal of verification is to guarantee the safety of deployed neural networks. Here, we claim that all the state-of-the-art verifiers we are aware of fail to reach this goal. Our key insight is that theoretical soundness (bounding the full-precision output while computing with floating point) does not imply practical soundness (bounding the floating point output in a potentially stochastic environment). We prove this observation for the approaches that are currently used to achieve provable theoretical soundness, such as interval analysis and its variants. We also argue that achieving practical soundness is significantly harder computationally. We support our claims empirically as well by evaluating several well-known verification methods. To mislead the verifiers, we create adversarial networks that detect and exploit features of the deployment environment, such as the order and precision of floating point operations. We demonstrate that all the tested verifiers are vulnerable to our new deployment-specific attacks, which proves that they are not practically sound.

[LG-77] A Finite-Time Analysis of TD Learning with Linear Function Approximation without Projections nor Strong Convexity

链接: https://arxiv.org/abs/2506.01052
作者: Wei-Cheng Lee,Francesco Orabona
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We investigate the finite-time convergence properties of Temporal Difference (TD) learning with linear function approximation, a cornerstone algorithm in reinforcement learning. While prior work has established convergence guarantees, these results typically rely on the assumption that each iterate is projected onto a bounded set or that the learning rate is set according to the unknown strong convexity constant – conditions that are both artificial and do not match the current practice. In this paper, we challenge the necessity of such assumptions and present a refined analysis of TD learning. We show that the simple projection-free variant converges with a rate of \tilde\mathcalO(\frac||\theta^*||^2_2\sqrtT) , even in the presence of Markovian noise. Our analysis reveals a novel self-bounding property of the TD updates and exploits it to guarantee bounded iterates. Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML) Cite as: arXiv:2506.01052 [cs.LG] (or arXiv:2506.01052v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2506.01052 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-78] Optimistic critics can empower small actors

链接: https://arxiv.org/abs/2506.01016
作者: Olya Mastikhina,Dhruv Sreenivas,Pablo Samuel Castro
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: RLC 2025

点击查看摘要

Abstract:Actor-critic methods have been central to many of the recent advances in deep reinforcement learning. The most common approach is to use symmetric architectures, whereby both actor and critic have the same network topology and number of parameters. However, recent works have argued for the advantages of asymmetric setups, specifically with the use of smaller actors. We perform broad empirical investigations and analyses to better understand the implications of this and find that, in general, smaller actors result in performance degradation and overfit critics. Our analyses suggest poor data collection, due to value underestimation, as one of the main causes for this behavior, and further highlight the crucial role the critic can play in alleviating this pathology. We explore techniques to mitigate the observed value underestimation, which enables further research in asymmetric actor-critic methods.

[LG-79] LoRA-BAM: Input Filtering for Fine-tuned LLM s via Boxed Abstraction Monitors over LoRA Layers

链接: https://arxiv.org/abs/2506.00998
作者: Changshun Wu,Tianyi Duan,Saddek Bensalem,Chih-Hong Cheng
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Fine-tuning large language models (LLMs) improves performance on domain-specific tasks but can lead to overfitting, making them unreliable on out-of-distribution (OoD) queries. We propose LoRA-BAM - a method that adds OoD detection monitors to the LoRA layer using boxed abstraction to filter questions beyond the model’s competence. Feature vectors from the fine-tuning data are extracted via the LLM and clustered. Clusters are enclosed in boxes; a question is flagged as OoD if its feature vector falls outside all boxes. To improve interpretability and robustness, we introduce a regularization loss during fine-tuning that encourages paraphrased questions to stay close in the feature space, and the enlargement of the decision boundary is based on the feature variance within a cluster. Our method complements existing defenses by providing lightweight and interpretable OoD detection.

[LG-80] Quantization-based Bounds on the Wasserstein Metric

链接: https://arxiv.org/abs/2506.00976
作者: Jonathan Bobrutsky,Amit Moscovich
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 23 pages, 8 figures, 7 tables

点击查看摘要

Abstract:The Wasserstein metric has become increasingly important in many machine learning applications such as generative modeling, image retrieval and domain adaptation. Despite its appeal, it is often too costly to compute. This has motivated approximation methods like entropy-regularized optimal transport, downsampling, and subsampling, which trade accuracy for computational efficiency. In this paper, we consider the challenge of computing efficient approximations to the Wasserstein metric that also serve as strict upper or lower bounds. Focusing on discrete measures on regular grids, our approach involves formulating and exactly solving a Kantorovich problem on a coarse grid using a quantized measure and specially designed cost matrix, followed by an upscaling and correction stage. This is done either in the primal or dual space to obtain valid upper and lower bounds on the Wasserstein metric of the full-resolution inputs. We evaluate our methods on the DOTmark optimal transport images benchmark, demonstrating a 10x-100x speedup compared to entropy-regularized OT while keeping the approximation error below 2%.

[LG-81] Pilot Contamination-Aware Graph Attention Network for Power Control in CFmMIMO

链接: https://arxiv.org/abs/2506.00967
作者: Tingting Zhang,Sergiy A. Vorobyov,David J. Love,Taejoon Kim,Kai Dong
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Optimization-based power control algorithms are predominantly iterative with high computational complexity, making them impractical for real-time applications in cell-free massive multiple-input multiple-output (CFmMIMO) systems. Learning-based methods have emerged as a promising alternative, and among them, graph neural networks (GNNs) have demonstrated their excellent performance in solving power control problems. However, all existing GNN-based approaches assume ideal orthogonality among pilot sequences for user equipments (UEs), which is unrealistic given that the number of UEs exceeds the available orthogonal pilot sequences in CFmMIMO schemes. Moreover, most learning-based methods assume a fixed number of UEs, whereas the number of active UEs varies over time in practice. Additionally, supervised training necessitates costly computational resources for computing the target power control solutions for a large volume of training samples. To address these issues, we propose a graph attention network for downlink power control in CFmMIMO systems that operates in a self-supervised manner while effectively handling pilot contamination and adapting to a dynamic number of UEs. Experimental results show its effectiveness, even in comparison to the optimal accelerated projected gradient method as a baseline.

[LG-82] Reinforcement Learning with Random Time Horizons

链接: https://arxiv.org/abs/2506.00962
作者: Enric Ribera Borrell,Lorenz Richter,Christof Schütte
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Probability (math.PR); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We extend the standard reinforcement learning framework to random time horizons. While the classical setting typically assumes finite and deterministic or infinite runtimes of trajectories, we argue that multiple real-world applications naturally exhibit random (potentially trajectory-dependent) stopping times. Since those stopping times typically depend on the policy, their randomness has an effect on policy gradient formulas, which we (mostly for the first time) derive rigorously in this work both for stochastic and deterministic policies. We present two complementary perspectives, trajectory or state-space based, and establish connections to optimal control theory. Our numerical experiments demonstrate that using the proposed formulas can significantly improve optimization convergence compared to traditional approaches.

[LG-83] Enhancing Parallelism in Decentralized Stochastic Convex Optimization ICML2025

链接: https://arxiv.org/abs/2506.00961
作者: Ofri Eisen,Ron Dorfman,Kfir Y. Levy
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: ICML 2025

点击查看摘要

Abstract:Decentralized learning has emerged as a powerful approach for handling large datasets across multiple machines in a communication-efficient manner. However, such methods often face scalability limitations, as increasing the number of machines beyond a certain point negatively impacts convergence rates. In this work, we propose Decentralized Anytime SGD, a novel decentralized learning algorithm that significantly extends the critical parallelism threshold, enabling the effective use of more machines without compromising performance. Within the stochastic convex optimization (SCO) framework, we establish a theoretical upper bound on parallelism that surpasses the current state-of-the-art, allowing larger networks to achieve favorable statistical guarantees and closing the gap with centralized learning in highly connected topologies.

[LG-84] Hidden Representation Clustering with Multi-Task Representation Learning towards Robust Online Budget Allocation

链接: https://arxiv.org/abs/2506.00959
作者: Xiaohan Wang,Yu Zhang,Guibin Jiang,Bing Cheng,Wei Lin
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Marketing optimization, commonly formulated as an online budget allocation problem, has emerged as a pivotal factor in driving user growth. Most existing research addresses this problem by following the principle of ‘first predict then optimize’ for each individual, which presents challenges related to large-scale counterfactual prediction and solving complexity trade-offs. Note that the practical data quality is uncontrollable, and the solving scale tends to be tens of millions. Therefore, the existing approaches make the robust budget allocation non-trivial, especially in industrial scenarios with considerable data noise. To this end, this paper proposes a novel approach that solves the problem from the cluster perspective. Specifically, we propose a multi-task representation network to learn the inherent attributes of individuals and project the original features into high-dimension hidden representations through the first two layers of the trained network. Then, we divide these hidden representations into K groups through partitioning-based clustering, thus reformulating the problem as an integer stochastic programming problem under different total budgets. Finally, we distill the representation module and clustering model into a multi-category model to facilitate online deployment. Offline experiments validate the effectiveness and superiority of our approach compared to six state-of-the-art marketing optimization algorithms. Online A/B tests on the Meituan platform indicate that the approach outperforms the online algorithm by 0.53% and 0.65%, considering order volume (OV) and gross merchandise volume (GMV), respectively.

[LG-85] Addressing the Collaboration Dilemma in Low-Data Federated Learning via Transient Sparsity

链接: https://arxiv.org/abs/2506.00932
作者: Qiao Xiao,Boqian Wu,Andrey Poddubnyy,Elena Mocanu,Phuong H. Nguyen,Mykola Pechenizkiy,Decebal Constantin Mocanu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Federated learning (FL) enables collaborative model training across decentralized clients while preserving data privacy, leveraging aggregated updates to build robust global models. However, this training paradigm faces significant challenges due to data heterogeneity and limited local datasets, which often impede effective collaboration. In such scenarios, we identify the Layer-wise Inertia Phenomenon in FL, wherein the middle layers of global model undergo minimal updates after early communication rounds, ultimately limiting the effectiveness of global aggregation. We demonstrate the presence of this phenomenon across a wide range of federated settings, spanning diverse datasets and architectures. To address this issue, we propose LIPS (Layer-wise Inertia Phenomenon with Sparsity), a simple yet effective method that periodically introduces transient sparsity to stimulate meaningful updates and empower global aggregation. Experiments demonstrate that LIPS effectively mitigates layer-wise inertia, enhances aggregation effectiveness, and improves overall performance in various FL scenarios. This work not only deepens the understanding of layer-wise learning dynamics in FL but also paves the way for more effective collaboration strategies in resource-constrained environments. Our code is publicly available at: this https URL.

[LG-86] Q-learning with Posterior Sampling

链接: https://arxiv.org/abs/2506.00917
作者: Priyank Agrawal,Shipra Agrawal,Azmat Azati
类目: Machine Learning (cs.LG)
*备注: 39 Pages

点击查看摘要

Abstract:Bayesian posterior sampling techniques have demonstrated superior empirical performance in many exploration-exploitation settings. However, their theoretical analysis remains a challenge, especially in complex settings like reinforcement learning. In this paper, we introduce Q-Learning with Posterior Sampling (PSQL), a simple Q-learning-based algorithm that uses Gaussian posteriors on Q-values for exploration, akin to the popular Thompson Sampling algorithm in the multi-armed bandit setting. We show that in the tabular episodic MDP setting, PSQL achieves a regret bound of \tilde O(H^2\sqrtSAT) , closely matching the known lower bound of \Omega(H\sqrtSAT) . Here, S, A denote the number of states and actions in the underlying Markov Decision Process (MDP), and T=KH with K being the number of episodes and H being the planning horizon. Our work provides several new technical insights into the core challenges in combining posterior sampling with dynamic programming and TD-learning-based RL algorithms, along with novel ideas for resolving those difficulties. We hope this will form a starting point for analyzing this efficient and important algorithmic technique in even more complex RL settings.

[LG-87] FourierFlow: Frequency-aware Flow Matching for Generative Turbulence Modeling

链接: https://arxiv.org/abs/2506.00862
作者: Haixin Wang,Jiashu Pan,Hao Wu,Fan Zhang,Tailin Wu
类目: Machine Learning (cs.LG)
*备注: 27 pages, 14 figures

点击查看摘要

Abstract:Modeling complex fluid systems, especially turbulence governed by partial differential equations (PDEs), remains a fundamental challenge in science and engineering. Recently, diffusion-based generative models have gained attention as a powerful approach for these tasks, owing to their capacity to capture long-range dependencies and recover hierarchical structures. However, we present both empirical and theoretical evidence showing that generative models struggle with significant spectral bias and common-mode noise when generating high-fidelity turbulent flows. Here we propose FourierFlow, a novel generative turbulence modeling framework that enhances the frequency-aware learning by both implicitly and explicitly mitigating spectral bias and common-mode noise. FourierFlow comprises three key innovations. Firstly, we adopt a dual-branch backbone architecture, consisting of a salient flow attention branch with local-global awareness to focus on sensitive turbulence areas. Secondly, we introduce a frequency-guided Fourier mixing branch, which is integrated via an adaptive fusion strategy to explicitly mitigate spectral bias in the generative model. Thirdly, we leverage the high-frequency modeling capabilities of the masked auto-encoder pre-training and implicitly align the features of the generative model toward high-frequency components. We validate the effectiveness of FourierFlow on three canonical turbulent flow scenarios, demonstrating superior performance compared to state-of-the-art methods. Furthermore, we show that our model exhibits strong generalization capabilities in challenging settings such as out-of-distribution domains, long-term temporal extrapolation, and robustness to noisy inputs. The code can be found at this https URL.

[LG-88] Infinite-Width Limit of a Single Attention Layer: Analysis via Tensor Programs

链接: https://arxiv.org/abs/2506.00846
作者: Mana Sakai,Ryo Karakida,Masaaki Imaizumi
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:In modern theoretical analyses of neural networks, the infinite-width limit is often invoked to justify Gaussian approximations of neuron preactivations (e.g., via neural network Gaussian processes or Tensor Programs). However, these Gaussian-based asymptotic theories have so far been unable to capture the behavior of attention layers, except under special regimes such as infinitely many heads or tailored scaling schemes. In this paper, leveraging the Tensor Programs framework, we rigorously identify the infinite-width limit distribution of variables within a single attention layer under realistic architectural dimensionality and standard 1/\sqrtn -scaling with n dimensionality. We derive the exact form of this limit law without resorting to infinite-head approximations or tailored scalings, demonstrating that it departs fundamentally from Gaussianity. This limiting distribution exhibits non-Gaussianity from a hierarchical structure, being Gaussian conditional on the random similarity scores. Numerical experiments validate our theoretical predictions, confirming the effectiveness of our theory at finite width and accurate description of finite-head attentions. Beyond characterizing a standalone attention layer, our findings lay the groundwork for developing a unified theory of deep Transformer architectures in the infinite-width regime.

[LG-89] LLM Cannot Discover Causality and Should Be Restricted to Non-Decisional Support in Causal Discovery

链接: https://arxiv.org/abs/2506.00844
作者: Xingyu Wu,Kui Yu,Jibin Wu,Kay Chen Tan
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper critically re-evaluates LLMs’ role in causal discovery and argues against their direct involvement in determining causal relationships. We demonstrate that LLMs’ autoregressive, correlation-driven modeling inherently lacks the theoretical grounding for causal reasoning and introduces unreliability when used as priors in causal discovery algorithms. Through empirical studies, we expose the limitations of existing LLM-based methods and reveal that deliberate prompt engineering (e.g., injecting ground-truth knowledge) could overstate their performance, helping to explain the consistently favorable results reported in much of the current literature. Based on these findings, we strictly confined LLMs’ role to a non-decisional auxiliary capacity: LLMs should not participate in determining the existence or directionality of causal relationships, but can assist the search process for causal graphs (e.g., LLM-based heuristic search). Experiments across various settings confirm that, by strictly isolating LLMs from causal decision-making, LLM-guided heuristic search can accelerate the convergence and outperform both traditional and LLM-based methods in causal structure learning. We conclude with a call for the community to shift focus from naively applying LLMs to developing specialized models and training method that respect the core principles of causal discovery.

[LG-90] Breaker: Removing Shortcut Cues with User Clustering for Single-slot Recommendation System

链接: https://arxiv.org/abs/2506.00828
作者: Chao Wang,Yue Zheng,Yujing Zhang,Yan Feng,Zhe Wang,Xiaowei Shi,An You,Yu Chen
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In a single-slot recommendation system, users are only exposed to one item at a time, and the system cannot collect user feedback on multiple items simultaneously. Therefore, only pointwise modeling solutions can be adopted, focusing solely on modeling the likelihood of clicks or conversions for items by users to learn user-item preferences, without the ability to capture the ranking information among different items directly. However, since user-side information is often much more abundant than item-side information, the model can quickly learn the differences in user intrinsic tendencies, which are independent of the items they are exposed to. This can cause these intrinsic tendencies to become a shortcut bias for the model, leading to insufficient mining of the most concerned user-item preferences. To solve this challenge, we introduce the Breaker model. Breaker integrates an auxiliary task of user representation clustering with a multi-tower structure for cluster-specific preference modeling. By clustering user representations, we ensure that users within each cluster exhibit similar characteristics, which increases the complexity of the pointwise recommendation task on the user side. This forces the multi-tower structure with cluster-driven parameter learning to better model user-item preferences, ultimately eliminating shortcut biases related to user intrinsic tendencies. In terms of training, we propose a delayed parameter update mechanism to enhance training stability and convergence, enabling end-to-end joint training of the auxiliary clustering and classification tasks. Both offline and online experiments demonstrate that our method surpasses the baselines. It has already been deployed and is actively serving tens of millions of users daily on Meituan, one of the most popular e-commerce platforms for services.

[LG-91] Uni-LoRA: One Vector is All You Need

链接: https://arxiv.org/abs/2506.00799
作者: Kaiyang Li,Shaobo Han,Qing Su,Wei Li,Zhipeng Cai,Shihao Ji
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Low-Rank Adaptation (LoRA) has become the de facto parameter-efficient fine-tuning (PEFT) method for large language models (LLMs) by constraining weight updates to low-rank matrices. Recent works such as Tied-LoRA, VeRA, and VB-LoRA push efficiency further by introducing additional constraints to reduce the trainable parameter space. In this paper, we show that the parameter space reduction strategies employed by these LoRA variants can be formulated within a unified framework, Uni-LoRA, where the LoRA parameter space, flattened as a high-dimensional vector space R^D , can be reconstructed through a projection from a subspace R^d, with d \ll D . We demonstrate that the fundamental difference among various LoRA methods lies in the choice of the projection matrix, P \in R^D \times d .Most existing LoRA variants rely on layer-wise or structure-specific projections that limit cross-layer parameter sharing, thereby compromising parameter efficiency. In light of this, we introduce an efficient and theoretically grounded projection matrix that is isometric, enabling global parameter sharing and reducing computation overhead. Furthermore, under the unified view of Uni-LoRA, this design requires only a single trainable vector to reconstruct LoRA parameters for the entire LLM - making Uni-LoRA both a unified framework and a “one-vector-only” solution. Extensive experiments on GLUE, mathematical reasoning, and instruction tuning benchmarks demonstrate that Uni-LoRA achieves state-of-the-art parameter efficiency while outperforming or matching prior approaches in predictive performance.

[LG-92] A Dynamic Stiefel Graph Neural Network for Efficient Spatio-Temporal Time Series Forecasting IJCAI2025

链接: https://arxiv.org/abs/2506.00798
作者: Jiankai Zheng,Liang Xie
类目: Machine Learning (cs.LG)
*备注: Accepted at IJCAI 2025

点击查看摘要

Abstract:Spatio-temporal time series (STTS) have been widely used in many applications. However, accurately forecasting STTS is challenging due to complex dynamic correlations in both time and space dimensions. Existing graph neural networks struggle to balance effectiveness and efficiency in modeling dynamic spatio-temporal relations. To address this problem, we propose the Dynamic Spatio-Temporal Stiefel Graph Neural Network (DST-SGNN) to efficiently process STTS. For DST-SGNN, we first introduce the novel Stiefel Graph Spectral Convolution (SGSC) and Stiefel Graph Fourier Transform (SGFT). The SGFT matrix in SGSC is constrained to lie on the Stiefel manifold, and SGSC can be regarded as a filtered graph spectral convolution. We also propose the Linear Dynamic Graph Optimization on Stiefel Manifold (LDGOSM), which can efficiently learn the SGFT matrix from the dynamic graph and significantly reduce the computational complexity. Finally, we propose a multi-layer SGSC (MSGSC) that efficiently captures complex spatio-temporal correlations. Extensive experiments on seven spatio-temporal datasets show that DST-SGNN outperforms state-of-the-art methods while maintaining relatively low computational costs.

[LG-93] Bridging Supervised and Temporal Difference Learning with Q-Conditioned Maximization

链接: https://arxiv.org/abs/2506.00795
作者: Xing Lei,Zifeng Zhuang,Shentao Yang,Sheng Xu,Yunhao Luo,Fei Shen,Xuetao Zhang,Donglin Wang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recently, supervised learning (SL) methodology has emerged as an effective approach for offline reinforcement learning (RL) due to their simplicity, stability, and efficiency. However, recent studies show that SL methods lack the trajectory stitching capability, typically associated with temporal difference (TD)-based approaches. A question naturally surfaces: How can we endow SL methods with stitching capability and bridge its performance gap with TD learning? To answer this question, we introduce Q -conditioned maximization supervised learning for offline goal-conditioned RL, which enhances SL with the stitching capability through Q -conditioned policy and Q -conditioned maximization. Concretely, we propose Goal-Conditioned Reinforced Supervised Learning (GCReinSL), which consists of (1) estimating the Q -function by CVAE from the offline dataset and (2) finding the maximum Q -value within the data support by integrating Q -function maximization with Expectile Regression. In inference time, our policy chooses optimal actions based on such a maximum Q -value. Experimental results from stitching evaluations on offline RL datasets demonstrate that our method outperforms prior SL approaches with stitching capabilities and goal data augmentation techniques.

[LG-94] Learning Juntas under Markov Random Fields

链接: https://arxiv.org/abs/2506.00764
作者: Gautam Chandrasekaran,Adam Klivans
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS)
*备注:

点击查看摘要

Abstract:We give an algorithm for learning O(\log n) juntas in polynomial-time with respect to Markov Random Fields (MRFs) in a smoothed analysis framework where only the external field has been randomly perturbed. This is a broad generalization of the work of Kalai and Teng, who gave an algorithm that succeeded with respect to smoothed product distributions (i.e., MRFs whose dependency graph has no edges). Our algorithm has two phases: (1) an unsupervised structure learning phase and (2) a greedy supervised learning algorithm. This is the first example where algorithms for learning the structure of an undirected graphical model lead to provably efficient algorithms for supervised learning.

[LG-95] Blending Complementary Memory Systems in Hybrid Quadratic-Linear Transformers

链接: https://arxiv.org/abs/2506.00744
作者: Kazuki Irie,Morris Yau,Samuel J. Gershman
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We develop hybrid memory architectures for general-purpose sequence processing neural networks, that combine key-value memory using softmax attention (KV-memory) with dynamic synaptic memory through fast-weight programming (FW-memory) – the core principles of quadratic and linear transformers, respectively. These two memory systems have complementary but individually limited properties: KV-memory offers precise retrieval but is constrained by quadratic complexity in sequence length, while FW-memory supports arbitrarily long sequences and enables more expressive computation but sacrifices precise recall. We propose and compare three methods to blend these two systems into a single memory system to leverage the strengths of both. We conduct experiments on general language modeling and retrieval tasks by training 340M- and 1.3B-parameter models from scratch, as well as on synthetic algorithmic tasks designed to precisely illustrate the benefits of certain hybrid methods over others. We also evaluate our hybrid memory systems on reinforcement learning in partially observable environments. Overall, we demonstrate how a well-designed hybrid can overcome the limitations of its individual components, offering new insights into the design principle of neural memory systems.

[LG-96] A condensing approach to multiple shooting neural ordinary differential equation

链接: https://arxiv.org/abs/2506.00724
作者: Siddharth Prabhu,Srinivas Rangarajan,Mayuresh Kothare
类目: Machine Learning (cs.LG); Dynamical Systems (math.DS)
*备注:

点击查看摘要

Abstract:Multiple-shooting is a parameter estimation approach for ordinary differential equations. In this approach, the trajectory is broken into small intervals, each of which can be integrated independently. Equality constraints are then applied to eliminate the shooting gap between the end of the previous trajectory and the start of the next trajectory. Unlike single-shooting, multiple-shooting is more stable, especially for highly oscillatory and long trajectories. In the context of neural ordinary differential equations, multiple-shooting is not widely used due to the challenge of incorporating general equality constraints. In this work, we propose a condensing-based approach to incorporate these shooting equality constraints while training a multiple-shooting neural ordinary differential equation (MS-NODE) using first-order optimization methods such as Adam.

[LG-97] RelDiff: Relational Data Generative Modeling with Graph-Based Diffusion Models

链接: https://arxiv.org/abs/2506.00710
作者: Valter Hudovernik,Minkai Xu,Juntong Shi,Lovro Šubelj,Stefano Ermon,Erik Štrumbelj,Jure Leskovec
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Real-world databases are predominantly relational, comprising multiple interlinked tables that contain complex structural and statistical dependencies. Learning generative models on relational data has shown great promise in generating synthetic data and imputing missing values. However, existing methods often struggle to capture this complexity, typically reducing relational data to conditionally generated flat tables and imposing limiting structural assumptions. To address these limitations, we introduce RelDiff, a novel diffusion generative model that synthesizes complete relational databases by explicitly modeling their foreign key graph structure. RelDiff combines a joint graph-conditioned diffusion process across all tables for attribute synthesis, and a 2K+ SBM graph generator based on the Stochastic Block Model for structure generation. The decomposition of graph structure and relational attributes ensures both high fidelity and referential integrity, both of which are crucial aspects of synthetic relational database generation. Experiments on 11 benchmark datasets demonstrate that RelDiff consistently outperforms prior methods in producing realistic and coherent synthetic relational databases. Code is available at this https URL.

[LG-98] Central Path Proximal Policy Optimization

链接: https://arxiv.org/abs/2506.00700
作者: Nikola Milosevic,Johannes Müller,Nico Scherf
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In constrained Markov decision processes, enforcing constraints during training is often thought of as decreasing the final return. Recently, it was shown that constraints can be incorporated directly in the policy geometry, yielding an optimization trajectory close to the central path of a barrier method, which does not compromise final return. Building on this idea, we introduce Central Path Proximal Policy Optimization (C3PO), a simple modification of PPO that produces policy iterates, which stay close to the central path of the constrained optimization problem. Compared to existing on-policy methods, C3PO delivers improved performance with tighter constraint enforcement, suggesting that central path-guided updates offer a promising direction for constrained policy optimization.

[LG-99] Learning to Upsample and Upmix Audio in the Latent Domain

链接: https://arxiv.org/abs/2506.00681
作者: Dimitrios Bralios,Paris Smaragdis,Jonah Casebeer
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Neural audio autoencoders create compact latent representations that preserve perceptually important information, serving as the foundation for both modern audio compression systems and generation approaches like next-token prediction and latent diffusion. Despite their prevalence, most audio processing operations, such as spatial and spectral up-sampling, still inefficiently operate on raw waveforms or spectral representations rather than directly on these compressed representations. We propose a framework that performs audio processing operations entirely within an autoencoder’s latent space, eliminating the need to decode to raw audio formats. Our approach dramatically simplifies training by operating solely in the latent domain, with a latent L1 reconstruction term, augmented by a single latent adversarial discriminator. This contrasts sharply with raw-audio methods that typically require complex combinations of multi-scale losses and discriminators. Through experiments in bandwidth extension and mono-to-stereo up-mixing, we demonstrate computational efficiency gains of up to 100x while maintaining quality comparable to post-processing on raw audio. This work establishes a more efficient paradigm for audio processing pipelines that already incorporate autoencoders, enabling significantly faster and more resource-efficient workflows across various audio tasks.

[LG-100] PackHero: A Scalable Graph-based Approach for Efficient Packer Identification

链接: https://arxiv.org/abs/2506.00659
作者: Marco Di Gennaro,Mario D’Onghia,Mario Polino,Stefano Zanero,Michele Carminati
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Anti-analysis techniques, particularly packing, challenge malware analysts, making packer identification fundamental. Existing packer identifiers have significant limitations: signature-based methods lack flexibility and struggle against dynamic evasion, while Machine Learning approaches require extensive training data, limiting scalability and adaptability. Consequently, achieving accurate and adaptable packer identification remains an open problem. This paper presents PackHero, a scalable and efficient methodology for identifying packers using a novel static approach. PackHero employs a Graph Matching Network and clustering to match and group Call Graphs from programs packed with known packers. We evaluate our approach on a public dataset of malware and benign samples packed with various packers, demonstrating its effectiveness and scalability across varying sample sizes. PackHero achieves a macro-average F1-score of 93.7% with just 10 samples per packer, improving to 98.3% with 100 samples. Notably, PackHero requires fewer samples to achieve stable performance compared to other Machine Learning-based tools. Overall, PackHero matches the performance of State-of-the-art signature-based tools, outperforming them in handling Virtualization-based packers such as Themida/Winlicense, with a recall of 100%.

[LG-101] Rethinking Neural-based Matrix Inversion: Why cant and Where can

链接: https://arxiv.org/abs/2506.00642
作者: Yuliang Ji,Jian Wu,Yuanzhe Xi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep neural networks have achieved substantial success across various scientific computing tasks. A pivotal challenge within this domain is the rapid and parallel approximation of matrix inverses, critical for numerous applications. Despite significant progress, there currently exists no universal neural-based method for approximating matrix inversion. This paper presents a theoretical analysis demonstrating the fundamental limitations of neural networks in developing a general matrix inversion model. We expand the class of Lipschitz functions to encompass a wider array of neural network models, thereby refining our theoretical approach. Moreover, we delineate specific conditions under which neural networks can effectively approximate matrix inverses. Our theoretical results are supported by experimental results from diverse matrix datasets, exploring the efficacy of neural networks in addressing the matrix inversion challenge.

[LG-102] Probabilistic Forecasting for Building Energy Systems using Time-Series Foundation Models NEURIPS

链接: https://arxiv.org/abs/2506.00630
作者: Young Jin Park,Francois Germain,Jing Liu,Ye Wang,Toshiaki Koike-Akino,Gordon Wichern,Navid Azizan,Christopher R. Laughman,Ankush Chakrabarty
类目: Machine Learning (cs.LG)
*备注: Preliminary version appeared in NeurIPS TSALM Workshop: this https URL

点击查看摘要

Abstract:Decision-making in building energy systems critically depends on the predictive accuracy of relevant time-series models. In scenarios lacking extensive data from a target building, foundation models (FMs) represent a promising technology that can leverage prior knowledge from vast and diverse pre-training datasets to construct accurate probabilistic predictors for use in decision-making tools. This paper investigates the applicability and fine-tuning strategies of time-series foundation models (TSFMs) in building energy forecasting. We analyze both full fine-tuning and parameter-efficient fine-tuning approaches, particularly low-rank adaptation (LoRA), by using real-world data from a commercial net-zero energy building to capture signals such as room occupancy, carbon emissions, plug loads, and HVAC energy consumption. Our analysis reveals that the zero-shot predictive performance of TSFMs is generally suboptimal. To address this shortcoming, we demonstrate that employing either full fine-tuning or parameter-efficient fine-tuning significantly enhances forecasting accuracy, even with limited historical data. Notably, fine-tuning with low-rank adaptation (LoRA) substantially reduces computational costs without sacrificing accuracy. Furthermore, fine-tuned TSFMs consistently outperform state-of-the-art deep forecasting models (e.g., temporal fusion transformers) in accuracy, robustness, and generalization across varying building zones and seasonal conditions. These results underline the efficacy of TSFMs for practical, data-constrained building energy management systems, enabling improved decision-making in pursuit of energy efficiency and sustainability.

[LG-103] Model Reprogramming Demystified: A Neural Tangent Kernel Perspective

链接: https://arxiv.org/abs/2506.00620
作者: Ming-Yu Chung,Jiashuo Fan,Hancheng Ye,Qinsi Wang,Wei-Chen Shen,Chia-Mu Yu,Pin-Yu Chen,Sy-Yen Kuo
类目: Machine Learning (cs.LG)
*备注: 24 pages, 8 figures, 4 tables

点击查看摘要

Abstract:Model Reprogramming (MR) is a resource-efficient framework that adapts large pre-trained models to new tasks with minimal additional parameters and data, offering a promising solution to the challenges of training large models for diverse tasks. Despite its empirical success across various domains such as computer vision and time-series forecasting, the theoretical foundations of MR remain underexplored. In this paper, we present a comprehensive theoretical analysis of MR through the lens of the Neural Tangent Kernel (NTK) framework. We demonstrate that the success of MR is governed by the eigenvalue spectrum of the NTK matrix on the target dataset and establish the critical role of the source model’s effectiveness in determining reprogramming outcomes. Our contributions include a novel theoretical framework for MR, insights into the relationship between source and target models, and extensive experiments validating our findings.

[LG-104] Constrained Stein Variational Gradient Descent for Robot Perception Planning and Identification

链接: https://arxiv.org/abs/2506.00589
作者: Griffin Tabor,Tucker Hermans
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Many core problems in robotics can be framed as constrained optimization problems. Often on these problems, the robotic system has uncertainty, or it would be advantageous to identify multiple high quality feasible solutions. To enable this, we present two novel frameworks for applying principles of constrained optimization to the new variational inference algorithm Stein variational gradient descent. Our general framework supports multiple types of constrained optimizers and can handle arbitrary constraints. We demonstrate on a variety of problems that we are able to learn to approximate distributions without violating constraints. Specifically, we show that we can build distributions of: robot motion plans that exactly avoid collisions, robot arm joint angles on the SE(3) manifold with exact table placement constraints, and object poses from point clouds with table placement constraints.

[LG-105] Decoding the Stressed Brain with Geometric Machine Learning ALT

链接: https://arxiv.org/abs/2506.00587
作者: Sonia Koszut,Sam Nallaperuma-Herzberg,Pietro Lio
类目: Machine Learning (cs.LG)
*备注: 12 pages, 5 figures. This version has been accepted as a full paper at the 2025 AI in Healthcare (AIiH) Conference

点击查看摘要

Abstract:Stress significantly contributes to both mental and physical disorders, yet traditional self-reported questionnaires are inherently subjective. In this study, we introduce a novel framework that employs geometric machine learning to detect stress from raw EEG recordings. Our approach constructs graphs by integrating structural connectivity (derived from electrode spatial arrangement) with functional connectivity from pairwise signal correlations. A spatio-temporal graph convolutional network (ST-GCN) processes these graphs to capture spatial and temporal dynamics. Experiments on the SAM-40 dataset show that the ST-GCN outperforms standard machine learning models on all key classification metrics and enhances interpretability, explored through ablation analyses of key channels and brain regions. These results pave the way for more objective and accurate stress detection methods.

[LG-106] Slow Feature Analysis as Variational Inference Objective

链接: https://arxiv.org/abs/2506.00580
作者: Merlin Schüler,Laurenz Wiskott
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:This work presents a novel probabilistic interpretation of Slow Feature Analysis (SFA) through the lens of variational inference. Unlike prior formulations that recover linear SFA from Gaussian state-space models with linear emissions, this approach relaxes the key constraint of linearity. While it does not lead to full equivalence to non-linear SFA, it recasts the classical slowness objective in a variational framework. Specifically, it allows the slowness objective to be interpreted as a regularizer to a reconstruction loss. Furthermore, we provide arguments, why – from the perspective of slowness optimization – the reconstruction loss takes on the role of the constraints that ensure informativeness in SFA. We conclude with a discussion of potential new research directions.

[LG-107] Neural Estimation for Scaling Entropic Multimarginal Optimal Transport

链接: https://arxiv.org/abs/2506.00573
作者: Dor Tsur,Ziv Goldfeld,Kristjan Greenewald,Haim Permuter
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Multimarginal optimal transport (MOT) is a powerful framework for modeling interactions between multiple distributions, yet its applicability is bottlenecked by a high computational overhead. Entropic regularization provides computational speedups via the multimarginal Sinkhorn algorithm, whose time complexity, for a dataset size n and k marginals, generally scales as O(n^k) . However, this dependence on the dataset size n is computationally prohibitive for many machine learning problems. In this work, we propose a new computational framework for entropic MOT, dubbed Neural Entropic MOT (NEMOT), that enjoys significantly improved scalability. NEMOT employs neural networks trained using mini-batches, which transfers the computational complexity from the dataset size to the size of the mini-batch, leading to substantial gains. We provide formal guarantees on the accuracy of NEMOT via non-asymptotic error bounds. We supplement these with numerical results that demonstrate the performance gains of NEMOT over Sinkhorn’s algorithm, as well as extensions to neural computation of multimarginal entropic Gromov-Wasserstein alignment. In particular, orders-of-magnitude speedups are observed relative to the state-of-the-art, with a notable increase in the feasible number of samples and marginals. NEMOT seamlessly integrates as a module in large-scale machine learning pipelines, and can serve to expand the practical applicability of entropic MOT for tasks involving multimarginal data.

[LG-108] AutoMixAlign: Adaptive Data Mixing for Multi-Task Preference Optimization in LLM s ACL2025

链接: https://arxiv.org/abs/2506.00569
作者: Nicholas E. Corrado,Julian Katz-Samuels,Adithya Devraj,Hyokun Yun,Chao Zhang,Yi Xu,Yi Pan,Bing Yin,Trishul Chilimbi
类目: Machine Learning (cs.LG)
*备注: ACL 2025, Main Conference

点击查看摘要

Abstract:When aligning large language models (LLMs), their performance on various tasks (such as being helpful, harmless, and honest) depends heavily on the composition of their training data. However, selecting a data mixture that achieves strong performance across all tasks is challenging. Existing approaches rely on large ablation studies, heuristics, or human intuition, but these can be prohibitively expensive and suboptimal. We study this problem in the setting of preference optimization via DPO and introduce AutoMixAlign (AMA), a theoretically-grounded algorithm that adaptively mixes datasets during training to balance performance across tasks. AMA first trains \textitspecialist models for each task to determine losses that correspond to strong task performance. Then, it trains a generalist model using a novel minimax optimization that prioritizes tasks for which generalist model losses deviate most from specialist model losses. To optimize this problem, we propose two algorithms: (1) AMA-R, which adaptively reweights the objective to prioritize tasks, and (2) AMA-S, which adaptively adjusts how much data is sampled from each task to prioritize tasks. Both algorithms achieve a convergence rate of O(1/\sqrtT) in the convex case. AMA-R’s convergence result follows from Sagawa et al. (2019), and we provide a convergence proof for AMA-S using online learning techniques such as EXP3. We evaluate AMA on several multitask alignment setups and find that AMA outperforms the standard alignment approach – which simply optimizes the total loss across all tasks – and also outperforms model merging methods.

[LG-109] RsGCN: Rescaling Enhances Generalization of GCNs for Solving Scalable Traveling Salesman Problems

链接: https://arxiv.org/abs/2506.00533
作者: Junquan Huang,Zong-Gan Chen,Yuncheng Jiang,Zhi-Hui Zhan
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:Neural traveling salesman problem (TSP) solvers face two critical challenges: poor generalization for scalable TSPs and high training costs. To address these challenges, we propose a new Rescaling Graph Convolutional Network (RsGCN). Focusing on the scale-dependent features (i.e., features varied with problem scales) related to nodes and edges that influence the sensitivity of GCNs to the problem scales, a Rescaling Mechanism in RsGCN enhances the generalization capability by (1) rescaling adjacent nodes to construct a subgraph with a uniform number of adjacent nodes for each node across various scales of TSPs, which stabilizes the graph message aggregation; (2) rescaling subgraph edges to adjust the lengths of subgraph edges to the same magnitude, which maintains numerical consistency. In addition, an efficient training strategy with a mixed-scale dataset and bidirectional loss is used in RsGCN. To fully exploit the heatmaps generated by RsGCN, we design an efficient post-search algorithm termed Re2Opt, in which a reconstruction process based on adaptive weight is incorporated to help avoid local optima. Based on a combined architecture of RsGCN and Re2Opt, our solver achieves remarkable generalization and low training cost: with only 3 epochs of training on the mixed-scale dataset containing instances with up to 100 nodes, it can be generalized successfully to 10K-node instances without any fine-tuning. Extensive experiments demonstrate our state-of-the-art performance across uniform distribution instances of 9 different scales from 20 to 10K nodes and 78 real-world instances from TSPLIB, while requiring the fewest learnable parameters and training epochs among neural competitors.

[LG-110] Ultra-Quantisation: Efficient Embedding Search via 1.58-bit Encodings

链接: https://arxiv.org/abs/2506.00528
作者: Richard Connor,Alan Dearle,Ben Claydon
类目: Machine Learning (cs.LG); Databases (cs.DB)
*备注: Submitted to SISAP25 International Conference on Similarity Search and Applications

点击查看摘要

Abstract:Many modern search domains comprise high-dimensional vectors of floating point numbers derived from neural networks, in the form of embeddings. Typical embeddings range in size from hundreds to thousands of dimensions, making the size of the embeddings, and the speed of comparison, a significant issue. Quantisation is a class of mechanism which replaces the floating point values with a smaller representation, for example a short integer. This gives an approximation of the embedding space in return for a smaller data representation and a faster comparison function. Here we take this idea almost to its extreme: we show how vectors of arbitrary-precision floating point values can be replaced by vectors whose elements are drawn from the set -1,0,1. This yields very significant savings in space and metric evaluation cost, while maintaining a strong correlation for similarity measurements. This is achieved by way of a class of convex polytopes which exist in the high-dimensional space. In this article we give an outline description of these objects, and show how they can be used for the basis of such radical quantisation while maintaining a surprising degree of accuracy. Comments: Submitted to SISAP25 International Conference on Similarity Search and Applications Subjects: Machine Learning (cs.LG); Databases (cs.DB) Cite as: arXiv:2506.00528 [cs.LG] (or arXiv:2506.00528v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2506.00528 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Richard Connor [view email] [v1] Sat, 31 May 2025 12:22:24 UTC (828 KB)

[LG-111] From Rules to Rewards: Reinforcement Learning for Interest Rate Adjustment in DeFi Lending

链接: https://arxiv.org/abs/2506.00505
作者: Hanxiao Qu,Krzysztof Gogol,Florian Groetschla,Claudio Tessone
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Decentralized Finance (DeFi) lending enables permissionless borrowing via smart contracts. However, it faces challenges in optimizing interest rates, mitigating bad debt, and improving capital efficiency. Rule-based interest-rate models struggle to adapt to dynamic market conditions, leading to inefficiencies. This work applies Offline Reinforcement Learning (RL) to optimize interest rate adjustments in DeFi lending protocols. Using historical data from Aave protocol, we evaluate three RL approaches: Conservative Q-Learning (CQL), Behavior Cloning (BC), and TD3 with Behavior Cloning (TD3-BC). TD3-BC demonstrates superior performance in balancing utilization, capital stability, and risk, outperforming existing models. It adapts effectively to historical stress events like the May 2021 crash and the March 2023 USDC depeg, showcasing potential for automated, real-time governance.

[LG-112] Federated learning framework for collaborative remaining useful life prognostics: an aircraft engine case study

链接: https://arxiv.org/abs/2506.00499
作者: Diogo Landau,Ingeborg de Pater,Mihaela Mitici,Nishant Saurabh
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Emerging Technologies (cs.ET); Systems and Control (eess.SY); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Complex systems such as aircraft engines are continuously monitored by sensors. In predictive aircraft maintenance, the collected sensor measurements are used to estimate the health condition and the Remaining Useful Life (RUL) of such systems. However, a major challenge when developing prognostics is the limited number of run-to-failure data samples. This challenge could be overcome if multiple airlines would share their run-to-failure data samples such that sufficient learning can be achieved. Due to privacy concerns, however, airlines are reluctant to share their data in a centralized setting. In this paper, a collaborative federated learning framework is therefore developed instead. Here, several airlines cooperate to train a collective RUL prognostic machine learning model, without the need to centrally share their data. For this, a decentralized validation procedure is proposed to validate the prognostics model without sharing any data. Moreover, sensor data is often noisy and of low quality. This paper therefore proposes four novel methods to aggregate the parameters of the global prognostic model. These methods enhance the robustness of the FL framework against noisy data. The proposed framework is illustrated for training a collaborative RUL prognostic model for aircraft engines, using the N-CMAPSS dataset. Here, six airlines are considered, that collaborate in the FL framework to train a collective RUL prognostic model for their aircraft’s engines. When comparing the proposed FL framework with the case where each airline independently develops their own prognostic model, the results show that FL leads to more accurate RUL prognostics for five out of the six airlines. Moreover, the novel robust aggregation methods render the FL framework robust to noisy data samples.

[LG-113] owards Graph-Based Privacy-Preserving Federated Learning: ModelNet - A ResNet-based Model Classification Dataset

链接: https://arxiv.org/abs/2506.00476
作者: Abhisek Ray,Lukas Esterle
类目: Machine Learning (cs.LG)
*备注: 8 pages, 8 figures

点击查看摘要

Abstract:Federated Learning (FL) has emerged as a powerful paradigm for training machine learning models across distributed data sources while preserving data locality. However, the privacy of local data is always a pivotal concern and has received a lot of attention in recent research on the FL regime. Moreover, the lack of domain heterogeneity and client-specific segregation in the benchmarks remains a critical bottleneck for rigorous evaluation. In this paper, we introduce ModelNet, a novel image classification dataset constructed from the embeddings extracted from a pre-trained ResNet50 model. First, we modify the CIFAR100 dataset into three client-specific variants, considering three domain heterogeneities (homogeneous, heterogeneous, and random). Subsequently, we train each client-specific subset of all three variants on the pre-trained ResNet50 model to save model parameters. In addition to multi-domain image data, we propose a new hypothesis to define the FL algorithm that can access the anonymized model parameters to preserve the local privacy in a more effective manner compared to existing ones. ModelNet is designed to simulate realistic FL settings by incorporating non-IID data distributions and client diversity design principles in the mainframe for both conventional and futuristic graph-driven FL algorithms. The three variants are ModelNet-S, ModelNet-D, and ModelNet-R, which are based on homogeneous, heterogeneous, and random data settings, respectively. To the best of our knowledge, we are the first to propose a cross-environment client-specific FL dataset along with the graph-based variant. Extensive experiments based on domain shifts and aggregation strategies show the effectiveness of the above variants, making it a practical benchmark for classical and graph-based FL research. The dataset and related code are available online.

[LG-114] Revisiting LLM s as Zero-Shot Time-Series Forecasters: Small Noise Can Break Large Models ACL

链接: https://arxiv.org/abs/2506.00457
作者: Junwoo Park,Hyuck Lee,Dohyun Lee,Daehoon Gwak,Jaegul Choo
类目: Machine Learning (cs.LG)
*备注: Annual Meeting of the Association for Computational Linguistics (ACL), 2025, Accepted as Short Paper

点击查看摘要

Abstract:Large Language Models (LLMs) have shown remarkable performance across diverse tasks without domain-specific training, fueling interest in their potential for time-series forecasting. While LLMs have shown potential in zero-shot forecasting through prompting alone, recent studies suggest that LLMs lack inherent effectiveness in forecasting. Given these conflicting findings, a rigorous validation is essential for drawing reliable conclusions. In this paper, we evaluate the effectiveness of LLMs as zero-shot forecasters compared to state-of-the-art domain-specific models. Our experiments show that LLM-based zero-shot forecasters often struggle to achieve high accuracy due to their sensitivity to noise, underperforming even simple domain-specific models. We have explored solutions to reduce LLMs’ sensitivity to noise in the zero-shot setting, but improving their robustness remains a significant challenge. Our findings suggest that rather than emphasizing zero-shot forecasting, a more promising direction would be to focus on fine-tuning LLMs to better process numerical sequences. Our experimental code is available at this https URL.

[LG-115] DV365: Extremely Long User History Modeling at Instagram KDD2025

链接: https://arxiv.org/abs/2506.00450
作者: Wenhan Lyu,Devashish Tyagi,Yihang Yang,Ziwei Li,Ajay Somani,Karthikeyan Shanmugasundaram,Nikola Andrejevic,Ferdi Adeputra,Curtis Zeng,Arun K. Singh,Maxime Ransan,Sagar Jain
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: SIGKDD 2025 accepted

点击查看摘要

Abstract:Long user history is highly valuable signal for recommendation systems, but effectively incorporating it often comes with high cost in terms of data center power consumption and GPU. In this work, we chose offline embedding over end-to-end sequence length optimization methods to enable extremely long user sequence modeling as a cost-effective solution, and propose a new user embedding learning strategy, multi-slicing and summarization, that generates highly generalizable user representation of user’s long-term stable interest. History length we encoded in this embedding is up to 70,000 and on average 40,000. This embedding, named as DV365, is proven highly incremental on top of advanced attentive user sequence models deployed in Instagram. Produced by a single upstream foundational model, it is launched in 15 different models across Instagram and Threads with significant impact, and has been production battle-proven for 1 year since our first launch.

[LG-116] PSI-PFL: Population Stability Index for Client Selection in non-IID Personalized Federated Learning

链接: https://arxiv.org/abs/2506.00440
作者: Daniel-M. Jimenez-Gutierrez,David Solans,Mohammed Elbamby,Nicolas Kourtellis
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Federated Learning (FL) enables decentralized machine learning (ML) model training while preserving data privacy by keeping data localized across clients. However, non-independent and identically distributed (non-IID) data across clients poses a significant challenge, leading to skewed model updates and performance degradation. Addressing this, we propose PSI-PFL, a novel client selection framework for Personalized Federated Learning (PFL) that leverages the Population Stability Index (PSI) to quantify and mitigate data heterogeneity (so-called non-IIDness). Our approach selects more homogeneous clients based on PSI, reducing the impact of label skew, one of the most detrimental factors in FL performance. Experimental results over multiple data modalities (tabular, image, text) demonstrate that PSI-PFL significantly improves global model accuracy, outperforming state-of-the-art baselines by up to 10% under non-IID scenarios while ensuring fairer local performance. PSI-PFL enhances FL performance and offers practical benefits in applications where data privacy and heterogeneity are critical.

[LG-117] PointODE: Lightweight Point Cloud Learning with Neural Ordinary Differential Equations on Edge

链接: https://arxiv.org/abs/2506.00438
作者: Keisuke Sugiura,Mizuki Yasuda,Hiroki Matsutani
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR)
*备注:

点击查看摘要

Abstract:Embedded edge devices are often used as a computing platform to run real-world point cloud applications, but recent deep learning-based methods may not fit on such devices due to limited resources. In this paper, we aim to fill this gap by introducing PointODE, a parameter-efficient ResNet-like architecture for point cloud feature extraction based on a stack of MLP blocks with residual connections. We leverage Neural ODE (Ordinary Differential Equation), a continuous-depth version of ResNet originally developed for modeling the dynamics of continuous-time systems, to compress PointODE by reusing the same parameters across MLP blocks. The point-wise normalization is proposed for PointODE to handle the non-uniform distribution of feature points. We introduce PointODE-Elite as a lightweight version with 0.58M trainable parameters and design its dedicated accelerator for embedded FPGAs. The accelerator consists of a four-stage pipeline to parallelize the feature extraction for multiple points and stores the entire parameters on-chip to eliminate most of the off-chip data transfers. Compared to the ARM Cortex-A53 CPU, the accelerator implemented on a Xilinx ZCU104 board speeds up the feature extraction by 4.9x, leading to 3.7x faster inference and 3.5x better energy-efficiency. Despite the simple architecture, PointODE-Elite shows competitive accuracy to the state-of-the-art models on both synthetic and real-world classification datasets, greatly improving the trade-off between accuracy and inference cost.

[LG-118] IDFormer: Exploiting Temporal and Interactive Dynamics Makes A Great Dynamic Graph Transformer KDD2025

链接: https://arxiv.org/abs/2506.00431
作者: Jie Peng,Zhewei Wei,Yuhang Ye
类目: Machine Learning (cs.LG)
*备注: KDD2025

点击查看摘要

Abstract:Due to the proficiency of self-attention mechanisms (SAMs) in capturing dependencies in sequence modeling, several existing dynamic graph neural networks (DGNNs) utilize Transformer architectures with various encoding designs to capture sequential evolutions of dynamic graphs. However, the effectiveness and efficiency of these Transformer-based DGNNs vary significantly, highlighting the importance of properly defining the SAM on dynamic graphs and comprehensively encoding temporal and interactive dynamics without extra complex modules. In this work, we propose TIDFormer, a dynamic graph TransFormer that fully exploits Temporal and Interactive Dynamics in an efficient manner. We clarify and verify the interpretability of our proposed SAM, addressing the open problem of its uninterpretable definitions on dynamic graphs in previous works. To model the temporal and interactive dynamics, respectively, we utilize the calendar-based time partitioning information and extract informative interaction embeddings for both bipartite and non-bipartite graphs using merely the sampled first-order neighbors. In addition, we jointly model temporal and interactive features by capturing potential changes in historical interaction patterns through a simple decomposition. We conduct extensive experiments on several dynamic graph datasets to verify the effectiveness and efficiency of TIDFormer. The experimental results demonstrate that TIDFormer excels, outperforming state-of-the-art models across most datasets and experimental settings. Furthermore, TIDFormer exhibits significant efficiency advantages compared to previous Transformer-based methods.

[LG-119] Blockchain-Enabled Privacy-Preserving Second-Order Federated Edge Learning in Personalized Healthcare

链接: https://arxiv.org/abs/2506.00416
作者: Anum Nawaz,Muhammad Irfan,Xianjia Yu,Zhuo Zou,Tomi Westerlund
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Federated learning (FL) has attracted increasing attention to mitigate security and privacy challenges in traditional cloud-centric machine learning models specifically in healthcare ecosystems. FL methodologies enable the training of global models through localized policies, allowing independent operations at the edge clients’ level. Conventional first-order FL approaches face several challenges in personalized model training due to heterogeneous non-independent and identically distributed (non-iid) data of each edge client. Recently, second-order FL approaches maintain the stability and consistency of non-iid datasets while improving personalized model training. This study proposes and develops a verifiable and auditable optimized second-order FL framework BFEL (blockchain-enhanced federated edge learning) based on optimized FedCurv for personalized healthcare systems. FedCurv incorporates information about the importance of each parameter to each client’s task (through Fisher Information Matrix) which helps to preserve client-specific knowledge and reduce model drift during aggregation. Moreover, it minimizes communication rounds required to achieve a target precision convergence for each edge client while effectively managing personalized training on non-iid and heterogeneous data. The incorporation of Ethereum-based model aggregation ensures trust, verifiability, and auditability while public key encryption enhances privacy and security. Experimental results of federated CNNs and MLPs utilizing Mnist, Cifar-10, and PathMnist demonstrate the high efficiency and scalability of the proposed framework.

[LG-120] JojoSCL: Shrinkage Contrastive Learning for single-cell RNA sequence Clustering

链接: https://arxiv.org/abs/2506.00410
作者: Ziwen Wang
类目: Machine Learning (cs.LG); Genomics (q-bio.GN); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Single-cell RNA sequencing (scRNA-seq) has revolutionized our understanding of cellular processes by enabling gene expression analysis at the individual cell level. Clustering allows for the identification of cell types and the further discovery of intrinsic patterns in single-cell data. However, the high dimensionality and sparsity of scRNA-seq data continue to challenge existing clustering models. In this paper, we introduce JojoSCL, a novel self-supervised contrastive learning framework for scRNA-seq clustering. By incorporating a shrinkage estimator based on hierarchical Bayesian estimation, which adjusts gene expression estimates towards more reliable cluster centroids to reduce intra-cluster dispersion, and optimized using Stein’s Unbiased Risk Estimate (SURE), JojoSCL refines both instance-level and cluster-level contrastive learning. Experiments on ten scRNA-seq datasets substantiate that JojoSCL consistently outperforms prevalent clustering methods, with further validation of its practicality through robustness analysis and ablation studies. JojoSCL’s code is available at: this https URL.

[LG-121] CLARIFY: Contrastive Preference Reinforcement Learning for Untangling Ambiguous Queries ICML2025

链接: https://arxiv.org/abs/2506.00388
作者: Ni Mu,Hao Hu,Xiao Hu,Yiqin Yang,Bo Xu,Qing-Shan Jia
类目: Machine Learning (cs.LG)
*备注: ICML 2025

点击查看摘要

Abstract:Preference-based reinforcement learning (PbRL) bypasses explicit reward engineering by inferring reward functions from human preference comparisons, enabling better alignment with human intentions. However, humans often struggle to label a clear preference between similar segments, reducing label efficiency and limiting PbRL’s real-world applicability. To address this, we propose an offline PbRL method: Contrastive LeArning for ResolvIng Ambiguous Feedback (CLARIFY), which learns a trajectory embedding space that incorporates preference information, ensuring clearly distinguished segments are spaced apart, thus facilitating the selection of more unambiguous queries. Extensive experiments demonstrate that CLARIFY outperforms baselines in both non-ideal teachers and real human feedback settings. Our approach not only selects more distinguished queries but also learns meaningful trajectory embeddings.

[LG-122] Deep-Learning-Driven Prefetching for Far Memory

链接: https://arxiv.org/abs/2506.00384
作者: Yutong Huang,Zhiyuan Guo,Yiying Zhang
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Operating Systems (cs.OS)
*备注:

点击查看摘要

Abstract:Modern software systems face increasing runtime performance demands, particularly in emerging architectures like far memory, where local-memory misses incur significant latency. While machine learning (ML) has proven effective in offline systems optimization, its application to high-frequency, runtime-level problems remains limited due to strict performance, generalization, and integration constraints. We present FarSight, a Linux-based far-memory system that leverages deep learning (DL) to efficiently perform accurate data prefetching. FarSight separates application semantics from runtime memory layout, allowing offline-trained DL models to predict access patterns using a compact vocabulary of ordinal possibilities, resolved at runtime through lightweight mapping structures. By combining asynchronous inference, lookahead prediction, and a cache-resident DL model, FarSight achieves high prediction accuracy with low runtime overhead. Our evaluation of FarSight on four data-intensive workloads shows that it outperforms the state-of-the-art far-memory system by up to 3.6 times. Overall, this work demonstrates the feasibility and advantages of applying modern ML techniques to complex, performance-critical software runtime problems.

[LG-123] FSNet: Feasibility-Seeking Neural Network for Constrained Optimization with Guarantees

链接: https://arxiv.org/abs/2506.00362
作者: Hoang T. Nguyen,Priya L. Donti
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Efficiently solving constrained optimization problems is crucial for numerous real-world applications, yet traditional solvers are often computationally prohibitive for real-time use. Machine learning-based approaches have emerged as a promising alternative to provide approximate solutions at faster speeds, but they struggle to strictly enforce constraints, leading to infeasible solutions in practice. To address this, we propose the Feasibility-Seeking-Integrated Neural Network (FSNet), which integrates a feasibility-seeking step directly into its solution procedure to ensure constraint satisfaction. This feasibility-seeking step solves an unconstrained optimization problem that minimizes constraint violations in a differentiable manner, enabling end-to-end training and providing guarantees on feasibility and convergence. Our experiments across a range of different optimization problems, including both smooth/nonsmooth and convex/nonconvex problems, demonstrate that FSNet can provide feasible solutions with solution quality comparable to (or in some cases better than) traditional solvers, at significantly faster speeds.

[LG-124] he iNaturalist Sounds Dataset

链接: https://arxiv.org/abs/2506.00343
作者: Mustafa Chasmai,Alexander Shepard,Subhransu Maji,Grant Van Horn
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:We present the iNaturalist Sounds Dataset (iNatSounds), a collection of 230,000 audio files capturing sounds from over 5,500 species, contributed by more than 27,000 recordists worldwide. The dataset encompasses sounds from birds, mammals, insects, reptiles, and amphibians, with audio and species labels derived from observations submitted to iNaturalist, a global citizen science platform. Each recording in the dataset varies in length and includes a single species annotation. We benchmark multiple backbone architectures, comparing multiclass classification objectives with multilabel objectives. Despite weak labeling, we demonstrate that iNatSounds serves as a useful pretraining resource by benchmarking it on strongly labeled downstream evaluation datasets. The dataset is available as a single, freely accessible archive, promoting accessibility and research in this important domain. We envision models trained on this data powering next-generation public engagement applications, and assisting biologists, ecologists, and land use managers in processing large audio collections, thereby contributing to the understanding of species compositions in diverse soundscapes.

[LG-125] Channel-Imposed Fusion: A Simple yet Effective Method for Medical Time Series Classification

链接: https://arxiv.org/abs/2506.00337
作者: Ming Hu,Jianfu Yin,Mingyu Dou,Yuqi Wang,Ruochen Dang,Siyi Liang,Cong Hu,Yao Wang,Bingliang Hu,Quan Wang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The automatic classification of medical time series signals, such as electroencephalogram (EEG) and electrocardiogram (ECG), plays a pivotal role in clinical decision support and early detection of diseases. Although Transformer based models have achieved notable performance by implicitly modeling temporal dependencies through self-attention mechanisms, their inherently complex architectures and opaque reasoning processes undermine their trustworthiness in high stakes clinical settings. In response to these limitations, this study shifts focus toward a modeling paradigm that emphasizes structural transparency, aligning more closely with the intrinsic characteristics of medical data. We propose a novel method, Channel Imposed Fusion (CIF), which enhances the signal-to-noise ratio through cross-channel information fusion, effectively reduces redundancy, and improves classification performance. Furthermore, we integrate CIF with the Temporal Convolutional Network (TCN), known for its structural simplicity and controllable receptive field, to construct an efficient and explicit classification framework. Experimental results on multiple publicly available EEG and ECG datasets demonstrate that the proposed method not only outperforms existing state-of-the-art (SOTA) approaches in terms of various classification metrics, but also significantly enhances the transparency of the classification process, offering a novel perspective for medical time series classification.

[LG-126] Active Learning via Regression Beyond Realizability

链接: https://arxiv.org/abs/2506.00316
作者: Atul Ganju,Shashaank Aiyer,Ved Sriraman,Karthik Sridharan
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present a new active learning framework for multiclass classification based on surrogate risk minimization that operates beyond the standard realizability assumption. Existing surrogate-based active learning algorithms crucially rely on realizability \unicodex2014 the assumption that the optimal surrogate predictor lies within the model class \unicodex2014 limiting their applicability in practical, misspecified settings. In this work we show that under conditions significantly weaker than realizability, as long as the class of models considered is convex, one can still obtain a label and sample complexity comparable to prior work. Despite achieving similar rates, the algorithmic approaches from prior works can be shown to fail in non-realizable settings where our assumption is satisfied. Our epoch-based active learning algorithm departs from prior methods by fitting a model from the full class to the queried data in each epoch and returning an improper classifier obtained by aggregating these models.

[LG-127] Learning Aerodynamics for the Control of Flying Humanoid Robots

链接: https://arxiv.org/abs/2506.00305
作者: Antonello Paolino,Gabriele Nava,Fabio Di Natale,Fabio Bergonti,Punith Reddy Vanteddu,Donato Grassi,Luca Riccobene,Alex Zanotti,Renato Tognaccini,Gianluca Iaccarino,Daniele Pucci
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Robots with multi-modal locomotion are an active research field due to their versatility in diverse environments. In this context, additional actuation can provide humanoid robots with aerial capabilities. Flying humanoid robots face challenges in modeling and control, particularly with aerodynamic forces. This paper addresses these challenges from a technological and scientific standpoint. The technological contribution includes the mechanical design of iRonCub-Mk1, a jet-powered humanoid robot, optimized for jet engine integration, and hardware modifications for wind tunnel experiments on humanoid robots for precise aerodynamic forces and surface pressure measurements. The scientific contribution offers a comprehensive approach to model and control aerodynamic forces using classical and learning techniques. Computational Fluid Dynamics (CFD) simulations calculate aerodynamic forces, validated through wind tunnel experiments on iRonCub-Mk1. An automated CFD framework expands the aerodynamic dataset, enabling the training of a Deep Neural Network and a linear regression model. These models are integrated into a simulator for designing aerodynamic-aware controllers, validated through flight simulations and balancing experiments on the iRonCub-Mk1 physical prototype.

[LG-128] Beyond Atomic Geometry Representations in Materials Science: A Human-in-the-Loop Multimodal Framework ICML2025

链接: https://arxiv.org/abs/2506.00302
作者: Can Polat,Hasan Kurban,Erchin Serpedin,Mustafa Kurban
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci)
*备注: Submitted to ICML 2025 Workshop on DataWorld

点击查看摘要

Abstract:Most materials science datasets are limited to atomic geometries (e.g., XYZ files), restricting their utility for multimodal learning and comprehensive data-centric analysis. These constraints have historically impeded the adoption of advanced machine learning techniques in the field. This work introduces MultiCrystalSpectrumSet (MCS-Set), a curated framework that expands materials datasets by integrating atomic structures with 2D projections and structured textual annotations, including lattice parameters and coordination metrics. MCS-Set enables two key tasks: (1) multimodal property and summary prediction, and (2) constrained crystal generation with partial cluster supervision. Leveraging a human-in-the-loop pipeline, MCS-Set combines domain expertise with standardized descriptors for high-quality annotation. Evaluations using state-of-the-art language and vision-language models reveal substantial modality-specific performance gaps and highlight the importance of annotation quality for generalization. MCS-Set offers a foundation for benchmarking multimodal models, advancing annotation practices, and promoting accessible, versatile materials science datasets. The dataset and implementations are available at this https URL.

[LG-129] Inference-Time Alignment of Diffusion Models with Evolutionary Algorithms

链接: https://arxiv.org/abs/2506.00299
作者: Purvish Jajal,Nick John Eliopoulos,Benjamin Shiue-Hal Chou,George K. Thiruvathukal,James C. Davis,Yung-Hsiang Lu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Diffusion models are state-of-the-art generative models in various domains, yet their samples often fail to satisfy downstream objectives such as safety constraints or domain-specific validity. Existing techniques for alignment require gradients, internal model access, or large computational budgets. We introduce an inference-time alignment framework based on evolutionary algorithms. We treat diffusion models as black-boxes and search their latent space to maximize alignment objectives. Our method enables efficient inference-time alignment for both differentiable and non-differentiable alignment objectives across a range of diffusion models. On the DrawBench and Open Image Preferences benchmark, our EA methods outperform state-of-the-art gradient-based and gradient-free inference-time methods. In terms of memory consumption, we require 55% to 76% lower GPU memory than gradient-based methods. In terms of running-time, we are 72% to 80% faster than gradient-based methods. We achieve higher alignment scores over 50 optimization steps on Open Image Preferences than gradient-based and gradient-free methods.

[LG-130] Performance Analysis of Convolutional Neural Network By Applying Unconstrained Binary Quadratic Programming

链接: https://arxiv.org/abs/2506.00247
作者: Aasish Kumar Sharma,Sanjeeb Prashad Pandey,Julian M. Kunkel
类目: Machine Learning (cs.LG); Emerging Technologies (cs.ET)
*备注: 11 pages, 22 figures, accepted in IEEE COMPSAC 2025 Conference. Preprint before peer review

点击查看摘要

Abstract:Convolutional Neural Networks (CNNs) are pivotal in computer vision and Big Data analytics but demand significant computational resources when trained on large-scale datasets. Conventional training via back-propagation (BP) with losses like Mean Squared Error or Cross-Entropy often requires extensive iterations and may converge sub-optimally. Quantum computing offers a promising alternative by leveraging superposition, tunneling, and entanglement to search complex optimization landscapes more efficiently. In this work, we propose a hybrid optimization method that combines an Unconstrained Binary Quadratic Programming (UBQP) formulation with Stochastic Gradient Descent (SGD) to accelerate CNN training. Evaluated on the MNIST dataset, our approach achieves a 10–15% accuracy improvement over a standard BP-CNN baseline while maintaining similar execution times. These results illustrate the potential of hybrid quantum-classical techniques in High-Performance Computing (HPC) environments for Big Data and Deep Learning. Fully realizing these benefits, however, requires a careful alignment of algorithmic structures with underlying quantum mechanisms.

[LG-131] DeGLIF for Label Noise Robust Node Classification using GNNs

链接: https://arxiv.org/abs/2506.00244
作者: Pintu Kumar,Nandyala Hemachandra
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Noisy labelled datasets are generally inexpensive compared to clean labelled datasets, and the same is true for graph data. In this paper, we propose a denoising technique DeGLIF: Denoising Graph Data using Leave-One-Out Influence Function. DeGLIF uses a small set of clean data and the leave-one- out influence function to make label noise robust node-level prediction on graph data. Leave-one-out influence function approximates the change in the model parameters if a training point is removed from the training dataset. Recent advances propose a way to calculate the leave-one-out influence function for Graph Neural Networks (GNNs). We extend that recent work to estimate the change in validation loss, if a training node is removed from the training dataset. We use this estimate and a new theoretically motivated relabelling function to denoise the training dataset. We propose two DeGLIF variants to identify noisy nodes. Both these variants do not require any information about the noise model or the noise level in the dataset; DeGLIF also does not estimate these quantities. For one of these variants, we prove that the noisy points detected can indeed increase risk. We carry out detailed computational experiments on different datasets to show the effectiveness of DeGLIF. It achieves better accuracy than other baseline algorithms

[LG-132] Sorrel: A simple and flexible framework for multi-agent reinforcement learning

链接: https://arxiv.org/abs/2506.00228
作者: Rebekah A. Gelpí,Yibing Ju,Ethan C. Jackson,Yikai Tang,Shon Verch,Claas Voelcker,William A. Cunningham
类目: Multiagent Systems (cs.MA); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce Sorrel (this https URL), a simple Python interface for generating and testing new multi-agent reinforcement learning environments. This interface places a high degree of emphasis on simplicity and accessibility, and uses a more psychologically intuitive structure for the basic agent-environment loop, making it a useful tool for social scientists to investigate how learning and social interaction leads to the development and change of group dynamics. In this short paper, we outline the basic design philosophy and features of Sorrel.

[LG-133] Intercept Cancer: Cancer Pre-Screening with Large Scale Healthcare Foundation Models

链接: https://arxiv.org/abs/2506.00209
作者: Liwen Sun,Hao-Ren Yao,Gary Gao,Ophir Frieder,Chenyan Xiong
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Cancer screening, leading to early detection, saves lives. Unfortunately, existing screening techniques require expensive and intrusive medical procedures, not globally available, resulting in too many lost would-be-saved lives. We present CATCH-FM, CATch Cancer early with Healthcare Foundation Models, a cancer pre-screening methodology that identifies high-risk patients for further screening solely based on their historical medical records. With millions of electronic healthcare records (EHR), we establish the scaling law of EHR foundation models pretrained on medical code sequences, pretrain compute-optimal foundation models of up to 2.4 billion parameters, and finetune them on clinician-curated cancer risk prediction cohorts. In our retrospective evaluation comprising of thirty thousand patients, CATCH-FM achieved strong efficacy (60% sensitivity) with low risk (99% specificity and Negative Predictive Value), outperforming feature-based tree models as well as general and medical large language models by large margins. Despite significant demographic, healthcare system, and EHR coding differences, CATCH-FM achieves state-of-the-art pancreatic cancer risk prediction on the EHRSHOT few-shot leaderboard, outperforming EHR foundation models pretrained using on-site patient data. Our analysis demonstrates the robustness of CATCH-FM in various patient distributions, the benefits of operating in the ICD code space, and its ability to capture non-trivial cancer risk factors. Our code will be open-sourced.

[LG-134] Unlocking the Power of Rehearsal in Continual Learning: A Theoretical Perspective ICML2025

链接: https://arxiv.org/abs/2506.00205
作者: Junze Deng,Qinhang Wu,Peizhong Ju,Sen Lin,Yingbin Liang,Ness Shroff
类目: Machine Learning (cs.LG)
*备注: accepted to ICML 2025

点击查看摘要

Abstract:Rehearsal-based methods have shown superior performance in addressing catastrophic forgetting in continual learning (CL) by storing and training on a subset of past data alongside new data in current task. While such a concurrent rehearsal strategy is widely used, it remains unclear if this approach is always optimal. Inspired by human learning, where sequentially revisiting tasks helps mitigate forgetting, we explore whether sequential rehearsal can offer greater benefits for CL compared to standard concurrent rehearsal. To address this question, we conduct a theoretical analysis of rehearsal-based CL in overparameterized linear models, comparing two strategies: 1) Concurrent Rehearsal, where past and new data are trained together, and 2) Sequential Rehearsal, where new data is trained first, followed by revisiting past data sequentially. By explicitly characterizing forgetting and generalization error, we show that sequential rehearsal performs better when tasks are less similar. These insights further motivate a novel Hybrid Rehearsal method, which trains similar tasks concurrently and revisits dissimilar tasks sequentially. We characterize its forgetting and generalization performance, and our experiments with deep neural networks further confirm that the hybrid approach outperforms standard concurrent rehearsal. This work provides the first comprehensive theoretical analysis of rehearsal-based CL.

[LG-135] When GPT Spills the Tea: Comprehensive Assessment of Knowledge File Leakage in GPT s

链接: https://arxiv.org/abs/2506.00197
作者: Xinyue Shen,Yun Shen,Michael Backes,Yang Zhang
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Knowledge files have been widely used in large language model (LLM) agents, such as GPTs, to improve response quality. However, concerns about the potential leakage of knowledge files have grown significantly. Existing studies demonstrate that adversarial prompts can induce GPTs to leak knowledge file content. Yet, it remains uncertain whether additional leakage vectors exist, particularly given the complex data flows across clients, servers, and databases in GPTs. In this paper, we present a comprehensive risk assessment of knowledge file leakage, leveraging a novel workflow inspired by Data Security Posture Management (DSPM). Through the analysis of 651,022 GPT metadata, 11,820 flows, and 1,466 responses, we identify five leakage vectors: metadata, GPT initialization, retrieval, sandboxed execution environments, and prompts. These vectors enable adversaries to extract sensitive knowledge file data such as titles, content, types, and sizes. Notably, the activation of the built-in tool Code Interpreter leads to a privilege escalation vulnerability, enabling adversaries to directly download original knowledge files with a 95.95% success rate. Further analysis reveals that 28.80% of leaked files are copyrighted, including digital copies from major publishers and internal materials from a listed company. In the end, we provide actionable solutions for GPT builders and platform providers to secure the GPT data supply chain.

[LG-136] Cluster-Aware Causal Mixer for Online Anomaly Detection in Multivariate Time Series

链接: https://arxiv.org/abs/2506.00188
作者: Md Mahmuddun Nabi Murad,Yasin Yilmaz
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Early and accurate detection of anomalies in time series data is critical, given the significant risks associated with false or missed detections. While MLP-based mixer models have shown promise in time series analysis, they lack a causality mechanism to preserve temporal dependencies inherent in the system. Moreover, real-world multivariate time series often contain numerous channels with diverse inter-channel correlations. A single embedding mechanism for all channels does not effectively capture these complex relationships. To address these challenges, we propose a novel cluster-aware causal mixer to effectively detect anomalies in multivariate time series. Our model groups channels into clusters based on their correlations, with each cluster processed through a dedicated embedding layer. In addition, we introduce a causal mixer in our model, which mixes the information while maintaining causality. Furthermore, we present an anomaly detection framework that accumulates the anomaly evidence over time to prevent false positives due to nominal outliers. Our proposed model operates in an online fashion, making it suitable for real-time time-series anomaly detection tasks. Experimental evaluations across six public benchmark datasets demonstrate that our model consistently achieves superior F1 scores.

[LG-137] On the Interaction of Noise Compression Role and Adaptivity under (L_0 L_1)-Smoothness: An SDE-based Approach

链接: https://arxiv.org/abs/2506.00181
作者: Enea Monzio Compagnoni,Rustem Islamov,Antonio Orvieto,Eduard Gorbunov
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: This manuscript is a work in progress: We welcome comments

点击查看摘要

Abstract:Using stochastic differential equation (SDE) approximations, we study the dynamics of Distributed SGD, Distributed Compressed SGD, and Distributed SignSGD under (L_0,L_1) -smoothness and flexible noise assumptions. Our analysis provides insights – which we validate through simulation – into the intricate interactions between batch noise, stochastic gradient compression, and adaptivity in this modern theoretical setup. For instance, we show that \textitadaptive methods such as Distributed SignSGD can successfully converge under standard assumptions on the learning rate scheduler, even under heavy-tailed noise. On the contrary, Distributed (Compressed) SGD with pre-scheduled decaying learning rate fails to achieve convergence, unless such a schedule also accounts for an inverse dependency on the gradient norm – de facto falling back into an adaptive method.

[LG-138] Empirical Validation of the Independent Chip Model

链接: https://arxiv.org/abs/2506.00180
作者: Juho Kim
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The independent chip model (ICM) forms a cornerstone of all modern poker tournament strategy. However, despite its prominence, the ICM’s performance in the real world has not been sufficiently scrutinized, especially at a large scale. In this paper, we introduce our new dataset of poker tournaments, consisting of results of over ten thousand events. Then, using this dataset, we perform two experiments as part of a large-scale empirical validation of the ICM. First, we verify that the ICM performs more accurately than a baseline we propose. Second, we obtain empirical evidence of the ICM underestimating the performances of players with larger stacks while overestimating those who are short-stacked. Our contributions may be useful to future researchers developing new algorithms for estimating a player’s value in poker tournaments.

[LG-139] Breakpoint: Scalable evaluation of system-level reasoning in LLM code agents

链接: https://arxiv.org/abs/2506.00172
作者: Kaivalya Hariharan,Uzay Girit,Atticus Wang,Jacob Andreas
类目: Machine Learning (cs.LG)
*备注: 21 pages, 14 figures

点击查看摘要

Abstract:Benchmarks for large language models (LLMs) have predominantly assessed short-horizon, localized reasoning. Existing long-horizon suites (e.g. SWE-bench) rely on manually curated issues, so expanding or tuning difficulty demands expensive human effort and evaluations quickly saturate. However, many real-world tasks, such as software engineering or scientific research, require agents to rapidly comprehend and manipulate novel, complex structures dynamically; evaluating these capabilities requires the ability to construct large and varied sets of problems for agents to solve. We introduce Breakpoint, a benchmarking methodology that automatically generates code-repair tasks by adversarially corrupting functions within real-world software repositories. Breakpoint systematically controls task difficulty along two clear dimensions: local reasoning (characterized by code complexity metrics such as cyclomatic complexity) and system-level reasoning (characterized by call-graph centrality and the number of simultaneously corrupted interdependent functions). In experiments across more than 900 generated tasks we demonstrate that our methodology can scale to arbitrary difficulty, with state-of-the-art models’ success rates ranging from 55% on the easiest tasks down to 0% on the hardest.

[LG-140] Randomized Dimensionality Reduction for Euclidean Maximization and Diversity Measures ICML2025

链接: https://arxiv.org/abs/2506.00165
作者: Jie Gao,Rajesh Jayaram,Benedikt Kolbe,Shay Sapir,Chris Schwiegelshohn,Sandeep Silwal,Erik Waingarten
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注: ICML 2025

点击查看摘要

Abstract:Randomized dimensionality reduction is a widely-used algorithmic technique for speeding up large-scale Euclidean optimization problems. In this paper, we study dimension reduction for a variety of maximization problems, including max-matching, max-spanning tree, max TSP, as well as various measures for dataset diversity. For these problems, we show that the effect of dimension reduction is intimately tied to the \emphdoubling dimension \lambda_X of the underlying dataset X – a quantity measuring intrinsic dimensionality of point sets. Specifically, we prove that a target dimension of O(\lambda_X) suffices to approximately preserve the value of any near-optimal solution,which we also show is necessary for some of these problems. This is in contrast to classical dimension reduction results, whose dependence increases with the dataset size |X| . We also provide empirical results validating the quality of solutions found in the projected space, as well as speedups due to dimensionality reduction.

[LG-141] Privacy Amplification in Differentially Private Zeroth-Order Optimization with Hidden States

链接: https://arxiv.org/abs/2506.00158
作者: Eli Chien,Wei-Ning Chen,Pan Li
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Zeroth-order optimization has emerged as a promising approach for fine-tuning large language models on domain-specific data, particularly under differential privacy (DP) and memory constraints. While first-order methods have been extensively studied from a privacy perspective, the privacy analysis and algorithmic design for zeroth-order methods remain significantly underexplored. A critical open question concerns hidden-state DP analysis: although convergent privacy bounds are known for first-order methods, it has remained unclear whether similar guarantees can be established for zeroth-order methods. In this work, we provide an affirmative answer by proving a convergent DP bound for zeroth-order optimization. Our analysis generalizes the celebrated privacy amplification-by-iteration framework to the setting of smooth loss functions in zeroth-order optimization. Furthermore, it induces better DP zeroth-order algorithmic designs that are previously unknown to the literature.

[LG-142] Aligning Language Models with Observational Data: Opportunities and Risks from a Causal Perspective

链接: https://arxiv.org/abs/2506.00152
作者: Erfan Loghmani
类目: Machine Learning (cs.LG); Econometrics (econ.EM); Machine Learning (stat.ML)
*备注: 10+12 pages, 8 figures

点击查看摘要

Abstract:Large language models are being widely used across industries to generate content that contributes directly to key performance metrics, such as conversion rates. Pretrained models, however, often fall short when it comes to aligning with human preferences or optimizing for business objectives. As a result, fine-tuning with good-quality labeled data is essential to guide models to generate content that achieves better results. Controlled experiments, like A/B tests, can provide such data, but they are often expensive and come with significant engineering and logistical challenges. Meanwhile, companies have access to a vast amount of historical (observational) data that remains underutilized. In this work, we study the challenges and opportunities of fine-tuning LLMs using observational data. We show that while observational outcomes can provide valuable supervision, directly fine-tuning models on such data can lead them to learn spurious correlations. We present empirical evidence of this issue using various real-world datasets and propose DeconfoundLM, a method that explicitly removes the effect of known confounders from reward signals. Using simulation experiments, we demonstrate that DeconfoundLM improves the recovery of causal relationships and mitigates failure modes found in fine-tuning methods that ignore or naively incorporate confounding variables. Our findings highlight that while observational data presents risks, with the right causal corrections, it can be a powerful source of signal for LLM alignment. Please refer to the project page for code and related resources.

[LG-143] On Designing Diffusion Autoencoders for Efficient Generation and Representation Learning

链接: https://arxiv.org/abs/2506.00136
作者: Magdalena Proszewska,Nikolay Malkin,N. Siddharth
类目: Machine Learning (cs.LG)
*备注: 21 pages, 10 tables, 15 figures

点击查看摘要

Abstract:Diffusion autoencoders (DAs) are variants of diffusion generative models that use an input-dependent latent variable to capture representations alongside the diffusion process. These representations, to varying extents, can be used for tasks such as downstream classification, controllable generation, and interpolation. However, the generative performance of DAs relies heavily on how well the latent variables can be modelled and subsequently sampled from. Better generative modelling is also the primary goal of another class of diffusion models – those that learn their forward (noising) process. While effective at adjusting the noise process in an input-dependent manner, they must satisfy additional constraints derived from the terminal conditions of the diffusion process. Here, we draw a connection between these two classes of models and show that certain design decisions (latent variable choice, conditioning method, etc.) in the DA framework – leading to a model we term DMZ – allow us to obtain the best of both worlds: effective representations as evaluated on downstream tasks, including domain transfer, as well as more efficient modelling and generation with fewer denoising steps compared to standard DMs.

[LG-144] radeoffs between Mistakes and ERM Oracle Calls in Online and Transductive Online Learning

链接: https://arxiv.org/abs/2506.00135
作者: Idan Attias,Steve Hanneke,Arvind Ramaswami
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We study online and transductive online learning when the learner interacts with the concept class only via Empirical Risk Minimization (ERM) or weak consistency oracles on arbitrary instance subsets. This contrasts with standard online models, where the learner knows the entire class. The ERM oracle returns a hypothesis minimizing loss on a given subset, while the weak consistency oracle returns a binary signal indicating whether the subset is realizable by some concept. The learner is evaluated by the number of mistakes and oracle calls. In the standard online setting with ERM access, we prove tight lower bounds in both realizable and agnostic cases: \Omega(2^d_VC) mistakes and \Omega(\sqrtT 2^d_LD) regret, where T is the number of timesteps and d_LD is the Littlestone dimension. We further show that existing online learning results with ERM access carry over to the weak consistency setting, incurring an additional O(T) in oracle calls. We then consider the transductive online model, where the instance sequence is known but labels are revealed sequentially. For general Littlestone classes, we show that optimal realizable and agnostic mistake bounds can be achieved using O(T^d_VC+1) weak consistency oracle calls. On the negative side, we show that limiting the learner to \Omega(T) weak consistency queries is necessary for transductive online learnability, and that restricting the learner to \Omega(T) ERM queries is necessary to avoid exponential dependence on the Littlestone dimension. Finally, for certain concept classes, we reduce oracle calls via randomized algorithms while maintaining similar mistake bounds. In particular, for Thresholds on an unknown ordering, O(\log T) ERM queries suffice; for k -Intervals, O(T^3 2^2k) weak consistency queries suffice.

[LG-145] Applying Large Language Models to Issue Classification: Revisiting with Extended Data and New Models

链接: https://arxiv.org/abs/2506.00128
作者: Gabriel Aracena,Kyle Luster,Fabio Santos,Igor Steinmacher,Marco A. Gerosa
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注: 35 pages, 2 figures, 9 tables, Pre-print for Science of Computer Programming

点击查看摘要

Abstract:Effective prioritization of issue reports in software engineering helps to optimize resource allocation and information recovery. However, manual issue classification is laborious and lacks scalability. As an alternative, many open source software (OSS) projects employ automated processes for this task, yet this method often relies on large datasets for adequate training. Traditionally, machine learning techniques have been used for issue classification. More recently, large language models (LLMs) have emerged as powerful tools for addressing a range of software engineering challenges, including code and test generation, mapping new requirements to legacy software endpoints, and conducting code reviews. The following research investigates an automated approach to issue classification based on LLMs. By leveraging the capabilities of such models, we aim to develop a robust system for prioritizing issue reports, mitigating the necessity for extensive training data while also maintaining reliability in classification. In our research, we developed an LLM-based approach for accurately labeling issues by selecting two of the most prominent large language models. We then compared their performance across multiple datasets. Our findings show that GPT-4o achieved the best results in classifying issues from the NLBSE 2024 competition. Moreover, GPT-4o outperformed DeepSeek R1, achieving an F1 score 20% higher when both models were trained on the same dataset from the NLBSE 2023 competition, which was ten times larger than the NLBSE 2024 dataset. The fine-tuned GPT-4o model attained an average F1 score of 80.7%, while the fine-tuned DeepSeek R1 model achieved 59.33%. Increasing the dataset size did not improve the F1 score, reducing the dependence on massive datasets for building an efficient solution to issue classification.

[LG-146] Interactive Imitation Learning for Dexterous Robotic Manipulation: Challenges and Perspectives – A Survey

链接: https://arxiv.org/abs/2506.00098
作者: Edgar Welte,Rania Rayyes
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 21 pages, 3 figures

点击查看摘要

Abstract:Dexterous manipulation is a crucial yet highly complex challenge in humanoid robotics, demanding precise, adaptable, and sample-efficient learning methods. As humanoid robots are usually designed to operate in human-centric environments and interact with everyday objects, mastering dexterous manipulation is critical for real-world deployment. Traditional approaches, such as reinforcement learning and imitation learning, have made significant strides, but they often struggle due to the unique challenges of real-world dexterous manipulation, including high-dimensional control, limited training data, and covariate shift. This survey provides a comprehensive overview of these challenges and reviews existing learning-based methods for dexterous manipulation, spanning imitation learning, reinforcement learning, and hybrid approaches. A promising yet underexplored direction is interactive imitation learning, where human feedback actively refines a robot’s behavior during training. While interactive imitation learning has shown success in various robotic tasks, its application to dexterous manipulation remains limited. To address this gap, we examine current interactive imitation learning techniques applied to other robotic tasks and discuss how these methods can be adapted to enhance dexterous manipulation. By synthesizing state-of-the-art research, this paper highlights key challenges, identifies gaps in current methodologies, and outlines potential directions for leveraging interactive imitation learning to improve dexterous robotic skills.

[LG-147] Hierarchical Bayesian Knowledge Tracing in Undergraduate Engineering Education

链接: https://arxiv.org/abs/2506.00057
作者: Yiwei Sun
类目: Computers and Society (cs.CY); Machine Learning (cs.LG); Applications (stat.AP); Machine Learning (stat.ML)
*备注: 6 pages, 6 figures, 3 tables

点击查看摘要

Abstract:Educators teaching entry-level university engineering modules face the challenge of identifying which topics students find most difficult and how to support diverse student needs effectively. This study demonstrates a rigorous yet interpretable statistical approach – hierarchical Bayesian modeling – that leverages detailed student response data to quantify both skill difficulty and individual student abilities. Using a large-scale dataset from an undergraduate Statics course, we identified clear patterns of skill mastery and uncovered distinct student subgroups based on their learning trajectories. Our analysis reveals that certain concepts consistently present challenges, requiring targeted instructional support, while others are readily mastered and may benefit from enrichment activities. Importantly, the hierarchical Bayesian method provides educators with intuitive, reliable metrics without sacrificing predictive accuracy. This approach allows for data-informed decisions, enabling personalized teaching strategies to improve student engagement and success. By combining robust statistical methods with clear interpretability, this study equips educators with actionable insights to better support diverse learner populations.

[LG-148] Graph Contrastive Learning for Optimizing Sparse Data in Recommender Systems with LightGCL

链接: https://arxiv.org/abs/2506.00048
作者: Aravinda Jatavallabha,Prabhanjan Bharadwaj,Ashish Chander
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: Term Paper, Machine Learning with Graphs, North Carolina State University

点击查看摘要

Abstract:Graph Neural Networks (GNNs) are powerful tools for recommendation systems, but they often struggle under data sparsity and noise. To address these issues, we implemented LightGCL, a graph contrastive learning model that uses Singular Value Decomposition (SVD) for robust graph augmentation, preserving semantic integrity without relying on stochastic or heuristic perturbations. LightGCL enables structural refinement and captures global collaborative signals, achieving significant gains over state-of-the-art models across benchmark datasets. Our experiments also demonstrate improved fairness and resilience to popularity bias, making it well-suited for real-world recommender systems.

[LG-149] Decoding Dense Embeddings: Sparse Autoencoders for Interpreting and Discretizing Dense Retrieval

链接: https://arxiv.org/abs/2506.00041
作者: Seongwan Park,Taeklim Kim,Youngjoong Ko
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Despite their strong performance, Dense Passage Retrieval (DPR) models suffer from a lack of interpretability. In this work, we propose a novel interpretability framework that leverages Sparse Autoencoders (SAEs) to decompose previously uninterpretable dense embeddings from DPR models into distinct, interpretable latent concepts. We generate natural language descriptions for each latent concept, enabling human interpretations of both the dense embeddings and the query-document similarity scores of DPR models. We further introduce Concept-Level Sparse Retrieval (CL-SR), a retrieval framework that directly utilizes the extracted latent concepts as indexing units. CL-SR effectively combines the semantic expressiveness of dense embeddings with the transparency and efficiency of sparse representations. We show that CL-SR achieves high index-space and computational efficiency while maintaining robust performance across vocabulary and semantic mismatches.

[LG-150] AbsoluteNet: A Deep Learning Neural Network to Classify Cerebral Hemodynamic Responses of Auditory Processing

链接: https://arxiv.org/abs/2506.00039
作者: Behtom Adeli,John Mclinden,Pankaj Pandey,Ming Shao,Yalda Shahriari
类目: Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:In recent years, deep learning (DL) approaches have demonstrated promising results in decoding hemodynamic responses captured by functional near-infrared spectroscopy (fNIRS), particularly in the context of brain-computer interface (BCI) applications. This work introduces AbsoluteNet, a novel deep learning architecture designed to classify auditory event-related responses recorded using fNIRS. The proposed network is built upon principles of spatio-temporal convolution and customized activation functions. Our model was compared against several models, namely fNIRSNET, MDNN, DeepConvNet, and ShallowConvNet. The results showed that AbsoluteNet outperforms existing models, reaching 87.0% accuracy, 84.8% sensitivity, and 89.2% specificity in binary classification, surpassing fNIRSNET, the second-best model, by 3.8% in accuracy. These findings underscore the effectiveness of our proposed deep learning model in decoding hemodynamic responses related to auditory processing and highlight the importance of spatio-temporal feature aggregation and customized activation functions to better fit fNIRS dynamics.

[LG-151] Query Drift Compensation: Enabling Compatibility in Continual Learning of Retrieval Embedding Models

链接: https://arxiv.org/abs/2506.00037
作者: Dipam Goswami,Liying Wang,Bartłomiej Twardowski,Joost van de Weijer
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: Accepted at CoLLAs 2025

点击查看摘要

Abstract:Text embedding models enable semantic search, powering several NLP applications like Retrieval Augmented Generation by efficient information retrieval (IR). However, text embedding models are commonly studied in scenarios where the training data is static, thus limiting its applications to dynamic scenarios where new training data emerges over time. IR methods generally encode a huge corpus of documents to low-dimensional embeddings and store them in a database index. During retrieval, a semantic search over the corpus is performed and the document whose embedding is most similar to the query embedding is returned. When updating an embedding model with new training data, using the already indexed corpus is suboptimal due to the non-compatibility issue, since the model which was used to obtain the embeddings of the corpus has changed. While re-indexing of old corpus documents using the updated model enables compatibility, it requires much higher computation and time. Thus, it is critical to study how the already indexed corpus can still be effectively used without the need of re-indexing. In this work, we establish a continual learning benchmark with large-scale datasets and continually train dense retrieval embedding models on query-document pairs from new datasets in each task and observe forgetting on old tasks due to significant drift of embeddings. We employ embedding distillation on both query and document embeddings to maintain stability and propose a novel query drift compensation method during retrieval to project new model query embeddings to the old embedding space. This enables compatibility with previously indexed corpus embeddings extracted using the old model and thus reduces the forgetting. We show that the proposed method significantly improves performance without any re-indexing. Code is available at this https URL.

[LG-152] Modality Equilibrium Matters: Minor-Modality-Aware Adaptive Alternating for Cross-Modal Memory Enhancement

链接: https://arxiv.org/abs/2506.00030
作者: Xiang Shi,Rui Zhang,Jiawei Liu,Yinpeng Liu,Qikai Cheng,Wei Lu
类目: Machine Learning (cs.LG)
*备注: work in progress

点击查看摘要

Abstract:Multimodal fusion is susceptible to modality imbalance, where dominant modalities overshadow weak ones, easily leading to biased learning and suboptimal fusion, especially for incomplete modality conditions. To address this problem, we propose a Shapley-guided alternating training framework that adaptively prioritizes minor modalities to balance and thus enhance the fusion. Our method leverages Shapley Value-based scheduling to improve the training sequence adaptively, ensuring that under-optimized modalities receive sufficient learning. Additionally, we introduce the memory module to refine and inherit modality-specific representations with a cross-modal mapping mechanism to align features at both the feature and sample levels. To further validate the adaptability of the proposed approach, the encoder module empirically adopts both conventional and LLM-based backbones. With building up a novel multimodal equilibrium metric, namely, equilibrium deviation metric (EDM), we evaluate the performance in both balance and accuracy across four multimodal benchmark datasets, where our method achieves state-of-the-art (SOTA) results. Meanwhile, robustness analysis under missing modalities highlights its strong generalization capabilities. Accordingly, our findings reveal the untapped potential of alternating training, demonstrating that strategic modality prioritization fundamentally balances and promotes multimodal learning, offering a new paradigm for optimizing multimodal training dynamics.

[LG-153] AI Accelerators for Large Language Model In-ference: Architecture Analysis and Scaling Strategies

链接: https://arxiv.org/abs/2506.00008
作者: Amit Sharma
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The rapid growth of large-language models (LLMs) is driving a new wave of specialized hardware for inference. This paper presents the first workload-centric, cross-architectural performance study of commercial AI accelerators, spanning GPU-based chips, hybrid packages, and wafer-scale engines. We compare memory hierarchies, compute fabrics, and on-chip interconnects, and observe up to 3.7x performance variation across architectures as batch size and sequence length change. Four scaling techniques for trillion-parameter models are examined; expert parallelism offers an 8.4x parameter-to-compute advantage but incurs 2.1x higher latency variance than tensor parallelism. These findings provide quantitative guidance for matching workloads to accelerators and reveal architectural gaps that next-generation designs must address.

[LG-154] Emerging ML-AI Techniques for Analog and RF EDA

链接: https://arxiv.org/abs/2506.00007
作者: Zhengfeng Wu,Ziyi Chen,Nnaemeka Achebe,Vaibhav V. Rao,Pratik Shrestha,Ioannis Savidis
类目: Hardware Architecture (cs.AR); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注: 9 pages, 2 figures

点击查看摘要

Abstract:This survey explores the integration of machine learning (ML) into EDA workflows for analog and RF circuits, addressing challenges unique to analog design, which include complex constraints, nonlinear design spaces, and high computational costs. State-of-the-art learning and optimization techniques are reviewed for circuit tasks such as constraint formulation, topology generation, device modeling, sizing, placement, and routing. The survey highlights the capability of ML to enhance automation, improve design quality, and reduce time-to-market while meeting the target specifications of an analog or RF circuit. Emerging trends and cross-cutting challenges, including robustness to variations and considerations of interconnect parasitics, are also discussed.

[LG-155] Stock Market Telepathy: Graph Neural Networks Predicting the Secret Conversations between MINT and G7 Countries

链接: https://arxiv.org/abs/2506.01945
作者: Nurbanu Bursa
类目: Econometrics (econ.EM); Machine Learning (cs.LG); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:Emerging economies, particularly the MINT countries (Mexico, Indonesia, Nigeria, and Türkiye), are gaining influence in global stock markets, although they remain susceptible to the economic conditions of developed countries like the G7 (Canada, France, Germany, Italy, Japan, the United Kingdom, and the United States). This interconnectedness and sensitivity of financial markets make understanding these relationships crucial for investors and policymakers to predict stock price movements accurately. To this end, we examined the main stock market indices of G7 and MINT countries from 2012 to 2024, using a recent graph neural network (GNN) algorithm called multivariate time series forecasting with graph neural network (MTGNN). This method allows for considering complex spatio-temporal connections in multivariate time series. In the implementations, MTGNN revealed that the US and Canada are the most influential G7 countries regarding stock indices in the forecasting process, and Indonesia and Türkiye are the most influential MINT countries. Additionally, our results showed that MTGNN outperformed traditional methods in forecasting the prices of stock market indices for MINT and G7 countries. Consequently, the study offers valuable insights into economic blocks’ markets and presents a compelling empirical approach to analyzing global stock market dynamics using MTGNN.

[LG-156] Machine-Learned Sampling of Conditioned Path Measures

链接: https://arxiv.org/abs/2506.01904
作者: Qijia Jiang,Reuben Cohn-Gordon
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Computation (stat.CO)
*备注:

点击查看摘要

Abstract:We propose algorithms for sampling from posterior path measures P(C([0, T], \mathbbR^d)) under a general prior process. This leverages ideas from (1) controlled equilibrium dynamics, which gradually transport between two path measures, and (2) optimization in \infty -dimensional probability space endowed with a Wasserstein metric, which can be used to evolve a density curve under the specified likelihood. The resulting algorithms are theoretically grounded and can be integrated seamlessly with neural networks for learning the target trajectory ensembles, without access to data.

[LG-157] Probing Quantum Spin Systems with Kolmogorov-Arnold Neural Network Quantum States

链接: https://arxiv.org/abs/2506.01891
作者: Mahmud Ashraf Shamim,Eric Reinhardt,Talal Ahmed Chowdhury,Sergei Gleyzer,Paulo T Araujo
类目: Quantum Physics (quant-ph); Disordered Systems and Neural Networks (cond-mat.dis-nn); Strongly Correlated Electrons (cond-mat.str-el); Machine Learning (cs.LG)
*备注: 16 pages, 13 figures

点击查看摘要

Abstract:Neural Quantum States (NQS) are a class of variational wave functions parametrized by neural networks (NNs) to study quantum many-body systems. In this work, we propose SineKAN, the NQS ansatz based on Kolmogorov-Arnold Networks (KANs), to represent quantum mechanical wave functions as nested univariate functions. We show that \sk wavefunction with learnable sinusoidal activation functions can capture the ground state energies, fidelities and various correlation functions of the 1D Transverse-Field Ising model, Anisotropic Heisenberg model, and Antiferromagnetic J_1-J_2 model with different chain lengths. In our study of the J_1-J_2 model with L=100 sites, we find that the SineKAN model outperforms several previously explored neural quantum state ansätze, including Restricted Boltzmann Machines (RBMs), Long Short-Term Memory models (LSTMs), and Feed-Forward Neural Networks (FFCN), when compared to the results obtained from the Density Matrix Renormalization Group (DMRG) algorithm.

[LG-158] Learning thermodynamic master equations for open quantum systems

链接: https://arxiv.org/abs/2506.01882
作者: Peter Sentz,Stanley Nicholson,Yujin Cho,Sohail Reddy,Brendan Keith,Stefanie Günther
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 20 pages, 7 figures

点击查看摘要

Abstract:The characterization of Hamiltonians and other components of open quantum dynamical systems plays a crucial role in quantum computing and other applications. Scientific machine learning techniques have been applied to this problem in a variety of ways, including by modeling with deep neural networks. However, the majority of mathematical models describing open quantum systems are linear, and the natural nonlinearities in learnable models have not been incorporated using physical principles. We present a data-driven model for open quantum systems that includes learnable, thermodynamically consistent terms. The trained model is interpretable, as it directly estimates the system Hamiltonian and linear components of coupling to the environment. We validate the model on synthetic two and three-level data, as well as experimental two-level data collected from a quantum device at Lawrence Livermore National Laboratory.

[LG-159] On-device Streaming Discrete Speech Units INTERSPEECH2025

链接: https://arxiv.org/abs/2506.01845
作者: Kwanghee Choi,Masao Someki,Emma Strubell,Shinji Watanabe
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
*备注: Accepted to Interspeech 2025, source code at this https URL

点击查看摘要

Abstract:Discrete speech units (DSUs) are derived from clustering the features of self-supervised speech models (S3Ms). DSUs offer significant advantages for on-device streaming speech applications due to their rich phonetic information, high transmission efficiency, and seamless integration with large language models. However, conventional DSU-based approaches are impractical as they require full-length speech input and computationally expensive S3Ms. In this work, we reduce both the attention window and the model size while preserving the effectiveness of DSUs. Our results demonstrate that we can reduce floating-point operations (FLOPs) by 50% with only a relative increase of 6.5% in character error rate (CER) on the ML-SUPERB 1h dataset. These findings highlight the potential of DSUs for real-time speech processing in resource-constrained environments.

[LG-160] An adaptive data sampling strategy for stabilizing dynamical systems via controller inference

链接: https://arxiv.org/abs/2506.01816
作者: Steffen W. R. Werner,Benjamin Peherstorfer
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Dynamical Systems (math.DS); Numerical Analysis (math.NA)
*备注: 27 pages, 9 figures

点击查看摘要

Abstract:Learning stabilizing controllers from data is an important task in engineering applications; however, collecting informative data is challenging because unstable systems often lead to rapidly growing or erratic trajectories. In this work, we propose an adaptive sampling scheme that generates data while simultaneously stabilizing the system to avoid instabilities during the data collection. Under mild assumptions, the approach provably generates data sets that are informative for stabilization and have minimal size. The numerical experiments demonstrate that controller inference with the novel adaptive sampling approach learns controllers with up to one order of magnitude fewer data samples than unguided data generation. The results show that the proposed approach opens the door to stabilizing systems in edge cases and limit states where instabilities often occur and data collection is inherently difficult.

[LG-161] Signature Maximum Mean Discrepancy Two-Sample Statistical Tests

链接: https://arxiv.org/abs/2506.01718
作者: Andrew Alden,Blanka Horvath,Zacharia Issa
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Dynamical Systems (math.DS)
*备注: 45 pages, 19 figures

点击查看摘要

Abstract:Maximum Mean Discrepancy (MMD) is a widely used concept in machine learning research which has gained popularity in recent years as a highly effective tool for comparing (finite-dimensional) distributions. Since it is designed as a kernel-based method, the MMD can be extended to path space valued distributions using the signature kernel. The resulting signature MMD (sig-MMD) can be used to define a metric between distributions on path space. Similarly to the original use case of the MMD as a test statistic within a two-sample testing framework, the sig-MMD can be applied to determine if two sets of paths are drawn from the same stochastic process. This work is dedicated to understanding the possibilities and challenges associated with applying the sig-MMD as a statistical tool in practice. We introduce and explain the sig-MMD, and provide easily accessible and verifiable examples for its practical use. We present examples that can lead to Type 2 errors in the hypothesis test, falsely indicating that samples have been drawn from the same underlying process (which generally occurs in a limited data setting). We then present techniques to mitigate the occurrence of this type of error.

[LG-162] From Turbulence to Tranquility: AI-Driven Low-Altitude Network

链接: https://arxiv.org/abs/2506.01378
作者: Kürşat Tekbıyık,Amir Hossein Fahim Raouf,İsmail Güvenç,Mingzhe Chen,Güneş Karabulut Kurt,Antoine Lesage-Landry
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Low Altitude Economy (LAE) networks own transformative potential in urban mobility, emergency response, and aerial logistics. However, these networks face significant challenges in spectrum management, interference mitigation, and real-time coordination across dynamic and resource-constrained environments. After addressing these challenges, this study explores three core elements for enabling intelligent LAE networks as follows machine learning-based spectrum sensing and coexistence, artificial intelligence (AI)-optimized resource allocation and trajectory planning, and testbed-driven validation and standardization. We highlight how federated and reinforcement learning techniques support decentralized, adaptive decision-making under mobility and energy constraints. In addition, we discuss the role of real-world platforms such as AERPAW in bridging the gap between simulation and deployment and enabling iterative system refinement under realistic conditions. This study aims to provide a forward-looking roadmap toward developing efficient and interoperable AI-driven LAE ecosystems.

[LG-163] Near-Optimal Clustering in Mixture of Markov Chains

链接: https://arxiv.org/abs/2506.01324
作者: Junghyun Lee,Yassir Jedra,Alexandre Proutière,Se-Young Yun
类目: Machine Learning (stat.ML); Information Theory (cs.IT); Machine Learning (cs.LG); Probability (math.PR)
*备注: 36 pages

点击查看摘要

Abstract:We study the problem of clustering T trajectories of length H , each generated by one of K unknown ergodic Markov chains over a finite state space of size S . The goal is to accurately group trajectories according to their underlying generative model. We begin by deriving an instance-dependent, high-probability lower bound on the clustering error rate, governed by the weighted KL divergence between the transition kernels of the chains. We then present a novel two-stage clustering algorithm. In Stage~I, we apply spectral clustering using a new injective Euclidean embedding for ergodic Markov chains – a contribution of independent interest that enables sharp concentration results. Stage~II refines the initial clusters via a single step of likelihood-based reassignment. Our method achieves a near-optimal clustering error with high probability, under the conditions H = \tilde\Omega(\gamma_\mathrmps^-1 (S^2 \vee \pi_\min^-1)) and TH = \tilde\Omega(\gamma_\mathrmps^-1 S^2 ) , where \pi_\min is the minimum stationary probability of a state across the K chains and \gamma_\mathrmps is the minimum pseudo-spectral gap. These requirements provide significant improvements, if not at least comparable, to the state-of-the-art guarantee (Kausik et al., 2023), and moreover, our algorithm offers a key practical advantage: unlike existing approach, it requires no prior knowledge of model-specific quantities (e.g., separation between kernels or visitation probabilities). We conclude by discussing the inherent gap between our upper and lower bounds, providing insights into the unique structure of this clustering problem.

[LG-164] Adversarial learning for nonparametric regression: Minimax rate and adaptive estimation

链接: https://arxiv.org/abs/2506.01267
作者: Jingfu Peng,Yuhong Yang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Despite tremendous advancements of machine learning models and algorithms in various application domains, they are known to be vulnerable to subtle, natural or intentionally crafted perturbations in future input data, known as adversarial attacks. While numerous adversarial learning methods have been proposed, fundamental questions about their statistical optimality in robust loss remain largely unanswered. In particular, the minimax rate of convergence and the construction of rate-optimal estimators under future X -attacks are yet to be worked out. In this paper, we address this issue in the context of nonparametric regression, under suitable assumptions on the smoothness of the regression function and the geometric structure of the input perturbation set. We first establish the minimax rate of convergence under adversarial L_q -risks with 1 \leq q \leq \infty and propose a piecewise local polynomial estimator that achieves the minimax optimality. The established minimax rate elucidates how the smoothness level and perturbation magnitude affect the fundamental limit of adversarial learning under future X -attacks. Furthermore, we construct a data-driven adaptive estimator that is shown to achieve, within a logarithmic factor, the optimal rate across a broad scale of nonparametric and adversarial classes. Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST); Methodology (stat.ME) Cite as: arXiv:2506.01267 [stat.ML] (or arXiv:2506.01267v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2506.01267 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-165] Flexible Mixed Precision Quantization for Learned Image Compression

链接: https://arxiv.org/abs/2506.01221
作者: Md Adnan Faisal Hossain,Zhihao Duan,Fengqing Zhu
类目: Image and Video Processing (eess.IV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Despite its improvements in coding performance compared to traditional codecs, Learned Image Compression (LIC) suffers from large computational costs for storage and deployment. Model quantization offers an effective solution to reduce the computational complexity of LIC models. However, most existing works perform fixed-precision quantization which suffers from sub-optimal utilization of resources due to the varying sensitivity to quantization of different layers of a neural network. In this paper, we propose a Flexible Mixed Precision Quantization (FMPQ) method that assigns different bit-widths to different layers of the quantized network using the fractional change in rate-distortion loss as the bit-assignment criterion. We also introduce an adaptive search algorithm which reduces the time-complexity of searching for the desired distribution of quantization bit-widths given a fixed model size. Evaluation of our method shows improved BD-Rate performance under similar model size constraints compared to other works on quantization of LIC models. We have made the source code available at this http URL.

[LG-166] Linear regression with overparameterized linear neural networks: Tight upper and lower bounds for implicit ell1-regularization

链接: https://arxiv.org/abs/2506.01143
作者: Hannes Matt,Dominik Stöger
类目: Machine Learning (stat.ML); Information Theory (cs.IT); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Modern machine learning models are often trained in a setting where the number of parameters exceeds the number of training samples. To understand the implicit bias of gradient descent in such overparameterized models, prior work has studied diagonal linear neural networks in the regression setting. These studies have shown that, when initialized with small weights, gradient descent tends to favor solutions with minimal \ell^1 -norm - an effect known as implicit regularization. In this paper, we investigate implicit regularization in diagonal linear neural networks of depth D\ge 2 for overparameterized linear regression problems. We focus on analyzing the approximation error between the limit point of gradient flow trajectories and the solution to the \ell^1 -minimization problem. By deriving tight upper and lower bounds on the approximation error, we precisely characterize how the approximation error depends on the scale of initialization \alpha . Our results reveal a qualitative difference between depths: for D \ge 3 , the error decreases linearly with \alpha , whereas for D=2 , it decreases at rate \alpha^1-\varrho , where the parameter \varrho \in [0,1) can be explicitly characterized. Interestingly, this parameter is closely linked to so-called null space property constants studied in the sparse recovery literature. We demonstrate the asymptotic tightness of our bounds through explicit examples. Numerical experiments corroborate our theoretical findings and suggest that deeper networks, i.e., D \ge 3 , may lead to better generalization, particularly for realistic initialization scales.

[LG-167] Generative diffusion posterior sampling for informative likelihoods

链接: https://arxiv.org/abs/2506.01083
作者: Zheng Zhao
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: Commemorative issue

点击查看摘要

Abstract:Sequential Monte Carlo (SMC) methods have recently shown successful results for conditional sampling of generative diffusion models. In this paper we propose a new diffusion posterior SMC sampler achieving improved statistical efficiencies, particularly under outlier conditions or highly informative likelihoods. The key idea is to construct an observation path that correlates with the diffusion model and to design the sampler to leverage this correlation for more efficient sampling. Empirical results conclude the efficiency.

[LG-168] Reconstruction and Prediction of Volterra Integral Equations Driven by Gaussian Noise

链接: https://arxiv.org/abs/2506.00933
作者: Zhihao Xu,Saisai Ding,Zhikun Zhang,Xiangjun Wang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Integral equations are widely used in fields such as applied modeling, medical imaging, and system identification, providing a powerful framework for solving deterministic problems. While parameter identification for differential equations has been extensively studied, the focus on integral equations, particularly stochastic Volterra integral equations, remains limited. This research addresses the parameter identification problem, also known as the equation reconstruction problem, in Volterra integral equations driven by Gaussian noise. We propose an improved deep neural networks framework for estimating unknown parameters in the drift term of these equations. The network represents the primary variables and their integrals, enhancing parameter estimation accuracy by incorporating inter-output relationships into the loss function. Additionally, the framework extends beyond parameter identification to predict the system’s behavior outside the integration interval. Prediction accuracy is validated by comparing predicted and true trajectories using a 95% confidence interval. Numerical experiments demonstrate the effectiveness of the proposed deep neural networks framework in both parameter identification and prediction tasks, showing robust performance under varying noise levels and providing accurate solutions for modeling stochastic systems.

[LG-169] Projection Pursuit Density Ratio Estimation

链接: https://arxiv.org/abs/2506.00866
作者: Meilin Wang,Wei Huang,Mingming Gong,Zheng Zhang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Density ratio estimation (DRE) is a paramount task in machine learning, for its broad applications across multiple domains, such as covariate shift adaptation, causal inference, independence tests and beyond. Parametric methods for estimating the density ratio possibly lead to biased results if models are misspecified, while conventional non-parametric methods suffer from the curse of dimensionality when the dimension of data is large. To address these challenges, in this paper, we propose a novel approach for DRE based on the projection pursuit (PP) approximation. The proposed method leverages PP to mitigate the impact of high dimensionality while retaining the model flexibility needed for the accuracy of DRE. We establish the consistency and the convergence rate for the proposed estimator. Experimental results demonstrate that our proposed method outperforms existing alternatives in various applications.

[LG-170] Generalized Linear Markov Decision Process

链接: https://arxiv.org/abs/2506.00818
作者: Sinian Zhang,Kaicheng Zhang,Ziping Xu,Tianxi Cai,Doudou Zhou
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 34 pages, 9 figures

点击查看摘要

Abstract:The linear Markov Decision Process (MDP) framework offers a principled foundation for reinforcement learning (RL) with strong theoretical guarantees and sample efficiency. However, its restrictive assumption-that both transition dynamics and reward functions are linear in the same feature space-limits its applicability in real-world domains, where rewards often exhibit nonlinear or discrete structures. Motivated by applications such as healthcare and e-commerce, where data is scarce and reward signals can be binary or count-valued, we propose the Generalized Linear MDP (GLMDP) framework-an extension of the linear MDP framework-that models rewards using generalized linear models (GLMs) while maintaining linear transition dynamics. We establish the Bellman completeness of GLMDPs with respect to a new function class that accommodates nonlinear rewards and develop two offline RL algorithms: Generalized Pessimistic Value Iteration (GPEVI) and a semi-supervised variant (SS-GPEVI) that utilizes both labeled and unlabeled trajectories. Our algorithms achieve theoretical guarantees on policy suboptimality and demonstrate improved sample efficiency in settings where reward labels are expensive or limited.

[LG-171] CLAP-ART: Automated Audio Captioning with Semantic-rich Audio Representation Tokenizer INTERSPEECH2025

链接: https://arxiv.org/abs/2506.00800
作者: Daiki Takeuchi,Binh Thien Nguyen,Masahiro Yasuda,Yasunori Ohishi,Daisuke Niizumi,Noboru Harada
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
*备注: Accepted to Interspeech2025

点击查看摘要

Abstract:Automated Audio Captioning (AAC) aims to describe the semantic contexts of general sounds, including acoustic events and scenes, by leveraging effective acoustic features. To enhance performance, an AAC method, EnCLAP, employed discrete tokens from EnCodec as an effective input for fine-tuning a language model BART. However, EnCodec is designed to reconstruct waveforms rather than capture the semantic contexts of general sounds, which AAC should describe. To address this issue, we propose CLAP-ART, an AAC method that utilizes ``semantic-rich and discrete’’ tokens as input. CLAP-ART computes semantic-rich discrete tokens from pre-trained audio representations through vector quantization. We experimentally confirmed that CLAP-ART outperforms baseline EnCLAP on two AAC benchmarks, indicating that semantic-rich discrete tokens derived from semantically rich AR are beneficial for AAC.

[LG-172] A Foundation Model for Non-Destructive Defect Identification from Vibrational Spectra

链接: https://arxiv.org/abs/2506.00725
作者: Mouyang Cheng,Chu-Liang Fu,Bowen Yu,Eunbi Rha,Abhijatmedhi Chotrattanapituk,Douglas L Abernathy,Yongqiang Cheng,Mingda Li
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注: 14 pages, 5 figures

点击查看摘要

Abstract:Defects are ubiquitous in solids and strongly influence materials’ mechanical and functional properties. However, non-destructive characterization and quantification of defects, especially when multiple types coexist, remain a long-standing challenge. Here we introduce DefectNet, a foundation machine learning model that predicts the chemical identity and concentration of substitutional point defects with multiple coexisting elements directly from vibrational spectra, specifically phonon density-of-states (PDoS). Trained on over 16,000 simulated spectra from 2,000 semiconductors, DefectNet employs a tailored attention mechanism to identify up to six distinct defect elements at concentrations ranging from 0.2% to 25%. The model generalizes well to unseen crystals across 56 elements and can be fine-tuned on experimental data. Validation using inelastic scattering measurements of SiGe alloys and MgB _2 superconductor demonstrates its accuracy and transferability. Our work establishes vibrational spectroscopy as a viable, non-destructive probe for point defect quantification in bulk materials, and highlights the promise of foundation models in data-driven defect engineering.

[LG-173] Uncertainty-Aware Genomic Classification of Alzheimers Disease: A Transformer-Based Ensemble Approach with Monte Carlo Dropout

链接: https://arxiv.org/abs/2506.00662
作者: Taeho Jo,Eun Hye Lee,Alzheimer’s Disease Sequencing Project
类目: Genomics (q-bio.GN); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:INTRODUCTION: Alzheimer’s disease (AD) is genetically complex, complicating robust classification from genomic data. METHODS: We developed a transformer-based ensemble model (TrUE-Net) using Monte Carlo Dropout for uncertainty estimation in AD classification from whole-genome sequencing (WGS). We combined a transformer that preserves single-nucleotide polymorphism (SNP) sequence structure with a concurrent random forest using flattened genotypes. An uncertainty threshold separated samples into an uncertain (high-variance) group and a more certain (low-variance) group. RESULTS: We analyzed 1050 individuals, holding out half for testing. Overall accuracy and area under the receiver operating characteristic (ROC) curve (AUC) were 0.6514 and 0.6636, respectively. Excluding the uncertain group improved accuracy from 0.6263 to 0.7287 (10.24% increase) and F1 from 0.5843 to 0.8205 (23.62% increase). DISCUSSION: Monte Carlo Dropout-driven uncertainty helps identify ambiguous cases that may require further clinical evaluation, thus improving reliability in AD genomic classification.

[LG-174] Score Matching With Missing Data ICML2025

链接: https://arxiv.org/abs/2506.00557
作者: Josh Givens,Song Liu,Henry W J Reeve
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注: Accepted for ICML 2025 Conference Proceedings (Spotlight)

点击查看摘要

Abstract:Score matching is a vital tool for learning the distribution of data with applications across many areas including diffusion processes, energy based modelling, and graphical model estimation. Despite all these applications, little work explores its use when data is incomplete. We address this by adapting score matching (and its major extensions) to work with missing data in a flexible setting where data can be partially missing over any subset of the coordinates. We provide two separate score matching variations for general use, an importance weighting (IW) approach, and a variational approach. We provide finite sample bounds for our IW approach in finite domain settings and show it to have especially strong performance in small sample lower dimensional cases. Complementing this, we show our variational approach to be strongest in more complex high-dimensional settings which we demonstrate on graphical model estimation tasks on both real and simulated data.

[LG-175] DiffPINN: Generative diffusion-initialized physics-informed neural networks for accelerating seismic wavefield representation

链接: https://arxiv.org/abs/2506.00471
作者: Shijun Cheng,Tariq Alkhalifah
类目: Geophysics (physics.geo-ph); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract:Physics-informed neural networks (PINNs) offer a powerful framework for seismic wavefield modeling, yet they typically require time-consuming retraining when applied to different velocity models. Moreover, their training can suffer from slow convergence due to the complexity of of the wavefield solution. To address these challenges, we introduce a latent diffusion-based strategy for rapid and effective PINN initialization. First, we train multiple PINNs to represent frequency-domain scattered wavefields for various velocity models, then flatten each trained network’s parameters into a one-dimensional vector, creating a comprehensive parameter dataset. Next, we employ an autoencoder to learn latent representations of these parameter vectors, capturing essential patterns across diverse PINN’s parameters. We then train a conditional diffusion model to store the distribution of these latent vectors, with the corresponding velocity models serving as conditions. Once trained, this diffusion model can generate latent vectors corresponding to new velocity models, which are subsequently decoded by the autoencoder into complete PINN parameters. Experimental results indicate that our method significantly accelerates training and maintains high accuracy across in-distribution and out-of-distribution velocity scenarios.

[LG-176] Off-Policy Evaluation of Ranking Policies via Embedding-Space User Behavior Modeling

链接: https://arxiv.org/abs/2506.00446
作者: Tatsuki Takahashi,Chihiro Maru,Hiroko Shoji
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Off-policy evaluation (OPE) in ranking settings with large ranking action spaces, which stems from an increase in both the number of unique actions and length of the ranking, is essential for assessing new recommender policies using only logged bandit data from previous versions. To address the high variance issues associated with existing estimators, we introduce two new assumptions: no direct effect on rankings and user behavior model on ranking embedding spaces. We then propose the generalized marginalized inverse propensity score (GMIPS) estimator with statistically desirable properties compared to existing ones. Finally, we demonstrate that the GMIPS achieves the lowest MSE. Notably, among GMIPS variants, the marginalized reward interaction IPS (MRIPS) incorporates a doubly marginalized importance weight based on a cascade behavior assumption on ranking embeddings. MRIPS effectively balances the trade-off between bias and variance, even as the ranking action spaces increase and the above assumptions may not hold, as evidenced by our experiments.

[LG-177] Label-shift robust federated feature screening for high-dimensional classification

链接: https://arxiv.org/abs/2506.00379
作者: Qi Qin,Erbo Li,Xingxiang Li,Yifan Sun,Wu Wang,Chen Xu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注: 57 pages,9 tables,8 figures

点击查看摘要

Abstract:Distributed and federated learning are important tools for high-dimensional classification of large datasets. To reduce computational costs and overcome the curse of dimensionality, feature screening plays a pivotal role in eliminating irrelevant features during data preprocessing. However, data heterogeneity, particularly label shifting across different clients, presents significant challenges for feature screening. This paper introduces a general framework that unifies existing screening methods and proposes a novel utility, label-shift robust federated feature screening (LR-FFS), along with its federated estimation procedure. The framework facilitates a uniform analysis of methods and systematically characterizes their behaviors under label shift conditions. Building upon this framework, LR-FFS leverages conditional distribution functions and expectations to address label shift without adding computational burdens and remains robust against model misspecification and outliers. Additionally, the federated procedure ensures computational efficiency and privacy protection while maintaining screening effectiveness comparable to centralized processing. We also provide a false discovery rate (FDR) control method for federated feature screening. Experimental results and theoretical analyses demonstrate LR-FFS’s superior performance across diverse client environments, including those with varying class distributions, sample sizes, and missing categorical data.

[LG-178] Power-of-Two (PoT) Weights in Large Language Models (LLM s)

链接: https://arxiv.org/abs/2506.00315
作者: Mahmoud Elgenedy
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Complexity of Neural Networks is increasing rapidly due to the massive increase in model parameters. Specifically, in Large Language Models (LLMs), the number of model parameters has grown exponentially in the past few years, for example, from 1.5 billion parameters in GPT2 to 175 billion in GPT3. This raises a significant challenge for implementation, especially for Edge devices where memory and processing power are very limited. In this work, we investigate reducing LLM complexity with special type of quantization, power of two (PoT), for linear layers weights and transformer tables. PoT not only provides memory reduction but more importantly provides significant computational reduction through converting multiplication to bit shifting. We obtained preliminary results of PoT quantization on Nano-GPT implementation using Shakespeare dataset. We then extended results to 124-M GPT-2 model. The PoT quantization results are shown to be very promising with cross entropy loss degradation \approx [1.3-0.88] with number of bits range [4-6] to represent power levels.

[LG-179] SoundSculpt: Direction and Semantics Driven Ambisonic Target Sound Extraction

链接: https://arxiv.org/abs/2506.00273
作者: Tuochao Chen,D Shin,Hakan Erdogan,Sinan Hersek
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
*备注:

点击查看摘要

Abstract:This paper introduces SoundSculpt, a neural network designed to extract target sound fields from ambisonic recordings. SoundSculpt employs an ambisonic-in-ambisonic-out architecture and is conditioned on both spatial information (e.g., target direction obtained by pointing at an immersive video) and semantic embeddings (e.g., derived from image segmentation and captioning). Trained and evaluated on synthetic and real ambisonic mixtures, SoundSculpt demonstrates superior performance compared to various signal processing baselines. Our results further reveal that while spatial conditioning alone can be effective, the combination of spatial and semantic information is beneficial in scenarios where there are secondary sound sources spatially close to the target. Additionally, we compare two different semantic embeddings derived from a text description of the target sound using text encoders.

[LG-180] Bayesian Data Sketching for Varying Coefficient Regression Models

链接: https://arxiv.org/abs/2506.00270
作者: Rajarshi Guhaniyogi,Laura Baracaldo,Sudipto Banerjee
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Varying coefficient models are popular for estimating nonlinear regression functions in functional data models. Their Bayesian variants have received limited attention in large data applications, primarily due to prohibitively slow posterior computations using Markov chain Monte Carlo (MCMC) algorithms. We introduce Bayesian data sketching for varying coefficient models to obviate computational challenges presented by large sample sizes. To address the challenges of analyzing large data, we compress the functional response vector and predictor matrix by a random linear transformation to achieve dimension reduction and conduct inference on the compressed data. Our approach distinguishes itself from several existing methods for analyzing large functional data in that it requires neither the development of new models or algorithms, nor any specialized computational hardware while delivering fully model-based Bayesian inference. Well-established methods and algorithms for varying coefficient regression models can be applied to the compressed data.

[LG-181] How hard is learning to cut? Trade-offs and sample complexity

链接: https://arxiv.org/abs/2506.00252
作者: Sammy Khalife,Andrea Lodi
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In the recent years, branch-and-cut algorithms have been the target of data-driven approaches designed to enhance the decision making in different phases of the algorithm such as branching, or the choice of cutting planes (cuts). In particular, for cutting plane selection two score functions have been proposed in the literature to evaluate the quality of a cut: branch-and-cut tree size and gap closed. In this paper, we present new sample complexity lower bounds, valid for both scores. We show that for a wide family of classes \mathcalF that maps an instance to a cut, learning over an unknown distribution of the instances to minimize those scores requires at least (up to multiplicative constants) as many samples as learning from the same class function \mathcalF any generic target function (using square loss). Our results also extend to the case of learning from a restricted set of cuts, namely those from the Simplex tableau. To the best of our knowledge, these constitute the first lower bounds for the learning-to-cut framework. We compare our bounds to known upper bounds in the case of neural networks and show they are nearly tight. We illustrate our results with a graph neural network selection evaluated on set covering and facility location integer programming models and we empirically show that the gap closed score is an effective proxy to minimize the branch-and-cut tree size. Although the gap closed score has been extensively used in the integer programming literature, this is the first principled analysis discussing both scores at the same time both theoretically and computationally.

[LG-182] Riemannian Principal Component Analysis

链接: https://arxiv.org/abs/2506.00226
作者: Oldemar Rodríguez
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST); Computation (stat.CO)
*备注:

点击查看摘要

Abstract:This paper proposes an innovative extension of Principal Component Analysis (PCA) that transcends the traditional assumption of data lying in Euclidean space, enabling its application to data on Riemannian manifolds. The primary challenge addressed is the lack of vector space operations on such manifolds. Fletcher et al., in their work \em Principal Geodesic Analysis for the Study of Nonlinear Statistics of Shape, proposed Principal Geodesic Analysis (PGA) as a geometric approach to analyze data on Riemannian manifolds, particularly effective for structured datasets like medical images, where the manifold’s intrinsic structure is apparent. However, PGA’s applicability is limited when dealing with general datasets that lack an implicit local distance notion. In this work, we introduce a generalized framework, termed \em Riemannian Principal Component Analysis (R-PCA), to extend PGA for any data endowed with a local distance structure. Specifically, we adapt the PCA methodology to Riemannian manifolds by equipping data tables with local metrics, enabling the incorporation of manifold geometry. This framework provides a unified approach for dimensionality reduction and statistical analysis directly on manifolds, opening new possibilities for datasets with region-specific or part-specific distance notions, ensuring respect for their intrinsic geometric properties.

[LG-183] Enhancing Drug Discovery: Autoencoder-Based Latent Space Augmentation for Improved Molecular Solubility Prediction using LatMixSol

链接: https://arxiv.org/abs/2506.00223
作者: Mohammad Saleh Hasankhani
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurate prediction of molecular solubility is a cornerstone of early-stage drug discovery, yet conventional machine learning models face significant challenges due to limited labeled data and the high-dimensional nature of molecular descriptors. To address these issues, we propose LatMixSol, a novel latent space augmentation framework that combines autoencoder-based feature compression with guided interpolation to enrich training data. Our approach first encodes molecular descriptors into a low-dimensional latent space using a two-layer autoencoder. Spectral clustering is then applied to group chemically similar molecules, enabling targeted MixUp-style interpolation within clusters. Synthetic samples are generated by blending latent vectors of cluster members and decoding them back to the original feature space. Evaluated on the Huuskonen solubility benchmark, LatMixSol demonstrates consistent improvements across three of four gradient-boosted regressors (CatBoost, LightGBM, HistGradientBoosting), achieving RMSE reductions of 3.2-7.6% and R-squared increases of 0.5-1.5%. Notably, HistGradientBoosting shows the most significant enhancement with a 7.6% RMSE improvement. Our analysis confirms that cluster-guided latent space augmentation preserves chemical validity while expanding dataset diversity, offering a computationally efficient strategy to enhance predictive models in resource-constrained drug discovery pipelines.

[LG-184] Overfitting has a limitation: a model-independent generalization error bound based on Rényi entropy

链接: https://arxiv.org/abs/2506.00182
作者: Atsushi Suzuki
类目: Machine Learning (stat.ML); Information Theory (cs.IT); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:Will further scaling up of machine learning models continue to bring success? A significant challenge in answering this question lies in understanding generalization error, which is the impact of overfitting. Understanding generalization error behavior of increasingly large-scale machine learning models remains a significant area of investigation, as conventional analyses often link error bounds to model complexity, failing to fully explain the success of extremely large architectures. This research introduces a novel perspective by establishing a model-independent upper bound for generalization error applicable to algorithms whose outputs are determined solely by the data’s histogram, such as empirical risk minimization or gradient-based methods. Crucially, this bound is shown to depend only on the Rényi entropy of the data-generating distribution, suggesting that a small generalization error can be maintained even with arbitrarily large models, provided the data quantity is sufficient relative to this entropy. This framework offers a direct explanation for the phenomenon where generalization performance degrades significantly upon injecting random noise into data, where the performance degrade is attributed to the consequent increase in the data distribution’s Rényi entropy. Furthermore, we adapt the no-free-lunch theorem to be data-distribution-dependent, demonstrating that an amount of data corresponding to the Rényi entropy is indeed essential for successful learning, thereby highlighting the tightness of our proposed generalization bound.

[LG-185] Minimax Rates for the Estimation of Eigenpairs of Weighted Laplace-Beltrami Operators on Manifolds

链接: https://arxiv.org/abs/2506.00171
作者: Nicolás García Trillos,Chenghui Li,Raghavendra Venkatraman
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Analysis of PDEs (math.AP)
*备注:

点击查看摘要

Abstract:We study the problem of estimating eigenpairs of elliptic differential operators from samples of a distribution \rho supported on a manifold M . The operators discussed in the paper are relevant in unsupervised learning and in particular are obtained by taking suitable scaling limits of widely used graph Laplacians over data clouds. We study the minimax risk for this eigenpair estimation problem and explore the rates of approximation that can be achieved by commonly used graph Laplacians built from random data. More concretely, assuming that \rho belongs to a certain family of distributions with controlled second derivatives, and assuming that the d -dimensional manifold M where \rho is supported has bounded geometry, we prove that the statistical minimax rate for approximating eigenvalues and eigenvectors in the H^1(M) -sense is n^-2/(d+4) , a rate that matches the minimax rate for a closely related density estimation problem. We then revisit the literature studying Laplacians over proximity graphs in the large data limit and prove that, under slightly stronger regularity assumptions on the data generating model, eigenpairs of graph Laplacians induce manifold agnostic estimators with an error of approximation that, up to logarithmic corrections, matches our lower bounds. Our analysis allows us to expand the existing literature on graph-based learning in at least two significant ways: 1) we consider stronger norms to measure the error of approximation than the ones that had been analyzed in the past; 2) our rates of convergence are uniform over a family of smooth distributions and do not just apply to densities with special symmetries, and, as a consequence of our lower bounds, are essentially sharp when the connectivity of the graph is sufficiently high.

[LG-186] Generator Based Inference (GBI)

链接: https://arxiv.org/abs/2506.00119
作者: Chi Lung Cheng,Ranit Das,Runze Li,Radha Mastandrea,Vinicius Mikuni,Benjamin Nachman,David Shih,Gup Singh
类目: High Energy Physics - Phenomenology (hep-ph); Machine Learning (cs.LG); High Energy Physics - Experiment (hep-ex)
*备注: 9 pages, 9 figures

点击查看摘要

Abstract:Statistical inference in physics is often based on samples from a generator (sometimes referred to as a ``forward model") that emulate experimental data and depend on parameters of the underlying theory. Modern machine learning has supercharged this workflow to enable high-dimensional and unbinned analyses to utilize much more information than ever before. We propose a general framework for describing the integration of machine learning with generators called Generator Based Inference (GBI). A well-studied special case of this setup is Simulation Based Inference (SBI) where the generator is a physics-based simulator. In this work, we examine other methods within the GBI toolkit that use data-driven methods to build the generator. In particular, we focus on resonant anomaly detection, where the generator describing the background is learned from sidebands. We show how to perform machine learning-based parameter estimation in this context with data-derived generators. This transforms the statistical outputs of anomaly detection to be directly interpretable and the performance on the LHCO community benchmark dataset establishes a new state-of-the-art for anomaly detection sensitivity.

[LG-187] nsor Network for Anomaly Detection in the Latent Space of Proton Collision Events at the LHC

链接: https://arxiv.org/abs/2506.00102
作者: Ema Puljak,Maurizio Pierini,Artur Garcia-Saez
类目: High Energy Physics - Phenomenology (hep-ph); Statistical Mechanics (cond-mat.stat-mech); Machine Learning (cs.LG); High Energy Physics - Experiment (hep-ex); Quantum Physics (quant-ph); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The pursuit of discovering new phenomena at the Large Hadron Collider (LHC) demands constant innovation in algorithms and technologies. Tensor networks are mathematical models on the intersection of classical and quantum machine learning, which present a promising and efficient alternative for tackling these challenges. In this work, we propose a tensor network-based strategy for anomaly detection at the LHC and demonstrate its superior performance in identifying new phenomena compared to established quantum methods. Our model is a parametrized Matrix Product State with an isometric feature map, processing a latent representation of simulated LHC data generated by an autoencoder. Our results highlight the potential of tensor networks to enhance new-physics discovery.

[LG-188] Probabilistic intraday electricity price forecasting using generative machine learning

链接: https://arxiv.org/abs/2506.00044
作者: Jieyu Chen,Sebastian Lerch,Melanie Schienle,Tomasz Serafin,Rafał Weron
类目: Applications (stat.AP); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The growing importance of intraday electricity trading in Europe calls for improved price forecasting and tailored decision-support tools. In this paper, we propose a novel generative neural network model to generate probabilistic path forecasts for intraday electricity prices and use them to construct effective trading strategies for Germany’s continuous-time intraday market. Our method demonstrates competitive performance in terms of statistical evaluation metrics compared to two state-of-the-art statistical benchmark approaches. To further assess its economic value, we consider a realistic fixed-volume trading scenario and propose various strategies for placing market sell orders based on the path forecasts. Among the different trading strategies, the price paths generated by our generative model lead to higher profit gains than the benchmark methods. Our findings highlight the potential of generative machine learning tools in electricity price forecasting and underscore the importance of economic evaluation.

[LG-189] Probabilistic Spatial Interpolation of Sparse Data using Diffusion Models

链接: https://arxiv.org/abs/2506.00033
作者: Valerie Tsao,Nathaniel W. Chaney,Manolis Veveakis
类目: Applications (stat.AP); Machine Learning (cs.LG)
*备注: 41 pages, 14 figures, submitted to AMS Artificial Intelligence for the Earth Systems

点击查看摘要

Abstract:The large underlying assumption of climate models today relies on the basis of a “confident” initial condition, a reasonably plausible snapshot of the Earth for which all future predictions depend on. However, given the inherently chaotic nature of our system, this assumption is complicated by sensitive dependence, where small uncertainties in initial conditions can lead to exponentially diverging outcomes over time. This challenge is particularly salient at global spatial scales and over centennial timescales, where data gaps are not just common but expected. The source of uncertainty is two-fold: (1) sparse, noisy observations from satellites and ground stations, and (2) internal variability stemming from the simplifying approximations within the models themselves. In practice, data assimilation methods are used to reconcile this missing information by conditioning model states on partial observations. Our work builds on this idea but operates at the extreme end of sparsity. We propose a conditional data imputation framework that reconstructs full temperature fields from as little as 1% observational coverage. The method leverages a diffusion model guided by a prekriged mask, effectively inferring the full-state fields from minimal data points. We validate our framework over the Southern Great Plains, focusing on afternoon (12:00-6:00 PM) temperature fields during the summer months of 2018-2020. Across varying observational densities–from swath data to isolated in-situ sensors–our model achieves strong reconstruction accuracy, highlighting its potential to fill in critical data gaps in both historical reanalysis and real-time forecasting pipelines. Comments: 41 pages, 14 figures, submitted to AMS Artificial Intelligence for the Earth Systems Subjects: Applications (stat.AP); Machine Learning (cs.LG) Cite as: arXiv:2506.00033 [stat.AP] (or arXiv:2506.00033v1 [stat.AP] for this version) https://doi.org/10.48550/arXiv.2506.00033 Focus to learn more arXiv-issued DOI via DataCite

[LG-190] Quantum Neural Networks in Practice: A Comparative Study with Classical Models from Standard Data Sets to Industrial Images

链接: https://arxiv.org/abs/2411.19276
作者: Daniel Basilewitsch,João F. Bravo,Christian Tutschku,Frederick Struckmeier
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 24 pages, 13 figures

点击查看摘要

Abstract:In this study, we compare the performance of randomized classical and quantum neural networks (NNs) as well as classical and quantum-classical hybrid convolutional neural networks (CNNs) for the task of binary image classification. We use two distinct methodologies: using randomized NNs on dimensionality-reduced data, and applying CNNs to full image data. We evaluate these approaches on three data sets of increasing complexity: an artificial hypercube dataset, MNIST handwritten digits and real-world industrial images. We analyze correlations between classification accuracy and quantum model hyperparameters, including the number of trainable parameters, feature encoding methods, circuit layers, entangling gate type and structure, gate entangling power, and measurement operators. For random quantum NNs, we compare their performance against literature models. Classical and quantum/hybrid models achieved statistically equivalent classification accuracies across most datasets, with no approach demonstrating consistent superiority. We observe that quantum models show lower variance with respect to initial training parameters, suggesting better training stability. Among the hyperparameters analyzed, only the number of trainable parameters showed a positive correlation with the model performance. Around 94% of the best-performing quantum NNs had entangling gates, although for hybrid CNNs, models without entanglement performed equally well but took longer to converge. Cross-dataset performance analysis revealed limited transferability of quantum models between different classification tasks. Our study provides an industry perspective on quantum machine learning for practical image classification tasks, highlighting both current limitations and potential avenues for further research in quantum circuit design, entanglement utilization, and model transferability across varied applications.

信息检索

[IR-0] GLoSS: Generative Language Models with Semantic Search for Sequential Recommendation KR

链接: https://arxiv.org/abs/2506.01910
作者: Krishna Acharya,Aleksandr V. Petrov,Juba Ziani
类目: Information Retrieval (cs.IR)
*备注: Our code and model checkpoints are publicly available at: this https URL

点击查看摘要

Abstract:We propose Generative Low-rank language model with Semantic Search (GLoSS), a generative recommendation framework that combines large language models with dense retrieval for sequential recommendation. Unlike prior methods such as GPT4Rec, which rely on lexical matching via BM25, GLoSS uses semantic search to retrieve relevant items beyond lexical matching. For query generation, we employ 4-bit quantized LlaMA-3 models fine-tuned with low-rank adaptation (LoRA), enabling efficient training and inference on modest hardware. We evaluate GLoSS on three real-world Amazon review datasets: Beauty, Toys, and Sports, and find that it achieves state-of-the-art performance. Compared to traditional ID-based baselines, GLoSS improves Recall@5 by 33.3%, 52.8%, and 15.2%, and NDCG@5 by 30.0%, 42.6%, and 16.1%, respectively. It also outperforms LLM-based recommenders such as P5, GPT4Rec, LlamaRec and E4SRec with Recall@5 gains of 4.3%, 22.8%, and 29.5%. Additionally, user segment evaluations show that GLoSS performs particularly well for cold-start users in the Amazon Toys and Sports datasets, and benefits from longer user histories in Amazon Beauty dataset, demonstrating robustness across different levels of interaction lengths.

[IR-1] SPOT-Trip: Dual-Preference Driven Out-of-Town Trip Recommendation

链接: https://arxiv.org/abs/2506.01705
作者: Yinghui Liu,Hao Miao,Guojiang Shen,Yan Zhao,Xiangjie Kong,Ivan Lee
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Out-of-town trip recommendation aims to generate a sequence of Points of Interest (POIs) for users traveling from their hometowns to previously unvisited regions based on personalized itineraries, e.g., origin, destination, and trip duration. Modeling the complex user preferences–which often exhibit a two-fold nature of static and dynamic interests–is critical for effective recommendations. However, the sparsity of out-of-town check-in data presents significant challenges in capturing such user preferences. Meanwhile, existing methods often conflate the static and dynamic preferences, resulting in suboptimal performance. In this paper, we for the first time systematically study the problem of out-of-town trip recommendation. A novel framework SPOT-Trip is proposed to explicitly learns the dual static-dynamic user preferences. Specifically, to handle scarce data, we construct a POI attribute knowledge graph to enrich the semantic modeling of users’ hometown and out-of-town check-ins, enabling the static preference modeling through attribute relation-aware aggregation. Then, we employ neural ordinary differential equations (ODEs) to capture the continuous evolution of latent dynamic user preferences and innovatively combine a temporal point process to describe the instantaneous probability of each preference behavior. Further, a static-dynamic fusion module is proposed to merge the learned static and dynamic user preferences. Extensive experiments on real data offer insight into the effectiveness of the proposed solutions, showing that SPOT-Trip achieves performance improvement by up to 17.01%.

[IR-2] Small Stickers Big Meanings: A Multilingual Sticker Semantic Understanding Dataset with a Gamified Approach

链接: https://arxiv.org/abs/2506.01668
作者: Heng Er Metilda Chee,Jiayin Wang,Zhiqiang Guo,Weizhi Ma,Min Zhang
类目: Multimedia (cs.MM); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Stickers, though small, are a highly condensed form of visual expression, ubiquitous across messaging platforms and embraced by diverse cultures, genders, and age groups. Despite their popularity, sticker retrieval remains an underexplored task due to the significant human effort and subjectivity involved in constructing high-quality sticker query datasets. Although large language models (LLMs) excel at general NLP tasks, they falter when confronted with the nuanced, intangible, and highly specific nature of sticker query generation. To address this challenge, we propose a threefold solution. First, we introduce Sticktionary, a gamified annotation framework designed to gather diverse, high-quality, and contextually resonant sticker queries. Second, we present StickerQueries, a multilingual sticker query dataset containing 1,115 English and 615 Chinese queries, annotated by over 60 contributors across 60+ hours. Lastly, Through extensive quantitative and qualitative evaluation, we demonstrate that our approach significantly enhances query generation quality, retrieval accuracy, and semantic understanding in the sticker domain. To support future research, we publicly release our multilingual dataset along with two fine-tuned query generation models. Subjects: Multimedia (cs.MM); Information Retrieval (cs.IR) Cite as: arXiv:2506.01668 [cs.MM] (or arXiv:2506.01668v1 [cs.MM] for this version) https://doi.org/10.48550/arXiv.2506.01668 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-3] Generative Next POI Recommendation with Semantic ID KDD2025

链接: https://arxiv.org/abs/2506.01375
作者: Dongsheng Wang,Yuxi Huang,Shen Gao,Yifan Wang,Chengrui Huang,Shuo Shang
类目: Information Retrieval (cs.IR)
*备注: 11 pages, 4 figures, the paper has been accepted by KDD 2025

点击查看摘要

Abstract:Point-of-interest (POI) recommendation systems aim to predict the next destinations of user based on their preferences and historical check-ins. Existing generative POI recommendation methods usually employ random numeric IDs for POIs, limiting the ability to model semantic relationships between similar locations. In this paper, we propose Generative Next POI Recommendation with Semantic ID (GNPR-SID), an LLM-based POI recommendation model with a novel semantic POI ID (SID) representation method that enhances the semantic understanding of POI modeling. There are two key components in our GNPR-SID: (1) a Semantic ID Construction module that generates semantically rich POI IDs based on semantic and collaborative features, and (2) a Generative POI Recommendation module that fine-tunes LLMs to predict the next POI using these semantic IDs. By incorporating user interaction patterns and POI semantic features into the semantic ID generation, our method improves the recommendation accuracy and generalization of the model. To construct semantically related SIDs, we propose a POI quantization method based on residual quantized variational autoencoder, which maps POIs into a discrete semantic space. We also propose a diversity loss to ensure that SIDs are uniformly distributed across the semantic space. Extensive experiments on three benchmark datasets demonstrate that GNPR-SID substantially outperforms state-of-the-art methods, achieving up to 16% improvement in recommendation accuracy.

[IR-4] AI4Contracts: LLM RAG -Powered Encoding of Financial Derivative Contracts

链接: https://arxiv.org/abs/2506.01063
作者: Maruf Ahmed Mridul,Ian Sloyan,Aparna Gupta,Oshani Seneviratne
类目: Information Retrieval (cs.IR)
*备注: 8 pages, 3 figures, 2 tables

点击查看摘要

Abstract:Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) are reshaping how AI systems extract and organize information from unstructured text. A key challenge is designing AI methods that can incrementally extract, structure, and validate information while preserving hierarchical and contextual relationships. We introduce CDMizer, a template-driven, LLM, and RAG-based framework for structured text transformation. By leveraging depth-based retrieval and hierarchical generation, CDMizer ensures a controlled, modular process that aligns generated outputs with predefined schema. Its template-driven approach guarantees syntactic correctness, schema adherence, and improved scalability, addressing key limitations of direct generation methods. Additionally, we propose an LLM-powered evaluation framework to assess the completeness and accuracy of structured representations. Demonstrated in the transformation of Over-the-Counter (OTC) financial derivative contracts into the Common Domain Model (CDM), CDMizer establishes a scalable foundation for AI-driven document understanding, structured synthesis, and automated validation in broader contexts.

[IR-5] AliBoost: Ecological Boosting Framework in Alibaba Platform KDD2025

链接: https://arxiv.org/abs/2506.00954
作者: Qijie Shen,Yuanchen Bei,Zihong Huang,Jialin Zhu,Keqin Xu,Boya Du,Jiawei Tang,Yuning Jiang,Feiran Huang,Xiao Huang,Hao Chen
类目: Information Retrieval (cs.IR)
*备注: 12 pages, 5 figures, accepted by KDD2025

点击查看摘要

Abstract:Maintaining a healthy ecosystem in billion-scale online platforms is challenging, as users naturally gravitate toward popular items, leaving cold and less-explored items behind. This ‘‘rich-get-richer’’ phenomenon hinders the growth of potentially valuable cold items and harms the platform’s ecosystem. Existing cold-start models primarily focus on improving initial recommendation performance for cold items but fail to address users’ natural preference for popular content. In this paper, we introduce AliBoost, Alibaba’s ecological boosting framework, designed to complement user-oriented natural recommendations and foster a healthier ecosystem. AliBoost incorporates a tiered boosting structure and boosting principles to ensure high-potential items quickly gain exposure while minimizing disruption to low-potential items. To achieve this, we propose the Stacking Fine-Tuning Cold Predictor to enhance the foundation CTR model’s performance on cold items for accurate CTR and potential prediction. AliBoost then employs an Item-oriented Bidding Boosting mechanism to deliver cold items to the most suitable users while balancing boosting speed with user-personalized preferences. Over the past six months, AliBoost has been deployed across Alibaba’s mainstream platforms, successfully cold-starting over a billion new items and increasing both clicks and GMV of cold items by over 60% within 180 days. Extensive online analysis and A/B testing demonstrate the effectiveness of AliBoost in addressing ecological challenges, offering new insights into the design of billion-scale recommender systems.

[IR-6] Optimizing Question Semantic Space for Dynamic Retrieval-Augmented Multi-hop Question Answering

链接: https://arxiv.org/abs/2506.00491
作者: Linhao Ye,Lang Yu,Zhikai Lei,Qin Chen,Jie Zhou,Liang He
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) is usually integrated into large language models (LLMs) to mitigate hallucinations and knowledge obsolescence. Whereas,conventional one-step retrieve-and-read methods are insufficient for multi-hop question answering, facing challenges of retrieval semantic mismatching and the high cost in handling interdependent subquestions. In this paper, we propose Optimizing Question Semantic Space for Dynamic Retrieval-Augmented Multi-hop Question Answering (Q-DREAM). Q-DREAM consists of three key modules: (1) the Question Decomposition Module (QDM), which decomposes multi-hop questions into fine-grained subquestions; (2) the Subquestion Dependency Optimizer Module (SDOM), which models the interdependent relations of subquestions for better understanding; and (3) the Dynamic Passage Retrieval Module (DPRM), which aligns subquestions with relevant passages by optimizing the semantic embeddings. Experimental results across various benchmarks demonstrate that Q-DREAM significantly outperforms existing RAG methods, achieving state-of-the-art performance in both in-domain and out-of-domain settings. Notably, Q-DREAM also improves retrieval efficiency while maintaining high accuracy compared with recent baselines.

[IR-7] K-order Ranking Preference Optimization for Large Language Models

链接: https://arxiv.org/abs/2506.00441
作者: Shihao Cai,Chongming Gao,Yang Zhang,Wentao Shi,Jizhi Zhang,Keqin Bao,Qifan Wang,Fuli Feng
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:To adapt large language models (LLMs) to ranking tasks, existing list-wise methods, represented by list-wise Direct Preference Optimization (DPO), focus on optimizing partial-order or full-order list ranking consistency for LLMs to enhance their ranking abilities. However, we argue that optimizing top-K ranking consistency could be more appropriate for real-world applications. There are two main reasons: (1) users are typically concerned with only the top-K results, making top-K ranking more important, and (2) tail items often lack precise feedback, making top-K ranking more reliable. Based on this, we propose K-order Ranking Preference Optimization (KPO) by extending the DPO’s Plackett-Luce model to accommodate top-K rankings. Additionally, recognizing that the number of important items can vary across queries, we extend KPO to dynamically determine appropriate K for different samples and introduce a curriculum learning strategy to boost training efficiency. Extensive experiments demonstrate the effectiveness of KPO, highlighting its high sample efficiency and robustness to noise. The code is available at this https URL.

[IR-8] FACE: A Fine-grained Reference Free Evaluator for Conversational Recommender Systems

链接: https://arxiv.org/abs/2506.00314
作者: Hideaki Joko,Faegheh Hasibi
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:A systematic, reliable, and low-cost evaluation of Conversational Recommender Systems (CRSs) remains an open challenge. Existing automatic CRS evaluation methods are proven insufficient for evaluating the dynamic nature of recommendation conversations. This work proposes FACE: a Fine-grained, Aspect-based Conversation Evaluation method that provides evaluation scores for diverse turn and dialogue level qualities of recommendation conversations. FACE is reference-free and shows strong correlation with human judgments, achieving system correlation of 0.9 and turn/dialogue-level of 0.5, outperforming state-of-the-art CRS evaluation methods by a large margin. Additionally, unlike existing LLM-based methods that provide single uninterpretable scores, FACE provides insights into the system performance and enables identifying and locating problems within conversations.

[IR-9] Curate Connect Inquire: A System for Findable Accessible Interoperable and Reusable (FAIR) Human-Robot Centered Datasets ICRA2025

链接: https://arxiv.org/abs/2506.00220
作者: Xingru Zhou,Sadanand Modak,Yao-Cheng Chan,Zhiyun Deng,Luis Sentis,Maria Esteva
类目: Information Retrieval (cs.IR); Human-Computer Interaction (cs.HC); Robotics (cs.RO)
*备注: 7 pages (excluding references), 8 pages (including references); 5 figures; accepted to the ICRA 2025 Workshop on Human-Centered Robot Learning in the Era of Big Data and Large Models

点击查看摘要

Abstract:The rapid growth of AI in robotics has amplified the need for high-quality, reusable datasets, particularly in human-robot interaction (HRI) and AI-embedded robotics. While more robotics datasets are being created, the landscape of open data in the field is uneven. This is due to a lack of curation standards and consistent publication practices, which makes it difficult to discover, access, and reuse robotics data. To address these challenges, this paper presents a curation and access system with two main contributions: (1) a structured methodology to curate, publish, and integrate FAIR (Findable, Accessible, Interoperable, Reusable) human-centered robotics datasets; and (2) a ChatGPT-powered conversational interface trained with the curated datasets metadata and documentation to enable exploration, comparison robotics datasets and data retrieval using natural language. Developed based on practical experience curating datasets from robotics labs within Texas Robotics at the University of Texas at Austin, the system demonstrates the value of standardized curation and persistent publication of robotics data. The system’s evaluation suggests that access and understandability of human-robotics data are significantly improved. This work directly aligns with the goals of the HCRL @ ICRA 2025 workshop and represents a step towards more human-centered access to data for embodied AI.

[IR-10] Getting almost all the bits from a quantum random access code

链接: https://arxiv.org/abs/2506.01903
作者: Han-Hsuan Lin(National Tsing Hua University, Taiwan),Ronald de Wolf(QuSoft, CWI and University of Amsterdam)
类目: Quantum Physics (quant-ph); Information Retrieval (cs.IR)
*备注: 14 pages LaTeX

点击查看摘要

Abstract:A quantum random access code (QRAC) is a map x\mapsto\rho_x that encodes n -bit strings x into m -qubit quantum states \rho_x , in a way that allows us to recover any one bit of x with success probability \geq p . The measurement on \rho_x that is used to recover, say, x_1 may destroy all the information about the other bits; this is in fact what happens in the well-known QRAC that encodes n=2 bits into m=1 qubits. Does this generalize to large n , i.e., could there exist QRACs that are so “obfuscated” that one cannot get much more than one bit out of them? Here we show that this is not the case: for every QRAC there exists a measurement that (with high probability) recovers the full n -bit string x up to small Hamming distance, even for the worst-case x .

附件下载

点击下载今日全部论文列表