本篇博文主要内容为 2025-06-05 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。
说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。
友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。
目录
概览 (2025-06-05)
今日共更新618篇论文,其中:
- 自然语言处理共134篇(Computation and Language (cs.CL))
- 人工智能共160篇(Artificial Intelligence (cs.AI))
- 计算机视觉共165篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共184篇(Machine Learning (cs.LG))
自然语言处理
[NLP-0] Efficient Knowledge Editing via Minimal Precomputation ACL2025
【速读】: 该论文试图解决知识编辑方法(如MEMIT)中预计算步骤带来的高计算成本问题,该步骤需要为每个编辑层预计算大量隐藏向量,导致显著的计算开销。解决方案的关键在于证明这种高额计算成本是不必要的,并通过理论分析和实验验证,表明仅需预计算极小部分的隐藏向量即可实现有效的知识编辑。具体而言,论文指出预计算步骤可以减少到原规定数量的0.3%以下,从而大幅降低计算时间并提升编辑效率。
链接: https://arxiv.org/abs/2506.04226
作者: Akshat Gupta,Maochuan Lu,Thomas Hartvigsen,Gopala Anumanchipalli
机构: UC Berkeley(加州大学伯克利分校); University of Virginia(弗吉尼亚大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: ACL 2025 Main Conference
Abstract:Knowledge editing methods like MEMIT are able to make data and compute efficient updates of factual knowledge by using a single sentence to update facts and their consequences. However, what is often overlooked is a “precomputation step”, which requires a one-time but significant computational cost. The authors of MEMIT originally precompute approximately 44 million hidden vectors per edited layer, which requires a forward pass over 44 million tokens. For GPT-J (6B), this precomputation step takes 36 hours on a single GPU, while it takes approximately 40 hours for Llama2-7B. Additionally, this precomputation time grows with model size. In this paper, we show that this excessive computational cost is unnecessary. Knowledge editing using MEMIT and related methods, such as ROME and EMMET, can be performed by pre-computing a very small portion of the 44 million hidden vectors. We first present the theoretical minimum number of hidden vector precomputation required for solutions of these editing methods to exist. We then empirically show that knowledge editing using these methods can be done by pre-computing significantly fewer hidden vectors. Specifically, we show that the precomputation step can be done with less than 0.3% of the originally stipulated number of hidden vectors. This saves a significant amount of precomputation time and allows users to begin editing new models within a few minutes.
zh
[NLP-1] Does Thinking More always Help? Understanding Test-Time Scaling in Reasoning Models
【速读】: 该论文试图解决测试阶段通过延长思考过程(test-time scaling)来提升推理模型性能的有效性问题,特别是针对“更多思考是否真正有助于提升推理能力”的疑问。研究发现,额外的思考会导致“过度思考”(overthinking),从而在初期提升性能后出现下降,这是由于额外思考增加了输出方差,造成了推理能力提升的假象。解决方案的关键在于提出一种替代方法——并行思考(parallel thinking),该方法受Best-of-N采样启发,在相同的推理预算内生成多个独立的推理路径,并通过多数投票选择最一致的回答,从而在保持精度的同时显著提升准确性,相比传统延长思考的方法提升了高达20%。
链接: https://arxiv.org/abs/2506.04210
作者: Soumya Suvra Ghosal,Souradip Chakraborty,Avinash Reddy,Yifu Lu,Mengdi Wang,Dinesh Manocha,Furong Huang,Mohammad Ghavamzadeh,Amrit Singh Bedi
机构: University of Maryland (马里兰大学); University of Michigan (密歇根大学); Princeton University (普林斯顿大学); Amazon AGI (亚马逊AGI); University of Central Florida (中佛罗里达大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Recent trends in test-time scaling for reasoning models (e.g., OpenAI o1, DeepSeek R1) have led to a popular belief that extending thinking traces using prompts like “Wait” or “Let me rethink” can improve performance. This raises a natural question: Does thinking more at test-time truly lead to better reasoning? To answer this question, we perform a detailed empirical study across models and benchmarks, which reveals a consistent pattern of initial performance improvements from additional thinking followed by a decline, due to “overthinking”. To understand this non-monotonic trend, we consider a simple probabilistic model, which reveals that additional thinking increases output variance-creating an illusion of improved reasoning while ultimately undermining precision. Thus, observed gains from “more thinking” are not true indicators of improved reasoning, but artifacts stemming from the connection between model uncertainty and evaluation metric. This suggests that test-time scaling through extended thinking is not an effective way to utilize the inference thinking budget. Recognizing these limitations, we introduce an alternative test-time scaling approach, parallel thinking, inspired by Best-of-N sampling. Our method generates multiple independent reasoning paths within the same inference budget and selects the most consistent response via majority vote, achieving up to 20% higher accuracy compared to extended thinking. This provides a simple yet effective mechanism for test-time scaling of reasoning models.
zh
[NLP-2] Advancing Multimodal Reasoning : From Optimized Cold Start to Staged Reinforcement Learning
【速读】: 该论文试图解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在复杂推理任务中表现不足的问题,尤其是在激活复杂推理能力方面存在困难。其解决方案的关键在于通过分析当前训练流程,提出三种关键现象:首先,有效的冷启动初始化对于提升MLLM的推理能力至关重要,仅使用精心选择的文本数据即可在未进行多模态强化学习(RL)前就达到优于许多近期多模态推理模型的性能;其次,标准的GRPO方法在多模态RL中存在梯度停滞问题,影响训练稳定性和性能;最后,多模态RL阶段后的纯文本RL训练能够进一步提升多模态推理能力。基于这些发现,论文提出了ReVisual-R1模型,通过解决多模态RL问题并结合分阶段训练策略,实现了在多个挑战性基准测试中的最先进性能。
链接: https://arxiv.org/abs/2506.04207
作者: Shuang Chen,Yue Guo,Zhaochen Su,Yafu Li,Yulun Wu,Jiacheng Chen,Jiayu Chen,Weijie Wang,Xiaoye Qu,Yu Cheng
机构: Zhejiang University (浙江大学); Fudan University (复旦大学); Soochow University (苏州大学); Shanghai AI Laboratory (上海人工智能实验室); The Chinese University of Hong Kong (香港中文大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages, 6 figures
Abstract:Inspired by the remarkable reasoning capabilities of Deepseek-R1 in complex textual tasks, many works attempt to incentivize similar capabilities in Multimodal Large Language Models (MLLMs) by directly applying reinforcement learning (RL). However, they still struggle to activate complex reasoning. In this paper, rather than examining multimodal RL in isolation, we delve into current training pipelines and identify three crucial phenomena: 1) Effective cold start initialization is critical for enhancing MLLM reasoning. Intriguingly, we find that initializing with carefully selected text data alone can lead to performance surpassing many recent multimodal reasoning models, even before multimodal RL. 2) Standard GRPO applied to multimodal RL suffers from gradient stagnation, which degrades training stability and performance. 3) Subsequent text-only RL training, following the multimodal RL phase, further enhances multimodal reasoning. This staged training approach effectively balances perceptual grounding and cognitive reasoning development. By incorporating the above insights and addressing multimodal RL issues, we introduce ReVisual-R1, achieving a new state-of-the-art among open-source 7B MLLMs on challenging benchmarks including MathVerse, MathVision, WeMath, LogicVista, DynaMath, and challenging AIME2024 and AIME2025.
zh
[NLP-3] R-Search: Empowering LLM Reasoning with Search via Multi-Reward Reinforcement Learning
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在深度交互式搜索任务中推理能力不足的问题,即模型难以识别最优的推理-搜索交互路径,导致响应质量不佳。其解决方案的关键在于提出R-Search框架,这是一个基于强化学习(Reinforcement Learning, RL)的推理-搜索集成方法,通过多奖励信号引导LLMs自主执行多步骤推理并进行深度搜索交互,从而优化推理-搜索轨迹,提升复杂逻辑与知识密集型任务的响应质量。
链接: https://arxiv.org/abs/2506.04185
作者: Qingfei Zhao,Ruobing Wang,Dingling Xu,Daren Zha,Limin Liu
机构: Institute of Information Engineering, Chinese Academy of Sciences; School of Cyber Security, University of Chinese Academy of Sciences; Beijing Normal University
类目: Computation and Language (cs.CL)
备注: 16 pages, 3 figures
Abstract:Large language models (LLMs) have notably progressed in multi-step and long-chain reasoning. However, extending their reasoning capabilities to encompass deep interactions with search remains a non-trivial challenge, as models often fail to identify optimal reasoning-search interaction trajectories, resulting in suboptimal responses. We propose R-Search, a novel reinforcement learning framework for Reasoning-Search integration, designed to enable LLMs to autonomously execute multi-step reasoning with deep search interaction, and learn optimal reasoning search interaction trajectories via multi-reward signals, improving response quality in complex logic- and knowledge-intensive tasks. R-Search guides the LLM to dynamically decide when to retrieve or reason, while globally integrating key evidence to enhance deep knowledge interaction between reasoning and search. During RL training, R-Search provides multi-stage, multi-type rewards to jointly optimize the reasoning-search trajectory. Experiments on seven datasets show that R-Search outperforms advanced RAG baselines by up to 32.2% (in-domain) and 25.1% (out-of-domain). The code and data are available at this https URL.
zh
[NLP-4] Long or short CoT? Investigating Instance-level Switch of Large Reasoning Models
【速读】: 该论文旨在解决长链式思维(long Chain-of-Thought, CoT)策略在复杂任务中虽表现优异但存在显著高token消耗的问题,以及如何在推理准确性和计算效率之间取得平衡。其解决方案的关键在于提出SwitchCoT,一个自动框架,能够根据任务上下文和资源可用性动态选择长CoT或短CoT策略,从而实现推理准确率与计算效率的优化。
链接: https://arxiv.org/abs/2506.04182
作者: Ruiqi Zhang,Changyi Xiao,Yixin Cao
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:With the rapid advancement of large reasoning models, long Chain-of-Thought (CoT) prompting has demonstrated strong performance on complex tasks. However, this often comes with a significant increase in token usage. In this paper, we conduct a comprehensive empirical analysis comparing long and short CoT strategies. Our findings reveal that while long CoT can lead to performance improvements, its benefits are often marginal relative to its significantly higher token consumption. Specifically, long CoT tends to outperform when ample generation budgets are available, whereas short CoT is more effective under tighter budget constraints. These insights underscore the need for a dynamic approach that selects the proper CoT strategy based on task context and resource availability. To address this, we propose SwitchCoT, an automatic framework that adaptively chooses between long and short CoT strategies to balance reasoning accuracy and computational efficiency. Moreover, SwitchCoT is designed to be budget-aware, making it broadly applicable across scenarios with varying resource constraints. Experimental results demonstrate that SwitchCoT can reduce inference costs by up to 50% while maintaining high accuracy. Notably, under limited token budgets, it achieves performance comparable to, or even exceeding, that of using either long or short CoT alone.
zh
[NLP-5] SuperWriter: Reflection-Driven Long-Form Generation with Large Language Models
【速读】: 该论文旨在解决长文本生成中普遍存在的连贯性不足、逻辑一致性差以及随着序列长度增加文本质量下降的问题(long-form text generation challenges)。其解决方案的关键在于提出SuperWriter-Agent框架,该框架通过引入显式的结构化思维规划与细化阶段,将生成过程引导至更系统化和认知基础化的流程,类似于专业写作者的写作方式,从而提升长文本的质量与一致性。
链接: https://arxiv.org/abs/2506.04180
作者: Yuhao Wu,Yushi Bai,Zhiqiang Hu,Juanzi Li,Roy Ka-Wei Lee
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Long-form text generation remains a significant challenge for large language models (LLMs), particularly in maintaining coherence, ensuring logical consistency, and preserving text quality as sequence length increases. To address these limitations, we propose SuperWriter-Agent, an agent-based framework designed to enhance the quality and consistency of long-form text generation. SuperWriter-Agent introduces explicit structured thinking-through planning and refinement stages into the generation pipeline, guiding the model to follow a more deliberate and cognitively grounded process akin to that of a professional writer. Based on this framework, we construct a supervised fine-tuning dataset to train a 7B SuperWriter-LM. We further develop a hierarchical Direct Preference Optimization (DPO) procedure that uses Monte Carlo Tree Search (MCTS) to propagate final quality assessments and optimize each generation step accordingly. Empirical results across diverse benchmarks demonstrate that SuperWriter-LM achieves state-of-the-art performance, surpassing even larger-scale baseline models in both automatic evaluation and human evaluation. Furthermore, comprehensive ablation studies demonstrate the effectiveness of hierarchical DPO and underscore the value of incorporating structured thinking steps to improve the quality of long-form text generation.
zh
[NLP-6] SkipGPT : Dynamic Layer Pruning Reinvented with Token Awareness and Module Decoupling
【速读】: 该论文旨在解决大规模语言模型(Large Language Models, LLMs)在推理过程中因深度多层架构导致的计算成本过高的问题。传统静态层剪枝方法未能考虑LLM推理中的两个关键动态特性:水平动态(token级异质性要求上下文感知的剪枝决策)和垂直动态(MLP与自注意力层的功能角色差异需要组件特定的剪枝策略)。其解决方案的关键在于提出一种动态层剪枝框架SkipGPT,通过两项核心创新实现优化:一是全局token感知的路由机制以优先处理关键token,二是针对MLP与自注意力组件的解耦剪枝策略。此外,为缓解训练不稳定性,引入了两阶段优化范式,包括解耦训练阶段和参数高效的LoRA微调阶段。
链接: https://arxiv.org/abs/2506.04179
作者: Anhao Zhao,Fanghua Ye,Yingqi Fan,Junlong Tong,Zhiwei Fei,Hui Su,Xiaoyu Shen
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) achieve remarkable performance across tasks but incur substantial computational costs due to their deep, multi-layered architectures. Layer pruning has emerged as a strategy to alleviate these inefficiencies, but conventional static pruning methods overlook two critical dynamics inherent to LLM inference: (1) horizontal dynamics, where token-level heterogeneity demands context-aware pruning decisions, and (2) vertical dynamics, where the distinct functional roles of MLP and self-attention layers necessitate component-specific pruning policies. We introduce SkipGPT, a dynamic layer pruning framework designed to optimize computational resource allocation through two core innovations: (1) global token-aware routing to prioritize critical tokens, and (2) decoupled pruning policies for MLP and self-attention components. To mitigate training instability, we propose a two-stage optimization paradigm: first, a disentangled training phase that learns routing strategies via soft parameterization to avoid premature pruning decisions, followed by parameter-efficient LoRA fine-tuning to restore performance impacted by layer removal. Extensive experiments demonstrate that SkipGPT reduces over 40% of model parameters while matching or exceeding the performance of the original dense model across benchmarks. By harmonizing dynamic efficiency with preserved expressivity, SkipGPT advances the practical deployment of scalable, resource-aware LLMs. Our code is publicly available at: this https URL.
zh
[NLP-7] A Dataset for Addressing Patients Information Needs related to Clinical Course of Hospitalization
【速读】: 该论文试图解决患者在住院期间对信息的需求难以通过现有人工智能系统准确且相关地满足的问题,其核心挑战在于缺乏用于评估AI生成回答事实性和相关性的高质量数据集。解决方案的关键是引入ArchEHR-QA,这是一个基于真实患者案例的专家标注数据集,涵盖了重症监护和急诊科场景下的患者问题、临床笔记片段及医生撰写的答案,并通过三种提示策略对大型语言模型(LLMs)进行了基准测试,以评估其在事实性和相关性方面的表现。
链接: https://arxiv.org/abs/2506.04156
作者: Sarvesh Soni,Dina Demner-Fushman
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Patients have distinct information needs about their hospitalization that can be addressed using clinical evidence from electronic health records (EHRs). While artificial intelligence (AI) systems show promise in meeting these needs, robust datasets are needed to evaluate the factual accuracy and relevance of AI-generated responses. To our knowledge, no existing dataset captures patient information needs in the context of their EHRs. We introduce ArchEHR-QA, an expert-annotated dataset based on real-world patient cases from intensive care unit and emergency department settings. The cases comprise questions posed by patients to public health forums, clinician-interpreted counterparts, relevant clinical note excerpts with sentence-level relevance annotations, and clinician-authored answers. To establish benchmarks for grounded EHR question answering (QA), we evaluated three open-weight large language models (LLMs)–Llama 4, Llama 3, and Mixtral–across three prompting strategies: generating (1) answers with citations to clinical note sentences, (2) answers before citations, and (3) answers from filtered citations. We assessed performance on two dimensions: Factuality (overlap between cited note sentences and ground truth) and Relevance (textual and semantic similarity between system and reference answers). The final dataset contains 134 patient cases. The answer-first prompting approach consistently performed best, with Llama 4 achieving the highest scores. Manual error analysis supported these findings and revealed common issues such as omitted key clinical evidence and contradictory or hallucinated content. Overall, ArchEHR-QA provides a strong benchmark for developing and evaluating patient-centered EHR QA systems, underscoring the need for further progress toward generating factual and relevant responses in clinical contexts.
zh
[NLP-8] Establishing Trustworthy LLM Evaluation via Shortcut Neuron Analysis ACL2025
【速读】: 该论文试图解决大规模语言模型(Large Language Models, LLMs)评估中因公共基准数据集存在数据污染而导致的公平性问题。现有评估方法依赖于易受污染的公开基准,而以往研究通过构建动态基准来应对污染,但此过程成本高且周期长。该论文的关键解决方案是通过分析污染模型本身的机制,发现其过估计现象可能源于训练过程中参数获取了捷径解(shortcut solutions)。进一步提出一种基于比较和因果分析的捷径神经元识别方法,并引入一种名为捷径神经元修补(shortcut neuron patching)的评估方法以抑制捷径神经元,从而有效缓解数据污染问题。
链接: https://arxiv.org/abs/2506.04142
作者: Kejian Zhu,Shangqing Tu,Zhuoran Jin,Lei Hou,Juanzi Li,Jun Zhao
机构: Chinese Academy of Sciences (中国科学院); University of Chinese Academy of Sciences (中国科学院大学); Tsinghua University (清华大学)
类目: Computation and Language (cs.CL)
备注: Accepted to ACL 2025 Main Conference
Abstract:The development of large language models (LLMs) depends on trustworthy evaluation. However, most current evaluations rely on public benchmarks, which are prone to data contamination issues that significantly compromise fairness. Previous researches have focused on constructing dynamic benchmarks to address contamination. However, continuously building new benchmarks is costly and cyclical. In this work, we aim to tackle contamination by analyzing the mechanisms of contaminated models themselves. Through our experiments, we discover that the overestimation of contaminated models is likely due to parameters acquiring shortcut solutions in training. We further propose a novel method for identifying shortcut neurons through comparative and causal analysis. Building on this, we introduce an evaluation method called shortcut neuron patching to suppress shortcut neurons. Experiments validate the effectiveness of our approach in mitigating contamination. Additionally, our evaluation results exhibit a strong linear correlation with MixEval, a recently released trustworthy benchmark, achieving a Spearman coefficient ( \rho ) exceeding 0.95. This high correlation indicates that our method closely reveals true capabilities of the models and is trustworthy. We conduct further experiments to demonstrate the generalizability of our method across various benchmarks and hyperparameter settings. Code: this https URL
zh
[NLP-9] MMR-V: Whats Left Unsaid? A Benchmark for Multimodal Deep Reasoning in Videos
【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在视频中定位多帧证据并进行多模态推理的能力不足的问题。现有视频基准测试主要关注理解任务,仅要求模型匹配问题中提到的“问题帧”并感知少量相邻帧,未能充分挑战模型的长距离多帧推理与深层推理能力。为弥补这一差距,本文提出了MMR-V:一个面向视频中多模态深度推理的基准测试。其关键在于构建具有长距离多帧推理、超越感知的隐藏信息推理、人工标注的可靠性以及精心设计的混淆干扰策略的任务集,以更全面地评估和提升模型的多模态推理能力。
链接: https://arxiv.org/abs/2506.04141
作者: Kejian Zhu,Zhuoran Jin,Hongbang Yuan,Jiachun Li,Shangqing Tu,Pengfei Cao,Yubo Chen,Kang Liu,Jun Zhao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Project Page: this https URL
Abstract:The sequential structure of videos poses a challenge to the ability of multimodal large language models (MLLMs) to locate multi-frame evidence and conduct multimodal reasoning. However, existing video benchmarks mainly focus on understanding tasks, which only require models to match frames mentioned in the question (hereafter referred to as “question frame”) and perceive a few adjacent frames. To address this gap, we propose MMR-V: A Benchmark for Multimodal Deep Reasoning in Videos. The benchmark is characterized by the following features. (1) Long-range, multi-frame reasoning: Models are required to infer and analyze evidence frames that may be far from the question frame. (2) Beyond perception: Questions cannot be answered through direct perception alone but require reasoning over hidden information. (3) Reliability: All tasks are manually annotated, referencing extensive real-world user understanding to align with common perceptions. (4) Confusability: Carefully designed distractor annotation strategies to reduce model shortcuts. MMR-V consists of 317 videos and 1,257 tasks. Our experiments reveal that current models still struggle with multi-modal reasoning; even the best-performing model, o4-mini, achieves only 52.5% accuracy. Additionally, current reasoning enhancement strategies (Chain-of-Thought and scaling test-time compute) bring limited gains. Further analysis indicates that the CoT demanded for multi-modal reasoning differs from it in textual reasoning, which partly explains the limited performance gains. We hope that MMR-V can inspire further research into enhancing multi-modal reasoning capabilities.
zh
[NLP-10] Are Lexicon-Based Tools Still the Gold Standard for Valence Analysis in Low-Resource Flemish?
【速读】: 该论文试图解决如何准确捕捉日常语言中情感极性(valence)的问题,特别是在弗拉芒语(Flemish)的自发性、真实语境下的叙述中。现有工具如LIWC和Pattern虽然在计算语言学和情感研究中被广泛使用,但面对自然语言的复杂性和上下文依赖性时存在局限。论文提出的解决方案关键在于评估三种针对荷兰语优化的大型语言模型(LLMs)在预测情感极性评分方面的表现,并与传统工具进行对比。研究结果表明,尽管LLMs架构有所进步,但当前模型仍无法准确捕捉真实语境下的情感极性,因此强调需要开发文化与语言定制化的模型和工具,以提升自动化情感分析的能力,并推动计算语言学与情感研究的发展。
链接: https://arxiv.org/abs/2506.04139
作者: Ratna Kandala,Katie Hoemann
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Understanding the nuances in everyday language is pivotal for advancements in computational linguistics emotions research. Traditional lexicon-based tools such as LIWC and Pattern have long served as foundational instruments in this domain. LIWC is the most extensively validated word count based text analysis tool in the social sciences and Pattern is an open source Python library offering functionalities for NLP. However, everyday language is inherently spontaneous, richly expressive, deeply context dependent. To explore the capabilities of LLMs in capturing the valences of daily narratives in Flemish, we first conducted a study involving approximately 25,000 textual responses from 102 Dutch-speaking participants. Each participant provided narratives prompted by the question, “What is happening right now and how do you feel about it?”, accompanied by self-assessed valence ratings on a continuous scale from -50 to +50. We then assessed the performance of three Dutch-specific LLMs in predicting these valence scores, and compared their outputs to those generated by LIWC and Pattern. Our findings indicate that, despite advancements in LLM architectures, these Dutch tuned models currently fall short in accurately capturing the emotional valence present in spontaneous, real-world narratives. This study underscores the imperative for developing culturally and linguistically tailored models/tools that can adeptly handle the complexities of natural language use. Enhancing automated valence analysis is not only pivotal for advancing computational methodologies but also holds significant promise for psychological research with ecologically valid insights into human daily experiences. We advocate for increased efforts in creating comprehensive datasets finetuning LLMs for low-resource languages like Flemish, aiming to bridge the gap between computational linguistics emotion research.
zh
[NLP-11] CLAIM: An Intent-Driven Multi-Agent Framework for Analyzing Manipulation in Courtroom Dialogues ACL
【速读】: 该论文试图解决法律领域中通过法律术语进行操纵行为的检测与分析问题,尤其是在法庭对话中的操纵识别、主要操纵者识别以及操纵技术分类。其解决方案的关键在于提出了LegalCon数据集和CLAIM框架,其中LegalCon是一个包含1,063个标注的法庭对话数据集,专注于长对话的操纵分析;而CLAIM是一个两阶段、意图驱动的多智能体框架,旨在通过上下文感知和信息驱动的决策来增强操纵分析能力。
链接: https://arxiv.org/abs/2506.04131
作者: Disha Sheshanarayana,Tanishka Magar,Ayushi Mittal,Neelam Chaplot
机构: Manipal University Jaipur (曼尼帕尔大学贾伊普尔分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to SICon 2025 ACL
Abstract:Courtrooms are places where lives are determined and fates are sealed, yet they are not impervious to manipulation. Strategic use of manipulation in legal jargon can sway the opinions of judges and affect the decisions. Despite the growing advancements in NLP, its application in detecting and analyzing manipulation within the legal domain remains largely unexplored. Our work addresses this gap by introducing LegalCon, a dataset of 1,063 annotated courtroom conversations labeled for manipulation detection, identification of primary manipulators, and classification of manipulative techniques, with a focus on long conversations. Furthermore, we propose CLAIM, a two-stage, Intent-driven Multi-agent framework designed to enhance manipulation analysis by enabling context-aware and informed decision-making. Our results highlight the potential of incorporating agentic frameworks to improve fairness and transparency in judicial processes. We hope that this contributes to the broader application of NLP in legal discourse analysis and the development of robust tools to support fairness in legal decision-making. Our code and data are available at this https URL.
zh
[NLP-12] Rectified Sparse Attention
【速读】: 该论文旨在解决大型语言模型在长序列生成中的效率问题,特别是稀疏解码方法中存在的键值缓存(KV cache)错位问题,该问题会导致近似误差累积并降低生成质量。其解决方案的关键在于提出了一种称为修正稀疏注意力(Rectified Sparse Attention, ReSA)的方法,该方法结合了块稀疏注意力与周期性密集校正,通过在固定时间间隔内使用密集前向传递刷新KV缓存,从而限制误差累积并保持与预训练分布的一致性。
链接: https://arxiv.org/abs/2506.04108
作者: Yutao Sun,Tianzhu Ye,Li Dong,Yuqing Xia,Jian Chen,Yizhao Gao,Shijie Cao,Jianyong Wang,Furu Wei
机构: Microsoft Research; Tsinghua University; The University of Hong Kong
类目: Computation and Language (cs.CL)
备注:
Abstract:Efficient long-sequence generation is a critical challenge for Large Language Models. While recent sparse decoding methods improve efficiency, they suffer from KV cache misalignment, where approximation errors accumulate and degrade generation quality. In this work, we propose Rectified Sparse Attention (ReSA), a simple yet effective method that combines block-sparse attention with periodic dense rectification. By refreshing the KV cache at fixed intervals using a dense forward pass, ReSA bounds error accumulation and preserves alignment with the pretraining distribution. Experiments across math reasoning, language modeling, and retrieval tasks demonstrate that ReSA achieves near-lossless generation quality with significantly improved efficiency. Notably, ReSA delivers up to 2.42 \times end-to-end speedup under decoding at 256K sequence length, making it a practical solution for scalable long-context inference. Code is available at this https URL.
zh
[NLP-13] xtAtari: 100K Frames Game Playing with Language Agents
【速读】: 该论文旨在解决语言代理在长视野决策任务中的性能评估问题,特别是针对需要跨越多达100,000步的复杂规划任务。其解决方案的关键在于构建TextAtari基准,通过将经典Atari游戏的视觉状态表示转化为丰富的文本描述,创建了一个结合序列决策与自然语言处理的挑战性测试平台。该基准采用无监督表示学习框架(AtariARI)生成文本,并通过多种代理框架和场景评估不同先验知识对长期规划任务的影响,从而推动语言模型与规划研究的交叉发展。
链接: https://arxiv.org/abs/2506.04098
作者: Wenhao Li,Wenwu Li,Chuyun Shen,Junjie Sheng,Zixiao Huang,Di Wu,Yun Hua,Wei Yin,Xiangfeng Wang,Hongyuan Zha,Bo Jin
机构: Tongji University (同济大学); East China Normal University (华东师范大学); Huawei Cloud Huawei Technologies Co., Ltd. (华为云华为技术有限公司); Shanghai Jiao Tong University (上海交通大学); Bank of Communications (交通银行); The Chinese University of Hong Kong, Shenzhen (香港中文大学,深圳)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 51 pages, 39 figures
Abstract:We present TextAtari, a benchmark for evaluating language agents on very long-horizon decision-making tasks spanning up to 100,000 steps. By translating the visual state representations of classic Atari games into rich textual descriptions, TextAtari creates a challenging test bed that bridges sequential decision-making with natural language processing. The benchmark includes nearly 100 distinct tasks with varying complexity, action spaces, and planning horizons, all rendered as text through an unsupervised representation learning framework (AtariARI). We evaluate three open-source large language models (Qwen2.5-7B, Gemma-7B, and Llama3.1-8B) across three agent frameworks (zero-shot, few-shot chain-of-thought, and reflection reasoning) to assess how different forms of prior knowledge affect performance on these long-horizon challenges. Four scenarios-Basic, Obscured, Manual Augmentation, and Reference-based-investigate the impact of semantic understanding, instruction comprehension, and expert demonstrations on agent decision-making. Our results reveal significant performance gaps between language agents and human players in extensive planning tasks, highlighting challenges in sequential reasoning, state tracking, and strategic planning across tens of thousands of steps. TextAtari provides standardized evaluation protocols, baseline implementations, and a framework for advancing research at the intersection of language models and planning.
zh
[NLP-14] AmbiK: Dataset of Ambiguous Tasks in Kitchen Environment ACL2025
【速读】: 该论文试图解决在真实环境中处理用户提供的模糊指令(ambiguous instructions)对于大型语言模型(Large Language Models, LLMs)而言仍是一个挑战的问题。现有的任务模糊性检测方法难以进行比较,因为它们在不同的数据集上进行测试且缺乏统一的基准。为此,本文提出了一种名为AmbiK的数据集,该数据集是针对厨房环境中机器人接收到的模糊指令的全文本数据集。AmbiK的关键在于其通过LLMs辅助收集并经过人工验证,包含1000对模糊任务及其明确对应任务,按模糊类型(人类偏好、常识知识、安全)分类,并附有环境描述、澄清问题与答案、用户意图和任务计划,共计2000个任务,旨在为研究人员提供一个统一的基准以比较模糊性检测方法。
链接: https://arxiv.org/abs/2506.04089
作者: Anastasiia Ivanova,Eva Bakaeva,Zoya Volovikova,Alexey K. Kovalev,Aleksandr I. Panov
机构: LMU(慕尼黑大学); MIPT(莫斯科物理技术学院); AIRI(人工智能研究机构)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Robotics (cs.RO)
备注: ACL 2025 (Main Conference)
Abstract:As a part of an embodied agent, Large Language Models (LLMs) are typically used for behavior planning given natural language instructions from the user. However, dealing with ambiguous instructions in real-world environments remains a challenge for LLMs. Various methods for task ambiguity detection have been proposed. However, it is difficult to compare them because they are tested on different datasets and there is no universal benchmark. For this reason, we propose AmbiK (Ambiguous Tasks in Kitchen Environment), the fully textual dataset of ambiguous instructions addressed to a robot in a kitchen environment. AmbiK was collected with the assistance of LLMs and is human-validated. It comprises 1000 pairs of ambiguous tasks and their unambiguous counterparts, categorized by ambiguity type (Human Preferences, Common Sense Knowledge, Safety), with environment descriptions, clarifying questions and answers, user intents, and task plans, for a total of 2000 tasks. We hope that AmbiK will enable researchers to perform a unified comparison of ambiguity detection methods. AmbiK is available at this https URL.
zh
[NLP-15] Multimodal Tabular Reasoning with Privileged Structured Information
【速读】: 该论文旨在解决从表格图像中进行表格推理的问题,即在真实场景下,表格通常以图像形式存在,而缺乏高质量的文本表示,导致传统基于大型语言模型(Large Language Models, LLMs)的推理方法难以直接应用。其解决方案的关键在于引入一种名为\sc Turbo的框架,该框架利用训练过程中可用的结构化信息来增强多模态大语言模型(Multimodal Large Language Models, MLLMs)。\sc Turbo通过基于DeepSeek-R1的结构感知推理路径生成器,生成高质量的跨模态数据,并通过反复生成和选择有利的推理路径,提升模型的表格推理能力。
链接: https://arxiv.org/abs/2506.04088
作者: Jun-Peng Jiang,Yu Xia,Hai-Long Sun,Shiyin Lu,Qing-Guo Chen,Weihua Luo,Kaifu Zhang,De-Chuan Zhan,Han-Jia Ye
机构: Alibaba Group (阿里巴巴集团)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Tabular reasoning involves multi-step information extraction and logical inference over tabular data. While recent advances have leveraged large language models (LLMs) for reasoning over structured tables, such high-quality textual representations are often unavailable in real-world settings, where tables typically appear as images. In this paper, we tackle the task of tabular reasoning from table images, leveraging privileged structured information available during training to enhance multimodal large language models (MLLMs). The key challenges lie in the complexity of accurately aligning structured information with visual representations, and in effectively transferring structured reasoning skills to MLLMs despite the input modality gap. To address these, we introduce TabUlar Reasoning with Bridged infOrmation (\sc Turbo), a new framework for multimodal tabular reasoning with privileged structured tables. \sc Turbo benefits from a structure-aware reasoning trace generator based on DeepSeek-R1, contributing to high-quality modality-bridged data. On this basis, \sc Turbo repeatedly generates and selects the advantageous reasoning paths, further enhancing the model’s tabular reasoning ability. Experimental results demonstrate that, with limited ( 9 k) data, \sc Turbo achieves state-of-the-art performance ( +7.2% vs. previous SOTA) across multiple datasets.
zh
[NLP-16] EuroLLM -9B: Technical Report
【速读】: 该论文旨在解决欧洲语言在现有开源大型语言模型中代表性不足和关注度低的问题。其关键解决方案是开发了EuroLLM-9B,这是一个从零开始训练的大型语言模型,支持所有24种欧盟官方语言及11种附加语言,并通过构建EuroFilter和EuroBlocks-Synthetic等组件,提升了欧洲语言的覆盖范围与模型性能。
链接: https://arxiv.org/abs/2506.04079
作者: Pedro Henrique Martins,João Alves,Patrick Fernandes,Nuno M. Guerreiro,Ricardo Rei,Amin Farajian,Mateusz Klimaszewski,Duarte M. Alves,José Pombal,Manuel Faysse,Pierre Colombo,François Yvon,Barry Haddow,José G. C. de Souza,Alexandra Birch,André F. T. Martins
机构: Unbabel; Instituto de Telecomunicações & Instituto Superior Técnico, Universidade de Lisboa; Carnegie Mellon University; MICS, CentraleSupélec, Université Paris-Saclay; Illuin Technology; University of Edinburgh; Equall; Aveni; Sorbonne Université, CNRS, ISIR
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 56 pages
Abstract:This report presents EuroLLM-9B, a large language model trained from scratch to support the needs of European citizens by covering all 24 official European Union languages and 11 additional languages. EuroLLM addresses the issue of European languages being underrepresented and underserved in existing open large language models. We provide a comprehensive overview of EuroLLM-9B’s development, including tokenizer design, architectural specifications, data filtering, and training procedures. We describe the pre-training data collection and filtering pipeline, including the creation of EuroFilter, an AI-based multilingual filter, as well as the design of EuroBlocks-Synthetic, a novel synthetic dataset for post-training that enhances language coverage for European languages. Evaluation results demonstrate EuroLLM-9B’s competitive performance on multilingual benchmarks and machine translation tasks, establishing it as the leading open European-made LLM of its size. To support open research and adoption, we release all major components of this work, including the base and instruction-tuned models, the EuroFilter classifier, and the synthetic post-training dataset.
zh
[NLP-17] LLM Eval-Med: A Real-world Clinical Benchmark for Medical LLM s with Physician Validation
【速读】: 该论文旨在解决当前医学领域大语言模型(Large Language Models, LLMs)评估中存在的不足,这些问题包括问题设计单一(多为选择题)、数据来源缺乏真实临床场景以及评估方法对复杂推理能力的评估不足。其解决方案的关键在于构建LLMEval-Med基准,涵盖五个核心医学领域,包含2,996道基于真实电子健康记录和专家设计的临床情景的问题,并设计了一个自动化评估流程,将专家开发的检查清单整合到LLM-as-Judge框架中,通过人机一致性分析验证机器评分的可靠性,动态优化检查清单和提示以确保评估的准确性与稳定性。
链接: https://arxiv.org/abs/2506.04078
作者: Ming Zhang,Yujiong Shen,Zelin Li,Huayu Sha,Binze Hu,Yuhui Wang,Chenhao Huang,Shichun Liu,Jingqi Tong,Changhao Jiang,Mingxu Chai,Zhiheng Xi,Shihan Dou,Tao Gui,Qi Zhang,Xuanjing Huang
机构: Fudan University (复旦大学); Northwestern University (西北大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Evaluating large language models (LLMs) in medicine is crucial because medical applications require high accuracy with little room for error. Current medical benchmarks have three main types: medical exam-based, comprehensive medical, and specialized assessments. However, these benchmarks have limitations in question design (mostly multiple-choice), data sources (often not derived from real clinical scenarios), and evaluation methods (poor assessment of complex reasoning). To address these issues, we present LLMEval-Med, a new benchmark covering five core medical areas, including 2,996 questions created from real-world electronic health records and expert-designed clinical scenarios. We also design an automated evaluation pipeline, incorporating expert-developed checklists into our LLM-as-Judge framework. Furthermore, our methodology validates machine scoring through human-machine agreement analysis, dynamically refining checklists and prompts based on expert feedback to ensure reliability. We evaluate 13 LLMs across three categories (specialized medical models, open-source models, and closed-source models) on LLMEval-Med, providing valuable insights for the safe and effective deployment of LLMs in medical domains. The dataset is released in this https URL.
zh
[NLP-18] A Novel Data Augmentation Approach for Automatic Speaking Assessment on Opinion Expressions ISCA
【速读】: 该论文试图解决自动化口语评估(Automated Speaking Assessment, ASA)在观点表达任务中因标注语音数据稀缺而导致的提示多样性不足和评分可靠性下降的问题。解决方案的关键在于提出一种新颖的训练范式,该范式利用大语言模型(Large Language Model, LLM)生成特定水平的多样化回答,并通过说话人感知的文本转语音合成技术将回答转换为合成语音,同时采用动态重要性损失根据合成语音与真实语音之间的特征分布差异自适应地重新加权训练样本,最终通过多模态大语言模型融合对齐的文本特征与语音信号以直接预测评分。
链接: https://arxiv.org/abs/2506.04077
作者: Chung-Chun Wang,Jhen-Ke Lin,Hao-Chien Lu,Hong-Yun Lin,Berlin Chen
机构: 未知
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: submitted to the ISCA SLaTE-2025 Workshop
Abstract:Automated speaking assessment (ASA) on opinion expressions is often hampered by the scarcity of labeled recordings, which restricts prompt diversity and undermines scoring reliability. To address this challenge, we propose a novel training paradigm that leverages a large language models (LLM) to generate diverse responses of a given proficiency level, converts responses into synthesized speech via speaker-aware text-to-speech synthesis, and employs a dynamic importance loss to adaptively reweight training instances based on feature distribution differences between synthesized and real speech. Subsequently, a multimodal large language model integrates aligned textual features with speech signals to predict proficiency scores directly. Experiments conducted on the LTTC dataset show that our approach outperforms methods relying on real data or conventional augmentation, effectively mitigating low-resource constraints and enabling ASA on opinion expressions with cross-modal information.
zh
[NLP-19] Acoustically Precise Hesitation Tagging Is Essential for End-to-End Verbatim Transcription Systems ISCA
【速读】: 该论文旨在解决自动口语评估中对话语流利度的准确捕捉问题,特别是针对第二语言(L2)口语转录中犹豫和填充词的识别,这对后续的错误分析和反馈至关重要。传统自动语音识别(ASR)系统往往忽略或泛化这些犹豫现象,导致重要声学细节的丢失。论文提出的解决方案关键在于对Whisper模型进行微调,并采用基于Gemini 2.0 Flash的声学精确填充词标注方案(Extra),相较于移除犹豫的纯文本标注(Pure)方案,显著提升了转录准确性,实现了5.5%的词错误率(WER),相对改进达11.3%。
链接: https://arxiv.org/abs/2506.04076
作者: Jhen-Ke Lin,Hao-Chien Lu,Chung-Chun Wang,Hong-Yun Lin,Berlin Chen
机构: 未知
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: submitted to the ISCA SLaTE-2025 Workshop
Abstract:Verbatim transcription for automatic speaking assessment demands accurate capture of disfluencies, crucial for downstream tasks like error analysis and feedback. However, many ASR systems discard or generalize hesitations, losing important acoustic details. We fine-tune Whisper models on the Speak Improve 2025 corpus using low-rank adaptation (LoRA), without recourse to external audio training data. We compare three annotation schemes: removing hesitations (Pure), generic tags (Rich), and acoustically precise fillers inferred by Gemini 2.0 Flash from existing audio-transcript pairs (Extra). Our challenge system achieved 6.47% WER (Pure) and 5.81% WER (Extra). Post-challenge experiments reveal that fine-tuning Whisper Large V3 Turbo with the “Extra” scheme yielded a 5.5% WER, an 11.3% relative improvement over the “Pure” scheme (6.2% WER). This demonstrates that explicit, realistic filled-pause labeling significantly enhances ASR accuracy for verbatim L2 speech transcription.
zh
[NLP-20] Controlling Difficulty of Generated Text for AI-Assisted Language Learning EMNLP2025
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)生成的文本复杂度较高,难以适应初学者(CEFR: A1-A2)语言学习需求的问题。其解决方案的关键在于采用可控生成技术,特别是无需模型微调的模块化方法,通过引入未来判别器(future discriminators)来调整输出的可理解性,从而更好地支持绝对初学者的语言学习。
链接: https://arxiv.org/abs/2506.04072
作者: Meiqing Jin,Liam Dugan,Chris Callison-Burch
机构: University of Pennsylvania (宾夕法尼亚大学)
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: Submitted to EMNLP 2025
Abstract:Practicing conversations with large language models (LLMs) presents a promising alternative to traditional in-person language learning. However, most LLMs generate text at a near-native level of complexity, making them ill-suited for beginner learners (CEFR: A1-A2). In this paper, we investigate whether controllable generation techniques – specifically modular methods that do not require model fine-tuning – can adapt LLM outputs to better support absolute beginners. We evaluate these methods through both automatic metrics and a user study with university-level learners of Japanese. Our findings show that while prompting alone fails to control output difficulty, the use of future discriminators (Yang and Klein, 2021) significantly improves output comprehensibility (from 40.4% to 84.3%). We further introduce a novel token-level evaluation metric, Token Miss Rate (TMR), that quantifies the proportion of incomprehensible tokens per utterance and correlates strongly with human judgments. To support future research in AI-assisted language learning, we release our code, models, annotation tools, and dataset.
zh
[NLP-21] LaF-GRPO: In-Situ Navigation Instruction Generation for the Visually Impaired via GRPO with LLM -as-Follower Reward
【速读】: 该论文旨在解决为视障(Visually Impaired, VI)个体生成精确、实时且可操作的导航指令的问题,该领域目前研究较为薄弱。其解决方案的关键在于提出LaF-GRPO(LLM-as-Follower GRPO),通过将大语言模型(Large Language Model, LLM)模拟为VI用户来生成奖励信号,从而指导视觉-语言模型(Vision-Language Model, VLM)的后训练过程,提升指令的可用性并减少对真实世界数据的依赖。
链接: https://arxiv.org/abs/2506.04070
作者: Yi Zhao,Siqi Wang,Jing Li
机构: The Hong Kong Polytechnic University (香港理工大学)
类目: Computation and Language (cs.CL); Multimedia (cs.MM)
备注:
Abstract:Navigation instruction generation for visually impaired (VI) individuals (NIG-VI) is critical yet relatively underexplored. This study, hence, focuses on producing precise, in-situ, step-by-step navigation instructions that are practically usable by VI users. Concretely, we propose LaF-GRPO (LLM-as-Follower GRPO), where an LLM simulates VI user responses to generate rewards guiding the Vision-Language Model (VLM) post-training. This enhances instruction usability while reducing costly real-world data needs. To facilitate training and testing, we introduce NIG4VI, a 27k-sample open-sourced benchmark. It provides diverse navigation scenarios with accurate spatial coordinates, supporting detailed, open-ended in-situ instruction generation. Experiments on NIG4VI show the effectiveness of LaF-GRPO by quantitative metrics (e.g., Zero-(LaF-GRPO) boosts BLEU +14%; SFT+(LaF-GRPO) METEOR 0.542 vs. GPT-4o’s 0.323) and yields more intuitive, safer instructions. Code and benchmark are available at \hrefthis https URLthis https URL.
zh
[NLP-22] Progressive Mastery: Customized Curriculum Learning with Guided Prompting for Mathematical Reasoning
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在后训练阶段面临的数据样本利用效率低下以及难以灵活处理不同难度样本的问题。其解决方案的关键在于提出了一种名为定制化课程学习(Customized Curriculum Learning, CCL)的新框架,该框架包含两项核心创新:一是基于模型自身能力定义难度,从而定制化课程数据集;二是开发了“引导提示”(Guided Prompting),通过策略性提示动态降低样本难度,从而有效利用原本可能损害性能的高难度样本。
链接: https://arxiv.org/abs/2506.04065
作者: Muling Wu,Qi Qian,Wenhao Liu,Xiaohua Wang,Zisu Huang,Di Liang,LI Miao,Shihan Dou,Changze Lv,Zhenghua Wang,Zhibo Xu,Lina Chen,Tianlong Li,Xiaoqing Zheng,Xuanjing Huang
机构: Fudan University (复旦大学); ByteDance Inc. (字节跳动公司)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large Language Models (LLMs) have achieved remarkable performance across various reasoning tasks, yet post-training is constrained by inefficient sample utilization and inflexible difficulty samples processing. To address these limitations, we propose Customized Curriculum Learning (CCL), a novel framework with two key innovations. First, we introduce model-adaptive difficulty definition that customizes curriculum datasets based on each model’s individual capabilities rather than using predefined difficulty metrics. Second, we develop “Guided Prompting,” which dynamically reduces sample difficulty through strategic hints, enabling effective utilization of challenging samples that would otherwise degrade performance. Comprehensive experiments on supervised fine-tuning and reinforcement learning demonstrate that CCL significantly outperforms uniform training approaches across five mathematical reasoning benchmarks, confirming its effectiveness across both paradigms in enhancing sample utilization and model performance.
zh
[NLP-23] High Accuracy Less Talk (HALT): Reliable LLM s through Capability-Aligned Finetuning
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在缺乏知识或能力时产生错误回答的问题,即幻觉(hallucination)。其解决方案的关键在于通过后训练(post-training)使模型仅在确信答案正确时生成内容,否则部分放弃生成。具体方法HALT通过将预训练模型的响应拆分为事实片段,并利用真实信息识别错误片段,从而生成与模型能力对齐的微调数据。该方法通过移除或替换错误片段实现响应完整性和正确性的权衡,显著提升了响应片段的平均正确率,并在多个领域实现了性能提升。
链接: https://arxiv.org/abs/2506.04051
作者: Tim Franzmeyer,Archie Sravankumar,Lijuan Liu,Yuning Mao,Rui Hou,Sinong Wang,Jakob N. Foerster,Luke Zettlemoyer,Madian Khabsa
机构: Meta(元)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) currently respond to every prompt. However, they can produce incorrect answers when they lack knowledge or capability – a problem known as hallucination. We instead propose post-training an LLM to generate content only when confident in its correctness and to otherwise (partially) abstain. Specifically, our method, HALT, produces capability-aligned post-training data that encodes what the model can and cannot reliably generate. We generate this data by splitting responses of the pretrained LLM into factual fragments (atomic statements or reasoning steps), and use ground truth information to identify incorrect fragments. We achieve capability-aligned finetuning responses by either removing incorrect fragments or replacing them with “Unsure from Here” – according to a tunable threshold that allows practitioners to trade off response completeness and mean correctness of the response’s fragments. We finetune four open-source models for biography writing, mathematics, coding, and medicine with HALT for three different trade-off thresholds. HALT effectively trades off response completeness for correctness, increasing the mean correctness of response fragments by 15% on average, while resulting in a 4% improvement in the F1 score (mean of completeness and correctness of the response) compared to the relevant baselines. By tuning HALT for highest correctness, we train a single reliable Llama3-70B model with correctness increased from 51% to 87% across all four domains while maintaining 53% of the response completeness achieved with standard finetuning.
zh
[NLP-24] Explainability-Based Token Replacement on LLM -Generated Text
【速读】: 该论文试图解决AI生成文本(AIGT)容易被检测的问题,旨在通过可解释AI(XAI)方法降低其可检测性。解决方案的关键在于利用SHAP和LIME等XAI方法识别对分类器预测影响最大的关键标记,并通过基于解释的标记替换策略对其进行修改,从而削弱单个分类器的检测能力。然而,研究同时表明,集成分类器在多语言和多领域任务中仍保持较强性能,证明了多模型方法能够有效缓解标记级操作的影响。
链接: https://arxiv.org/abs/2506.04050
作者: Hadi Mohammadi,Anastasia Giachanou,Daniel L. Oberski,Ayoub Bagheri
机构: Utrecht University(乌得勒支大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Generative models, especially large language models (LLMs), have shown remarkable progress in producing text that appears human-like. However, they often exhibit patterns that make their output easier to detect than text written by humans. In this paper, we investigate how explainable AI (XAI) methods can be used to reduce the detectability of AI-generated text (AIGT) while also introducing a robust ensemble-based detection approach. We begin by training an ensemble classifier to distinguish AIGT from human-written text, then apply SHAP and LIME to identify tokens that most strongly influence its predictions. We propose four explainability-based token replacement strategies to modify these influential tokens. Our findings show that these token replacement approaches can significantly diminish a single classifier’s ability to detect AIGT. However, our ensemble classifier maintains strong performance across multiple languages and domains, showing that a multi-model approach can mitigate the impact of token-level manipulations. These results show that XAI methods can make AIGT harder to detect by focusing on the most influential tokens. At the same time, they highlight the need for robust, ensemble-based detection strategies that can adapt to evolving approaches for hiding AIGT.
zh
[NLP-25] On Support Samples of Next Word Prediction ACL2025
【速读】: 该论文试图解决语言模型决策过程的可解释性问题,特别是针对下一个词预测任务中的数据驱动可解释性进行研究。其解决方案的关键在于利用表示定理(representer theorem)识别出两种类型的支撑样本(support samples),即那些促进或阻碍特定预测的样本,并揭示支撑样本的内在属性可以在训练开始前就被预测。此外,研究还强调了非支撑样本在防止过拟合、塑造泛化能力和表征学习中的关键作用,尤其是在深层网络中其重要性更为显著。
链接: https://arxiv.org/abs/2506.04047
作者: Yuqian Li,Yupei Du,Yufang Liu,Feifei Feng,Mou Xiao Feng,Yuanbin Wu
机构: East China Normal University (华东师范大学); Utrecht University (乌得勒支大学); Midea Group (美的集团)
类目: Computation and Language (cs.CL)
备注: Accepted to ACL2025(Main Conference)
Abstract:Language models excel in various tasks by making complex decisions, yet understanding the rationale behind these decisions remains a challenge. This paper investigates \emphdata-centric interpretability in language models, focusing on the next-word prediction task. Using representer theorem, we identify two types of \emphsupport samples-those that either promote or deter specific predictions. Our findings reveal that being a support sample is an intrinsic property, predictable even before training begins. Additionally, while non-support samples are less influential in direct predictions, they play a critical role in preventing overfitting and shaping generalization and representation learning. Notably, the importance of non-support samples increases in deeper layers, suggesting their significant role in intermediate representation this http URL insights shed light on the interplay between data and model decisions, offering a new dimension to understanding language model behavior and interpretability.
zh
[NLP-26] Lacuna Inc. at SemEval-2025 Task 4: LoRA-Enhanced Influence-Based Unlearning for LLM s SEMEVAL-2025 ACL2025
【速读】: 该论文试图解决大语言模型中移除特定知识的问题,即在不从头开始重新训练的情况下,去除模型中的敏感内容而不损害其整体性能(SemEval-2025 Task 4: Unlearning sensitive content from Large Language Models)。解决方案的关键在于结合基于影响函数(influence functions)的方法以消除数据对模型的影响,并利用二阶优化(second-order optimization)来稳定模型的整体性能。
链接: https://arxiv.org/abs/2506.04044
作者: Aleksey Kudelya,Alexander Shirnin
机构: HSE University (高等经济大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to SemEval-2025, an ACL 2025 workshop
Abstract:This paper describes LIBU (LoRA enhanced influence-based unlearning), an algorithm to solve the task of unlearning - removing specific knowledge from a large language model without retraining from scratch and compromising its overall utility (SemEval-2025 Task 4: Unlearning sensitive content from Large Language Models). The algorithm combines classical \textitinfluence functions to remove the influence of the data from the model and \textitsecond-order optimization to stabilize the overall utility. Our experiments show that this lightweight approach is well applicable for unlearning LLMs in different kinds of task.
zh
[NLP-27] hink Like a Person Before Responding: A Multi-Faceted Evaluation of Persona-Guided LLM s for Countering Hate WOAH2025 ACL
【速读】: 该论文试图解决如何评估由大型语言模型(Large Language Model, LLM)生成的反叙事(counter-narratives, CN)在应对网络仇恨言论中的有效性问题。解决方案的关键在于提出一个涵盖四个维度的评估框架,包括人物角色构建、冗长性与可读性、情感语气以及伦理稳健性,以此系统性地分析LLM生成的CN在实际应用中的表现与局限性。
链接: https://arxiv.org/abs/2506.04043
作者: Mikel K. Ngueajio,Flor Miriam Plaza-del-Arco,Yi-Ling Chung,Danda B. Rawat,Amanda Cercas Curry
机构: Howard University (霍华德大学); LIACS, Leiden University (莱顿大学信息科学中心,莱顿大学); Genaios (Genaios); CENTAI Institute (CENTAI研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: Accepted at ACL WOAH 2025
Abstract:Automated counter-narratives (CN) offer a promising strategy for mitigating online hate speech, yet concerns about their affective tone, accessibility, and ethical risks remain. We propose a framework for evaluating Large Language Model (LLM)-generated CNs across four dimensions: persona framing, verbosity and readability, affective tone, and ethical robustness. Using GPT-4o-Mini, Cohere’s CommandR-7B, and Meta’s LLaMA 3.1-70B, we assess three prompting strategies on the MT-Conan and HatEval datasets. Our findings reveal that LLM-generated CNs are often verbose and adapted for people with college-level literacy, limiting their accessibility. While emotionally guided prompts yield more empathetic and readable responses, there remain concerns surrounding safety and effectiveness.
zh
[NLP-28] Unveiling and Eliminating the Shortcut Learning for Locate-Then-Edit Knowledge Editing via Both Subject and Relation Awareness
【速读】: 该论文试图解决大语言模型在进行知识编辑时产生的不可控问题,即现有定位-编辑方法容易对与目标编辑主题相关但无关的关系产生过度修改,导致副作用。解决方案的关键在于揭示了优化过程中出现的“捷径学习”(shortcut learning)问题,即模型过度学习主体特征而忽视关系特征。为了解决这一问题,作者提出了一种新的两阶段优化过程,以平衡主体特征和关系特征的学习,从而实现可控的知识编辑。
链接: https://arxiv.org/abs/2506.04042
作者: Xiyu Liu,Zhengxiao Liu,Naibin Gu,Zheng Lin,Ji Xiang,Weiping Wang
机构: Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China; School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China
类目: Computation and Language (cs.CL)
备注:
Abstract:Knowledge editing aims to alternate the target knowledge predicted by large language models while ensuring the least side effects on unrelated knowledge. An effective way to achieve knowledge editing is to identify pivotal parameters for predicting factual associations and modify them with an optimization process to update the predictions. However, these locate-then-edit methods are uncontrollable since they tend to modify most unrelated relations connected to the subject of target editing. We unveil that this failure of controllable editing is due to a shortcut learning issue during the optimization process. Specifically, we discover two crucial features that are the subject feature and the relation feature for models to learn during optimization, but the current optimization process tends to over-learning the subject feature while neglecting the relation feature. To eliminate this shortcut learning of the subject feature, we propose a novel two-stage optimization process that balances the learning of the subject feature and the relation feature. Experimental results demonstrate that our approach successfully prevents knowledge editing from shortcut learning and achieves the optimal overall performance, contributing to controllable knowledge editing.
zh
[NLP-29] LexTime: A Benchmark for Temporal Ordering of Legal Events
【速读】: 该论文试图解决法律文本中时间推理的问题,特别是针对生成式 AI (Generative AI) 在事件顺序理解方面的不足,现有数据集缺乏专家语言评估,导致对模型在法律语境下处理事件顺序的能力理解不足。解决方案的关键在于引入 LexTime 数据集,这是首个专门用于评估 LLMs 在法律语言中事件顺序能力的数据集,包含来自美国联邦诉状的 512 个实例,每个实例均标注了事件对及其时间关系,从而为研究模型性能提供了基准。
链接: https://arxiv.org/abs/2506.04041
作者: Claire Barale,Leslie Barrett,Vikram Sunil Bajaj,Michael Rovatsos
机构: School of Informatics, University of Edinburgh (信息学院,爱丁堡大学); Bloomberg (彭博)
类目: Computation and Language (cs.CL)
备注: Preprint
Abstract:Temporal reasoning in legal texts is important for applications like case law analysis and compliance monitoring. However, existing datasets lack expert language evaluation, leaving a gap in understanding how LLMs manage event ordering in legal contexts. We introduce LexTime, the first dataset designed to evaluate LLMs’ event ordering capabilities in legal language, consisting of 512 instances from U.S. Federal Complaints with annotated event pairs and their temporal relations. Our findings show that (1) LLMs are more accurate on legal event ordering than on narrative (up to +10.5%); (2) longer input contexts and implicit events boost accuracy, reaching 80.8% for implicit-explicit event pairs; (3) legal linguistic complexities and nested clauses remain a challenge. We investigate how context length, explicit vs implicit event pairs, and legal language features affect model performance, demonstrating the need for specific modeling strategies to enhance temporal event reasoning.
zh
[NLP-30] Mitigating Hallucinations in Large Vision-Language Models via Entity-Centric Multimodal Preference Optimization
【速读】: 该论文试图解决大型视觉语言模型(Large Visual Language Models, LVLMs)中存在的幻觉问题,这一问题主要源于模态错位和底层大语言模型(Large Language Models, LLMs)的固有幻觉。解决方案的关键在于提出一种以实体为中心的多模态偏好优化方法(Entity-centric Multimodal Preference Optimization, EMPO),该方法在模态对齐方面优于现有的人类偏好对齐方法。此外,为克服高质量多模态偏好数据的稀缺性,研究者利用开源指令数据集从图像、指令和响应三个维度自动构建高质量偏好数据,从而提升模型的可靠性与准确性。
链接: https://arxiv.org/abs/2506.04039
作者: Jiulong Wu,Zhengliang Shi,Shuaiqiang Wang,Jizhou Huang,Dawei Yin,Lingyong Yan,Min Cao,Min Zhang
机构: Soochow University (苏州大学); Baidu Inc. (百度公司); Shandong University (山东大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Large Visual Language Models (LVLMs) have demonstrated impressive capabilities across multiple tasks. However, their trustworthiness is often challenged by hallucinations, which can be attributed to the modality misalignment and the inherent hallucinations of their underlying Large Language Models (LLMs) backbone. Existing preference alignment methods focus on aligning model responses with human preferences while neglecting image-text modality alignment, resulting in over-reliance on LLMs and hallucinations. In this paper, we propose Entity-centric Multimodal Preference Optimization (EMPO), which achieves enhanced modality alignment than existing human preference alignment methods. Besides, to overcome the scarcity of high-quality multimodal preference data, we utilize open-source instruction datasets to automatically construct high-quality preference data across three aspects: image, instruction, and response. Experiments on two human preference datasets and five multimodal hallucination benchmarks demonstrate the effectiveness of EMPO, e.g., reducing hallucination rates by 85.9% on Object-HalBench and 49.8% on MM-HalBench.
zh
[NLP-31] he mutual exclusivity bias of bilingual visually grounded speech models INTERSPEECH2025
【速读】: 该论文试图解决多语言环境下的词汇学习机制问题,特别是探讨双语儿童在词汇-对象关联中是否表现出较弱的互斥性(Mutual Exclusivity, ME)倾向。研究通过构建双语视觉 grounded 语音(Visual Grounded Speech, VGS)模型,分析其在英语、法语和荷兰语组合数据上的表现,揭示了双语模型相比单语模型通常表现出更弱的ME偏差。解决方案的关键在于利用多语言训练数据生成的联合视觉嵌入,分析其对熟悉与新颖概念区分能力的影响,从而解释ME偏差的形成机制及在双语环境中的变化。
链接: https://arxiv.org/abs/2506.04037
作者: Dan Oneata,Leanne Nortje,Yevgen Matusevych,Herman Kamper
机构: 未知
类目: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: Interspeech 2025
Abstract:Mutual exclusivity (ME) is a strategy where a novel word is associated with a novel object rather than a familiar one, facilitating language learning in children. Recent work has found an ME bias in a visually grounded speech (VGS) model trained on English speech with paired images. But ME has also been studied in bilingual children, who may employ it less due to cross-lingual ambiguity. We explore this pattern computationally using bilingual VGS models trained on combinations of English, French, and Dutch. We find that bilingual models generally exhibit a weaker ME bias than monolingual models, though exceptions exist. Analyses show that the combined visual embeddings of bilingual models have a smaller variance for familiar data, partly explaining the increase in confusion between novel and familiar concepts. We also provide new insights into why the ME bias exists in VGS models in the first place. Code and data: this https URL
zh
[NLP-32] AI Agents for Conversational Patient Triage: Preliminary Simulation-Based Evaluation with Real-World EHR Data
【速读】: 该论文旨在解决如何为医疗代理模型(healthcare agentic models)提供真实且多样化的测试环境问题,以支持其训练与评估。解决方案的关键在于构建一个基于真实电子健康记录(EHR)数据生成的患者模拟器(Patient Simulator),该模拟器能够通过多轮对话模拟患者症状检查过程,并生成符合临床实际的交互数据。此方法利用真实患者案例(vignettes)来创建合成测试对象,从而确保模拟结果与专家临床判断高度一致,为大规模训练和测试多轮对话式AI代理提供了有效途径。
链接: https://arxiv.org/abs/2506.04032
作者: Sina Rashidian,Nan Li,Jonathan Amar,Jong Ha Lee,Sam Pugh,Eric Yang,Geoff Masterson,Myoung Cha,Yugang Jia,Akhil Vaid
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Background: We present a Patient Simulator that leverages real world patient encounters which cover a broad range of conditions and symptoms to provide synthetic test subjects for development and testing of healthcare agentic models. The simulator provides a realistic approach to patient presentation and multi-turn conversation with a symptom-checking agent. Objectives: (1) To construct and instantiate a Patient Simulator to train and test an AI health agent, based on patient vignettes derived from real EHR data. (2) To test the validity and alignment of the simulated encounters provided by the Patient Simulator to expert human clinical providers. (3) To illustrate the evaluation framework of such an LLM system on the generated realistic, data-driven simulations – yielding a preliminary assessment of our proposed system. Methods: We first constructed realistic clinical scenarios by deriving patient vignettes from real-world EHR encounters. These vignettes cover a variety of presenting symptoms and underlying conditions. We then evaluate the performance of the Patient Simulator as a simulacrum of a real patient encounter across over 500 different patient vignettes. We leveraged a separate AI agent to provide multi-turn questions to obtain a history of present illness. The resulting multiturn conversations were evaluated by two expert clinicians. Results: Clinicians scored the Patient Simulator as consistent with the patient vignettes in those same 97.7% of cases. The extracted case summary based on the conversation history was 99% relevant. Conclusions: We developed a methodology to incorporate vignettes derived from real healthcare patient data to build a simulation of patient responses to symptom checking agents. The performance and alignment of this Patient Simulator could be used to train and test a multi-turn conversational AI agent at scale.
zh
[NLP-33] QQSUM: A Novel Task and Model of Quantitative Query-Focused Summarization for Review-based Product Question Answering ACL2025
【速读】: 该论文试图解决传统基于评论的产品问答(Review-based Product Question Answering, PQA)系统生成答案时仅从单一视角出发,无法捕捉用户意见多样性的问题。其解决方案的关键在于引入了一个新的任务——定量查询聚焦摘要(Quantitative Query-Focused Summarization, QQSUM),该任务通过提取具有代表性的关键点(Key Points, KPs)并量化其出现频率,以更全面地回答用户查询。为实现这一目标,作者提出了QQSUM-RAG模型,该模型扩展了检索增强生成(Retrieval-Augmented Generation, RAG),采用少样本学习联合训练一个面向关键点的检索器和摘要生成器,从而生成能够体现多样化和代表性观点的摘要。
链接: https://arxiv.org/abs/2506.04020
作者: An Quang Tang,Xiuzhen Zhang,Minh Ngoc Dinh,Zhuang Li
机构: RMIT University, Australia (RMIT大学)
类目: Computation and Language (cs.CL)
备注: Paper accepted to ACL 2025 Main Conference
Abstract:Review-based Product Question Answering (PQA) allows e-commerce platforms to automatically address customer queries by leveraging insights from user reviews. However, existing PQA systems generate answers with only a single perspective, failing to capture the diversity of customer opinions. In this paper we introduce a novel task Quantitative Query-Focused Summarization (QQSUM), which aims to summarize diverse customer opinions into representative Key Points (KPs) and quantify their prevalence to effectively answer user queries. While Retrieval-Augmented Generation (RAG) shows promise for PQA, its generated answers still fall short of capturing the full diversity of viewpoints. To tackle this challenge, our model QQSUM-RAG, which extends RAG, employs few-shot learning to jointly train a KP-oriented retriever and a KP summary generator, enabling KP-based summaries that capture diverse and representative opinions. Experimental results demonstrate that QQSUM-RAG achieves superior performance compared to state-of-the-art RAG baselines in both textual quality and quantification accuracy of opinions. Our source code is available at: this https URL
zh
[NLP-34] CETBench: A Novel Dataset constructed via Transformations over Programs for Benchmarking LLM s for Code-Equivalence Checking
【速读】: 该论文试图解决生成式 AI (Generative AI) 在代码等价性检查任务中的适用性问题,即判断两个程序是否在功能上等价。其关键在于发现即使是最简单的代码变换也会显著降低当前最先进模型(SOTA LLMs)在该任务上的性能,并提出一种基于微调的简单方法来提升模型在变换后程序对上的表现。
链接: https://arxiv.org/abs/2506.04019
作者: Neeva Oza,Ishaan Govil,Parul Gupta,Dinesh Khandelwal,Dinesh Garg,Parag Singla
机构: 未知
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL); Machine Learning (cs.LG); Programming Languages (cs.PL)
备注:
Abstract:LLMs have been extensively used for the task of automated code generation. In this work, we examine the applicability of LLMs for the related but relatively unexplored task of code-equivalence checking, i.e., given two programs, whether they are functionally equivalent or not. This is an important problem since benchmarking code equivalence can play a critical role in evaluating LLM capabilities for tasks such as code re-writing and code translation. Towards this end, we present CETBench - Code Equivalence with Transformations Benchmark, constructed via a repository of programs, where two programs in the repository may be solving the same or different tasks. Each instance in our dataset is obtained by taking a pair of programs in the repository and applying a random series of pre-defined code transformations, resulting in (non-)equivalent pairs. Our analysis on this dataset reveals a surprising finding that very simple code transformations in the underlying pair of programs can result in a significant drop in performance of SOTA LLMs for the task of code-equivalence checking. To remedy this, we present a simple fine-tuning-based approach to boost LLM performance on the transformed pairs of programs. Our approach for dataset generation is generic, and can be used with repositories with varying program difficulty levels and allows for applying varying numbers as well as kinds of transformations. In our experiments, we perform ablations over the difficulty level of original programs, as well as the kind of transformations used in generating pairs for equivalence checking. Our analysis presents deep insights into the working of LLMs for the task of code-equivalence, and points to the fact that they may still be far from what could be termed as a semantic understanding of the underlying code.
zh
[NLP-35] Agent Misalignment: Measuring the Propensity for Misaligned Behaviour in LLM -Based Agents NEURIPS2025
【速读】: 该论文试图解决大型语言模型(Large Language Model, LLM)代理在现实场景中表现出对齐偏差(misalignment)的可能性问题,特别是其执行非对齐行为的倾向性(misalignment propensity)。现有研究主要关注代理执行非对齐行为的能力(misalignment capability)和对有害指令的顺从性(misuse propensity),但对代理在实际环境中尝试非对齐行为的概率仍缺乏深入理解。论文提出了一种名为AgentMisalignment的对齐倾向性基准,包含一系列现实场景以评估LLM代理的非对齐行为,包括目标保护、抵抗关闭、故意降低表现和权力追求等子类别。解决方案的关键在于通过系统提示(system prompts)系统性地调整代理个性,并发现代理个性特征可以显著且不可预测地影响其对齐倾向,甚至可能超过模型选择的影响,从而强调了在部署AI代理时进行细致系统提示工程的重要性。
链接: https://arxiv.org/abs/2506.04018
作者: Akshat Naik,Patrick Quinn,Guillermo Bosch,Emma Gouné,Francisco Javier Campos Zabala,Jason Ross Brown,Edward James Young
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注: Prepint, under review for NeurIPS 2025
Abstract:As Large Language Model (LLM) agents become more widespread, associated misalignment risks increase. Prior work has examined agents’ ability to enact misaligned behaviour (misalignment capability) and their compliance with harmful instructions (misuse propensity). However, the likelihood of agents attempting misaligned behaviours in real-world settings (misalignment propensity) remains poorly understood. We introduce a misalignment propensity benchmark, AgentMisalignment, consisting of a suite of realistic scenarios in which LLM agents have the opportunity to display misaligned behaviour. We organise our evaluations into subcategories of misaligned behaviours, including goal-guarding, resisting shutdown, sandbagging, and power-seeking. We report the performance of frontier models on our benchmark, observing higher misalignment on average when evaluating more capable models. Finally, we systematically vary agent personalities through different system prompts. We find that persona characteristics can dramatically and unpredictably influence misalignment tendencies – occasionally far more than the choice of model itself – highlighting the importance of careful system prompt engineering for deployed AI agents. Our work highlights the failure of current alignment methods to generalise to LLM agents, and underscores the need for further propensity evaluations as autonomous systems become more prevalent.
zh
[NLP-36] Seeing What Tastes Good: Revisiting Multimodal Distributional Semantics in the Billion Parameter Era ACL
【速读】: 该论文试图解决的问题是:当前最先进的基础模型在表征具体物体概念的语义特征规范方面的能力如何,尤其是它们是否能够捕捉到如“玫瑰是红色的、有甜香且是一种花”这样的属性。论文的解决方案关键在于通过探测任务测试这些模型对物体属性的感知能力,并评估仅基于图像数据训练的图像编码器、多模态训练的图像编码器以及仅语言模型在预测扩展的更密集的McRae规范和较新的Binder属性评分数据集上的表现。
链接: https://arxiv.org/abs/2506.03994
作者: Dan Oneata,Desmond Elliott,Stella Frank
机构: politehnica Bucharest(布加勒斯特理工大学); Pioneer Center for AI(人工智能先驱中心); Department of Computer Science, University of Copenhagen(哥本哈根大学计算机科学系)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: ACL Findings 2025
Abstract:Human learning and conceptual representation is grounded in sensorimotor experience, in contrast to state-of-the-art foundation models. In this paper, we investigate how well such large-scale models, trained on vast quantities of data, represent the semantic feature norms of concrete object concepts, e.g. a ROSE is red, smells sweet, and is a flower. More specifically, we use probing tasks to test which properties of objects these models are aware of. We evaluate image encoders trained on image data alone, as well as multimodally-trained image encoders and language-only models, on predicting an extended denser version of the classic McRae norms and the newer Binder dataset of attribute ratings. We find that multimodal image encoders slightly outperform language-only approaches, and that image-only encoders perform comparably to the language models, even on non-visual attributes that are classified as “encyclopedic” or “function”. These results offer new insights into what can be learned from pure unimodal learning, and the complementarity of the modalities.
zh
[NLP-37] Words of Warmth: Trust and Sociability Norms for over 26k English Words ACL2025
【速读】: 该论文试图解决如何量化和分析人类对他人及群体评估中的核心维度——温暖(Warmth, W)与能力(Competence, C),特别是温暖的两个子维度——信任(Trust, T)和亲和力(Sociability, S)在语言层面的表达问题。其解决方案的关键在于构建了“Words of Warmth”,这是首个大规模手动标注的英语词汇与温暖(以及信任和亲和力)关联的语料库,涵盖了超过26,000个英语单词,并验证了这些关联的高度可靠性。该语料库为研究儿童随年龄增长对WCTS词汇的掌握速度以及进行偏差和刻板印象相关研究提供了重要工具。
链接: https://arxiv.org/abs/2506.03993
作者: Saif M. Mohammad
机构: National Research Council Canada (国家研究委员会)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: In Proceedings of ACL 2025 Main
Abstract:Social psychologists have shown that Warmth (W) and Competence © are the primary dimensions along which we assess other people and groups. These dimensions impact various aspects of our lives from social competence and emotion regulation to success in the work place and how we view the world. More recent work has started to explore how these dimensions develop, why they have developed, and what they constitute. Of particular note, is the finding that warmth has two distinct components: Trust (T) and Sociability (S). In this work, we introduce Words of Warmth, the first large-scale repository of manually derived word–warmth (as well as word–trust and word–sociability) associations for over 26k English words. We show that the associations are highly reliable. We use the lexicons to study the rate at which children acquire WCTS words with age. Finally, we show that the lexicon enables a wide variety of bias and stereotype research through case studies on various target entities. Words of Warmth is freely available at: this http URL
zh
[NLP-38] DynTok: Dynamic Compression of Visual Tokens for Efficient and Effective Video Understanding
【速读】: 该论文试图解决视频建模中由于将视频表示为视觉标记序列而导致的视觉标记数量过多的问题,尤其是在长视频场景下,这会增加计算开销。解决方案的关键在于提出一种名为DynTok的动态视频标记压缩策略,该策略通过自适应地将视觉标记分组并在每组内合并,实现低信息密度区域的高压缩率,同时保留关键内容,从而在保持性能的同时将标记数量减少至原始大小的44.4%。
链接: https://arxiv.org/abs/2506.03990
作者: Hongzhi Zhang,Jingyuan Zhang,Xingguang Ji,Qi Wang,Fuzheng Zhang
机构: 未知
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Typical video modeling methods, such as LLava, represent videos as sequences of visual tokens, which are then processed by the LLM backbone for effective video understanding. However, this approach leads to a massive number of visual tokens, especially for long videos. A practical solution is to first extract relevant visual information from the large visual context before feeding it into the LLM backbone, thereby reducing computational overhead. In this work, we introduce DynTok, a novel \textbfDynamic video \textbfToken compression strategy. DynTok adaptively splits visual tokens into groups and merges them within each group, achieving high compression in regions with low information density while preserving essential content. Our method reduces the number of tokens to 44.4% of the original size while maintaining comparable performance. It further benefits from increasing the number of video frames and achieves 65.3% on Video-MME and 72.5% on MLVU. By applying this simple yet effective compression method, we expose the redundancy in video token representations and offer insights for designing more efficient video modeling techniques.
zh
[NLP-39] Stronger Baselines for Retrieval-Augmented Generation with Long-Context Language Models
【速读】: 该论文试图解决多阶段检索增强生成(RAG)流水线在长上下文语言模型(LMs)出现后是否仍能提供可测量优势的问题。其解决方案的关键在于通过系统化的token预算扩展评估,对比两种多阶段RAG流水线(ReadAgent和RAPTOR)与三种基线方法(包括DOS RAG),发现DOS RAG——一种保留原始段落顺序的简单检索-阅读方法——在多个长上下文问答(QA)基准上表现优异,甚至优于更复杂的模型。研究建议将DOS RAG作为未来RAG评估的简单但强大的基线。
链接: https://arxiv.org/abs/2506.03989
作者: Alex Laitenberger,Christopher D. Manning,Nelson F. Liu
机构: Stanford University (斯坦福大学)
类目: Computation and Language (cs.CL)
备注: 10 pages, 5 figures, for associated source code, see this https URL
Abstract:With the rise of long-context language models (LMs) capable of processing tens of thousands of tokens in a single pass, do multi-stage retrieval-augmented generation (RAG) pipelines still offer measurable benefits over simpler, single-stage approaches? To assess this question, we conduct a controlled evaluation for QA tasks under systematically scaled token budgets, comparing two recent multi-stage pipelines, ReadAgent and RAPTOR, against three baselines, including DOS RAG (Document’s Original Structure RAG), a simple retrieve-then-read method that preserves original passage order. Despite its straightforward design, DOS RAG consistently matches or outperforms more intricate methods on multiple long-context QA benchmarks. We recommend establishing DOS RAG as a simple yet strong baseline for future RAG evaluations, pairing it with emerging embedding and language models to assess trade-offs between complexity and effectiveness as model capabilities evolve.
zh
[NLP-40] Around the World in 24 Hours: Probing LLM Knowledge of Time and Place
【速读】: 该论文试图解决语言模型在时空联合推理能力方面的不足,即现有研究多局限于单独测试时间或空间逻辑推理,且通常在简单或人工环境中进行,缺乏对真实世界复杂时空关系的综合评估。解决方案的关键在于构建了GeoTemp数据集,该数据集包含320k个提示,覆盖217个国家的289个城市及37个时区,从而为评估语言模型在时空联合推理任务中的表现提供了真实且全面的基准。
链接: https://arxiv.org/abs/2506.03984
作者: Carolin Holtermann,Paul Röttger,Anne Lauscher
机构: University of Hamburg(汉堡大学); University of Bocconi(博科尼大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Reasoning over time and space is essential for understanding our world. However, the abilities of language models in this area are largely unexplored as previous work has tested their abilities for logical reasoning in terms of time and space in isolation or only in simple or artificial environments. In this paper, we present the first evaluation of the ability of language models to jointly reason over time and space. To enable our analysis, we create GeoTemp, a dataset of 320k prompts covering 289 cities in 217 countries and 37 time zones. Using GeoTemp, we evaluate eight open chat models of three different model families for different combinations of temporal and geographic knowledge. We find that most models perform well on reasoning tasks involving only temporal knowledge and that overall performance improves with scale. However, performance remains constrained in tasks that require connecting temporal and geographical information. We do not find clear correlations of performance with specific geographic regions. Instead, we find a significant performance increase for location names with low model perplexity, suggesting their repeated occurrence during model training. We further demonstrate that their performance is heavily influenced by prompt formulation - a direct injection of geographical knowledge leads to performance gains, whereas, surprisingly, techniques like chain-of-thought prompting decrease performance on simpler tasks.
zh
[NLP-41] Voice Activity Projection Model with Multimodal Encoders
【速读】: 该论文旨在解决人机交互中的轮流对话管理(turn-taking management)问题,这一问题由于社会语境的复杂性和多模态特性而具有挑战性。现有方法通常依赖于静音时长,而先前的语音活动投影(Voice Activity Projection, VAP)模型通过将轮流行为统一表示为预测目标,提升了预测性能。本文的关键解决方案是引入一个融合预训练音频和面部编码器的多模态VAP模型,以捕捉细微的表达,从而在轮流预测任务中取得更优或竞争力的表现。
链接: https://arxiv.org/abs/2506.03980
作者: Takeshi Saga,Catherine Pelachaud
机构: Sorbonne Université (索邦大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Turn-taking management is crucial for any social interaction. Still, it is challenging to model human-machine interaction due to the complexity of the social context and its multimodal nature. Unlike conventional systems based on silence duration, previous existing voice activity projection (VAP) models successfully utilized a unified representation of turn-taking behaviors as prediction targets, which improved turn-taking prediction performance. Recently, a multimodal VAP model outperformed the previous state-of-the-art model by a significant margin. In this paper, we propose a multimodal model enhanced with pre-trained audio and face encoders to improve performance by capturing subtle expressions. Our model performed competitively, and in some cases, even better than state-of-the-art models on turn-taking metrics. All the source codes and pretrained models are available at this https URL.
zh
[NLP-42] Structured Pruning for Diverse Best-of-N Reasoning Optimization ACL2025
【速读】: 该论文试图解决在基于Transformer的语言模型中,通过模型剪枝提升推理能力的问题,而传统上模型剪枝主要用于实现计算节省。解决方案的关键在于提出SPRINT框架,该框架通过对比学习动态选择在推理过程中最优的注意力头和层进行剪枝,通过对问题嵌入与头嵌入的对齐,识别出能提高推理准确性的剪枝配置。
链接: https://arxiv.org/abs/2506.03978
作者: Hieu Trung Nguyen,Bao Nguyen,Viet Anh Nguyen
机构: The Chinese University of Hong Kong (香港中文大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted to ACL 2025
Abstract:Model pruning in transformer-based language models, traditionally viewed as a means of achieving computational savings, can enhance the model’s reasoning capabilities. In this work, we uncover a surprising phenomenon: the selective pruning of certain attention heads leads to improvements in reasoning performance, particularly on challenging tasks. Motivated by this observation, we propose SPRINT, a novel contrastive learning framework that dynamically selects the optimal head and layer to prune during inference. By aligning question embeddings with head embeddings, SPRINT identifies those pruned-head configurations that result in more accurate reasoning. Extensive experiments demonstrate that our method significantly outperforms traditional best-of- N and random head selection strategies on the MATH500 and GSM8K datasets.
zh
[NLP-43] From Real to Synthetic: Synthesizing Millions of Diversified and Complicated User Instructions with Attributed Grounding ACL2025
【速读】: 该论文旨在解决大规模、多样化且复杂的指令数据生成问题,以实现大型语言模型(Large Language Models, LLMs)的自动对齐。现有方法在生成合成指令时要么受限于有限的地面来源导致分布狭窄,要么依赖简单的扩展无法生成具有复杂性的有效轨迹。论文提出的解决方案关键在于利用属性化地面(attributed grounding)技术,该技术包含两个核心步骤:一是自上而下的属性化过程,将部分真实指令与具体用户场景相联系;二是自下而上的合成过程,通过网络文档首先生成情境,再生成有意义的指令。这一框架使得大规模获取多样且复杂的指令成为可能。
链接: https://arxiv.org/abs/2506.03968
作者: Chiwei Zhu,Benfeng Xu,Xiaorui Wang,Zhendong Mao
机构: 未知
类目: Computation and Language (cs.CL)
备注: To be published at ACL 2025
Abstract:The pursuit of diverse, complex, and large-scale instruction data is crucial for automatically aligning large language models (LLMs). While there are methods capable of generating synthetic instructions at scale, they either suffer from limited grounding sources, leading to a narrow distribution, or rely on trivial extensions that fail to produce meaningful trajectories in terms of complexity. In contrast, instructions that benefit efficient alignment are typically crafted with cognitive insights and grounded in real-world use cases. In this paper, we synthesize such instructions using attributed grounding, which involves 1) a top-down attribution process that grounds a selective set of real instructions to situated users, and 2) a bottom-up synthesis process that leverages web documents to first generate a situation, then a meaningful instruction. This framework allows us to harvest diverse and complex instructions at scale, utilizing the vast range of web documents. Specifically, we construct a dataset of 1 million instructions, called SynthQuestions, and demonstrate that models trained on it achieve leading performance on several common benchmarks, with improvements that continually scale with more web corpora. Data, models and codes will be available at this https URL.
zh
[NLP-44] ableEval: A Real-World Benchmark for Complex Multilingual and Multi-Structured Table Question Answering
【速读】: 该论文试图解决当前大型语言模型(Large Language Models, LLMs)在表格问答(TableQA)任务中面临的挑战,包括对复杂表格结构、多语言数据和领域特定推理的处理能力不足,以及现有基准测试在数据泄露和跨语言、跨领域泛化能力方面的局限性。其解决方案的关键在于引入TableEval,一个面向真实场景的TableQA基准测试,包含来自政府、金融、学术和行业报告等四个领域的多种表格结构(如简洁表、层次表和嵌套表),并涵盖简体中文、繁体中文和英文的跨语言场景,同时通过收集近期真实文档数据以减少数据泄露风险;此外,还提出了SEAT评估框架,从子问题层面评估模型输出与参考答案的语义对齐度,以更准确地衡量模型性能。
链接: https://arxiv.org/abs/2506.03949
作者: Junnan Zhu,Jingyi Wang,Bohan Yu,Xiaoyu Wu,Junbo Li,Lei Wang,Nan Xu
机构: Beijing Wenge Technology Co., Ltd. (北京文革科技有限公司); State Key Laboratory of Multimodal Artificial Intelligence Systems (多模态人工智能系统国家重点实验室); Institute of Automation, CAS (自动化研究所,中国科学院); School of Advanced Interdisciplinary Sciences, University of Chinese Academy of Sciences (交叉科学高等研究院,中国科学院大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:LLMs have shown impressive progress in natural language processing. However, they still face significant challenges in TableQA, where real-world complexities such as diverse table structures, multilingual data, and domain-specific reasoning are crucial. Existing TableQA benchmarks are often limited by their focus on simple flat tables and suffer from data leakage. Furthermore, most benchmarks are monolingual and fail to capture the cross-lingual and cross-domain variability in practical applications. To address these limitations, we introduce TableEval, a new benchmark designed to evaluate LLMs on realistic TableQA tasks. Specifically, TableEval includes tables with various structures (such as concise, hierarchical, and nested tables) collected from four domains (including government, finance, academia, and industry reports). Besides, TableEval features cross-lingual scenarios with tables in Simplified Chinese, Traditional Chinese, and English. To minimize the risk of data leakage, we collect all data from recent real-world documents. Considering that existing TableQA metrics fail to capture semantic accuracy, we further propose SEAT, a new evaluation framework that assesses the alignment between model responses and reference answers at the sub-question level. Experimental results have shown that SEAT achieves high agreement with human judgment. Extensive experiments on TableEval reveal critical gaps in the ability of state-of-the-art LLMs to handle these complex, real-world TableQA tasks, offering insights for future improvements. We make our dataset available here: this https URL.
zh
[NLP-45] Hanging in the Balance: Pivotal Moments in Crisis Counseling Conversations ACL2025
【速读】: 该论文旨在解决在对话过程中识别关键转折点(pivotal moments)的问题,这些时刻的响应可能显著影响对话的后续发展。解决方案的关键在于提出一种无监督的计算方法,通过分析对话中预期结果的不确定性来检测这些关键时刻,即当下一步可能所说的内容对结果产生广泛影响时,该时刻被视为关键转折点。
链接: https://arxiv.org/abs/2506.03941
作者: Vivian Nguyen,Lillian Lee,Cristian Danescu-Niculescu-Mizil
机构: Cornell University (康奈尔大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Physics and Society (physics.soc-ph)
备注: To appear in the Proceedings of ACL 2025. Code and demo available in ConvoKit ( this http URL )
Abstract:During a conversation, there can come certain moments where its outcome hangs in the balance. In these pivotal moments, how one responds can put the conversation on substantially different trajectories leading to significantly different outcomes. Systems that can detect when such moments arise could assist conversationalists in domains with highly consequential outcomes, such as mental health crisis counseling. In this work, we introduce an unsupervised computational method for detecting such pivotal moments as they happen, in an online fashion. Our approach relies on the intuition that a moment is pivotal if our expectation of the outcome varies widely depending on what might be said next. By applying our method to crisis counseling conversations, we first validate it by showing that it aligns with human perception – counselors take significantly longer to respond during moments detected by our method – and with the eventual conversational trajectory – which is more likely to change course at these times. We then use our framework to explore the relation of the counselor’s response during pivotal moments with the eventual outcome of the session. Comments: To appear in the Proceedings of ACL 2025. Code and demo available in ConvoKit (this http URL) Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Physics and Society (physics.soc-ph) Cite as: arXiv:2506.03941 [cs.CL] (or arXiv:2506.03941v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2506.03941 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-46] Graph Counselor: Adaptive Graph Exploration via Multi-Agent Synergy to Enhance LLM Reasoning ACL2025
【速读】: 该论文旨在解决现有Graph Retrieval Augmented Generation (GraphRAG)方法在信息聚合效率和推理机制灵活性方面的固有局限性。具体而言,传统方法依赖单一代理和固定迭代模式,难以适应图数据中多层级的文本、结构和度信息;同时,其预设的推理方案无法动态调整推理深度和实现精确语义修正。论文提出的解决方案关键在于引入基于多智能体协作的Graph Counselor方法,其中包含自适应图信息提取模块(AGIEM)和多视角自我反思(SR)模块,通过规划、思考与执行代理的协同工作,实现复杂图结构的精确建模和动态信息提取策略调整,并通过自我反思与逆向推理机制提升推理结果的准确性和语义一致性。
链接: https://arxiv.org/abs/2506.03939
作者: Junqi Gao,Xiang Zou,YIng Ai,Dong Li,Yichen Niu,Biqing Qi,Jianxing Liu
机构: Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); School of Mathematics, Harbin Institute of Technology (哈尔滨工业大学数学学院); Department of Control Science and Engineering, Harbin Institute of Technology (哈尔滨工业大学控制科学与工程系)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted by ACL 2025
Abstract:Graph Retrieval Augmented Generation (GraphRAG) effectively enhances external knowledge integration capabilities by explicitly modeling knowledge relationships, thereby improving the factual accuracy and generation quality of Large Language Models (LLMs) in specialized domains. However, existing methods suffer from two inherent limitations: 1) Inefficient Information Aggregation: They rely on a single agent and fixed iterative patterns, making it difficult to adaptively capture multi-level textual, structural, and degree information within graph data. 2) Rigid Reasoning Mechanism: They employ preset reasoning schemes, which cannot dynamically adjust reasoning depth nor achieve precise semantic correction. To overcome these limitations, we propose Graph Counselor, an GraphRAG method based on multi-agent collaboration. This method uses the Adaptive Graph Information Extraction Module (AGIEM), where Planning, Thought, and Execution Agents work together to precisely model complex graph structures and dynamically adjust information extraction strategies, addressing the challenges of multi-level dependency modeling and adaptive reasoning depth. Additionally, the Self-Reflection with Multiple Perspectives (SR) module improves the accuracy and semantic consistency of reasoning results through self-reflection and backward reasoning mechanisms. Experiments demonstrate that Graph Counselor outperforms existing methods in multiple graph reasoning tasks, exhibiting higher reasoning accuracy and generalization ability. Our code is available at this https URL.
zh
[NLP-47] VisCoder: Fine-Tuning LLM s for Executable Python Visualization Code Generation
【速读】: 该论文旨在解决大型语言模型(Large language models, LLMs)在可视化任务(如绘制图表和图形)中表现不佳的问题,这些问题的成功依赖于代码的正确性和视觉语义。现有指令微调数据集缺乏执行基础的监督,并且对迭代代码修正的支持有限,导致生成的图表不可靠。论文提出的解决方案是构建VisCode-200K数据集,该数据集包含超过200,000个示例,涵盖从开源仓库中验证的绘图代码及其自然语言指令和渲染图表,以及来自Code-Feedback的45,000个多轮修正对话,使模型能够利用运行时反馈修正错误代码。关键在于通过执行反馈驱动的学习提升可执行且视觉准确的代码生成能力。
链接: https://arxiv.org/abs/2506.03930
作者: Yuansheng Ni,Ping Nie,Kai Zou,Xiang Yue,Wenhu Chen
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) often struggle with visualization tasks like plotting diagrams, charts, where success depends on both code correctness and visual semantics. Existing instruction-tuning datasets lack execution-grounded supervision and offer limited support for iterative code correction, resulting in fragile and unreliable plot generation. We present VisCode-200K, a large-scale instruction tuning dataset for Python-based visualization and self-correction. It contains over 200K examples from two sources: (1) validated plotting code from open-source repositories, paired with natural language instructions and rendered plots; and (2) 45K multi-turn correction dialogues from Code-Feedback, enabling models to revise faulty code using runtime feedback. We fine-tune Qwen2.5-Coder-Instruct on VisCode-200K to create VisCoder, and evaluate it on PandasPlotBench. VisCoder significantly outperforms strong open-source baselines and approaches the performance of proprietary models like GPT-4o-mini. We further adopt a self-debug evaluation protocol to assess iterative repair, demonstrating the benefits of feedback-driven learning for executable, visually accurate code generation.
zh
[NLP-48] More or Less Wrong: A Benchmark for Directional Bias in LLM Comparative Reasoning
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在面对语义提示时表现出的推理偏差问题,特别是输入表述如何系统性地影响模型预测方向。其解决方案的关键在于引入MathComp基准测试,这是一个包含300个比较场景的受控数据集,每个场景在三个LLM家族下通过14种提示变体进行评估,从而揭示语义提示对模型推理的影响机制。研究发现,模型错误常反映语言引导效应,而链式思维提示虽可减少偏差,但其效果依赖于推理格式的自由度。
链接: https://arxiv.org/abs/2506.03923
作者: Mohammadamin Shafiei,Hamidreza Saffari,Nafise Sadat Moosavi
机构: University of Milan (米兰大学); Politecnico di Milano (米兰理工大学); University of Sheffield (谢菲尔德大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) are known to be sensitive to input phrasing, but the mechanisms by which semantic cues shape reasoning remain poorly understood. We investigate this phenomenon in the context of comparative math problems with objective ground truth, revealing a consistent and directional framing bias: logically equivalent questions containing the words more'',
less’‘, or equal'' systematically steer predictions in the direction of the framing term. To study this effect, we introduce MathComp, a controlled benchmark of 300 comparison scenarios, each evaluated under 14 prompt variants across three LLM families. We find that model errors frequently reflect linguistic steering, systematic shifts toward the comparative term present in the prompt. Chain-of-thought prompting reduces these biases, but its effectiveness varies: free-form reasoning is more robust, while structured formats may preserve or reintroduce directional drift. Finally, we show that including demographic identity terms (e.g.,
a woman’‘, ``a Black person’') in input scenarios amplifies directional drift, despite identical underlying quantities, highlighting the interplay between semantic framing and social referents. These findings expose critical blind spots in standard evaluation and motivate framing-aware benchmarks for diagnosing reasoning robustness and fairness in LLMs.
zh
[NLP-49] HSSBench: Benchmarking Humanities and Social Sciences Ability for Multimodal Large Language Models
【速读】: 该论文试图解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)评估基准在人文与社会科学(Humanities and Social Sciences, HSS)领域中存在的不足,即现有基准主要关注STEM学科中的通用知识和垂直步骤推理,而忽视了HSS领域所需的横向跨学科思维和多领域知识的深度融合。解决方案的关键在于提出HSSBench,这是一个专门针对HSS任务设计的多语言评估基准,并引入一种针对HSS场景的新型数据生成流程,通过领域专家与自动化代理的协作,生成并迭代优化样本,以更准确地反映HSS任务的复杂性。
链接: https://arxiv.org/abs/2506.03922
作者: Zhaolu Kang,Junhao Gong,Jiaxu Yan,Wanke Xia,Yian Wang,Ziwen Wang,Huaxuan Ding,Zhuo Cheng,Wenhao Cao,Zhiyuan Feng,Siqi He,Shannan Yan,Junzhe Chen,Xiaomin He,Chaoya Jiang,Wei Ye,Kaidong Yu,Xuelong Li
机构: TeleAI(电信AI); Peking University (北京大学); Tsinghua University (清华大学); Chinese Academy of Sciences (中国科学院); University of British Columbia (不列颠哥伦比亚大学); Renmin University of China (中国人民大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Multimodal Large Language Models (MLLMs) have demonstrated significant potential to advance a broad range of domains. However, current benchmarks for evaluating MLLMs primarily emphasize general knowledge and vertical step-by-step reasoning typical of STEM disciplines, while overlooking the distinct needs and potential of the Humanities and Social Sciences (HSS). Tasks in the HSS domain require more horizontal, interdisciplinary thinking and a deep integration of knowledge across related fields, which presents unique challenges for MLLMs, particularly in linking abstract concepts with corresponding visual representations. Addressing this gap, we present HSSBench, a dedicated benchmark designed to assess the capabilities of MLLMs on HSS tasks in multiple languages, including the six official languages of the United Nations. We also introduce a novel data generation pipeline tailored for HSS scenarios, in which multiple domain experts and automated agents collaborate to generate and iteratively refine each sample. HSSBench contains over 13,000 meticulously designed samples, covering six key categories. We benchmark more than 20 mainstream MLLMs on HSSBench and demonstrate that it poses significant challenges even for state-of-the-art models. We hope that this benchmark will inspire further research into enhancing the cross-disciplinary reasoning abilities of MLLMs, especially their capacity to internalize and connect knowledge across fields.
zh
[NLP-50] Compositional Generalisation for Explainable Hate Speech Detection
【速读】: 该论文试图解决仇恨言论检测模型在训练数据之外的泛化能力不足的问题,这一问题被归因于数据集偏差和使用句子级别的标签,未能教会模型仇恨言论的潜在结构。解决方案的关键在于引入更细粒度的跨度级别标注,并构建一个名为U-PLEAD的数据集,其中表达式在所有上下文中以相等频率出现,从而提升模型的组合泛化能力。同时,该研究还提出了一个包含约8,000条人工验证帖子的新型组合泛化基准,实验表明结合U-PLEAD与真实数据可显著提升模型的泛化性能并达到当前最优效果。
链接: https://arxiv.org/abs/2506.03916
作者: Agostina Calabrese,Tom Sherborne,Björn Ross,Mirella Lapata
机构: School of Informatics, University of Edinburgh (信息学院,爱丁堡大学); Cohere(协同)
类目: Computation and Language (cs.CL)
备注:
Abstract:Hate speech detection is key to online content moderation, but current models struggle to generalise beyond their training data. This has been linked to dataset biases and the use of sentence-level labels, which fail to teach models the underlying structure of hate speech. In this work, we show that even when models are trained with more fine-grained, span-level annotations (e.g., “artists” is labeled as target and “are parasites” as dehumanising comparison), they struggle to disentangle the meaning of these labels from the surrounding context. As a result, combinations of expressions that deviate from those seen during training remain particularly difficult for models to detect. We investigate whether training on a dataset where expressions occur with equal frequency across all contexts can improve generalisation. To this end, we create U-PLEAD, a dataset of ~364,000 synthetic posts, along with a novel compositional generalisation benchmark of ~8,000 manually validated posts. Training on a combination of U-PLEAD and real data improves compositional generalisation while achieving state-of-the-art performance on the human-sourced PLEAD.
zh
[NLP-51] When Fairness Isnt Statistical: The Limits of Machine Learning in Evaluating Legal Reasoning
【速读】: 该论文试图解决在法律领域中,尤其是难民裁决等高风险情境下,如何通过统计方法有效评估公平性的问题。研究指出,尽管机器学习(Machine Learning, ML)技术被广泛用于检测结果中的差异性,但法律决策受自由裁量、规范复杂性和有限真实标签的影响,使得统计方法在评估公平性方面存在局限。论文提出的关键解决方案是,强调在评估法律公平性时,不能仅依赖数据驱动的统计方法,而应结合法律推理和制度背景,以更全面地理解公平性的内涵。
链接: https://arxiv.org/abs/2506.03913
作者: Claire Barale,Michael Rovatsos,Nehal Bhuta
机构: Claire Barale1, Michael Rovatsos1, Nehal Bhuta2
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Preprint
Abstract:Legal decisions are increasingly evaluated for fairness, consistency, and bias using machine learning (ML) techniques. In high-stakes domains like refugee adjudication, such methods are often applied to detect disparities in outcomes. Yet it remains unclear whether statistical methods can meaningfully assess fairness in legal contexts shaped by discretion, normative complexity, and limited ground truth. In this paper, we empirically evaluate three common ML approaches (feature-based analysis, semantic clustering, and predictive modeling) on a large, real-world dataset of 59,000+ Canadian refugee decisions (AsyLex). Our experiments show that these methods produce divergent and sometimes contradictory signals, that predictive modeling often depends on contextual and procedural features rather than legal features, and that semantic clustering fails to capture substantive legal reasoning. We show limitations of statistical fairness evaluation, challenge the assumption that statistical regularity equates to fairness, and argue that current computational approaches fall short of evaluating fairness in legally discretionary domains. We argue that evaluating fairness in law requires methods grounded not only in data, but in legal reasoning and institutional context. Comments: Preprint Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2506.03913 [cs.CL] (or arXiv:2506.03913v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2506.03913 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-52] he Harmonic Structure of Information Contours ACL2025
【速读】: 该论文试图解决语言中信息密度分布不均的问题,即尽管有均匀信息密度(UID)假说认为说话者力求在文本中均匀分布信息,但实际语言中的信息速率通常围绕一个全局平均值波动。为了解释这些波动,传统观点倾向于语法约束、风格选择或受众设计等因素,而本文提出了一种替代视角:这些波动可能受到隐含的周期性语言压力影响,使得信息速率在规律的时间间隔内振荡,可能同时涉及多个频率。解决方案的关键在于应用谐波回归并引入一种称为时间缩放的新扩展方法,以检测和测试信息轮廓中的周期性特征。
链接: https://arxiv.org/abs/2506.03902
作者: Eleftheria Tsipidi,Samuel Kiegeland,Franz Nowak,Tianyang Xu,Ethan Wilcox,Alex Warstadt,Ryan Cotterell,Mario Giulianelli
机构: ETH Zürich(ETH Zurich); TTIC(TTIC); Georgetown(Georgetown); UCSD(UCSD)
类目: Computation and Language (cs.CL)
备注: ACL 2025 (main conference)
Abstract:The uniform information density (UID) hypothesis proposes that speakers aim to distribute information evenly throughout a text, balancing production effort and listener comprehension difficulty. However, language typically does not maintain a strictly uniform information rate; instead, it fluctuates around a global average. These fluctuations are often explained by factors such as syntactic constraints, stylistic choices, or audience design. In this work, we explore an alternative perspective: that these fluctuations may be influenced by an implicit linguistic pressure towards periodicity, where the information rate oscillates at regular intervals, potentially across multiple frequencies simultaneously. We apply harmonic regression and introduce a novel extension called time scaling to detect and test for such periodicity in information contours. Analyzing texts in English, Spanish, German, Dutch, Basque, and Brazilian Portuguese, we find consistent evidence of periodic patterns in information rate. Many dominant frequencies align with discourse structure, suggesting these oscillations reflect meaningful linguistic organization. Beyond highlighting the connection between information rate and discourse structure, our approach offers a general framework for uncovering structural pressures at various levels of linguistic granularity.
zh
[NLP-53] Magic Mushroom: A Customizable Benchmark for Fine-grained Analysis of Retrieval Noise Erosion in RAG Systems
【速读】: 该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)系统在现实场景中对检索噪声敏感的问题,这些问题可能导致生成结果的不准确或误导。现有基准无法有效模拟真实检索环境中复杂且异质的噪声分布,从而影响对RAG系统鲁棒性的可靠评估。论文提出了一种名为Magic Mushroom的基准,用于复制“魔法蘑菇”噪声——即表面上看似相关但实际会误导RAG系统的上下文。Magic Mushroom包含大量单跳和多跳问答对,并支持研究人员根据具体研究目标灵活配置噪声组合,实现高度可控的评估环境。其关键在于通过精细的噪声分类和可配置性,为RAG系统的噪声鲁棒性研究提供更真实、更全面的测试平台。
链接: https://arxiv.org/abs/2506.03901
作者: Yuxin Zhang,Yan Wang,Yongrui Chen,Shenyu Zhang,Xinbang Dai,Sheng Bi,Guilin Qi
机构: Southeast University (东南大学); Ministry of Education (教育部)
类目: Computation and Language (cs.CL)
备注:
Abstract:Retrieval-Augmented Generation (RAG) systems enhance Large Language Models (LLMs) by incorporating external retrieved information, mitigating issues such as hallucination and outdated knowledge. However, RAG systems are highly sensitive to retrieval noise prevalent in real-world scenarios. Existing benchmarks fail to emulate the complex and heterogeneous noise distributions encountered in real-world retrieval environments, undermining reliable robustness assessment. In this paper, we define four categories of retrieval noise based on linguistic properties and noise characteristics, aiming to reflect the heterogeneity of noise in real-world scenarios. Building on this, we introduce Magic Mushroom, a benchmark for replicating “magic mushroom” noise: contexts that appear relevant on the surface but covertly mislead RAG systems. Magic Mushroom comprises 7,468 single-hop and 3,925 multi-hop question-answer pairs. More importantly, Magic Mushroom enables researchers to flexibly configure combinations of retrieval noise according to specific research objectives or application scenarios, allowing for highly controlled evaluation setups. We evaluate LLM generators of varying parameter scales and classic RAG denoising strategies under diverse noise distributions to investigate their performance dynamics during progressive noise encroachment. Our analysis reveals that both generators and denoising strategies have significant room for improvement and exhibit extreme sensitivity to noise distributions. Magic Mushroom emerges as a promising tool for evaluating and advancing noise-robust RAG systems, accelerating their widespread deployment in real-world applications. The Magic Mushroom benchmark is available at the this https URL. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2506.03901 [cs.CL] (or arXiv:2506.03901v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2506.03901 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Yuxin Zhang [view email] [v1] Wed, 4 Jun 2025 12:55:59 UTC (4,973 KB) Full-text links: Access Paper: View a PDF of the paper titled Magic Mushroom: A Customizable Benchmark for Fine-grained Analysis of Retrieval Noise Erosion in RAG Systems, by Yuxin Zhang and 6 other authorsView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.CL prev | next new | recent | 2025-06 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack
zh
[NLP-54] Pre3: Enabling Deterministic Pushdown Automata for Faster Structured LLM Generation ACL2025
【速读】: 该论文旨在解决大规模语言模型(Large Language Model, LLM)在生成结构化输出(如JSON)时的效率问题,特别是在处理LR(1)文法时,现有方法通过将LR(1)文法解析为下推自动机(Pushdown Automaton, PDA)导致上下文相关标记处理的运行时执行开销,尤其在大规模推理批次下效率低下。解决方案的关键在于提出Pre^3,该方法利用确定性下推自动机(Deterministic Pushdown Automaton, DPDA)优化受限LLM解码效率,通过预计算前缀条件边实现提前分析和并行转移处理,并将LR(1)转移图转换为DPDA,从而消除运行时路径探索需求,显著降低时间每输出标记(Time Per Output Token, TPOT)并提升吞吐量。
链接: https://arxiv.org/abs/2506.03887
作者: Junyi Chen,Shihao Bai,Zaijun Wang,Siyu Wu,Chuheng Du,Hailong Yang,Ruihao Gong,Shengzhong Liu,Fan Wu,Guihai Chen
机构: Shanghai Jiao Tong University (上海交通大学); Beihang University (北京航空航天大学); Sensetime Research (商汤研究院)
类目: Computation and Language (cs.CL)
备注: Published as a conference paper at ACL 2025
Abstract:Extensive LLM applications demand efficient structured generations, particularly for LR(1) grammars, to produce outputs in specified formats (e.g., JSON). Existing methods primarily parse LR(1) grammars into a pushdown automaton (PDA), leading to runtime execution overhead for context-dependent token processing, especially inefficient under large inference batches. To address these issues, we propose Pre ^3 that exploits deterministic pushdown automata (DPDA) to optimize the constrained LLM decoding efficiency. First, by precomputing prefix-conditioned edges during the preprocessing, Pre ^3 enables ahead-of-time edge analysis and thus makes parallel transition processing possible. Second, by leveraging the prefix-conditioned edges, Pre ^3 introduces a novel approach that transforms LR(1) transition graphs into DPDA, eliminating the need for runtime path exploration and achieving edge transitions with minimal overhead. Pre ^3 can be seamlessly integrated into standard LLM inference frameworks, reducing time per output token (TPOT) by up to 40% and increasing throughput by up to 36% in our experiments. Our code is available at this https URL.
zh
[NLP-55] Kinship in Speech: Leverag ing Linguistic Relatedness for Zero-Shot TTS in Indian Languages INTERSPEECH2025
【速读】: 该论文试图解决为印度众多语言(尤其是缺乏数字资源的语言)训练文本到语音(Text-to-speech, TTS)系统的问题。解决方案的关键在于零样本合成(zero-shot synthesis),通过增强共享音素表示并调整文本解析规则以匹配目标语言的音系结构,从而减少合成器的负担并实现快速适应。该方法利用语言间的语言学联系,成功生成了梵语、马哈拉施特里亚语、卡纳拉·孔坎尼语、迈蒂利语和库鲁克语等语言的可理解且自然的语音。
链接: https://arxiv.org/abs/2506.03884
作者: Utkarsh Pathak,Chandra Sai Krishna Gunda,Anusha Prakash,Keshav Agarwal,Hema A. Murthy
机构: 未知
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at INTERSPEECH 2025
Abstract:Text-to-speech (TTS) systems typically require high-quality studio data and accurate transcriptions for training. India has 1369 languages, with 22 official using 13 scripts. Training a TTS system for all these languages, most of which have no digital resources, seems a Herculean task. Our work focuses on zero-shot synthesis, particularly for languages whose scripts and phonotactics come from different families. The novelty of our work is in the augmentation of a shared phone representation and modifying the text parsing rules to match the phonotactics of the target language, thus reducing the synthesiser overhead and enabling rapid adaptation. Intelligible and natural speech was generated for Sanskrit, Maharashtrian and Canara Konkani, Maithili and Kurukh by leveraging linguistic connections across languages with suitable synthesisers. Evaluations confirm the effectiveness of this approach, highlighting its potential to expand speech technology access for under-represented languages.
zh
[NLP-56] RadialRouter: Structured Representation for Efficient and Robust Large Language Models Routing
【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)路由方法在有效性上的局限性,即未能充分探索用户查询与LLMs特性之间的内在联系。其解决方案的关键在于提出一种名为RadialRouter的新框架,该框架采用轻量级的基于Transformer的骨干网络RadialFormer,通过径向结构来表征查询与LLMs之间的关系,并基于RadialFormer的最终状态进行最优LLM选择。此外,通过结合Kullback-Leibler散度与查询-查询对比损失的目标函数进一步优化管道,提升了路由的鲁棒性。
链接: https://arxiv.org/abs/2506.03880
作者: Ruihan Jin,Pengpeng Shao,Zhengqi Wen,Jinyang Wu,Mingkuan Feng,Shuai Zhang,Jianhua Tao
机构: Tsinghua University (清华大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:The rapid advancements in large language models (LLMs) have led to the emergence of routing techniques, which aim to efficiently select the optimal LLM from diverse candidates to tackle specific tasks, optimizing performance while reducing costs. Current LLM routing methods are limited in effectiveness due to insufficient exploration of the intrinsic connection between user queries and the characteristics of LLMs. To address this issue, in this paper, we present RadialRouter, a novel framework for LLM routing which employs a lightweight Transformer-based backbone with a radial structure named RadialFormer to articulate the query-LLMs relationship. The optimal LLM selection is performed based on the final states of RadialFormer. The pipeline is further refined by an objective function that combines Kullback-Leibler divergence with the query-query contrastive loss to enhance robustness. Experimental results on RouterBench show that RadialRouter significantly outperforms existing routing methods by 9.2% and 5.8% in the Balance and Cost First scenarios, respectively. Additionally, its adaptability toward different performance-cost trade-offs and the dynamic LLM pool demonstrates practical application potential.
zh
[NLP-57] EuroGEST: Investigating gender stereotypes in multilingual language models
【速读】: 该论文试图解决多语言大语言模型(Large Language Models, LLMs)中性别刻板印象的评估问题,特别是针对英语以外的欧洲语言缺乏相关基准的问题。解决方案的关键在于构建EuroGEST数据集,该数据集通过翻译工具、质量估计指标和形态学启发式方法扩展了现有的专家指导基准,从而实现了跨语言的性别刻板印象推理测量,并通过人工评估验证了其准确性。
链接: https://arxiv.org/abs/2506.03867
作者: Jacqueline Rowe,Mateusz Klimaszewski,Liane Guillou,Shannon Vallor,Alexandra Birch
机构: University of Edinburgh (爱丁堡大学); Warsaw University of Technology (华沙理工大学); Aveni (阿文尼)
类目: Computation and Language (cs.CL)
备注: 8 pages, 6 figures, 1 table
Abstract:Large language models increasingly support multiple languages, yet most benchmarks for gender bias remain English-centric. We introduce EuroGEST, a dataset designed to measure gender-stereotypical reasoning in LLMs across English and 29 European languages. EuroGEST builds on an existing expert-informed benchmark covering 16 gender stereotypes, expanded in this work using translation tools, quality estimation metrics, and morphological heuristics. Human evaluations confirm that our data generation method results in high accuracy of both translations and gender labels across languages. We use EuroGEST to evaluate 24 multilingual language models from six model families, demonstrating that the strongest stereotypes in all models across all languages are that women are \textitbeautiful, \textitempathetic and \textitneat and men are \textitleaders, \textitstrong, tough and \textitprofessional. We also show that larger models encode gendered stereotypes more strongly and that instruction finetuning does not consistently reduce gendered stereotypes. Our work highlights the need for more multilingual studies of fairness in LLMs and offers scalable methods and resources to audit gender bias across languages.
zh
[NLP-58] PulseReddit: A Novel Reddit Dataset for Benchmarking MAS in High-Frequency Cryptocurrency Trading
【速读】: 该论文试图解决如何利用社交媒体中的社会情绪信息提升高频交易(HFT)性能的问题,特别是在加密货币市场中。解决方案的关键在于构建了首个将大规模Reddit讨论数据与高频加密货币市场统计信息对齐的新型数据集——PulseReddit,并通过基于大型语言模型(LLM)的多智能体系统(MAS)进行实证研究,验证了社会情绪数据对交易表现的积极影响。
链接: https://arxiv.org/abs/2506.03861
作者: Qiuhan Han,Qian Wang,Atsushi Yoshikawa,Masayuki Yamamura
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:High-Frequency Trading (HFT) is pivotal in cryptocurrency markets, demanding rapid decision-making. Social media platforms like Reddit offer valuable, yet underexplored, information for such high-frequency, short-term trading. This paper introduces \textbfPulseReddit, a novel dataset that is the first to align large-scale Reddit discussion data with high-frequency cryptocurrency market statistics for short-term trading analysis. We conduct an extensive empirical study using Large Language Model (LLM)-based Multi-Agent Systems (MAS) to investigate the impact of social sentiment from PulseReddit on trading performance. Our experiments conclude that MAS augmented with PulseReddit data achieve superior trading outcomes compared to traditional baselines, particularly in bull markets, and demonstrate robust adaptability across different market regimes. Furthermore, our research provides conclusive insights into the performance-efficiency trade-offs of different LLMs, detailing significant considerations for practical model selection in HFT applications. PulseReddit and our findings establish a foundation for advanced MAS research in HFT, demonstrating the tangible benefits of integrating social media.
zh
[NLP-59] Prompt Candidates then Distill: A Teacher-Student Framework for LLM -driven Data Annotation ACL2025
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在数据标注过程中因固有不确定性导致的标签错误问题,这会严重影响下游应用的数据质量。现有方法通常采用激进策略,即通过提示LLM为每个未标记样本确定单一黄金标签,但这种方法在处理困难样本时容易产生错误标签。该论文提出的解决方案的关键在于引入一种候选标注范式,鼓励LLM在不确定时输出所有可能的标签,并通过一个教师-学生框架CanDist,利用小型语言模型(Small Language Model, SLM)对候选标注进行知识蒸馏,从而确保下游任务获得唯一且高质量的标签。
链接: https://arxiv.org/abs/2506.03857
作者: Mingxuan Xia,Haobo Wang,Yixuan Li,Zewei Yu,Jindong Wang,Junbo Zhao,Runze Wu
机构: Zhejiang University (浙江大学); University of Wisconsin Madison (威斯康星大学麦迪逊分校); William & Mary (威廉与玛丽学院); NetEase Fuxi AI Lab (网易伏羲人工智能实验室)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Accepted to ACL 2025 (Main conference)
Abstract:Recently, Large Language Models (LLMs) have demonstrated significant potential for data annotation, markedly reducing the labor costs associated with downstream applications. However, existing methods mostly adopt an aggressive strategy by prompting LLM to determine a single gold label for each unlabeled sample. Due to the inherent uncertainty within LLMs, they often produce incorrect labels for difficult samples, severely compromising the data quality for downstream applications. Motivated by ambiguity aversion in human behaviors, we propose a novel candidate annotation paradigm wherein large language models are encouraged to output all possible labels when incurring uncertainty. To ensure unique labels are provided for downstream tasks, we develop a teacher-student framework CanDist that distills candidate annotations with a Small Language Model (SLM). We further provide a rigorous justification demonstrating that distilling candidate annotations from the teacher LLM offers superior theoretical guarantees compared to directly using single annotations. Extensive experiments across six text classification tasks validate the effectiveness of our proposed method. The source code is available at this https URL.
zh
[NLP-60] Brain-tuned Speech Models Better Reflect Speech Processing Stages in the Brain INTERSPEECH2025
【速读】: 该论文试图解决预训练自监督语音模型在语义层次结构上与人类语音处理机制不匹配的问题,即模型中层编码丰富的语义而深层语义表达较差。其解决方案的关键在于采用脑部调优(brain-tuning),通过使用人类脑电记录对模型进行微调,从而提升模型在语义理解方面的表现,并使其更符合大脑语音处理的中间阶段特征。
链接: https://arxiv.org/abs/2506.03832
作者: Omer Moussa,Mariya Toneva
机构: 未知
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS); Neurons and Cognition (q-bio.NC)
备注: Proceedings of Interspeech 2025
Abstract:Pretrained self-supervised speech models excel in speech tasks but do not reflect the hierarchy of human speech processing, as they encode rich semantics in middle layers and poor semantics in late layers. Recent work showed that brain-tuning (fine-tuning models using human brain recordings) improves speech models’ semantic understanding. Here, we examine how well brain-tuned models further reflect the brain’s intermediate stages of speech processing. We find that late layers of brain-tuned models substantially improve over pretrained models in their alignment with semantic language regions. Further layer-wise probing reveals that early layers remain dedicated to low-level acoustic features, while late layers become the best at complex high-level tasks. These findings show that brain-tuned models not only perform better but also exhibit a well-defined hierarchical processing going from acoustic to semantic representations, making them better model organisms for human speech processing.
zh
[NLP-61] Multi-objective Aligned Bidword Generation Model for E-commerce Search Advertising SIGIR2025
【速读】: 该论文旨在解决检索系统在匹配用户查询与相关广告时面临的挑战,特别是在处理大量长尾查询时,这些查询无法与商家出价关键词或产品标题匹配,导致部分广告未能被召回,从而影响用户体验和搜索效率。现有查询重写研究虽已探索多种方法,如查询日志挖掘、查询-出价词向量匹配或基于生成的重写方法,但往往难以同时优化用户原始查询与重写后的相关性与真实性,并最大化召回广告的收益潜力。该论文提出的多目标对齐出价词生成模型(Multi-objective aligned Bidword Generation Model, MoBGM)通过引入判别器、生成器及偏好对齐模块,关键在于设计一个判别器以优化相关性、真实性和平台收益这三个核心目标,并利用判别器的反馈信号训练一个多目标对齐的出价词生成器,以实现三者协同优化。
链接: https://arxiv.org/abs/2506.03827
作者: Zhenhui Liu,Chunyuan Yuan,Ming Pang,Zheng Fang,Li Yuan,Xue Jiang,Changping Peng,Zhangang Lin,Zheng Luo,Jingping Shao
机构: JD.COM(京东); Peking University(北京大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: Accepted by SIGIR2025
Abstract:Retrieval systems primarily address the challenge of matching user queries with the most relevant advertisements, playing a crucial role in e-commerce search advertising. The diversity of user needs and expressions often produces massive long-tail queries that cannot be matched with merchant bidwords or product titles, which results in some advertisements not being recalled, ultimately harming user experience and search efficiency. Existing query rewriting research focuses on various methods such as query log mining, query-bidword vector matching, or generation-based rewriting. However, these methods often fail to simultaneously optimize the relevance and authenticity of the user’s original query and rewrite and maximize the revenue potential of recalled ads. In this paper, we propose a Multi-objective aligned Bidword Generation Model (MoBGM), which is composed of a discriminator, generator, and preference alignment module, to address these challenges. To simultaneously improve the relevance and authenticity of the query and rewrite and maximize the platform revenue, we design a discriminator to optimize these key objectives. Using the feedback signal of the discriminator, we train a multi-objective aligned bidword generator that aims to maximize the combined effect of the three objectives. Extensive offline and online experiments show that our proposed algorithm significantly outperforms the state of the art. After deployment, the algorithm has created huge commercial value for the platform, further verifying its feasibility and robustness. Comments: Accepted by SIGIR2025 Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR) Cite as: arXiv:2506.03827 [cs.CL] (or arXiv:2506.03827v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2506.03827 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-62] CRAWLDoc: A Dataset for Robust Ranking of Bibliographic Documents
【速读】: 该论文试图解决从多样化的网络来源中准确提取元数据所面临的挑战,这些问题主要源于网页布局和数据格式的差异。解决方案的关键在于提出CRAWLDoc方法,该方法通过从出版物的URL(如数字对象标识符)开始,抓取着陆页及所有相关链接资源,并将这些资源、锚文本和URL嵌入到统一表示中,从而实现跨出版商和数据格式的相关文档的稳健且与布局无关的排名。
链接: https://arxiv.org/abs/2506.03822
作者: Fabian Karl,Ansgar Scherp
机构: Universität Ulm (乌尔姆大学)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: Accepted at SCOLIA 2025
Abstract:Publication databases rely on accurate metadata extraction from diverse web sources, yet variations in web layouts and data formats present challenges for metadata providers. This paper introduces CRAWLDoc, a new method for contextual ranking of linked web documents. Starting with a publication’s URL, such as a digital object identifier, CRAWLDoc retrieves the landing page and all linked web resources, including PDFs, ORCID profiles, and supplementary materials. It embeds these resources, along with anchor texts and the URLs, into a unified representation. For evaluating CRAWLDoc, we have created a new, manually labeled dataset of 600 publications from six top publishers in computer science. Our method CRAWLDoc demonstrates a robust and layout-independent ranking of relevant documents across publishers and data formats. It lays the foundation for improved metadata extraction from web documents with various layouts and formats. Our source code and dataset can be accessed at this https URL.
zh
[NLP-63] Automatic Correction of Writing Anomalies in Hausa Texts
【速读】: 该论文试图解决豪萨语(Hausa)文本中常见的书写异常问题,如错误的字符替换和空格错误,这些问题可能阻碍自然语言处理(NLP)应用的性能。解决方案的关键在于通过微调基于Transformer的模型来自动纠正这些异常,具体方法包括利用从多个公开来源收集的语料库生成一个包含超过450,000对噪声-清晰豪萨语句子的大规模平行数据集,并引入合成噪声以模拟现实中的书写错误,同时采用SentencePiece分词技术适配多种多语言及非洲语言模型进行纠错任务。
链接: https://arxiv.org/abs/2506.03820
作者: Ahmad Mustapha Wali,Sergiu Nisioi
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Hausa texts are often characterized by writing anomalies such as incorrect character substitutions and spacing errors, which sometimes hinder natural language processing (NLP) applications. This paper presents an approach to automatically correct the anomalies by finetuning transformer-based models. Using a corpus gathered from several public sources, we created a large-scale parallel dataset of over 450,000 noisy-clean Hausa sentence pairs by introducing synthetically generated noise, fine-tuned to mimic realistic writing errors. Moreover, we adapted several multilingual and African language-focused models, including M2M100, AfriTEVA, mBART, and Opus-MT variants for this correction task using SentencePiece tokenization. Our experimental results demonstrate significant increases in F1, BLEU and METEOR scores, as well as reductions in Character Error Rate (CER) and Word Error Rate (WER). This research provides a robust methodology, a publicly available dataset, and effective models to improve Hausa text quality, thereby advancing NLP capabilities for the language and offering transferable insights for other low-resource languages.
zh
[NLP-64] Mark My Words: A Robust Multilingual Model for Punctuation in Text and Speech Transcripts
【速读】: 该论文旨在解决口语化语音转录文本中标点符号恢复的准确性问题,特别是在存在不流畅现象(如错误开始和回溯)的情况下,当前模型表现不佳,这限制了后续任务如翻译、文本转语音和摘要等的性能。解决方案的关键在于引入Cadence,一个基于预训练大语言模型的通用标点恢复模型,该模型能够处理干净书面文本和高度自发的口语转录文本,并在性能上超越了先前的最先进方法,同时扩展了对22种印度语言和英语的支持。
链接: https://arxiv.org/abs/2506.03793
作者: Sidharth Pulipaka,Sparsh Jain,Ashwin Sankar,Raj Dabre
机构: Nilekani Centre at AI4Bharat (Nilekani 中心 at AI4Bharat); Indian Institute of Technology, Madras (印度理工学院,马德拉斯校区); Indian Institute of Technology, Bombay (印度理工学院,孟买校区); Mahindra University, Hyderabad (马亨德拉大学,海得拉巴校区)
类目: Computation and Language (cs.CL)
备注: Work in Progress
Abstract:Punctuation plays a vital role in structuring meaning, yet current models often struggle to restore it accurately in transcripts of spontaneous speech, especially in the presence of disfluencies such as false starts and backtracking. These limitations hinder the performance of downstream tasks like translation, text to speech, summarization, etc. where sentence boundaries are critical for preserving quality. In this work, we introduce Cadence, a generalist punctuation restoration model adapted from a pretrained large language model. Cadence is designed to handle both clean written text and highly spontaneous spoken transcripts. It surpasses the previous state of the art in performance while expanding support from 14 to all 22 Indian languages and English. We conduct a comprehensive analysis of model behavior across punctuation types and language families, identifying persistent challenges under domain shift and with rare punctuation marks. Our findings demonstrate the efficacy of utilizing pretrained language models for multilingual punctuation restoration and highlight Cadence practical value for low resource NLP pipelines at scale.
zh
[NLP-65] Knockout LLM Assessment: Using Large Language Models for Evaluations through Iterative Pairwise Comparisons
【速读】: 该论文试图解决当前大型语言模型(Large Language Models, LLMs)作为评估者时存在的局部评估偏差问题,即现有LLM-as-a-Judge方法主要依赖个体评估或单轮成对比较,导致评估模型无法形成全局排名视角。解决方案的关键在于提出一种基于淘汰赛机制的评估方法——Knockout Assessment,通过迭代式的成对比较,提升评分的准确性,实验结果显示该方法在大学水平考试评分和机器翻译评估中平均提升了与专家评估的皮尔逊相关性0.07,使LLM的评估结果更接近人类评分。
链接: https://arxiv.org/abs/2506.03785
作者: Isik Baran Sandan,Tu Anh Dinh,Jan Niehues
机构: Karlsruhe Institute of Technology (卡尔斯鲁厄理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 4 pages, 2 figures
Abstract:Large Language Models (LLMs) have shown to be effective evaluators across various domains such as machine translations or the scientific domain. Current LLM-as-a-Judge approaches rely mostly on individual assessments or a single round of pairwise assessments, preventing the judge LLM from developing a global ranking perspective. To address this, we present Knockout Assessment, an LLM-asa Judge method using a knockout tournament system with iterative pairwise comparisons. Experiments across three LLMs on two datasets show that knockout assessment improves scoring accuracy, increasing Pearson correlation with expert evaluations by 0.07 on average for university-level exam scoring and machine translation evaluations, aligning LLM assessments more closely with human scoring.
zh
[NLP-66] Unifying Uniform and Binary-coding Quantization for Accurate Compression of Large Language Models ACL2025
【速读】: 该论文旨在解决如何在保持精度的前提下对大型语言模型(Large Language Models, LLMs)进行量化的问题。现有的二进制编码量化(Binary-coding quantization, BCQ)和均匀量化(Uniform quantization, UQ)虽然分别具备较强的表达能力和可优化性,但未能同时兼顾两者的优势。论文提出的解决方案是UniQuanF(Unified Quantization with Flexible Mapping),其关键在于通过统一UQ中的灵活映射技术和BCQ中的非均匀量化层级,实现表达能力和可优化性的结合,并通过统一初始化、局部与周期性映射技术精确优化参数,最终在不增加部署成本的情况下提升模型精度。
链接: https://arxiv.org/abs/2506.03781
作者: Seungcheol Park,Jeongin Bae,Beomseok Kwon,Minjun Kim,Byeongwook Kim,Se Jung Kwon,U Kang,Dongsoo Lee
机构: Seoul National University (首尔国立大学); NAVER Cloud (NAVER云)
类目: Computation and Language (cs.CL)
备注: ACL 2025 Main Track
Abstract:How can we quantize large language models while preserving accuracy? Quantization is essential for deploying large language models (LLMs) efficiently. Binary-coding quantization (BCQ) and uniform quantization (UQ) are promising quantization schemes that have strong expressiveness and optimizability, respectively. However, neither scheme leverages both advantages. In this paper, we propose UniQuanF (Unified Quantization with Flexible Mapping), an accurate quantization method for LLMs. UniQuanF harnesses both strong expressiveness and optimizability by unifying the flexible mapping technique in UQ and non-uniform quantization levels of BCQ. We propose unified initialization, and local and periodic mapping techniques to optimize the parameters in UniQuanF precisely. After optimization, our unification theorem removes computational and memory overhead, allowing us to utilize the superior accuracy of UniQuanF without extra deployment costs induced by the unification. Experimental results demonstrate that UniQuanF outperforms existing UQ and BCQ methods, achieving up to 4.60% higher accuracy on GSM8K benchmark.
zh
[NLP-67] ClozeMath: Improving Mathematical Reasoning in Language Models by Learning to Fill Equations ACL2025
【速读】: 该论文试图解决传统基于下一句预测的训练范式可能无法充分捕捉人类思维学习机制的问题,特别是在数学推理任务中。其解决方案的关键在于提出一种名为ClozeMath的新方法,该方法通过文本填空任务,从给定的解题过程预测被遮蔽的方程,类似于人类学习中的完形练习(cloze exercises),从而更好地提升大语言模型(LLM)在数学推理方面的性能和鲁棒性。
链接: https://arxiv.org/abs/2506.03763
作者: Quang Hieu Pham,Thuy Duong Nguyen,Tung Pham,Anh Tuan Luu,Dat Quoc Nguyen
机构: Qualcomm AI Research (高通人工智能研究中心); Nanyang Technological University (南洋理工大学); Qualcomm Vietnam Company Limited (高通越南有限公司)
类目: Computation and Language (cs.CL)
备注: Accepted to ACL 2025 Findings
Abstract:The capabilities of large language models (LLMs) have been enhanced by training on data that reflects human thought processes, such as the Chain-of-Thought format. However, evidence suggests that the conventional scheme of next-word prediction may not fully capture how humans learn to think. Inspired by how humans generalize mathematical reasoning, we propose a new approach named ClozeMath to fine-tune LLMs for mathematical reasoning. Our ClozeMath involves a text-infilling task that predicts masked equations from a given solution, analogous to cloze exercises used in human learning. Experiments on GSM8K, MATH, and GSM-Symbolic show that ClozeMath surpasses the strong baseline Masked Thought in performance and robustness, with two test-time scaling decoding algorithms, Beam Search and Chain-of-Thought decoding. Additionally, we conduct an ablation study to analyze the effects of various architectural and implementation choices on our approach.
zh
[NLP-68] AhaKV: Adaptive Holistic Attention-Driven KV Cache Eviction for Efficient Inference of Large Language Models
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在推理过程中因键值(Key-Value, KV)缓存占用大量内存而导致的资源消耗问题,特别是现有方法通过累积注意力分数进行令牌淘汰时所存在的偏差问题。该偏差导致保留的令牌集中在序列的初始位置,限制了模型对全局上下文信息的访问。解决方案的关键在于提出自适应整体注意力KV(Adaptive holistic attention KV, AhaKV),通过根据注意力分数的信息熵期望自适应调整softmax的尺度来缓解累积注意力分数的偏差,并利用价值向量的信息来优化适应性评分,从而更有效地保留全局上下文中的关键令牌。
链接: https://arxiv.org/abs/2506.03762
作者: Yifeng Gu,Zicong Jiang,Jianxiu Jin,Kailing Guo,Ziyang Zhang,Xiangmin Xu
机构: South China University of Technology (华南理工大学); Pazhou Laboratory (琶洲实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 14 pages, 8 figures
Abstract:Large Language Models (LLMs) have significantly advanced the field of Artificial Intelligence. However, their deployment is resource-intensive, not only due to the large number of model parameters but also because the (Key-Value) KV cache consumes a lot of memory during inference. While several works propose reducing the KV cache by evicting the unnecessary tokens, these approaches rely on accumulated attention score as eviction score to quantify the importance of the token. We identify the accumulated attention score is biased and it decreases with the position of the tokens in the mathematical expectation. As a result, the retained tokens concentrate on the initial positions, limiting model’s access to global contextual information. To address this issue, we propose Adaptive holistic attention KV (AhaKV), it addresses the bias of the accumulated attention score by adaptively tuning the scale of softmax according the expectation of information entropy of attention scores. To make use of the holistic attention information in self-attention mechanism, AhaKV utilize the information of value vectors, which is overlooked in previous works, to refine the adaptive score. We show theoretically that our method is well suited for bias reduction. We deployed AhaKV on different models with a fixed cache budget. Experiments show that AhaKV successfully mitigates bias and retains crucial tokens across global context and achieve state-of-the-art results against other related work on several benchmark tasks.
zh
[NLP-69] Act-as-Pet: Benchmarking the Abilities of Large Language Models as E-Pets in Social Network Services
【速读】: 该论文试图解决当前大型语言模型(Large Language Models, LLMs)在虚拟宠物陪伴应用中缺乏系统性评估与优化的问题。现有方法仅关注基础的宠物角色扮演互动,未能全面衡量LLMs在复杂情感交互和长期陪伴中的能力。论文提出的解决方案是构建Pet-Bench,这是一个专门用于评估LLMs在自我演化、发展行为及人机交互维度上的基准测试平台,其关键在于通过多样化任务(如智能日程安排、基于记忆的对话和心理对话)模拟复杂的宠物行为,从而更真实地反映虚拟宠物陪伴的特性,并揭示模型规模与性能之间的关系。
链接: https://arxiv.org/abs/2506.03761
作者: Hongcheng Guo,Zheyong Xie,Shaosheng Cao,Boyang Wang,Weiting Liu,Zheyu Ye,Zhoujun Li,Zuozhu Liu
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:As interest in using Large Language Models (LLMs) for interactive and emotionally rich experiences grows, virtual pet companionship emerges as a novel yet underexplored application. Existing approaches focus on basic pet role-playing interactions without systematically benchmarking LLMs for comprehensive companionship. In this paper, we introduce Pet-Bench, a dedicated benchmark that evaluates LLMs across both self-interaction and human-interaction dimensions. Unlike prior work, Pet-Bench emphasizes self-evolution and developmental behaviors alongside interactive engagement, offering a more realistic reflection of pet companionship. It features diverse tasks such as intelligent scheduling, memory-based dialogues, and psychological conversations, with over 7,500 interaction instances designed to simulate complex pet behaviors. Evaluation of 28 LLMs reveals significant performance variations linked to model size and inherent capabilities, underscoring the need for specialized optimization in this domain. Pet-Bench serves as a foundational resource for benchmarking pet-related LLM abilities and advancing emotionally immersive human-pet interactions.
zh
[NLP-70] PromptCanvas: Composable Prompting Workspaces Using Dynamic Widgets for Exploration and Iteration in Creative Writing
【速读】: 该论文试图解决传统对话式用户界面(User Interface, UI)在支持创造性任务时存在的局限性,特别是在提升用户对AI生成内容的控制力和减少认知负荷方面。其解决方案的关键在于提出PromptCanvas,这是一个将提示工程转化为可组合、基于组件(widget-based)体验的概念,通过无限画布让用户生成、定制和排列交互式组件,从而实现对文本多维度的可视化管理和灵活操作。
链接: https://arxiv.org/abs/2506.03741
作者: Rifat Mehreen Amin,Oliver Hans Kühle,Daniel Buschek,Andreas Butz
机构: LMU Munich(慕尼黑路德维希-马克西米利安大学); University of Bayreuth(拜罗伊特大学)
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL)
备注:
Abstract:We introduce PromptCanvas, a concept that transforms prompting into a composable, widget-based experience on an infinite canvas. Users can generate, customize, and arrange interactive widgets representing various facets of their text, offering greater control over AI-generated content. PromptCanvas allows widget creation through system suggestions, user prompts, or manual input, providing a flexible environment tailored to individual needs. This enables deeper engagement with the creative process. In a lab study with 18 participants, PromptCanvas outperformed a traditional conversational UI on the Creativity Support Index. Participants found that it reduced cognitive load, with lower mental demand and frustration. Qualitative feedback revealed that the visual organization of thoughts and easy iteration encouraged new perspectives and ideas. A follow-up field study (N=10) confirmed these results, showcasing the potential of dynamic, customizable interfaces in improving collaborative writing with AI.
zh
[NLP-71] Generating Pedagogically Meaningful Visuals for Math Word Problems: A New Benchmark and Analysis of Text-to-Image Models ACL2025
【速读】: 该论文旨在解决数学应用题(Math Word Problems, MWPs)中可视化内容生成的自动化问题,传统方法依赖人工创建视觉辅助工具,耗时且缺乏系统化支持。其解决方案的关键在于提出Math2Visual框架,该框架基于数学教师访谈所确定的设计空间和预定义的视觉语言,以准确呈现MWPs中的核心数学关系。通过构建包含1,903个标注视觉元素的数据集,并对文本到图像(Text-to-Image, TTI)模型进行微调,验证了该方法在教育性视觉生成任务中的有效性。
链接: https://arxiv.org/abs/2506.03735
作者: Junling Wang,Anna Rutkiewicz,April Yi Wang,Mrinmaya Sachan
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Findings of the Association for Computational Linguistics: ACL 2025
Abstract:Visuals are valuable tools for teaching math word problems (MWPs), helping young learners interpret textual descriptions into mathematical expressions before solving them. However, creating such visuals is labor-intensive and there is a lack of automated methods to support this process. In this paper, we present Math2Visual, an automatic framework for generating pedagogically meaningful visuals from MWP text descriptions. Math2Visual leverages a pre-defined visual language and a design space grounded in interviews with math teachers, to illustrate the core mathematical relationships in MWPs. Using Math2Visual, we construct an annotated dataset of 1,903 visuals and evaluate Text-to-Image (TTI) models for their ability to generate visuals that align with our design. We further fine-tune several TTI models with our dataset, demonstrating improvements in educational visual generation. Our work establishes a new benchmark for automated generation of pedagogically meaningful visuals and offers insights into key challenges in producing multimodal educational content, such as the misrepresentation of mathematical relationships and the omission of essential visual elements.
zh
[NLP-72] Verbalized Confidence Triggers Self-Verification: Emergent Behavior Without Explicit Reasoning Supervision
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在链式推理(Chain-of-Thought, CoT)过程中置信度校准的问题,即如何使模型生成的口头化置信度估计更加准确和可靠。其解决方案的关键在于通过带有标量置信度标签的监督微调,促使模型自发地产生自我验证行为,而无需显式的推理监督或基于强化学习的奖励机制。这种微调方法使模型能够在低置信度查询时生成更长且具有自我检查性质的响应,而在高置信度情况下提供更简洁的答案,从而提升模型的校准性能与可解释性。
链接: https://arxiv.org/abs/2506.03723
作者: Chaeyun Jang,Moonseok Choi,Yegon Kim,Hyungi Lee,Juho Lee
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Uncertainty calibration is essential for the safe deployment of large language models (LLMs), particularly when users rely on verbalized confidence estimates. While prior work has focused on classifiers or short-form generation, confidence calibration for chain-of-thought (CoT) reasoning remains largely unexplored. Surprisingly, we find that supervised fine-tuning with scalar confidence labels alone suffices to elicit self-verification behavior of language models, without any explicit reasoning supervision or reinforcement learning-based rewards. Despite being trained only to produce a verbalized confidence score without any self-verifying examples, the model learns to generate longer and self-checking responses for low-confidence queries while providing more concise answers for high-confidence ones. We further propose a simple rethinking method that boosts performance via test-time scaling based on calibrated uncertainty. Experiments on GSM8K and held-out reasoning tasks such as MATH-500 and ARC-Challenge show that our confidence-aware fine-tuning improves both calibration and accuracy, while also enhancing interpretability by aligning the model’s reasoning path with its confidence.
zh
[NLP-73] MFLA: Monotonic Finite Look-ahead Attention for Streaming Speech Recognition INTERSPEECH2025
【速读】: 该论文试图解决将大规模预训练语音模型(如Whisper)集成到流式系统中的挑战,特别是在保持低延迟的同时保证识别质量。其解决方案的关键在于提出一种前缀到前缀的训练框架,通过引入连续积分-放电机制建立连续语音序列与离散文本标记之间的准单调对齐,并设计单调有限前瞻注意力机制,使每个标记能够关注无限左文和有限右文,同时采用wait-k解码策略简化解码过程并确保训练与测试的一致性。
链接: https://arxiv.org/abs/2506.03722
作者: Yinfeng Xia,Huiyan Li,Chenyang Le,Manhong Wang,Yutao Sun,Xingyang Ma,Yanmin Qian
机构: 未知
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted by Interspeech 2025
Abstract:Applying large pre-trained speech models like Whisper has shown promise in reducing training costs for various speech tasks. However, integrating these models into streaming systems remains a challenge. This paper presents a novel prefix-to-prefix training framework for streaming recognition by fine-tuning the Whisper. We introduce the Continuous Integrate-and-Fire mechanism to establish a quasi-monotonic alignment between continuous speech sequences and discrete text tokens. Additionally, we design Monotonic Finite Look-ahead Attention, allowing each token to attend to infinite left-context and finite right-context from the speech sequences. We also employ the wait-k decoding strategy to simplify the decoding process while ensuring consistency between training and testing. Our theoretical analysis and experiments demonstrate that this approach achieves a controllable trade-off between latency and quality, making it suitable for various streaming applications.
zh
[NLP-74] ScoreRAG : A Retrieval-Augmented Generation Framework with Consistency-Relevance Scoring and Structured Summarization for News Generation
【速读】: 该论文试图解决自动化新闻生成中存在幻觉、事实不一致以及缺乏领域专业知识的问题(hallucinations, factual inconsistencies, and lack of domain-specific expertise)。解决方案的关键在于提出ScoreRAG框架,该框架结合了检索增强生成、一致性相关性评估和结构化摘要生成,通过多阶段流程提升新闻生成的质量与专业性。
链接: https://arxiv.org/abs/2506.03704
作者: Pei-Yun Lin,Yen-lung Tsai
机构: National Chengchi University (国立政治大学)
类目: Computation and Language (cs.CL)
备注: 11 pages, 8 figures. Code and demo available at this https URL . Submitted to arXiv for public access; journal submission planned
Abstract:This research introduces ScoreRAG, an approach to enhance the quality of automated news generation. Despite advancements in Natural Language Processing and large language models, current news generation methods often struggle with hallucinations, factual inconsistencies, and lack of domain-specific expertise when producing news articles. ScoreRAG addresses these challenges through a multi-stage framework combining retrieval-augmented generation, consistency relevance evaluation, and structured summarization. The system first retrieves relevant news documents from a vector database, maps them to complete news items, and assigns consistency relevance scores based on large language model evaluations. These documents are then reranked according to relevance, with low-quality items filtered out. The framework proceeds to generate graded summaries based on relevance scores, which guide the large language model in producing complete news articles following professional journalistic standards. Through this methodical approach, ScoreRAG aims to significantly improve the accuracy, coherence, informativeness, and professionalism of generated news articles while maintaining stability and consistency throughout the generation process. The code and demo are available at: this https URL.
zh
[NLP-75] AdaDecode: Accelerating LLM Decoding with Adaptive Layer Parallelism ICML2025
【速读】: 该论文旨在解决大规模语言模型(Large Language Models, LLMs)在长文本生成任务中解码效率低下的问题,尤其是传统自回归解码方式由于逐个生成token的顺序依赖性,无法充分利用现代硬件的并行计算能力。其解决方案的关键在于提出AdaDecode方法,该方法通过在高置信度时在中间层生成token,并将后续层的计算延迟至需要时并行执行,从而提升硬件利用率和减少解码延迟,同时通过最终验证步骤确保输出与标准自回归解码一致。
链接: https://arxiv.org/abs/2506.03700
作者: Zhepei Wei,Wei-Lin Chen,Xinyu Zhu,Yu Meng
机构: 未知
类目: Computation and Language (cs.CL)
备注: ICML 2025. Code: this https URL
Abstract:Large language models (LLMs) are increasingly used for long-content generation (e.g., long Chain-of-Thought reasoning) where decoding efficiency becomes a critical bottleneck: Autoregressive decoding is inherently limited by its sequential token generation process, where each token must be generated before the next can be processed. This sequential dependency restricts the ability to fully leverage modern hardware’s parallel processing capabilities. Existing methods like speculative decoding and layer skipping offer potential speedups but have notable drawbacks: speculative decoding relies on an auxiliary “drafter” model, which can be challenging to acquire and increases memory overhead, while layer skipping may introduce discrepancies in the outputs due to the missing key-value cache at skipped layers. In this work, we propose AdaDecode, which accelerates LLM decoding without requiring auxiliary models or changes to the original model parameters, while ensuring output consistency. AdaDecode leverages the insight that many tokens can accurately be generated at intermediate layers, as further layers often do not significantly alter predictions once the model reaches a certain confidence. By adaptively generating tokens at intermediate layers when confidence is high, AdaDecode enables the next token’s computation to begin immediately. The remaining layer computations for early-predicted tokens are deferred and executed in parallel with subsequent tokens when needed, maximizing hardware utilization and reducing decoding latency. A final verification step ensures that early predictions match the results of standard autoregressive decoding, preserving output parity. Experiments across diverse generation tasks shows that AdaDecode consistently achieves superior decoding throughput with up to 1.73x speedup, while guaranteeing output parity with standard autoregressive decoding.
zh
[NLP-76] Robust Preference Optimization via Dynamic Target Margins ACL2025
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)对齐过程中因数据质量下降导致的性能受限问题,特别是在存在噪声的情况下,直接偏好优化(Direct Preference Optimization, DPO)的有效性受到显著影响。其解决方案的关键在于提出一种动态目标边界偏好优化算法 \gamma -PO,该算法通过实例特定的边界校准,在成对比较中调整奖励边界,从而有策略地优先处理高置信度的偏好对,并抑制模糊对可能引入的噪声。
链接: https://arxiv.org/abs/2506.03690
作者: Jie Sun,Junkang Wu,Jiancan Wu,Zhibo Zhu,Xingyu Lu,Jun Zhou,Lintao Ma,Xiang Wang
机构: 未知
类目: Computation and Language (cs.CL)
备注: 18 pages, 6 figures, accepted to The 63rd Annual Meeting of the Association for Computational Linguistics (ACL2025)
Abstract:The alignment of Large Language Models (LLMs) is crucial for ensuring their safety and reliability in practical applications. Direct Preference Optimization (DPO) has emerged as an efficient method that directly optimizes models using preference pairs, significantly reducing resource demands. However, the effectiveness of DPO heavily depends on the data quality, which is frequently compromised by noise. In this work, we propose \gamma -PO, a dynamic target margin preference optimization algorithm that adjust reward margins at the pairwise level. By introducing instance-specific margin calibration, \gamma -PO strategically prioritizes high-confidence pairs (those demonstrating higher reward margins) while suppressing potential noise from ambiguous pairs. Moreover, \gamma -PO is a plug-and-play method, compatible with variants of DPO that rely on reward margin between preference pairs. Across benchmarks such as AlpacaEval2 and Arena-Hard, \gamma -PO achieves an average 4.4% improvement over other baselines, setting new benchmarks for state-of-the-art performance. Additionally, \gamma -PO requires minimal code changes and has a negligible impact on training efficiency, making it a robust solution for enhancing LLMs alignment. Our codes are available at \hrefthis https URLthis https URL.
zh
[NLP-77] Efficient Data Selection for Domain Adaptation of ASR Using Pseudo-Labels and Multi-Stage Filtering INTERSPEECH2025
【速读】: 该论文试图解决在小组织缺乏标注数据和计算资源的情况下,对预训练自动语音识别(ASR)模型进行领域微调的挑战。其解决方案的关键在于提出一种稳健的数据选择方法,通过过滤使用Whisper(编码器-解码器)和Zipformer(转换器)模型生成的伪标签,结合词错误率(WER)预测、命名实体识别(NER)和字符错误率(CER)分析等多种选择策略,以提取高质量的训练片段。
链接: https://arxiv.org/abs/2506.03681
作者: Pradeep Rangappa,Andres Carofilis,Jeena Prakash,Shashi Kumar,Sergio Burdisso,Srikanth Madikeri,Esau Villatoro-Tello,Bidisha Sharma,Petr Motlicek,Kadri Hacioglu,Shankar Venkatesan,Saurabh Vyas,Andreas Stolcke
机构: 未知
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted at Interspeech 2025, Netherlands
Abstract:Fine-tuning pretrained ASR models for specific domains is challenging for small organizations with limited labeled data and computational resources. Here, we explore different data selection pipelines and propose a robust approach that improves ASR adaptation by filtering pseudo-labels generated using Whisper (encoder-decoder) and Zipformer (transducer) models. Our approach integrates multiple selection strategies – including word error rate (WER) prediction, named entity recognition (NER), and character error rate (CER) analysis – to extract high-quality training segments. We evaluate our method on Whisper and Zipformer using a 7500-hour baseline, comparing it to a CER-based approach relying on hypotheses from three ASR systems. Fine-tuning on 7500 hours of pseudo-labeled call center data achieves 12.3% WER, while our filtering reduces the dataset to 100 hours (1.4%) with similar performance; a similar trend is observed on Fisher English.
zh
[NLP-78] ROSA: Addressing text understanding challenges in photographs via ROtated SAmpling
【速读】: 该论文旨在解决视觉障碍人群在使用视觉问答(Visual Question Answering, VQA)系统时,由于照片中文字方向不正确而导致的识别困难问题。现有VQA基准数据集主要包含由视力正常用户拍摄的正向文字图像,未能充分反映这一群体在实际使用中遇到的挑战。论文提出的解决方案关键在于引入ROtated SAmpling (ROSA)解码策略,该策略通过优化文本丰富的图像中错误方向文字的处理方式,显著提升了VQA模型的性能,相较于贪心解码方法,在最佳模型中提升了11.7个百分点。
链接: https://arxiv.org/abs/2506.03665
作者: Hernán Maina,Guido Ivetta,Mateo Lione Stuto,Julian Martin Eisenschlos,Jorge Sánchez,Luciana Benotti
机构: FAMAF, Universidad Nacional de Córdoba(法学院和数学系,科尔多瓦国立大学); CONICET(阿根廷国家科学技术研究委员会); Fundación Vía Libre(自由之路基金会); Mercado Libre Inc.(Mercado Libre公司)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Visually impaired people could benefit from Visual Question Answering (VQA) systems to interpret text in their surroundings. However, current models often struggle with recognizing text in the photos taken by this population. Through in-depth interviews with visually impaired individuals, we identified common framing conventions that frequently result in misaligned text. Existing VQA benchmarks primarily feature well-oriented text captured by sighted users, under-representing these challenges. To address this gap, we introduce ROtated SAmpling (ROSA), a decoding strategy that enhances VQA performance in text-rich images with incorrectly oriented text. ROSA outperforms Greedy decoding by 11.7 absolute points in the best-performing model.
zh
[NLP-79] rustworthy Medical Question Answering: An Evaluation-Centric Survey
【速读】: 该论文试图解决医疗问答(Medical QA)系统中可信度(Trustworthiness)不足的问题,特别是在大型语言模型(LLM)被广泛应用于医疗场景的背景下,其回答的可靠性直接影响临床决策和患者安全。解决方案的关键在于系统性地分析并评估医疗QA系统在六个核心维度上的可信度,即事实性(Factuality)、鲁棒性(Robustness)、公平性(Fairness)、安全性(Safety)、可解释性(Explainability)和校准性(Calibration),并通过基准测试与评估引导的技术手段,如检索增强的验证、对抗微调和安全对齐,推动模型性能的提升。
链接: https://arxiv.org/abs/2506.03659
作者: Yinuo Wang,Robert E. Mercer,Frank Rudzicz,Sudipta Singha Roy,Pengjie Ren,Zhumin Chen,Xindi Wang
机构: Shandong University (山东大学); University of Western Ontario (西安大略大学); Dalhousie University (达尔豪斯大学); Vector Institute for Artificial Intelligence (人工智能向量研究所)
类目: Computation and Language (cs.CL)
备注:
Abstract:Trustworthiness in healthcare question-answering (QA) systems is important for ensuring patient safety, clinical effectiveness, and user confidence. As large language models (LLMs) become increasingly integrated into medical settings, the reliability of their responses directly influences clinical decision-making and patient outcomes. However, achieving comprehensive trustworthiness in medical QA poses significant challenges due to the inherent complexity of healthcare data, the critical nature of clinical scenarios, and the multifaceted dimensions of trustworthy AI. In this survey, we systematically examine six key dimensions of trustworthiness in medical QA, i.e., Factuality, Robustness, Fairness, Safety, Explainability, and Calibration. We review how each dimension is evaluated in existing LLM-based medical QA systems. We compile and compare major benchmarks designed to assess these dimensions and analyze evaluation-guided techniques that drive model improvements, such as retrieval-augmented grounding, adversarial fine-tuning, and safety alignment. Finally, we identify open challenges-such as scalable expert evaluation, integrated multi-dimensional metrics, and real-world deployment studies-and propose future research directions to advance the safe, reliable, and transparent deployment of LLM-powered medical QA.
zh
[NLP-80] RewardAnything: Generalizable Principle-Following Reward Models
【速读】: 该论文试图解决当前奖励模型(Reward Models, RMs)在面对多样化实际需求时适应性不足的问题,其核心问题是现有RMs依赖于固定的偏好数据集进行训练,导致对单一隐式偏好分布的刚性对齐,难以适应不同任务中对简洁性或详细解释等不同需求。解决方案的关键在于提出一种可泛化的、遵循原则的奖励模型,即RewardAnything,该模型能够理解和遵循动态提供的自然语言形式的奖励原则,从而实现对新原则的快速适应而无需重新训练。
链接: https://arxiv.org/abs/2506.03637
作者: Zhuohao Yu,Jiali Zeng,Weizheng Gu,Yidong Wang,Jindong Wang,Fandong Meng,Jie Zhou,Yue Zhang,Shikun Zhang,Wei Ye
机构: Peking University (北京大学); WeChat AI (微信人工智能); William & Mary (威廉与玛丽学院); Westlake University (西湖大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 23 pages, 8 figures
Abstract:Reward Models, essential for guiding Large Language Model optimization, are typically trained on fixed preference datasets, resulting in rigid alignment to single, implicit preference distributions. This prevents adaptation to diverse real-world needs-from conciseness in one task to detailed explanations in another. The standard practice of collecting task-specific preference data and retraining reward models is resource-intensive, often producing biased rewards, and limits practical application. We introduce generalizable, principle-following reward models. We propose that RMs should understand and adhere to dynamically provided natural language specifications of reward principles, similar to instruction-following in LLMs. To measure this capability, we develop RABench, a comprehensive benchmark for RMs focusing on generalization across diverse principles. Evaluations on RABench reveal poor generalization of current RMs. As a solution, we present RewardAnything, a novel RM designed and trained to explicitly follow natural language principles. We achieve SotA performance with RewardAnything in traditional RM benchmark simply by specifying a well-defined principle, and results on RABench show we excel in adapting to novel principles without retraining. Furthermore, RewardAnything integrates seamlessly with existing RLHF methods and we show by a case study on how to automatically and efficiently align LLMs with only natural language principles.
zh
[NLP-81] Robustness of Prompting: Enhancing Robustness of Large Language Models Against Prompting Attacks
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在面对输入扰动(如拼写错误或字符顺序微小错误)时性能显著下降的问题,尽管已有 prompting 技术进展,但尚缺乏一种能明确缓解此类扰动负面影响的 prompting 策略。论文提出的解决方案是 Robustness of Prompting (RoP),其关键在于通过两个阶段增强模型的鲁棒性:第一阶段为错误纠正,利用多种扰动方法生成对抗样本以构建自动纠正输入错误的提示;第二阶段为引导,基于修正后的输入生成最优引导提示,引导模型进行更稳健和准确的推理。
链接: https://arxiv.org/abs/2506.03627
作者: Lin Mu,Guowei Chu,Li Ni,Lei Sang,Zhize Wu,Peiquan Jin,Yiwen Zhang
机构: Anhui University (安徽大学); Hefei University (合肥大学); University of Science and Technology of China (中国科学技术大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 13pages
Abstract:Large Language Models (LLMs) have demonstrated remarkable performance across various tasks by effectively utilizing a prompting strategy. However, they are highly sensitive to input perturbations, such as typographical errors or slight character order errors, which can substantially degrade their performance. Despite advances in prompting techniques, developing a prompting strategy that explicitly mitigates the negative impact of such perturbations remains an open challenge. To bridge this gap, we propose Robustness of Prompting (RoP), a novel prompting strategy specifically designed to enhance the robustness of LLMs. RoP consists of two stages: Error Correction and Guidance. In the Error Correction stage, RoP applies diverse perturbation methods to generate adversarial examples, which are then used to construct prompts that automatically correct input errors. In the Guidance stage, RoP generates an optimal guidance prompting based on the corrected input, steering the model toward more robust and accurate inferences. Through comprehensive experiments spanning arithmetic, commonsense, and logical reasoning tasks, we demonstrate that RoP significantly improves LLMs’ robustness against adversarial perturbations. Notably, it maintains model accuracy with only minimal degradation compared to clean input scenarios, thereby establishing RoP as a practical and effective approach for enhancing LLM robustness in real-world applications.
zh
[NLP-82] Do Large Language Models Know Folktales? A Case Study of Yokai in Japanese Folktales
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在文化知识方面存在局限性的问题,特别是其对非英语社区文化的理解不足。解决方案的关键在于构建一个专门用于评估LLMs文化意识的基准数据集,即YokaiEval,该数据集包含809道关于日本民间故事中妖怪(Yokai)的多选题,旨在测试模型对特定文化内容的理解能力。研究结果表明,使用日语语言资源训练的模型在该任务上的表现优于以英语为中心的模型,尤其是基于Llama-3并经过持续日语预训练的模型表现尤为突出。
链接: https://arxiv.org/abs/2506.03619
作者: Ayuto Tsutsumi,Yuu Jinnai
机构: Tokyo Metropolitan University (东京都立大学); CyberAgent (CyberAgent)
类目: Computation and Language (cs.CL)
备注:
Abstract:Although Large Language Models (LLMs) have demonstrated strong language understanding and generation abilities across various languages, their cultural knowledge is often limited to English-speaking communities, which can marginalize the cultures of non-English communities. To address the problem, evaluation of the cultural awareness of the LLMs and the methods to develop culturally aware LLMs have been investigated. In this study, we focus on evaluating knowledge of folktales, a key medium for conveying and circulating culture. In particular, we focus on Japanese folktales, specifically on knowledge of Yokai. Yokai are supernatural creatures originating from Japanese folktales that continue to be popular motifs in art and entertainment today. Yokai have long served as a medium for cultural expression, making them an ideal subject for assessing the cultural awareness of LLMs. We introduce YokaiEval, a benchmark dataset consisting of 809 multiple-choice questions (each with four options) designed to probe knowledge about yokai. We evaluate the performance of 31 Japanese and multilingual LLMs on this dataset. The results show that models trained with Japanese language resources achieve higher accuracy than English-centric models, with those that underwent continued pretraining in Japanese, particularly those based on Llama-3, performing especially well. The code and dataset are available at this https URL ILab/YokaiEval.
zh
[NLP-83] Learning to Insert [PAUSE] Tokens for Better Reasoning ACL
【速读】: 该论文旨在提升基于Transformer的大语言模型(Large Language Models, LLMs)的推理能力,通过引入一种新的训练方法来增强模型在推理步骤中的表现。其解决方案的关键在于动态插入特定的[PAUSE]标记,该方法根据token的对数似然确定序列中模型置信度最低的位置,并在这些位置上插入[PAUSE]标记,从而提升模型对未来token的预测能力。实验结果表明,该方法在多个数据集和不同规模的模型上均优于传统的微调和之前的标记插入方法。
链接: https://arxiv.org/abs/2506.03616
作者: Eunki Kim,Sangryul Kim,James Thorne
机构: KAIST AI (KAIST人工智能)
类目: Computation and Language (cs.CL)
备注: 18 pages, 5 figures, ACL findings
Abstract:To enhance reasoning capabilities, previous works have explored incorporating special-purpose tokens into the training process. These strategies strengthen the learning mechanism of transformer-based large language models (LLMs). Building on prior research, in which inserting dummy tokens consecutively just before reasoning steps can enhance effectiveness, we introduce a novel approach termed Dynamic Inserting Tokens Training (DIT). Our method identifies positions within sequences where model confidence is lowest according to token log-likelihood. Strategically inserting [PAUSE] tokens on these positions bolsters the model’s predictive capabilities for subsequent tokens. Experimental results across diverse datasets and models, from the 2.7B model to the 8B model, demonstrate that DIT consistently outperforms traditional fine-tuning and previous token insertion methods. With this simple yet effective method, we achieve accuracy gains of up to 4.7%p on GSM8K, 3.23%p on AQUA-RAT, and pass@1 improvements of up to 3.4%p on MBPP datasets. Our work shows a model-based, dynamic approach rather than a heuristic one, thereby broadening the scope of research in reasoning.
zh
[NLP-84] VLMs Can Aggregate Scattered Training Patches
【速读】: 该论文试图解决视觉-语言模型(Vision-Language Models, VLMs)在训练数据中存在危险样本时所带来的安全风险问题,特别是当有害图像被分割为看似无害的图像块并分散到多个训练样本中时,模型可能通过视觉拼接(visual stitching)能力整合这些片段,从而在推理阶段生成有害响应。解决方案的关键在于揭示VLMs具备从多个共享相同文本描述的训练样本中整合分散视觉信息的能力,并通过实验验证这种能力可被恶意利用,导致有害内容绕过数据审核机制。
链接: https://arxiv.org/abs/2506.03614
作者: Zhanhui Zhou,Lingjie Chen,Chao Yang,Chaochao Lu
机构: Shanghai Artificial Intelligence Laboratory (上海人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:One way to mitigate risks in vision-language models (VLMs) is to remove dangerous samples in their training data. However, such data moderation can be easily bypassed when harmful images are split into small, benign-looking patches, scattered across many training samples. VLMs may then learn to piece these fragments together during training and generate harmful responses at inference, either from full images or text references. For instance, if trained on image patches from a bloody scene paired with the descriptions “safe,” VLMs may later describe, the full image or a text reference to the scene, as “safe.” We define the core ability of VLMs enabling this attack as \textitvisual stitching – the ability to integrate visual information spread across multiple training samples that share the same textual descriptions. In our work, we first demonstrate visual stitching abilities in common open-source VLMs on three datasets where each image is labeled with a unique synthetic ID: we split each (\textttimage, \textttID) pair into (\textttpatch, \textttID)\ pairs at different granularity for finetuning, and we find that tuned models can verbalize the correct IDs from full images or text reference. Building on this, we simulate the adversarial data poisoning scenario mentioned above by using patches from dangerous images and replacing IDs with text descriptions like safe'' or
unsafe’', demonstrating how harmful content can evade moderation in patches and later be reconstructed through visual stitching, posing serious VLM safety risks. Code is available at this https URL.
zh
[NLP-85] Auto prompt sql: a resource-efficient architecture for text-to-sql translation in constrained environments
【速读】: 该论文旨在解决在资源受限环境下使用高效的文本到SQL(Text-to-SQL)方法的挑战,特别是在小规模开源模型与大规模封闭源代码模型之间能力差距的问题。其解决方案的关键在于提出一种名为Auto Prompt SQL (AP-SQL) 的新架构,该架构通过分解任务为模式过滤、基于上下文示例的检索增强型文本到SQL生成以及提示驱动的模式链接和SQL生成来实现。此外,通过微调大型语言模型以提高模式选择的准确性,并探索提示工程在整个过程中的影响,特别是利用思维链(Chain-of-Thought, CoT)和图思维(Graph-of-Thought, GoT)模板来显著提升模型的推理能力,从而实现准确的SQL生成。
链接: https://arxiv.org/abs/2506.03598
作者: Zetong Tang,Qian Ma,Di Wu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 4 pages,2 figures,EITCE 2025
Abstract:Using the best Text-to-SQL methods in resource-constrained environments is challenging due to their reliance on resource-intensive open-source models. This paper introduces Auto Prompt SQL(AP-SQL), a novel architecture designed to bridge the gap between resource-efficient small open-source models and the powerful capabilities of large closed-source models for Text-to-SQL translation. Our method decomposes the task into schema filtering, retrieval-augmented text-to-SQL generation based on in-context examples, and prompt-driven schema linking and SQL generation. To improve schema selection accuracy, we fine-tune large language models. Crucially, we also explore the impact of prompt engineering throughout the process, leveraging Chain-of-Thought(CoT) and Graph-of-Thought(GoT) templates to significantly enhance the model’s reasoning for accurate SQL generation. Comprehensive evaluations on the Spider benchmarks demonstrate the effectiveness of AP-SQL.
zh
[NLP-86] Is linguistically-motivated data augmentation worth it? ACL2025
【速读】: 该论文试图解决在低资源语言中,通过数据增强策略提升序列到序列任务(如机器翻译和互线词注)性能的问题,特别是比较语言学上“无意识”(linguistically-naive)与“有动机”(linguistically-motivated)的数据增强方法的有效性。其解决方案的关键在于系统地评估两种类型的数据增强策略在两种不同形态属性的低资源语言(Uspanteko和Arapaho)中的表现,发现当生成的新样本与训练数据分布差异不大时,语言学驱动的方法相比无意识方法能带来一定的优势。
链接: https://arxiv.org/abs/2506.03593
作者: Ray Groshan,Michael Ginn,Alexis Palmer
机构: University of Colorado (科罗拉多大学)
类目: Computation and Language (cs.CL)
备注: Accepted to ACL 2025 Main. First two authors contributed equally
Abstract:Data augmentation, a widely-employed technique for addressing data scarcity, involves generating synthetic data examples which are then used to augment available training data. Researchers have seen surprising success from simple methods, such as random perturbations from natural examples, where models seem to benefit even from data with nonsense words, or data that doesn’t conform to the rules of the language. A second line of research produces synthetic data that does in fact follow all linguistic constraints; these methods require some linguistic expertise and are generally more challenging to implement. No previous work has done a systematic, empirical comparison of both linguistically-naive and linguistically-motivated data augmentation strategies, leaving uncertainty about whether the additional time and effort of linguistically-motivated data augmentation work in fact yields better downstream performance. In this work, we conduct a careful and comprehensive comparison of augmentation strategies (both linguistically-naive and linguistically-motivated) for two low-resource languages with different morphological properties, Uspanteko and Arapaho. We evaluate the effectiveness of many different strategies and their combinations across two important sequence-to-sequence tasks for low-resource languages: machine translation and interlinear glossing. We find that linguistically-motivated strategies can have benefits over naive approaches, but only when the new examples they produce are not significantly unlike the training data distribution. Comments: Accepted to ACL 2025 Main. First two authors contributed equally Subjects: Computation and Language (cs.CL) Cite as: arXiv:2506.03593 [cs.CL] (or arXiv:2506.03593v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2506.03593 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-87] From Understanding to Generation: An Efficient Shortcut for Evaluating Language Models
【速读】: 该论文试图解决在训练大语言模型(Large Language Models, LLMs)过程中,由于需要进行耗时且计算资源密集的自然语言生成(NLG)评估而导致的计算负担过重问题。其解决方案的关键在于将生成式任务重新表述为计算成本更低的自然语言理解(NLU)任务,从而实现对关键模型能力的有效监控,同时显著降低评估时间,实验结果显示平均评估时间减少了超过35倍。
链接: https://arxiv.org/abs/2506.03592
作者: Viktor Hangya,Fabian Küch,Darina Gold
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Iterative evaluation of LLMs during training is essential to ensure expected capability development, but can be time- and compute-intensive. While NLU tasks, where the model selects from fixed answer choices, are cheap to evaluate, essential capabilities like reasoning and code generation rely on the more time-consuming NLG (token-by-token generation) format. In this work, our aim is to decrease the computational burden of NLG benchmarks in order to enable monitoring crucial LLM capabilities during model training. We reformulate generative tasks into computationally cheaper NLU alternatives. We test the performance correlation between the original and reformulated tasks using 8 LMs of various sizes and 4 capabilities: mathematical reasoning, code generation, factual knowledge and reading comprehension. Our results show a strong correlation between task formats, supporting capability assessment via cheaper alternatives and achieving over 35x average reduction in evaluation time. We plan to publish our benchmark adaptions.
zh
[NLP-88] BiMa: Towards Biases Mitigation for Text-Video Retrieval via Scene Element Guidance
【速读】: 该论文旨在解决文本-视频检索(TVR)系统中因数据集内视觉-语言偏差导致预训练视觉-语言模型忽略关键细节的问题。其解决方案的关键在于提出BiMa框架,通过生成表征每个视频的场景元素,并将这些元素融入视频嵌入以增强细粒度和显著细节的表示,同时引入机制将文本特征解耦为内容和偏差成分,从而实现对文本的去偏处理。
链接: https://arxiv.org/abs/2506.03589
作者: Huy Le,Nhat Chung,Tung Kieu,Anh Nguyen,Ngan Le
机构: FPT Software AI Center, Vietnam; Aalborg University, Denmark; Pioneer Centre for AI, Denmark; University of Liverpool, UK; AICV Lab, University of Arkansas, USA
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 22 pages, 14 figures
Abstract:Text-video retrieval (TVR) systems often suffer from visual-linguistic biases present in datasets, which cause pre-trained vision-language models to overlook key details. To address this, we propose BiMa, a novel framework designed to mitigate biases in both visual and textual representations. Our approach begins by generating scene elements that characterize each video by identifying relevant entities/objects and activities. For visual debiasing, we integrate these scene elements into the video embeddings, enhancing them to emphasize fine-grained and salient details. For textual debiasing, we introduce a mechanism to disentangle text features into content and bias components, enabling the model to focus on meaningful content while separately handling biased information. Extensive experiments and ablation studies across five major TVR benchmarks (i.e., MSR-VTT, MSVD, LSMDC, ActivityNet, and DiDeMo) demonstrate the competitive performance of BiMa. Additionally, the model’s bias mitigation capability is consistently validated by its strong results on out-of-distribution retrieval tasks.
zh
[NLP-89] Preface to the Special Issue of the TAL Journal on Scholarly Document Processing
【速读】: 该论文试图解决学术文献快速增长导致研究人员难以及时获取和理解新知识的问题,其解决方案的关键在于利用大型语言模型(Large Language Models, LLMs)来执行文献综述、写作辅助以及研究内容的交互式探索等任务,从而提取可靠且可操作的见解。
链接: https://arxiv.org/abs/2506.03587
作者: Florian Boudin,Akiko Aizawa
机构: 未知
类目: Digital Libraries (cs.DL); Computation and Language (cs.CL)
备注:
Abstract:The rapid growth of scholarly literature makes it increasingly difficult for researchers to keep up with new knowledge. Automated tools are now more essential than ever to help navigate and interpret this vast body of information. Scientific papers pose unique difficulties, with their complex language, specialized terminology, and diverse formats, requiring advanced methods to extract reliable and actionable insights. Large language models (LLMs) offer new opportunities, enabling tasks such as literature reviews, writing assistance, and interactive exploration of research. This special issue of the TAL journal highlights research addressing these challenges and, more broadly, research on natural language processing and information retrieval for scholarly and scientific documents.
zh
[NLP-90] Automatically Suggesting Diverse Example Sentences for L2 Japanese Learners Using Pre-Trained Language Models
【速读】: 该论文旨在解决为第二语言(L2)日语学习者提供多样化且符合其语言水平的例句的问题,以促进有效的语言习得。研究的核心解决方案是利用预训练语言模型(Pre-trained Language Models, PLMs)生成或检索适合不同水平学习者的例句,具体包括两种方法:一种是将PLMs作为质量评分组件用于从新构建的日语句子语料库中检索句子,另一种是直接使用零样本学习生成句子。研究结果表明,尽管在句子质量评价上存在分歧,但检索方法在多数评估者中更受青睐,尤其适用于初级和高级学习者,而生成方法则平均得分较低,但展示了PLMs在提升句子建议系统适应性方面的潜力。
链接: https://arxiv.org/abs/2506.03580
作者: Enrico Benedetti,Akiko Aizawa,Florian Boudin
机构: National Institute of Informatics, Japan(国立情報学研究所,日本); JFLI, CNRS, Nantes University, France(语言与语言学联合实验室,法国)
类目: Computation and Language (cs.CL)
备注: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop)
Abstract:Providing example sentences that are diverse and aligned with learners’ proficiency levels is essential for fostering effective language acquisition. This study examines the use of Pre-trained Language Models (PLMs) to produce example sentences targeting L2 Japanese learners. We utilize PLMs in two ways: as quality scoring components in a retrieval system that draws from a newly curated corpus of Japanese sentences, and as direct sentence generators using zero-shot learning. We evaluate the quality of sentences by considering multiple aspects such as difficulty, diversity, and naturalness, with a panel of raters consisting of learners of Japanese, native speakers – and GPT-4. Our findings suggest that there is inherent disagreement among participants on the ratings of sentence qualities, except for difficulty. Despite that, the retrieval approach was preferred by all evaluators, especially for beginner and advanced target proficiency, while the generative approaches received lower scores on average. Even so, our experiments highlight the potential for using PLMs to enhance the adaptability of sentence suggestion systems and therefore improve the language learning journey.
zh
[NLP-91] KG-BiLM: Knowledge Graph Embedding via Bidirectional Language Models
【速读】: 该论文旨在解决如何将符号化知识图谱(Knowledge Graphs, KGs)与语言模型(Language Models, LMs)进行有效融合,以实现更丰富的语义理解问题。现有方法通常侧重于图结构或文本语义中的某一方面,未能同时捕捉全局KG连通性、细微的语言上下文以及判别性推理语义。其解决方案的关键在于提出KG-BiLM框架,该框架通过三个核心组件实现知识图谱结构与生成式Transformer语义表达的融合:双向知识注意力机制、知识掩码预测以及对比图语义聚合,从而在链接预测任务中表现出色,尤其在大规模具有复杂多跳关系的图中验证了其有效性。
链接: https://arxiv.org/abs/2506.03576
作者: Zirui Chen,Xin Wang,Zhao Li,Wenbin Guo,Dongxiao He
机构: Tianjin University (天津大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent advances in knowledge representation learning (KRL) highlight the urgent necessity to unify symbolic knowledge graphs (KGs) with language models (LMs) for richer semantic understanding. However, existing approaches typically prioritize either graph structure or textual semantics, leaving a gap: a unified framework that simultaneously captures global KG connectivity, nuanced linguistic context, and discriminative reasoning semantics. To bridge this gap, we introduce KG-BiLM, a bidirectional LM framework that fuses structural cues from KGs with the semantic expressiveness of generative transformers. KG-BiLM incorporates three key components: (i) Bidirectional Knowledge Attention, which removes the causal mask to enable full interaction among all tokens and entities; (ii) Knowledge-Masked Prediction, which encourages the model to leverage both local semantic contexts and global graph connectivity; and (iii) Contrastive Graph Semantic Aggregation, which preserves KG structure via contrastive alignment of sampled sub-graph representations. Extensive experiments on standard benchmarks demonstrate that KG-BiLM outperforms strong baselines in link prediction, especially on large-scale graphs with complex multi-hop relations - validating its effectiveness in unifying structural information and textual semantics.
zh
[NLP-92] Exchange of Perspective Prompting Enhances Reasoning in Large Language Models
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在处理自然语言处理(Natural Language Processing, NLP)任务时,因对问题理解的固有局限性而导致性能受限的问题。其解决方案的关键在于提出了一种名为Exchange-of-Perspective (EoP)的新框架,该框架通过在不同问题定义之间交换视角,打破特定问题表述带来的固定思维模式,从而提升模型的综合表现。
链接: https://arxiv.org/abs/2506.03573
作者: Lin Sun,Can Zhang
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) have made significant advancements in addressing diverse natural language processing (NLP) tasks. However, their performance is often limited by inherent comprehension of problems. To address this limitation, we propose Exchange-of-Perspective (EoP), a novel framework designed to exchange perspectives across different definitions of problem, so that it can break the fixed mindset from any particular formulation of the question. We conducted extensive and comprehensive experiments on 8 benchmarks. The results show that EoP can significantly improve performance. For instance, compared to the non-commutative baseline PHP, with GPT-3.5-Turbo and EoP, we observe a 3.6% improvement on AQuA (60.6% to 64.2%), while GPT-4-powered EoP demonstrates a 7.7% overall accuracy enhancement on Math (53.9% to 61.6%) and a 3.5% improvement on OlympiadBench Maths (43.5% to 47.0%) when using Qwen-2.5-72b.
zh
[NLP-93] FreePRM: Training Process Reward Models Without Ground Truth Process Labels
【速读】: 该论文试图解决在训练过程奖励模型(Process Reward Model, PRM)时对昂贵且难以获取的步骤级标签的依赖问题。其解决方案的关键在于提出一种弱监督框架FreePRM,该框架通过基于最终结果正确性生成伪步骤级标签,并利用缓冲概率(Buffer Probability)减少伪标注中的噪声影响,从而在无需真实步骤级标签的情况下实现高性能的PRM训练。
链接: https://arxiv.org/abs/2506.03570
作者: Lin Sun,Chuang Liu,Xiaofeng Ma,Tao Yang,Weijia Lu,Ning Wu
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Recent advancements in Large Language Models (LLMs) have demonstrated that Process Reward Models (PRMs) play a crucial role in enhancing model performance. However, training PRMs typically requires step-level labels, either manually annotated or automatically generated, which can be costly and difficult to obtain at scale. To address this challenge, we introduce FreePRM, a weakly supervised framework for training PRMs without access to ground-truth step-level labels. FreePRM first generates pseudo step-level labels based on the correctness of final outcome, and then employs Buffer Probability to eliminate impact of noise inherent in pseudo labeling. Experimental results show that FreePRM achieves an average F1 score of 53.0% on ProcessBench, outperforming fully supervised PRM trained on Math-Shepherd by +24.1%. Compared to other open-source PRMs, FreePRM outperforms upon RLHFlow-PRM-Mistral-8B (28.4%) by +24.6%, EurusPRM (31.3%) by +21.7%, and Skywork-PRM-7B (42.1%) by +10.9%. This work introduces a new paradigm in PRM training, significantly reducing reliance on costly step-level annotations while maintaining strong performance.
zh
[NLP-94] MiMo-VL Technical Report
【速读】: 该论文旨在解决多模态视觉语言理解与推理中的性能瓶颈问题,特别是在通用视觉理解与跨模态推理任务中提升模型表现。其关键解决方案是通过四阶段预训练(2.4万亿个token)结合混合在线策略强化学习(MORL),整合多种奖励信号以优化模型性能。研究还强调了在预训练阶段引入高质量的长链式思维(Chain-of-Thought)推理数据的重要性,并验证了混合强化学习在多领域优化中的有效性,尽管存在同步优化的挑战。
链接: https://arxiv.org/abs/2506.03569
作者: Xiaomi LLM-Core Team:Zihao Yue,Zhenru Lin,Yifan Song,Weikun Wang,Shuhuai Ren,Shuhao Gu,Shicheng Li,Peidian Li,Liang Zhao,Lei Li,Kainan Bao,Hao Tian,Hailin Zhang,Gang Wang,Dawei Zhu,Cici,Chenhong He,Bowen Ye,Bowen Shen,Zihan Zhang,Zihan Jiang,Zhixian Zheng,Zhichao Song,Zhenbo Luo,Yue Yu,Yudong Wang,Yuanyuan Tian,Yu Tu,Yihan Yan,Yi Huang,Xu Wang,Xinzhe Xu,Xingchen Song,Xing Zhang,Xing Yong,Xin Zhang,Xiangwei Deng,Wenyu Yang,Wenhan Ma,Weiwei Lv,Weiji Zhuang,Wei Liu,Sirui Deng,Shuo Liu,Shimao Chen,Shihua Yu,Shaohui Liu,Shande Wang,Rui Ma,Qiantong Wang,Peng Wang,Nuo Chen,Menghang Zhu,Kangyang Zhou,Kang Zhou,Kai Fang,Jun Shi,Jinhao Dong,Jiebao Xiao,Jiaming Xu,Huaqiu Liu,Hongshen Xu,Heng Qu,Haochen Zhao,Hanglong Lv,Guoan Wang,Duo Zhang,Dong Zhang,Di Zhang,Chong Ma,Chang Liu,Can Cai,Bingquan Xia
机构: Xiaomi(小米)
类目: Computation and Language (cs.CL)
备注: 32 pages
Abstract:We open-source MiMo-VL-7B-SFT and MiMo-VL-7B-RL, two powerful vision-language models delivering state-of-the-art performance in both general visual understanding and multimodal reasoning. MiMo-VL-7B-RL outperforms Qwen2.5-VL-7B on 35 out of 40 evaluated tasks, and scores 59.4 on OlympiadBench, surpassing models with up to 78B parameters. For GUI grounding applications, it sets a new standard with 56.1 on OSWorld-G, even outperforming specialized models such as UI-TARS. Our training combines four-stage pre-training (2.4 trillion tokens) with Mixed On-policy Reinforcement Learning (MORL) integrating diverse reward signals. We identify the importance of incorporating high-quality reasoning data with long Chain-of-Thought into pre-training stages, and the benefits of mixed RL despite challenges in simultaneous multi-domain optimization. We also contribute a comprehensive evaluation suite covering 50+ tasks to promote reproducibility and advance the field. The model checkpoints and full evaluation suite are available at this https URL.
zh
[NLP-95] POSS: Position Specialist Generates Better Draft for Speculative Decoding
【速读】: 该论文旨在解决生成式 AI (Generative AI) 在推理过程中因草稿模型生成特征的误差累积导致后期位置预测质量下降的问题。其解决方案的关键在于提出位置 specialists (PosS),即多个针对特定位置设计的草稿层,用于在指定位置生成令牌,从而减少每个 specialist 需要处理的草稿模型特征偏差范围,显著提升后续位置的令牌接受率。
链接: https://arxiv.org/abs/2506.03566
作者: Langlin Huang,Chengsong Huang,Jixuan Leng,Di Huang,Jiaxin Huang
机构: Washington University in St. Louis (华盛顿大学圣路易斯分校); Carnegie Mellon University (卡内基梅隆大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Speculative decoding accelerates Large Language Model (LLM) inference by using a small draft model to predict multiple tokens, and a large target model to verify these tokens in parallel. Recent studies leverage the hidden state of the target model to enhance draft model prediction accuracy. However, existing methods suffer from the degrading quality of draft token predictions at later positions, due to error accumulation in draft model generated features. In this paper, we propose Position Specialists (PosS), which consist of multiple position-specialized draft layers to generate tokens at assigned position(s). Position specialists greatly improve token acceptance rate at later positions per drafting round, as each specialist only needs to focus on handling a certain level of draft model feature deviation. Experiment results on Llama-3-8B-Instruct and Llama-2-13B-chat across six datasets demonstrate that PosS effectively improves over baselines on average acceptance length and speed-up ratio. Our codebase is available at this https URL.
zh
[NLP-96] ConsistentChat: Building Skeleton-Guided Consistent Dialogues for Large Language Models from Scratch
【速读】: 该论文试图解决当前指令数据生成方法主要关注单轮指令而忽视多轮对话连贯性的问题,导致在长对话中出现上下文偏移和任务完成率下降。其解决方案的关键在于提出一种基于骨架引导的多轮对话生成框架,通过显式建模人类对话意图来约束多轮指令生成过程,具体包括两个阶段:意图建模和骨架生成,从而确保对话的连贯性和目标导向性。
链接: https://arxiv.org/abs/2506.03558
作者: Jiawei Chen,Xinyan Guan,Qianhao Yuan,Guozhao Mo,Weixiang Zhou,Yaojie Lu,Hongyu Lin,Ben He,Le Sun,Xianpei Han
机构: Chinese Information Processing Laboratory, Institute of Software, Chinese Academy of Sciences (中国科学院软件研究所自然语言处理实验室); University of Chinese Academy of Sciences (中国科学院大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Current instruction data synthesis methods primarily focus on single-turn instructions and often neglect cross-turn coherence, resulting in context drift and reduced task completion rates in extended conversations. To address this limitation, we propose Skeleton-Guided Multi-Turn Dialogue Generation, a framework that constrains multi-turn instruction synthesis by explicitly modeling human conversational intent. It operates in two stages: (1) Intent Modeling, which captures the global structure of human dialogues by assigning each conversation to one of nine well-defined intent trajectories, ensuring a coherent and goal-oriented information flow; and (2) Skeleton Generation, which constructs a structurally grounded sequence of user queries aligned with the modeled intent, thereby serving as a scaffold that constrains and guides the downstream instruction synthesis process. Based on this process, we construct ConsistentChat, a multi-turn instruction dataset with approximately 15,000 multi-turn conversations and 224,392 utterances. Experiments on the Light, Topdial, and MT-Eval benchmarks show that models fine-tuned on ConsistentChat achieve a 20-30% improvement in chat consistency and up to a 15% increase in task success rate, significantly outperforming models trained on existing single-turn and multi-turn instruction datasets.
zh
[NLP-97] BPO: Revisiting Preference Modeling in Direct Preference Optimization
【速读】: 该论文试图解决直接偏好优化(Direct Preference Optimization, DPO)在对齐大语言模型(Large Language Models, LLMs)与人类偏好时存在的退化选择响应(Degraded Chosen Responses, DCR)问题,即DPO忽视了绝对奖励幅度,导致选择响应的可能性降低和分布外响应生成风险增加。解决方案的关键在于提出平衡偏好优化(Balanced Preference Optimization, BPO),其通过两个核心组件——平衡奖励边界和差距适配器——动态平衡选择响应与拒绝响应的优化,从而从根本上解决DCR问题,且无需在损失函数中引入额外约束。
链接: https://arxiv.org/abs/2506.03557
作者: Lin Sun,Chuang Liu,Peng Liu,Bingyang Li,Weijia Lu,Ning Wu
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Direct Preference Optimization (DPO) have emerged as a popular method for aligning Large Language Models (LLMs) with human preferences. While DPO effectively preserves the relative ordering between chosen and rejected responses through pairwise ranking losses, it often neglects absolute reward magnitudes. This oversight can decrease the likelihood of chosen responses and increase the risk of generating out-of-distribution responses, leading to poor performance. We term this issue Degraded Chosen Responses (DCR).To address this issue, we propose Balanced Preference Optimization (BPO), a novel framework that dynamically balances the optimization of chosen and rejected responses through two key components: balanced reward margin and gap adaptor. Unlike previous methods, BPO can fundamentally resolve DPO’s DCR issue, without introducing additional constraints to the loss function. Experimental results on multiple mathematical reasoning tasks show that BPO significantly outperforms DPO, improving accuracy by +10.1% with Llama-3.1-8B-Instruct (18.8% to 28.9%) and +11.7% with Qwen2.5-Math-7B (35.0% to 46.7%). It also surpasses DPO variants by +3.6% over IPO (43.1%), +5.0% over SLiC (41.7%), and +3.1% over Cal-DPO (43.6%) on the same model. Remarkably, our algorithm requires only a single line of code modification, making it simple to implement and fully compatible with existing DPO-based frameworks.
zh
[NLP-98] Debate Reflect and Distill: Multi-Agent Feedback with Tree-Structured Preference Optimization for Efficient Language Model Enhancement ACL2025
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)因计算需求高而难以广泛部署的问题,同时提升小型模型在知识密集型和复杂推理任务中的性能。其解决方案的关键在于提出一种名为 Debate and Reflect (DR) 的框架,通过小型模型与强教师模型之间的多轮辩论,获取可操作的反馈(如错误分析和修正策略),从而指导学生模型的优化;此外,引入树状直接偏好优化(Tree-structured Direct Preference Optimization, T-DPO)以高效利用辩论日志,将交互信息组织为层次化结构进行有效训练。
链接: https://arxiv.org/abs/2506.03541
作者: Xiaofeng Zhou,Heyan Huang,Lizi Liao
机构: Beijing Institute of Technology (北京理工大学); Singapore Management University (新加坡管理大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 16 pages, 10 figures. The camera-ready paper for Findings of ACL 2025
Abstract:Large Language Models (LLMs) continue to set new standards in knowledge-intensive and complex reasoning tasks, yet their high computational demands limit widespread adoption. While distilling large models into smaller ones offers a sustainable solution, current techniques–such as static knowledge distillation, resource-intensive reinforcement learning from human feedback, or limited self-reflection–struggle to yield substantial and lasting performance gains. In this paper, we present a novel Debate and Reflect (DR) framework that orchestrates multi-turn debates between smaller models and stronger teacher models, eliciting actionable feedback (e.g., error analysis, corrective strategies) to guide student models. Further, we introduce Tree-structured Direct Preference Optimization (T-DPO) to efficiently leverage these debate logs, organizing interactions into a hierarchical format for effective training. Empirical evaluations across diverse NLP benchmarks demonstrate that our approach significantly improves smaller-model accuracy, robustness, and generalization, outperforming conventional baselines by a large margin.
zh
[NLP-99] Go-Browse: Training Web Agents with Structured Exploration
【速读】: 该论文试图解决数字代理(digital agents)在理解环境方面存在的不足,例如网络浏览代理可能在不熟悉的网站中迷失,无法确定为实现目标必须访问的页面。解决方案的关键在于提出Go-Browse方法,通过结构化探索网络环境自动收集多样且真实的网络代理数据,其核心是将数据收集过程建模为图搜索,从而在不同探索过程中复用信息,提高探索效率。
链接: https://arxiv.org/abs/2506.03533
作者: Apurva Gandhi,Graham Neubig
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:One of the fundamental problems in digital agents is their lack of understanding of their environment. For instance, a web browsing agent may get lost in unfamiliar websites, uncertain what pages must be visited to achieve its goals. To address this, we propose Go-Browse, a method for automatically collecting diverse and realistic web agent data at scale through structured exploration of web environments. Go-Browse achieves efficient exploration by framing data collection as a graph search, enabling reuse of information across exploration episodes. We instantiate our method on the WebArena benchmark, collecting a dataset of 10K successful task-solving trajectories and 40K interaction steps across 100 URLs. Fine-tuning a 7B parameter language model on this dataset achieves a success rate of 21.7% on the WebArena benchmark, beating GPT-4o mini by 2.4% and exceeding current state-of-the-art results for sub-10B parameter models by 2.9%.
zh
[NLP-100] Video-Skill-CoT: Skill-based Chain-of-Thoughts for Domain-Adaptive Video Reasoning
【速读】: 该论文旨在解决现有基于链式思维(Chain-of-Thought, CoT)的视频理解方法在面对不同领域特定技能(如事件检测、空间关系理解、情感理解等)时适应性不足的问题。其解决方案的关键在于提出Video-Skill-CoT(Video-SKoT)框架,该框架通过自动构建和利用与技能相关的CoT监督信号,实现领域自适应的视频推理。具体而言,首先构建基于技能的CoT标注,提取训练问题中的领域相关推理技能并聚类为共享技能分类体系,为每个视频-问题对生成定制化的多步骤CoT推理过程;其次引入技能特定专家学习框架,每个专家模块专注于部分推理技能,并通过轻量级适配器进行训练,从而提升模型在不同视频领域的泛化能力。
链接: https://arxiv.org/abs/2506.03525
作者: Daeun Lee,Jaehong Yoon,Jaemin Cho,Mohit Bansal
机构: University of North Carolina at Chapel Hill (北卡罗来纳大学教堂山分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Project website: this https URL
Abstract:Recent advances in Chain-of-Thought (CoT) reasoning have improved complex video understanding, but existing methods often struggle to adapt to domain-specific skills (e.g., event detection, spatial relation understanding, emotion understanding) over various video content. To address this, we propose Video-Skill-CoT (a.k.a. Video-SKoT), a framework that automatically constructs and leverages skill-aware CoT supervisions for domain-adaptive video reasoning. First, we construct skill-based CoT annotations: we extract domain-relevant reasoning skills from training questions, cluster them into a shared skill taxonomy, and create detailed multi-step CoT rationale tailored to each video-question pair for training. Second, we introduce a skill-specific expert learning framework. Each expert module specializes in a subset of reasoning skills and is trained with lightweight adapters using the collected CoT supervision. We demonstrate the effectiveness of the proposed approach on three video understanding benchmarks, where Video-SKoT consistently outperforms strong baselines. We also provide in-depth analyses on comparing different CoT annotation pipelines and learned skills over multiple video domains.
zh
[NLP-101] Seed-Coder: Let the Code Model Curate Data for Itself
【速读】: 该论文旨在解决大规模语言模型(Large Language Model, LLM)预训练中代码数据(Code Data)构建依赖人工干预所带来的可扩展性差、主观偏差高及维护成本高的问题。其解决方案的关键在于引入Seed-Coder,通过模型驱动的数据流水线,主要利用大语言模型对代码数据进行评分和筛选,从而最大限度地减少人工参与,提升数据构建的效率与质量。
链接: https://arxiv.org/abs/2506.03524
作者: Yuyu Zhang,Jing Su,Yifan Sun,Chenguang Xi,Xia Xiao,Shen Zheng,Anxiang Zhang,Kaibo Liu,Daoguang Zan,Tao Sun,Jinhua Zhu,Shulin Xin,Dong Huang,Yetao Bai,Lixin Dong,Chao Li,Jianchong Chen,Hanzhi Zhou,Yifan Huang,Guanghan Ning,Xierui Song,Jiaze Chen,Siyao Liu,Kai Shen,Liang Xiang,Yonghui Wu
机构: ByteDance(字节跳动)
类目: Computation and Language (cs.CL)
备注:
Abstract:Code data in large language model (LLM) pretraining is recognized crucial not only for code-related tasks but also for enhancing general intelligence of LLMs. Current open-source LLMs often heavily rely on human effort to produce their code pretraining data, such as employing hand-crafted filtering rules tailored to individual programming languages, or using human-annotated data to train quality filters. However, these approaches are inherently limited in scalability, prone to subjective biases, and costly to extend and maintain across diverse programming languages. To address these challenges, we introduce Seed-Coder, a series of open-source LLMs comprising base, instruct and reasoning models of 8B size, minimizing human involvement in data construction. Our code pretraining data is produced by a model-centric data pipeline, which predominantly leverages LLMs for scoring and filtering code data. The instruct model is further trained via supervised fine-tuning and preference optimization, and the reasoning model leverages Long-Chain-of-Thought (LongCoT) reinforcement learning to improve multi-step code reasoning. Seed-Coder achieves state-of-the-art results among open-source models of similar size and even surpasses some much larger models, demonstrating superior performance in code generation, code completion, code editing, code reasoning, and software engineering tasks.
zh
[NLP-102] okAlign: Efficient Vocabulary Adaptation via Token Alignment ACL2025
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在新领域或语言中因分词器(tokenizer)效率低下和词汇不匹配而导致的训练与生成性能下降问题,以及阻碍模型间知识迁移的token级蒸馏(token-level distillation)难题。其解决方案的关键在于提出一种名为TokAlign的方法,通过从token共现视角替换LLM的词汇表,并进一步实现模型间的token级知识迁移,该方法首先通过学习一个一对一的token ID映射矩阵对齐源词汇表与目标词汇表,随后重新排列并逐步微调模型参数(包括嵌入层),从而显著提升多语言文本压缩率和词汇初始化效果。
链接: https://arxiv.org/abs/2506.03523
作者: Chong Li,Jiajun Zhang,Chengqing Zong
机构: State Key Laboratory of Multimodal Artificial Intelligence Systems (国家多模态人工智能系统重点实验室); Institute of Automation, CAS (中国科学院自动化研究所); School of Artificial Intelligence, University of Chinese Academy of Sciences (中国科学院大学人工智能学院)
类目: Computation and Language (cs.CL)
备注: ACL 2025, our codes and models are available at this https URL
Abstract:Tokenization serves as a foundational step for Large Language Models (LLMs) to process text. In new domains or languages, the inefficiency of the tokenizer will slow down the training and generation of LLM. The mismatch in vocabulary also hinders deep knowledge transfer between LLMs like token-level distillation. To mitigate this gap, we propose an efficient method named TokAlign to replace the vocabulary of LLM from the token co-occurrences view, and further transfer the token-level knowledge between models. It first aligns the source vocabulary to the target one by learning a one-to-one mapping matrix for token IDs. Model parameters, including embeddings, are rearranged and progressively fine-tuned for the new vocabulary. Our method significantly improves multilingual text compression rates and vocabulary initialization for LLMs, decreasing the perplexity from 3.4 \texte^2 of strong baseline methods to 1.2 \texte^2 after initialization. Experimental results on models across multiple parameter scales demonstrate the effectiveness and generalization of TokAlign, which costs as few as 5k steps to restore the performance of the vanilla model. After unifying vocabularies between LLMs, token-level distillation can remarkably boost (+4.4% than sentence-level distillation) the base model, costing only 235M tokens.
zh
[NLP-103] An Efficient Task-Oriented Dialogue Policy: Evolutionary Reinforcement Learning Injected by Elite Individuals
【速读】: 该论文旨在解决深度强化学习(Deep Reinforcement Learning, DRL)在任务导向型对话系统中由于状态和动作空间的高维性而导致的探索与利用平衡难题,该问题常导致局部最优或收敛性能不佳。其解决方案的关键在于将进化算法(Evolutionary Algorithms, EAs)的全局搜索能力与DRL的局部优化能力相结合,以实现更有效的探索与利用平衡。为进一步提升进化算法的搜索效率,论文还提出了精英个体注入(Elite Individual Injection, EII)机制,通过自适应引入表现最佳的个体来加速进化过程。
链接: https://arxiv.org/abs/2506.03519
作者: Yangyang Zhao,Ben Niu,Libo Qin,Shihan Wang
机构: Changsha University of Science and Technology (长沙理工大学); Central South University (中南大学); Utrecht University (乌得勒支大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Deep Reinforcement Learning (DRL) is widely used in task-oriented dialogue systems to optimize dialogue policy, but it struggles to balance exploration and exploitation due to the high dimensionality of state and action spaces. This challenge often results in local optima or poor convergence. Evolutionary Algorithms (EAs) have been proven to effectively explore the solution space of neural networks by maintaining population diversity. Inspired by this, we innovatively combine the global search capabilities of EA with the local optimization of DRL to achieve a balance between exploration and exploitation. Nevertheless, the inherent flexibility of natural language in dialogue tasks complicates this direct integration, leading to prolonged evolutionary times. Thus, we further propose an elite individual injection mechanism to enhance EA’s search efficiency by adaptively introducing best-performing individuals into the population. Experiments across four datasets show that our approach significantly improves the balance between exploration and exploitation, boosting performance. Moreover, the effectiveness of the EII mechanism in reducing exploration time has been demonstrated, achieving an efficient integration of EA and DRL on task-oriented dialogue policy tasks.
zh
[NLP-104] Accurate Sublayer Pruning for Large Language Models by Exploiting Latency and Tunability Information IJCAI2025
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)推理速度慢的问题,而不会牺牲模型的准确性。现有子层剪枝算法由于盲目选择需要剪枝的子层,忽略了每个子层的不同特性,导致准确性受限。该论文提出的解决方案关键在于SPRINT(Sublayer PRuning wIth LateNcy and Tunability Information),通过综合考虑剪枝后的延迟减少量和子层的可调性,精确选择需要剪枝的子层,并迭代地剪枝冗余子层同时快速调整剩余子层的参数,从而实现最佳的准确性与加速比平衡。
链接: https://arxiv.org/abs/2506.03510
作者: Seungcheol Park,Sojin Lee,Jongjin Kim,Jinsik Lee,Hyunjik Jo,U Kang
机构: Seoul National University (首尔国立大学); LG AI Research (LG人工智能研究)
类目: Computation and Language (cs.CL)
备注: IJCAI 2025 Main Track
Abstract:How can we accelerate large language models(LLMs) without sacrificing accuracy? The slow inference speed of LLMs hinders us to benefit from their remarkable performance in diverse applications. This is mainly because numerous sublayers are stacked together in LLMs. Sublayer pruning compresses and expedites LLMs via removing unnecessary sublayers. However, existing sublayer pruning algorithms are limited in accuracy since they naively select sublayers to prune, overlooking the different characteristics of each sublayer. In this paper, we propose SPRINT (Sublayer PRuning wIth LateNcy and Tunability Information), an accurate sublayer pruning method for LLMs. SPRINT accurately selects a target sublayer to prune by considering 1) the amount of latency reduction after pruning and 2) the tunability of sublayers. SPRINT iteratively prunes redundant sublayers and swiftly tunes the parameters of remaining sublayers. Experiments show that SPRINT achieves the best accuracy-speedup trade-off, exhibiting up to 23.88%p higher accuracy on zero-shot commonsense reasoning benchmarks compared to existing pruning algorithms.
zh
[NLP-105] Measuring Human Involvement in AI-Generated Text: A Case Study on Academic Writing IJCNN2025
【速读】: 该论文试图解决生成式 AI(Generative AI)内容中人类参与度检测的难题,即在人机协作生成文本的过程中,传统二分类检测方法因无法准确反映人类参与程度而存在局限性。解决方案的关键在于引入 BERTScore 作为衡量人类参与度的指标,并采用基于 RoBERTa 的多任务回归模型,在标记分类任务上进行训练,以更精确地评估不同层次的人类参与度。
链接: https://arxiv.org/abs/2506.03501
作者: Yuchen Guo,Zhicheng Dou,Huy H. Nguyen,Ching-Chun Chang,Saku Sugawara,Isao Echizen
机构: The University of Tokyo (东京大学); National Institute of Informatics (国立情报学研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: IJCNN2025 accepted
Abstract:Content creation has dramatically progressed with the rapid advancement of large language models like ChatGPT and Claude. While this progress has greatly enhanced various aspects of life and work, it has also negatively affected certain areas of society. A recent survey revealed that nearly 30% of college students use generative AI to help write academic papers and reports. Most countermeasures treat the detection of AI-generated text as a binary classification task and thus lack robustness. This approach overlooks human involvement in the generation of content even though human-machine collaboration is becoming mainstream. Besides generating entire texts, people may use machines to complete or revise texts. Such human involvement varies case by case, which makes binary classification a less than satisfactory approach. We refer to this situation as participation detection obfuscation. We propose using BERTScore as a metric to measure human involvement in the generation process and a multi-task RoBERTa-based regressor trained on a token classification task to address this problem. To evaluate the effectiveness of this approach, we simulated academic-based scenarios and created a continuous dataset reflecting various levels of human involvement. All of the existing detectors we examined failed to detect the level of human involvement on this dataset. Our method, however, succeeded (F1 score of 0.9423 and a regressor mean squared error of 0.004). Moreover, it demonstrated some generalizability across generative models. Our code is available at this https URL
zh
[NLP-106] Beyond Memorization: A Rigorous Evaluation Framework for Medical Knowledge Editing
【速读】: 该论文试图解决在医学领域中,现有知识编辑(Knowledge Editing, KE)方法难以有效泛化到新场景的问题,其关键在于当前KE方法仅实现注入信息的浅层记忆,而未能使模型真正理解并推理医学知识。为克服这一限制,论文提出了Self-Generated Rationale Editing (SGR-Edit),其核心是利用模型生成的推理过程作为编辑目标,从而揭示模型内部的推理机制,并显著提升现有KE方法的效果。
链接: https://arxiv.org/abs/2506.03490
作者: Shigeng Chen,Linhao Luo,Zhangchi Qiu,Yanan Cao,Carl Yang,Shirui Pan
机构: Griffith University (格里菲斯大学); Chinese Academy of Sciences (中国科学院); Emory University (埃默里大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Recently, knowledge editing (KE) has emerged as a promising approach to update specific facts in Large Language Models (LLMs) without the need for full retraining. Despite the effectiveness in general-domain benchmarks, their applicability to complex medical domain remains largely unexplored. Medical knowledge editing is particularly challenging, as it requires LLMs to internalize the knowledge and generalize to unseen scenarios for effective and interpretable decision-making. In this work, we propose a novel framework called MedEditBench to rigorously evaluate the effectiveness of existing KE methods in the medical domain. In MedEditBench, we introduce a new medical knowledge editing benchmark as well as three different knowledge editing paradigms, which are designed to assess the impact of different knowledge sources for editing. Our findings indicate that current KE methods result in only superficial memorization of the injected information, failing to generalize to new scenarios. To overcome this limitation, we present Self-Generated Rationale Editing (SGR-Edit), which utilizes model-derived rationales as the target knowledge for editing, thereby uncovering the underlying reasoning process and demonstrating significant improvements over existing KE approaches. Additionally, we offer deeper insights into medical knowledge editing, including the localization of medical knowledge in LLMs and the impact of sequential editing on evolving knowledge. This could provide practical guidance for implementing KE methods in real-world medical applications.
zh
[NLP-107] EpiCoDe: Boosting Model Performance Beyond Training with Extrapolation and Contrastive Decoding ACL2025
【速读】: 该论文试图解决在数据稀缺场景下,大型语言模型(Large language models, LLMs)由于缺乏高质量标注数据而难以获得处理下游任务能力的问题。解决方案的关键在于提出一种名为EpiCoDe的新方法,该方法通过模型外推增强微调模型,并结合对比解码技术,利用外推模型与原始微调模型的logit分数对比来进一步减少预测误差,从而在不进行额外训练的情况下提升模型性能。
链接: https://arxiv.org/abs/2506.03489
作者: Mingxu Tao,Jie Hu,Mingchuan Yang,Yunhuai Liu,Dongyan Zhao,Yansong Feng
机构: Wangxuan Institute of Computer Technology, Peking University (王选计算机技术研究所,北京大学); Research Institute of China Telecom (中国电信研究院); School of Computer Science, Peking University (计算机学院,北京大学); Beijing Institute of Big Data Research (北京大数据研究院); National Key Laboratory of General Artificial Intelligence, BIGAI (通用人工智能国家重点实验室,BIGAI)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: ACL 2025 Findings
Abstract:The remarkable performance of Large language models (LLMs) relies heavily on the availability of abundant high-quality training data. However, the high cost of acquiring annotated data often prevents models from obtaining capabilities to tackle downstream tasks. In this paper, we introduce a novel method, EpiCoDe that boosts model performance in data-scarcity scenarios without extra training. We first employ model extrapolation to enhance a finetuned model with its inferior version, and then adopt contrastive decoding to further reduce predicted errors, by comparing the logit scores given by the extrapolated and the vanilla finetuned model. Experiments across three tasks over four different LLMs show that EpiCoDe consistently outperforms existing methods with significant and robust improvement. We also propose a new theoretical framework to reveal the mechanism behind contrastive decoding in data-scarcity scenarios, which further helps us better understand the effectiveness of EpiCoDe.
zh
[NLP-108] ProRank: Prompt Warmup via Reinforcement Learning for Small Language Models Reranking
【速读】: 该论文试图解决小语言模型(Small Language Models, SLMs)在文档重排序任务中因缺乏微调而难以理解任务提示的问题,从而限制了其在重排序中的有效性。解决方案的关键在于提出一种两阶段训练方法ProRank:第一阶段通过强化学习GRPO进行提示预热,引导SLMs理解任务提示并生成更准确的粗粒度二分类相关性评分;第二阶段通过细粒度评分学习阶段对SLMs进行持续微调,无需引入额外层以进一步提升重排序质量。
链接: https://arxiv.org/abs/2506.03487
作者: Xianming Li,Aamir Shakir,Rui Huang,Julius Lipp,Jing Li
机构: Mixedbread AI(混合面包人工智能); The Hong Kong Polytechnic University(香港理工大学)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:
Abstract:Reranking is fundamental to information retrieval and retrieval-augmented generation, with recent Large Language Models (LLMs) significantly advancing reranking quality. While recent advances with LLMs have significantly improved document reranking quality, current approaches primarily rely on large-scale LLMs (7B parameters) through zero-shot prompting, presenting high computational costs. Small Language Models (SLMs) offer a promising alternative because of their efficiency, but our preliminary quantitative analysis reveals they struggle with understanding task prompts without fine-tuning. This limits their effectiveness for document reranking tasks. To address this issue, we introduce a novel two-stage training approach, ProRank, for SLM-based document reranking. First, we propose a prompt warmup stage using reinforcement learning GRPO to steer SLMs to understand task prompts and generate more accurate coarse-grained binary relevance scores for document reranking. Then, we continuously fine-tune the SLMs with a fine-grained score learning stage without introducing additional layers to further improve the reranking quality. Comprehensive experimental results demonstrate that the proposed ProRank consistently outperforms both the most advanced open-source and proprietary reranking models. Notably, our lightweight ProRank-0.5B model even surpasses the powerful 32B LLM reranking model on the BEIR benchmark, establishing that properly trained SLMs can achieve superior document reranking performance while maintaining computational efficiency.
zh
[NLP-109] Explainable AI: XAI-Guided Context-Aware Data Augmentation
【速读】: 该论文旨在解决低资源语言中由于标注数据稀缺而导致的AI模型鲁棒性和泛化能力不足的问题,以及传统数据增强技术引入噪声、导致语义漂移、破坏上下文连贯性、缺乏控制和过拟合等缺陷。其解决方案的关键在于提出XAI-Guided Context-Aware Data Augmentation框架,该框架利用可解释AI(Explainable AI, XAI)技术在保持大部分任务相关特征的同时,对次要特征进行修改,并通过迭代反馈机制根据可解释性驱动的洞察和模型性能提升不断优化增强数据。
链接: https://arxiv.org/abs/2506.03484
作者: Melkamu Abay Mersha,Mesay Gemeda Yigezu,Atnafu Lambebo Tonja,Hassan Shakil,Samer Iskander,Olga Kolesnikova,Jugal Kalita
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Explainable AI (XAI) has emerged as a powerful tool for improving the performance of AI models, going beyond providing model transparency and interpretability. The scarcity of labeled data remains a fundamental challenge in developing robust and generalizable AI models, particularly for low-resource languages. Conventional data augmentation techniques introduce noise, cause semantic drift, disrupt contextual coherence, lack control, and lead to overfitting. To address these challenges, we propose XAI-Guided Context-Aware Data Augmentation. This novel framework leverages XAI techniques to modify less critical features while selectively preserving most task-relevant features. Our approach integrates an iterative feedback loop, which refines augmented data over multiple augmentation cycles based on explainability-driven insights and the model performance gain. Our experimental results demonstrate that XAI-SR-BT and XAI-PR-BT improve the accuracy of models on hate speech and sentiment analysis tasks by 6.6% and 8.1%, respectively, compared to the baseline, using the Amharic dataset with the XLM-R model. XAI-SR-BT and XAI-PR-BT outperform existing augmentation techniques by 4.8% and 5%, respectively, on the same dataset and model. Overall, XAI-SR-BT and XAI-PR-BT consistently outperform both baseline and conventional augmentation techniques across all tasks and models. This study provides a more controlled, interpretable, and context-aware solution to data augmentation, addressing critical limitations of existing augmentation techniques and offering a new paradigm shift for leveraging XAI techniques to enhance AI model training.
zh
[NLP-110] APT: Improving Specialist LLM Performance with Weakness Case Acquisition and Iterative Preference Training ACL2025
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在进行领域特定微调时,容易导致通用能力退化的问题。解决方案的关键在于提出一种名为APT(Weakness Case Acquisition and Iterative Preference Training)的方法,该方法通过使用自生成的不偏好弱点数据(bad cases和similar cases)进行训练,仅针对模型出现错误的样本以及为该目的检索的小规模相似样本进行训练,从而最小化对模型现有知识库的干扰,有效保持其通用能力。
链接: https://arxiv.org/abs/2506.03483
作者: Jun Rao,Zepeng Lin,Xuebo Liu,Xiaopeng Ke,Lian Lian,Dong Jin,Shengjun Cheng,Jun Yu,Min Zhang
机构: Institute of Computing and Intelligence, Harbin Institute of Technology, Shenzhen, China (计算与智能研究所,哈尔滨工业大学,深圳); Huawei Cloud Computing Technologies Co., Ltd. (华为云计算技术有限公司)
类目: Computation and Language (cs.CL)
备注: ACL2025 Findings
Abstract:Large Language Models (LLMs) often require domain-specific fine-tuning to address targeted tasks, which risks degrading their general capabilities. Maintaining a balance between domain-specific enhancements and general model utility is a key challenge. This paper proposes a novel approach named APT (Weakness Case Acquisition and Iterative Preference Training) to enhance domain-specific performance with self-generated dis-preferred weakness data (bad cases and similar cases). APT uniquely focuses on training the model using only those samples where errors occur, alongside a small, similar set of samples retrieved for this purpose. This targeted training minimizes interference with the model’s existing knowledge base, effectively retaining generic capabilities. Experimental results on the LLama-2 and Mistral-V0.3 models across various benchmarks demonstrate that APT ensures no reduction in generic capacity and achieves superior performance on downstream tasks compared to various existing methods. This validates our method as an effective strategy for enhancing domain-specific capabilities without sacrificing the model’s broader applicability.
zh
[NLP-111] Delta-KNN: Improving Demonstration Selection in In-Context Learning for Alzheimers Disease Detection
【速读】: 该论文旨在解决阿尔茨海默病(Alzheimer’s Disease, AD)早期诊断中利用患者生成文本进行语言异常分析的难题,特别是通过大型语言模型(Large Language Models, LLMs)作为健康助手进行AD检测的问题。现有基于相似性选择的常规In-Context Learning (ICL)方法在该任务中表现不佳,主要由于任务本身的复杂性。论文提出的解决方案关键在于引入Delta-KNN,这是一种新的演示样本选择策略,通过Delta分数评估每个训练样本的相对增益,并结合基于KNN的检索器动态选择最优“代表”样本,从而提升ICL性能。实验结果表明,Delta-KNN在多个AD检测数据集和开源LLM上均优于现有基线方法,特别是在使用Llama-3.1模型时达到了新的最先进水平。
链接: https://arxiv.org/abs/2506.03476
作者: Chuyuan Li,Raymond Li,Thalia S. Field,Giuseppe Carenini
机构: The University of British Columbia (不列颠哥伦比亚大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Alzheimer’s Disease (AD) is a progressive neurodegenerative disorder that leads to dementia, and early intervention can greatly benefit from analyzing linguistic abnormalities. In this work, we explore the potential of Large Language Models (LLMs) as health assistants for AD diagnosis from patient-generated text using in-context learning (ICL), where tasks are defined through a few input-output examples. Empirical results reveal that conventional ICL methods, such as similarity-based selection, perform poorly for AD diagnosis, likely due to the inherent complexity of this task. To address this, we introduce Delta-KNN, a novel demonstration selection strategy that enhances ICL performance. Our method leverages a delta score to assess the relative gains of each training example, coupled with a KNN-based retriever that dynamically selects optimal “representatives” for a given input. Experiments on two AD detection datasets across three open-source LLMs demonstrate that Delta-KNN consistently outperforms existing ICL baselines. Notably, when using the Llama-3.1 model, our approach achieves new state-of-the-art results, surpassing even supervised classifiers.
zh
[NLP-112] Culture Matters in Toxic Language Detection in Persian ACL2025
【速读】: 该论文旨在解决波斯语中毒性语言检测的问题,以促进更安全的在线环境并限制有害内容的传播。其解决方案的关键在于比较多种方法,包括微调、数据增强、零样本学习、少样本学习以及跨语言迁移学习,并特别强调文化背景对迁移学习效果的影响:具有文化相似性的语言在迁移学习中表现更优,而文化差异较大的语言则效果较差。
链接: https://arxiv.org/abs/2506.03458
作者: Zahra Bokaei,Walid Magdy,Bonnie Webber
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted to ACL 2025 (Main Track)
Abstract:Toxic language detection is crucial for creating safer online environments and limiting the spread of harmful content. While toxic language detection has been under-explored in Persian, the current work compares different methods for this task, including fine-tuning, data enrichment, zero-shot and few-shot learning, and cross-lingual transfer learning. What is especially compelling is the impact of cultural context on transfer learning for this task: We show that the language of a country with cultural similarities to Persian yields better results in transfer learning. Conversely, the improvement is lower when the language comes from a culturally distinct country. Warning: This paper contains examples of toxic language that may disturb some readers. These examples are included for the purpose of research on toxic detection.
zh
[NLP-113] Exploiting LLM s for Automatic Hypothesis Assessment via a Logit-Based Calibrated Prior
【速读】: 该论文试图解决自动化假设生成过程中出现的新瓶颈——假设评估问题,即如何自动判断大量统计关系(如相关性、趋势、因果联系)是否具有新颖性、非平凡性或值得专家关注。解决方案的关键在于利用大规模语言模型(Large Language Models, LLMs)中编码的广泛知识,构建变量对相关性的先验分布。具体而言,提出了一种基于逻辑值校准先验(Logit-based Calibrated Prior)的方法,将模型原始输出的逻辑值转化为校准后的连续预测分布,从而评估观察到的相关性是否符合预期,进而判断其新颖性。
链接: https://arxiv.org/abs/2506.03444
作者: Yue Gong,Raul Castro Fernandez
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Under Review
Abstract:As hypothesis generation becomes increasingly automated, a new bottleneck has emerged: hypothesis assessment. Modern systems can surface thousands of statistical relationships-correlations, trends, causal links-but offer little guidance on which ones are novel, non-trivial, or worthy of expert attention. In this work, we study the complementary problem to hypothesis generation: automatic hypothesis assessment. Specifically, we ask: given a large set of statistical relationships, can we automatically assess which ones are novel and worth further exploration? We focus on correlations as they are a common entry point in exploratory data analysis that often serve as the basis for forming deeper scientific or causal hypotheses. To support automatic assessment, we propose to leverage the vast knowledge encoded in LLMs’ weights to derive a prior distribution over the correlation value of a variable pair. If an LLM’s prior expects the correlation value observed, then such correlation is not surprising, and vice versa. We propose the Logit-based Calibrated Prior, an LLM-elicited correlation prior that transforms the model’s raw output logits into a calibrated, continuous predictive distribution over correlation values. We evaluate the prior on a benchmark of 2,096 real-world variable pairs and it achieves a sign accuracy of 78.8%, a mean absolute error of 0.26, and 95% credible interval coverage of 89.2% in predicting Pearson correlation coefficient. It also outperforms a fine-tuned RoBERTa classifier in binary correlation prediction and achieves higher precision@K in hypothesis ranking. We further show that the prior generalizes to correlations not seen during LLM pretraining, reflecting context-sensitive reasoning rather than memorization. Comments: Under Review Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL) Cite as: arXiv:2506.03444 [cs.LG] (or arXiv:2506.03444v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2506.03444 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-114] me Course MechInterp: Analyzing the Evolution of Components and Knowledge in Large Language Models
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)如何获取和存储事实性知识的问题,以提升其可解释性和可靠性。其解决方案的关键在于通过跟踪Olm-7B模型在预训练过程中注意力头和前馈网络(Feed Forward Networks, FFNs)的角色演变,对这些组件进行分类并分析其稳定性与转换情况,从而揭示LLMs中知识表征的动态变化机制。
链接: https://arxiv.org/abs/2506.03434
作者: Ahmad Dawar Hakimi,Ali Modarressi,Philipp Wicke,Hinrich Schütze
机构: LMU Munich (慕尼黑路德维希-马克西米利安大学); Munich Center for Machine Learning (慕尼黑机器学习中心)
类目: Computation and Language (cs.CL)
备注:
Abstract:Understanding how large language models (LLMs) acquire and store factual knowledge is crucial for enhancing their interpretability and reliability. In this work, we analyze the evolution of factual knowledge representation in the OLMo-7B model by tracking the roles of its attention heads and feed forward networks (FFNs) over the course of pre-training. We classify these components into four roles: general, entity, relation-answer, and fact-answer specific, and examine their stability and transitions. Our results show that LLMs initially depend on broad, general-purpose components, which later specialize as training progresses. Once the model reliably predicts answers, some components are repurposed, suggesting an adaptive learning process. Notably, attention heads display the highest turnover. We also present evidence that FFNs remain more stable throughout training. Furthermore, our probing experiments reveal that location-based relations converge to high accuracy earlier in training than name-based relations, highlighting how task complexity shapes acquisition dynamics. These insights offer a mechanistic view of knowledge formation in LLMs.
zh
[NLP-115] Adaptive Task Vectors for Large Language Models
【速读】: 该论文试图解决In-Context Learning (ICL)在任务执行中面临的局限性,包括对演示顺序的敏感性、上下文长度限制以及计算效率低下等问题。其解决方案的关键在于提出自适应任务向量(Adaptive Task Vectors, ATV),该方法通过动态生成与每个输入查询相关的任务向量,而非依赖固定的演示集和任务向量。ATV利用小型语言模型生成任务向量,并将其转换以适配目标大语言模型(LLM)的架构,从而引导其输出生成,提升了模型在未见任务上的泛化能力。
链接: https://arxiv.org/abs/2506.03426
作者: Joonseong Kang,Soojeong Lee,Subeen Park,Sumin Park,Taero Kim,Jihee Kim,Ryunyi Lee,Kyungwoo Song
机构: Yonsei University (延世大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:In-Context Learning (ICL) enables Large Language Models (LLMs) to perform tasks without parameter updates by conditioning on a few demonstrations provided in the prompt. Despite its success, ICL suffers from several limitations, including sensitivity to demonstration order, context length constraints, and computational inefficiency. To address these challenges, task vector-based approaches compress task information into a single vector. However, these methods typically construct task vectors from fixed sets of demonstrations and reuse them across input queries, without conditioning on the specific input. This limitation can lead models to struggle with effective adaptation when the input query is not well aligned with the underlying demonstrations, consequently degrading their generalization performance on unseen tasks. To overcome this limitation, we propose Adaptive Task Vectors (ATV), a simple and effective framework that dynamically generates task vectors conditioned on each input query. ATV employs a small language model to generate task vectors, which are then transformed to match the target LLM’s architecture and applied to guide its output generation. In contrast to ICL and previous vector-based approaches, which rely on fixed demonstration sets and their corresponding vectors, ATV dynamically generates task vectors tailored to each specific input query and task. Consequently, ATV demonstrates strong performance and generalization capabilities, even for unseen tasks. Furthermore, we provide a theoretical analysis indicating that ATV is expressively equivalent to LoRA under equal rank budgets and more expressive than Prefix-Tuning, thereby offering formal support for its representational advantage.
zh
[NLP-116] DistRAG : Towards Distance-Based Spatial Reasoning in LLM s
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在空间推理任务中表现不足的问题,特别是在处理与距离相关的问题时缺乏可靠的推理能力。解决方案的关键在于提出一种名为DistRAG的新方法,该方法通过在图中编码城市和城镇之间的测地距离,并检索与问题相关的上下文子图,从而让LLM能够获取训练过程中未显式学习的空间信息,进而回答基于距离的推理问题。
链接: https://arxiv.org/abs/2506.03424
作者: Nicole R Schneider,Nandini Ramachandran,Kent O’Sullivan,Hanan Samet
机构: University of Maryland College Park(马里兰大学学院公园分校); University of Sydney(悉尼大学)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:
Abstract:Many real world tasks where Large Language Models (LLMs) can be used require spatial reasoning, like Point of Interest (POI) recommendation and itinerary planning. However, on their own LLMs lack reliable spatial reasoning capabilities, especially about distances. To address this problem, we develop a novel approach, DistRAG, that enables an LLM to retrieve relevant spatial information not explicitly learned during training. Our method encodes the geodesic distances between cities and towns in a graph and retrieves a context subgraph relevant to the question. Using this technique, our method enables an LLM to answer distance-based reasoning questions that it otherwise cannot answer. Given the vast array of possible places an LLM could be asked about, DistRAG offers a flexible first step towards providing a rudimentary `world model’ to complement the linguistic knowledge held in LLMs.
zh
[NLP-117] rajectory Prediction Meets Large Language Models : A Survey
【速读】: 该论文试图解决如何将语言驱动的技术整合到轨迹预测中,以提升自主系统在轨迹感知、建模和预测方面的能力。其解决方案的关键在于利用生成式 AI (Generative AI) 的语义理解和推理能力,通过多种语言模型相关的范式(如语言建模、场景理解、数据生成及推理解释等)来增强轨迹预测的准确性与可解释性。
链接: https://arxiv.org/abs/2506.03408
作者: Yi Xu,Ruining Yang,Yitian Zhang,Yizhou Wang,Jianglin Lu,Mingyuan Zhang,Lili Su,Yun Fu
机构: Northeastern University (东北大学)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, GitHub: this https URL
Abstract:Recent advances in large language models (LLMs) have sparked growing interest in integrating language-driven techniques into trajectory prediction. By leveraging their semantic and reasoning capabilities, LLMs are reshaping how autonomous systems perceive, model, and predict trajectories. This survey provides a comprehensive overview of this emerging field, categorizing recent work into five directions: (1) Trajectory prediction via language modeling paradigms, (2) Direct trajectory prediction with pretrained language models, (3) Language-guided scene understanding for trajectory prediction, (4) Language-driven data generation for trajectory prediction, (5) Language-based reasoning and interpretability for trajectory prediction. For each, we analyze representative methods, highlight core design choices, and identify open challenges. This survey bridges natural language processing and trajectory prediction, offering a unified perspective on how language can enrich trajectory prediction.
zh
[NLP-118] Comparison of different Unique hard attention transformer models by the formal languages they can recognize
【速读】: 该论文试图解决独特硬注意力变换器编码器(Unique Hard Attention Transformers, UHATs)在识别形式语言方面的能力问题,重点探讨了其在不同注意力机制下的表现。解决方案的关键在于区分掩码与非掩码、有限与无限图像以及通用与双线性注意力评分函数,并通过一阶逻辑的下界和电路复杂性的上界来分析这些模型的理论极限。
链接: https://arxiv.org/abs/2506.03370
作者: Leonid Ryvkin
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Formal Languages and Automata Theory (cs.FL)
备注:
Abstract:This note is a survey of various results on the capabilities of unique hard attention transformers encoders (UHATs) to recognize formal languages. We distinguish between masked vs. non-masked, finite vs. infinite image and general vs. bilinear attention score functions. We recall some relations between these models, as well as a lower bound in terms of first-order logic and an upper bound in terms of circuit complexity.
zh
[NLP-119] A Multimodal Multilingual and Multidimensional Pipeline for Fine-grained Crowdsourcing Earthquake Damage Evaluation
【速读】: 该论文旨在解决快速、细粒度的灾害损毁评估问题,这一问题在应急响应中至关重要,但因地面传感器有限和官方报告延迟而难以实现。解决方案的关键在于提出一种结构化的多模态、多语言、多维(3M)流程,该流程利用多模态大语言模型(Multimodal Large Language Models, MLLMs)整合图像与文本信号,从而有效评估灾害影响,并展现出与真实地震数据的高度相关性。
链接: https://arxiv.org/abs/2506.03360
作者: Zihui Ma,Lingyao Li,Juan Li,Wenyue Hua,Jingxiao Liu,Qingyuan Feng,Yuki Miura
机构: New York University (纽约大学); University of South Florida (南佛罗里达大学); Google LLC (谷歌公司); University of California, Santa Barbara (加州大学圣塔芭芭拉分校); Massachusetts Institute of Technology (麻省理工学院)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:
Abstract:Rapid, fine-grained disaster damage assessment is essential for effective emergency response, yet remains challenging due to limited ground sensors and delays in official reporting. Social media provides a rich, real-time source of human-centric observations, but its multimodal and unstructured nature presents challenges for traditional analytical methods. In this study, we propose a structured Multimodal, Multilingual, and Multidimensional (3M) pipeline that leverages multimodal large language models (MLLMs) to assess disaster impacts. We evaluate three foundation models across two major earthquake events using both macro- and micro-level analyses. Results show that MLLMs effectively integrate image-text signals and demonstrate a strong correlation with ground-truth seismic data. However, performance varies with language, epicentral distance, and input modality. This work highlights the potential of MLLMs for disaster assessment and provides a foundation for future research in applying MLLMs to real-time crisis contexts. The code and data are released at: this https URL
zh
[NLP-120] Ask a Local: Detecting Hallucinations With Specialized Model Divergence
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)中幻觉问题,即模型生成看似合理但事实错误的信息。其解决方案的关键在于提出一种名为“Ask a Local”的新颖幻觉检测方法,该方法利用专业化模型在遇到领域特定错误时表现出更大的惊讶程度这一直觉,通过计算语言专业化模型的困惑度分布之间的差异来识别可能产生幻觉的文本片段。该方法在多语言环境中尤为适用,因其无需适应、外部数据源或训练即可自然扩展至多种语言,并选择了计算效率高的模型以实现可扩展性。
链接: https://arxiv.org/abs/2506.03357
作者: Aldan Creo,Héctor Cerezo-Costas,Pedro Alonso-Doval,Maximiliano Hormazábal-Lagos
机构: Fundación Centro Tecnolóxico de Telecomunicacións de Galicia (GRADIANT) (西班牙加利西亚电信技术中心基金会)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Supplementary materials: this https URL
Abstract:Hallucinations in large language models (LLMs) - instances where models generate plausible but factually incorrect information - present a significant challenge for AI. We introduce “Ask a Local”, a novel hallucination detection method exploiting the intuition that specialized models exhibit greater surprise when encountering domain-specific inaccuracies. Our approach computes divergence between perplexity distributions of language-specialized models to identify potentially hallucinated spans. Our method is particularly well-suited for a multilingual context, as it naturally scales to multiple languages without the need for adaptation, relying on external data sources, or performing training. Moreover, we select computationally efficient models, providing a scalable solution that can be applied to a wide range of languages and domains. Our results on a human-annotated question-answer dataset spanning 14 languages demonstrate consistent performance across languages, with Intersection-over-Union (IoU) scores around 0.3 and comparable Spearman correlation values. Our model shows particularly strong performance on Italian and Catalan, with IoU scores of 0.42 and 0.38, respectively, while maintaining cross-lingual effectiveness without language-specific adaptations. We release our code and architecture to facilitate further research in multilingual hallucination detection. Comments: Supplementary materials: this https URL Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2506.03357 [cs.CL] (or arXiv:2506.03357v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2506.03357 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-121] Cross-Platform Violence Detection on Social Media: A Dataset and Analysis
【速读】: 该论文试图解决社交媒体平台上暴力威胁的检测与理解问题,其核心挑战在于缺乏跨平台、高质量且细粒度标注的数据。解决方案的关键在于构建了一个包含30,000条人工编码的暴力威胁及其子类型(如政治暴力和性暴力)的跨平台数据集,并通过机器学习分析验证了该数据集在不同平台间的泛化能力。研究结果表明,即使数据来源平台和编码标准不同,该数据集仍能实现高分类精度,这对内容分类策略及跨社交媒体的暴力内容理解具有重要意义。
链接: https://arxiv.org/abs/2506.03312
作者: Celia Chen,Scotty Beland,Ingo Burghardt,Jill Byczek,William J. Conway,Eric Cotugno,Sadaf Davre,Megan Fletcher,Rajesh Kumar Gnanasekaran,Kristin Hamilton,Marilyn Harbert,Jordan Heustis,Tanaya Jha,Emily Klein,Hayden Kramer,Alex Leitch,Jessica Perkins,Casi Sherman,Celia Sterrn,Logan Stevens,Rebecca Zarrella,Jennifer Golbeck
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: In Proceedings of the 17th ACM Web Science Conference (WebSci '25). 9 pages
Abstract:Violent threats remain a significant problem across social media platforms. Useful, high-quality data facilitates research into the understanding and detection of malicious content, including violence. In this paper, we introduce a cross-platform dataset of 30,000 posts hand-coded for violent threats and sub-types of violence, including political and sexual violence. To evaluate the signal present in this dataset, we perform a machine learning analysis with an existing dataset of violent comments from YouTube. We find that, despite originating from different platforms and using different coding criteria, we achieve high classification accuracy both by training on one dataset and testing on the other, and in a merged dataset condition. These results have implications for content-classification strategies and for understanding violent content across social media.
zh
[NLP-122] he Reader is the Metric: How Textual Features and Reader Profiles Explain Conflicting Evaluations of AI Creative Writing ACL
【速读】: 该论文试图解决AI生成文学文本与人类创作文本在文学质量评估中存在分歧的问题,其核心在于探讨这种分歧是否源于文本本身的内在质量差异,还是读者对文学的解释和价值判断存在差异。论文的解决方案关键在于通过分析读者的偏好特征向量,构建一个共享的“偏好空间”,从而量化文本特征与读者偏好之间的对齐程度,揭示文学质量评估的主观性本质。
链接: https://arxiv.org/abs/2506.03310
作者: Guillermo Marco,Julio Gonzalo,Víctor Fresno
机构: UNED Research Group in NLP and IR (nlp.uned.es)
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: Camera-ready version, 14 pages, 3 figures. Accepted to Findings of the Association for Computational Linguistics (ACL) 2025. Code data: this https URL
Abstract:Recent studies comparing AI-generated and human-authored literary texts have produced conflicting results: some suggest AI already surpasses human quality, while others argue it still falls short. We start from the hypothesis that such divergences can be largely explained by genuine differences in how readers interpret and value literature, rather than by an intrinsic quality of the texts evaluated. Using five public datasets (1,471 stories, 101 annotators including critics, students, and lay readers), we (i) extract 17 reference-less textual features (e.g., coherence, emotional variance, average sentence length…); (ii) model individual reader preferences, deriving feature importance vectors that reflect their textual priorities; and (iii) analyze these vectors in a shared “preference space”. Reader vectors cluster into two profiles: ‘surface-focused readers’ (mainly non-experts), who prioritize readability and textual richness; and ‘holistic readers’ (mainly experts), who value thematic development, rhetorical variety, and sentiment dynamics. Our results quantitatively explain how measurements of literary quality are a function of how text features align with each reader’s preferences. These findings advocate for reader-sensitive evaluation frameworks in the field of creative text generation.
zh
[NLP-123] Hopscotch: Discovering and Skipping Redundancies in Language Models
【速读】: 该论文试图解决现代因果语言模型中注意力块冗余的问题,即并非所有注意力块都对每个任务具有同等贡献。解决方案的关键在于提出Hopscotch方法,该方法能够识别并跳过对任务贡献最小的注意力块,同时通过引入轻量级可训练缩放参数来调整剩余层的输出,从而保持输出质量并缓解因移除注意力块导致的隐藏状态分布偏移。
链接: https://arxiv.org/abs/2506.03303
作者: Mustafa Eyceoz,Nikhil Shivakumar Nayak,Hao Wang,Ligong Han,Akash Srivastava
机构: Red Hat AI Innovation (Red Hat AI 创新)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 10 pages, 4 figures, 9 tables
Abstract:Modern causal language models stack many attention blocks to improve performance, but not all blocks are necessary for every task. We propose Hopscotch, a simple yet effective method that identifies and skips attention blocks with least contributions to a task and adapts to preserve output quality. Hopscotch jointly optimizes which blocks to skip and how to scale the outputs of the remaining layers. By introducing lightweight, trainable scaling parameters to attention and MLP blocks, it mitigates distribution shifts in hidden states caused by removing attention blocks. Hopscotch does not modify model weights or require access to pretraining or instruction-tuning data, and is compatible with existing model compression techniques. When applied to \textttLlama-3.1-8B and \textttQwen2.5-7B , Hopscotch achieves less than a 2% drop in performance even after skipping four attention blocks.
zh
[NLP-124] From Instructions to ODRL Usage Policies: An Ontology Guided Approach VLDB2024 VLDB
【速读】: 该论文试图解决如何自动从自然语言指令生成符合W3C Open Digital Rights Language (ODRL)规范的使用策略问题。其解决方案的关键在于利用大型语言模型(如GPT-4)结合ODRL本体及其文档作为提示的核心部分,并通过优化的本体文档来引导策略生成过程,从而实现知识图谱(Knowledge Graph, KG)的端到端构建。
链接: https://arxiv.org/abs/2506.03301
作者: Daham M. Mustafa,Abhishek Nadgeri,Diego Collarana,Benedikt T. Arnold,Christoph Quix,Christoph Lange,Stefan Decker
机构: Fraunhofer FIT(弗劳恩霍夫研究所); RWTH Aachen University(亚琛工业大学); Universidad Privada Boliviana(玻利维亚私立大学)
类目: Computation and Language (cs.CL)
备注: The paper is accepted at LLM+KG: International Workshop on Data Management Opportunities in Unifying Large Language Models + Knowledge Graphs, VLDB 2024, August 26, 2024, Guangzhou, China. this https URL
Abstract:This study presents an approach that uses large language models such as GPT-4 to generate usage policies in the W3C Open Digital Rights Language ODRL automatically from natural language instructions. Our approach uses the ODRL ontology and its documentation as a central part of the prompt. Our research hypothesis is that a curated version of existing ontology documentation will better guide policy generation. We present various heuristics for adapting the ODRL ontology and its documentation to guide an end-to-end KG construction process. We evaluate our approach in the context of dataspaces, i.e., distributed infrastructures for trustworthy data exchange between multiple participating organizations for the cultural domain. We created a benchmark consisting of 12 use cases of varying complexity. Our evaluation shows excellent results with up to 91.95% accuracy in the resulting knowledge graph.
zh
[NLP-125] Unleashing the Reasoning Potential of Pre-trained LLM s by Critique Fine-Tuning on One Problem
【速读】: 该论文试图解决如何高效释放强大基础大语言模型(Large Language Models, LLMs)的推理潜力的问题。传统方法如强化学习(Reinforcement Learning, RL)虽然有效,但存在成本高和不稳定的问题。论文提出的解决方案的关键在于通过仅对一个问题进行批判性微调(Critique Fine-Tuning, CFT),利用教师模型对单一问题生成的多样化解题方案进行详细批判,从而构建批判数据集,并在此基础上对模型进行微调。实验表明,该方法在少量计算资源下即可显著提升模型在多种推理任务上的表现。
链接: https://arxiv.org/abs/2506.03295
作者: Yubo Wang,Ping Nie,Kai Zou,Lijun Wu,Wenhu Chen
机构: University of Waterloo (滑铁卢大学); Vector Institute (向量研究所); Netmind.AI (Netmind.AI); Shanghai AI Lab (上海人工智能实验室)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:We have witnessed that strong LLMs like Qwen-Math, MiMo, and Phi-4 possess immense reasoning potential inherited from the pre-training stage. With reinforcement learning (RL), these models can improve dramatically on reasoning tasks. Recent studies have shown that even RL on a single problem can unleash these models’ reasoning capabilities. However, RL is not only expensive but also unstable. Even one-shot RL requires hundreds of GPU hours. This raises a critical question: Is there a more efficient way to unleash the reasoning potential of these powerful base LLMs? In this work, we demonstrate that Critique Fine-Tuning (CFT) on only one problem can effectively unleash the reasoning potential of LLMs. Our method constructs critique data by collecting diverse model-generated solutions to a single problem and using teacher LLMs to provide detailed critiques. We fine-tune Qwen and Llama family models, ranging from 1.5B to 14B parameters, on the CFT data and observe significant performance gains across diverse reasoning tasks. For example, with just 5 GPU hours of training, Qwen-Math-7B-CFT show an average improvement of 15% on six math benchmarks and 16% on three logic reasoning benchmarks. These results are comparable to or even surpass the results from RL with 20x less compute. Ablation studies reveal the robustness of one-shot CFT across different prompt problems. These results highlight one-shot CFT as a simple, general, and compute-efficient approach to unleashing the reasoning capabilities of modern LLMs.
zh
[NLP-126] HyperSteer: Activation Steering at Scale with Hypernetworks
【速读】: 该论文试图解决如何有效生成用于控制语言模型(Language Models, LMs)文本生成的转向向量(steering vectors)的问题。现有方法中,无监督字典学习方法虽然可以生成大量转向向量,但缺乏对每个向量个体效果的保证以及对相关转向任务覆盖范围的控制;而有监督方法虽然针对性强且效果好,但需要为每个额外的转向向量收集更多数据并进行训练。该论文提出的解决方案是HyperSteer,其关键在于采用基于超网络(hypernetwork)的架构,通过端到端训练生成转向向量,这些向量由自然语言转向提示和被转向LM的内部状态条件化生成。
链接: https://arxiv.org/abs/2506.03292
作者: Jiuding Sun,Sidharth Baskaran,Zhengxuan Wu,Michael Sklar,Christopher Potts,Atticus Geiger
机构: Stanford University (斯坦福大学); Pr(Ai)2R Group (Pr(Ai)2R 组); Georgia Institute of Technology (佐治亚理工学院); Confirm Labs (确认实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Steering language models (LMs) by modifying internal activations is a popular approach for controlling text generation. Unsupervised dictionary learning methods, e.g., sparse autoencoders, can be scaled to produce many steering vectors, but lack guarantees on the individual efficacy of each vector and control over the coverage of relevant steering tasks. In contrast, supervised methods for constructing steering vectors are targeted and effective, but require more data collection and training for each additional steering vector produced. In this work, we introduce HyperSteer, a family of hypernetwork-based architectures which are trained end-to-end to generate steering vectors conditioned on the natural language steering prompts and the internals of the steered LM. In our evaluations, we show that scaling HyperSteer with thousands of steering prompts exceeds the performance of state-of-the-art activation steering methods, even on steering prompts never seen during training. Moreover, HyperSteer performs on par with steering-via-prompting.
zh
[NLP-127] FailureSensorIQ: A Multi-Choice QA Dataset for Understanding Sensor Relationships and Failure Modes
【速读】: 该论文试图解决工业4.0领域中大型语言模型(Large Language Models, LLMs)在复杂、领域特定场景下的推理与理解能力评估问题。传统问答基准未能全面覆盖故障模式、传感器数据及其关系的多维推理能力,因此本文提出FailureSensorIQ,一个基于多选题问答(Multi-Choice Question-Answering, MCQA)的基准系统,以更全面地评估LLMs在工业资产中的表现。解决方案的关键在于结合数据驱动与领域驱动的方法,通过引入专家标注的MCQA数据集、基于非文本数据构建的基准以及LLM驱动的特征选择工具,提升模型对关键贡献因素和有用模式的识别能力。
链接: https://arxiv.org/abs/2506.03278
作者: Christodoulos Constantinides,Dhaval Patel,Shuxin Lin,Claudio Guerrero,Sunil Dagajirao Patil,Jayant Kalagnanam
机构: IBM TJ Watson Research Center (IBM TJ沃森研究中心); IBM (IBM)
类目: Computation and Language (cs.CL)
备注:
Abstract:We introduce FailureSensorIQ, a novel Multi-Choice Question-Answering (MCQA) benchmarking system designed to assess the ability of Large Language Models (LLMs) to reason and understand complex, domain-specific scenarios in Industry 4.0. Unlike traditional QA benchmarks, our system focuses on multiple aspects of reasoning through failure modes, sensor data, and the relationships between them across various industrial assets. Through this work, we envision a paradigm shift where modeling decisions are not only data-driven using statistical tools like correlation analysis and significance tests, but also domain-driven by specialized LLMs which can reason about the key contributors and useful patterns that can be captured with feature engineering. We evaluate the Industrial knowledge of over a dozen LLMs-including GPT-4, Llama, and Mistral-on FailureSensorIQ from different lens using Perturbation-Uncertainty-Complexity analysis, Expert Evaluation study, Asset-Specific Knowledge Gap analysis, ReAct agent using external knowledge-bases. Even though closed-source models with strong reasoning capabilities approach expert-level performance, the comprehensive benchmark reveals a significant drop in performance that is fragile to perturbations, distractions, and inherent knowledge gaps in the models. We also provide a real-world case study of how LLMs can drive the modeling decisions on 3 different failure prediction datasets related to various assets. We release: (a) expert-curated MCQA for various industrial assets, (b) FailureSensorIQ benchmark and Hugging Face leaderboard based on MCQA built from non-textual data found in ISO documents, and © LLMFeatureSelector, an LLM-based feature selection scikit-learn pipeline. The software is available at this https URL.
zh
[NLP-128] A conclusive remark on linguistic theorizing and language modeling
【速读】: 该论文试图解决语言学领域中针对特定目标论文的回应与讨论问题,其关键在于总结并分析收到的回复内容,以完成对原文的最终评论(final remark)。通过系统梳理和归纳不同观点,论文旨在为相关学术争论提供更全面的理解和参考。
链接: https://arxiv.org/abs/2506.03268
作者: Cristiano Chesi
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:This is the final remark on the replies received to my target paper in the Italian Journal of Linguistics
zh
[NLP-129] Evaluating Large Language Models for Zero-Shot Disease Labeling in CT Radiology Reports Across Organ Systems
【速读】: 该论文旨在解决医学影像报告中疾病注释的自动化问题,具体是评估大型语言模型(Large Language Models, LLMs)在胸部、腹部和盆腔(CAP)CT报告的多疾病标记中的有效性。其解决方案的关键在于对比基于规则的算法(RBA)、RadBERT以及三种轻量级开源权重LLMs的性能,并通过零样本提示(zero-shot prompting)方法进行外部验证,以确定LLMs在不同器官系统中的泛化能力及标注准确性。
链接: https://arxiv.org/abs/2506.03259
作者: Michael E. Garcia-Alcoser,Mobina GhojoghNejad,Fakrul Islam Tushar,David Kim,Kyle J. Lafata,Geoffrey D. Rubin,Joseph Y. Lo
机构: 未知
类目: Computation and Language (cs.CL)
备注: 23 pages, 10 figures, to be submitted in Radiology: Artificial Intelligence
Abstract:Purpose: This study aims to evaluate the effectiveness of large language models (LLMs) in automating disease annotation of CT radiology reports. We compare a rule-based algorithm (RBA), RadBERT, and three lightweight open-weight LLMs for multi-disease labeling of chest, abdomen, and pelvis (CAP) CT reports. Materials and Methods: This retrospective study analyzed 40,833 CT reports from 29,540 patients, with 1,789 CAP reports manually annotated across three organ systems. External validation was conducted using the CT-RATE dataset. Three open-weight LLMs were tested with zero-shot prompting. Performance was evaluated using Cohen’s Kappa and micro/macro-averaged F1 scores. Results: In 12,197 Duke CAP reports from 8,854 patients, Llama-3.1 8B and Gemma-3 27B showed the highest agreement ( \kappa median: 0.87). On the manually annotated set, Gemma-3 27B achieved the top macro-F1 (0.82), followed by Llama-3.1 8B (0.79), while the RBA scored lowest (0.64). On the CT-RATE dataset (lungs/pleura only), Llama-3.1 8B performed best (0.91), with Gemma-3 27B close behind (0.89). Performance differences were mainly due to differing labeling practices, especially for lung atelectasis. Conclusion: Lightweight LLMs outperform rule-based methods for CT report annotation and generalize across organ systems with zero-shot prompting. However, binary labels alone cannot capture the full nuance of report language. LLMs can provide a flexible, efficient solution aligned with clinical judgment and user needs. Comments: 23 pages, 10 figures, to be submitted in Radiology: Artificial Intelligence Subjects: Computation and Language (cs.CL) ACMclasses: I.2.7 Cite as: arXiv:2506.03259 [cs.CL] (or arXiv:2506.03259v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2506.03259 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Michael Garcia-Alcoser [view email] [v1] Tue, 3 Jun 2025 18:00:08 UTC (1,119 KB)
zh
[NLP-130] DiaBlo: Diagonal Blocks Are Sufficient For Finetuning
【速读】: 该论文旨在解决参数高效微调(Parameter-Efficient Finetuning, PEFT)方法与全模型微调之间存在的性能差距问题。其解决方案的关键在于提出DiaBlo,一种仅更新选定模型权重矩阵的对角块的简单而有效的PEFT方法。与低秩适应(Low Rank Adaptation, LoRA)及其变体不同,DiaBlo无需进行低秩矩阵乘法,从而避免了对辅助初始化方案或定制优化策略的依赖,实现了稳定且鲁棒的收敛,同时保持了与LoRA相当的内存效率和训练速度。
链接: https://arxiv.org/abs/2506.03230
作者: Selcuk Gurses,Aozhong Zhang,Yanxia Deng,Xun Dong,Xin Li,Naigang Wang,Penghang Yin,Zi Yang
机构: University at Albany, SUNY (纽约州立大学阿尔巴尼分校); IBM T. J. Watson Research Center (IBM托马斯·J·沃森研究中心)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Optimization and Control (math.OC)
备注:
Abstract:Finetuning is a critical step for adapting large language models (LLMs) to domain-specific downstream tasks. To mitigate the substantial computational and memory costs of full-model fine-tuning, Parameter-Efficient Finetuning (PEFT) methods have been proposed to update only a small subset of model parameters. However, performance gaps between PEFT approaches and full-model fine-tuning still exist. In this work, we present DiaBlo, a simple yet effective PEFT approach that updates only the diagonal blocks of selected model weight matrices. Unlike Low Rank Adaptation (LoRA) and its variants, DiaBlo eliminates the need for low rank matrix products, thereby avoiding the reliance on auxiliary initialization schemes or customized optimization strategies to improve convergence. This design leads to stable and robust convergence while maintaining comparable memory efficiency and training speed to LoRA. We conduct extensive experiments across a range of tasks, including commonsense reasoning, arithmetic reasoning, code generation, and safety alignment, to evaluate the effectiveness and efficiency of DiaBlo. Across these benchmarks, DiaBlo demonstrates strong and consistent performance while maintaining high memory efficiency and fast finetuning speed. Codes are available at this https URL.
zh
[NLP-131] Infinity Parser: Layout Aware Reinforcement Learning for Scanned Document Parsing
【速读】: 该论文旨在解决扫描文档自动解析为结构化、机器可读格式的瓶颈问题,传统多阶段流水线存在误差传播和对多样布局适应性有限的缺陷。其解决方案的关键在于提出一种端到端的强化学习框架layoutRL,通过优化归一化编辑距离、段落数量准确性和阅读顺序保持的复合奖励函数,使模型具备显式的版面感知能力。
链接: https://arxiv.org/abs/2506.03197
作者: Baode Wang,Biao Wu,Weizhen Li,Meng Fang,Yanjie Liang,Zuming Huang,Haozhe Wang,Jun Huang,Ling Chen,Wei Chu,Yuan Qi
机构: INFLY Tech; Australian Artificial Intelligence Institute; University of Liverpool
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 16 pages, 12 figures
Abstract:Automated parsing of scanned documents into richly structured, machine-readable formats remains a critical bottleneck in Document AI, as traditional multi-stage pipelines suffer from error propagation and limited adaptability to diverse layouts. We introduce layoutRL, an end-to-end reinforcement learning framework that trains models to be explicitly layout-aware by optimizing a composite reward of normalized edit distance, paragraph count accuracy, and reading order preservation. Leveraging our newly released dataset, Infinity-Doc-55K, which combines 55K high-fidelity synthetic scanned document parsing data with expert-filtered real-world documents, we instantiate layoutRL in a vision-language-model-based parser called Infinity-Parser. Evaluated on English and Chinese benchmarks for OCR, table and formula extraction, and reading order detection, Infinity-Parser achieves new state-of-the-art performance in both accuracy and structural fidelity, outpacing specialist pipelines and general-purpose vision-language models. We will publicly release our code and dataset to accelerate progress in robust document understanding.
zh
[NLP-132] mRAG : Elucidating the Design Space of Multi-modal Retrieval-Augmented Generation
【速读】: 该论文旨在解决大型视觉-语言模型(Large Vision-Language Models, LVLMs)在动态现实应用场景中面临的性能限制问题,包括静态训练数据、幻觉现象以及无法验证最新外部证据的缺陷。其解决方案的关键在于引入检索增强生成(Retrieval-Augmented Generation, RAG)方法,通过检索机制使LVLM能够访问大规模知识数据库,从而在生成过程中融入事实性与上下文相关的信息,提升模型输出的准确性和可靠性。
链接: https://arxiv.org/abs/2505.24073
作者: Chan-Wei Hu,Yueqi Wang,Shuo Xing,Chia-Ju Chen,Zhengzhong Tu
机构: Texas A&M University (德克萨斯A&M大学); University of California, Berkeley (加州大学伯克利分校); University of Texas at Austin (德克萨斯大学奥斯汀分校)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 11 figures
Abstract:Large Vision-Language Models (LVLMs) have made remarkable strides in multimodal tasks such as visual question answering, visual grounding, and complex reasoning. However, they remain limited by static training data, susceptibility to hallucinations, and inability to verify claims against up-to-date, external evidence, compromising their performance in dynamic real-world applications. Retrieval-Augmented Generation (RAG) offers a practical solution to mitigate these challenges by allowing the LVLMs to access large-scale knowledge databases via retrieval mechanisms, thereby grounding model outputs in factual, contextually relevant information. Here in this paper, we conduct the first systematic dissection of the multimodal RAG pipeline for LVLMs, explicitly investigating (1) the retrieval phase: on the modality configurations and retrieval strategies, (2) the re-ranking stage: on strategies to mitigate positional biases and improve the relevance of retrieved evidence, and (3) the generation phase: we further investigate how to best integrate retrieved candidates into the final generation process. Finally, we extend to explore a unified agentic framework that integrates re-ranking and generation through self-reflection, enabling LVLMs to select relevant evidence and suppress irrelevant context dynamically. Our full-stack exploration of RAG for LVLMs yields substantial insights, resulting in an average performance boost of 5% without any fine-tuning.
zh
[NLP-133] one recognition in low-resource languages of North-East India: peeling the layers of SSL-based speech models INTERSPEECH2025
【速读】: 该论文试图解决在资源匮乏的语言环境中,如印度东北部的Angami、Ao和Mizo三种语言,如何有效进行声调识别的问题。解决方案的关键在于利用自监督学习(Self-Supervised Learning, SSL)模型,特别是Wav2vec2.0基础模型,通过分析不同层的声调识别性能,并发现中间层对于声调识别最为关键,无论预训练语言是否为声调语言。此外,研究还揭示了声调系统、声调类型及方言差异对声调识别的影响。
链接: https://arxiv.org/abs/2506.03606
作者: Parismita Gogoi,Sishir Kalita,Wendy Lalhminghlui,Viyazonuo Terhiija,Moakala Tzudir,Priyankoo Sarmah,S. R. M. Prasanna
机构: IIT Guwahati(印度理工学院古瓦哈提分校); DUIET, Dibrugarh University(迪布鲁加尔大学迪乌iet学院); Armsoftech.air(阿姆斯科技空气); National Institute of Electronics & Information Technology(国家电子与信息技术研究所); IIIT Dharwad(印度信息科技学院达沃德分校)
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Signal Processing (eess.SP)
备注: Accepted in Interspeech2025
Abstract:This study explores the use of self-supervised learning (SSL) models for tone recognition in three low-resource languages from North Eastern India: Angami, Ao, and Mizo. We evaluate four Wav2vec2.0 base models that were pre-trained on both tonal and non-tonal languages. We analyze tone-wise performance across the layers for all three languages and compare the different models. Our results show that tone recognition works best for Mizo and worst for Angami. The middle layers of the SSL models are the most important for tone recognition, regardless of the pre-training language, i.e. tonal or non-tonal. We have also found that the tone inventory, tone types, and dialectal variations affect tone recognition. These findings provide useful insights into the strengths and weaknesses of SSL-based embeddings for tonal languages and highlight the potential for improving tone recognition in low-resource settings. The source code is available at GitHub 1 .
zh
计算机视觉
[CV-0] LayerFlow: A Unified Model for Layer-aware Video Generation
【速读】:该论文试图解决视频生成中对分层内容(如透明前景、干净背景和混合场景)进行可控生成的问题,同时支持多种变体操作,如分解混合视频或根据给定前景生成背景。解决方案的关键在于提出LayerFlow框架,通过将不同层的视频组织为子片段,并利用层嵌入(layer embeddings)区分各片段及其对应的层提示,从而在统一框架下实现多种生成任务。此外,为应对缺乏高质量分层训练视频的数据问题,设计了多阶段训练策略,结合静态图像与视频数据,提升模型性能。
链接: https://arxiv.org/abs/2506.04228
作者: Sihui Ji,Hao Luo,Xi Chen,Yuanpeng Tu,Yiyang Wang,Hengshuang Zhao
机构: The University of Hong Kong (香港大学); DAMO Academy (达摩院); Alibaba Group (阿里巴巴集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL
Abstract:We present LayerFlow, a unified solution for layer-aware video generation. Given per-layer prompts, LayerFlow generates videos for the transparent foreground, clean background, and blended scene. It also supports versatile variants like decomposing a blended video or generating the background for the given foreground and vice versa. Starting from a text-to-video diffusion transformer, we organize the videos for different layers as sub-clips, and leverage layer embeddings to distinguish each clip and the corresponding layer-wise prompts. In this way, we seamlessly support the aforementioned variants in one unified framework. For the lack of high-quality layer-wise training videos, we design a multi-stage training strategy to accommodate static images with high-quality layer annotations. Specifically, we first train the model with low-quality video data. Then, we tune a motion LoRA to make the model compatible with static frames. Afterward, we train the content LoRA on the mixture of image data with high-quality layered images along with copy-pasted video data. During inference, we remove the motion LoRA thus generating smooth videos with desired layers.
zh
[CV-1] Object-centric 3D Motion Field for Robot Learning from Human Videos
【速读】:该论文试图解决从人类视频中提取动作知识(或动作表示)以用于机器人控制策略学习的问题,这一过程面临建模复杂性或信息丢失等挑战。解决方案的关键在于提出使用以物体为中心的3D运动场(object-centric 3D motion field)来表示动作,并构建一个新颖的框架,从视频中提取该表示以实现零样本控制。其核心创新包括:一种用于训练“去噪”3D运动场估计器的新颖训练流程,能够从带有噪声深度的人类视频中稳健地提取精细的物体3D运动;以及一种密集的以物体为中心的3D运动场预测架构,有利于跨具身迁移和背景泛化。
链接: https://arxiv.org/abs/2506.04227
作者: Zhao-Heng Yin,Sherry Yang,Pieter Abbeel
机构: BAIR, UC Berkeley EECS (BAIR,加州大学伯克利分校电子工程与计算机科学系); Google DeepMind (谷歌深度思维)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Systems and Control (eess.SY)
备注: Project: this https URL
Abstract:Learning robot control policies from human videos is a promising direction for scaling up robot learning. However, how to extract action knowledge (or action representations) from videos for policy learning remains a key challenge. Existing action representations such as video frames, pixelflow, and pointcloud flow have inherent limitations such as modeling complexity or loss of information. In this paper, we propose to use object-centric 3D motion field to represent actions for robot learning from human videos, and present a novel framework for extracting this representation from videos for zero-shot control. We introduce two novel components in its implementation. First, a novel training pipeline for training a ‘‘denoising’’ 3D motion field estimator to extract fine object 3D motions from human videos with noisy depth robustly. Second, a dense object-centric 3D motion field prediction architecture that favors both cross-embodiment transfer and policy generalization to background. We evaluate the system in real world setups. Experiments show that our method reduces 3D motion estimation error by over 50% compared to the latest method, achieve 55% average success rate in diverse tasks where prior approaches fail~( \lesssim 10 %), and can even acquire fine-grained manipulation skills like insertion.
zh
[CV-2] Voyager: Long-Range and World-Consistent Video Diffusion for Explorable 3D Scene Generation
【速读】:该论文旨在解决生成长距离、3D一致且可探索的3D场景这一复杂问题,尤其是在视频游戏和虚拟现实等实际应用中,用户需要沿自定义相机轨迹进行交互。其解决方案的关键在于提出Voyager框架,该框架通过端到端的场景生成与重建实现帧间内在一致性,无需依赖传统的3D重建流水线(如基于运动恢复结构或多视角立体视觉的方法)。Voyager的核心创新包括:1)世界一致的视频扩散模型,联合生成对齐的RGB与深度视频序列以确保全局连贯性;2)长距离世界探索机制,结合点云裁剪和自回归推理实现上下文感知的一致性扩展;3)可扩展的数据引擎,自动化相机位姿估计与度量深度预测,从而支持大规模多样化训练数据的构建。
链接: https://arxiv.org/abs/2506.04225
作者: Tianyu Huang,Wangguandong Zheng,Tengfei Wang,Yuhao Liu,Zhenwei Wang,Junta Wu,Jie Jiang,Hui Li,Rynson W.H. Lau,Wangmeng Zuo,Chunchao Guo
机构: Harbin Institute of Technology (哈尔滨工业大学); Southeast University (东南大学); Tencent Hunyuan (腾讯混元); City University of Hong Kong (香港城市大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Real-world applications like video gaming and virtual reality often demand the ability to model 3D scenes that users can explore along custom camera trajectories. While significant progress has been made in generating 3D objects from text or images, creating long-range, 3D-consistent, explorable 3D scenes remains a complex and challenging problem. In this work, we present Voyager, a novel video diffusion framework that generates world-consistent 3D point-cloud sequences from a single image with user-defined camera path. Unlike existing approaches, Voyager achieves end-to-end scene generation and reconstruction with inherent consistency across frames, eliminating the need for 3D reconstruction pipelines (e.g., structure-from-motion or multi-view stereo). Our method integrates three key components: 1) World-Consistent Video Diffusion: A unified architecture that jointly generates aligned RGB and depth video sequences, conditioned on existing world observation to ensure global coherence 2) Long-Range World Exploration: An efficient world cache with point culling and an auto-regressive inference with smooth video sampling for iterative scene extension with context-aware consistency, and 3) Scalable Data Engine: A video reconstruction pipeline that automates camera pose estimation and metric depth prediction for arbitrary videos, enabling large-scale, diverse training data curation without manual 3D annotations. Collectively, these designs result in a clear improvement over existing methods in visual quality and geometric accuracy, with versatile applications.
zh
[CV-3] Seeing in the Dark: Benchmarking Egocentric 3D Vision with the Oxford Day-and-Night Dataset
【速读】:该论文试图解决在复杂光照条件下进行新颖视角合成(NVS)和视觉重定位的问题。现有数据集通常缺乏真实三维几何、广泛光照变化以及完整的六自由度(6DoF)运动等关键特征,而Oxford Day-and-Night数据集通过利用Meta ARIA眼镜捕获第一视角视频,并应用多会话SLAM技术估计相机位姿、重建三维点云以及对不同光照条件下的序列进行对齐,从而弥补了这些不足。该数据集覆盖了超过30公里的轨迹和40,000平方米的区域,为第一视角三维视觉研究提供了丰富的基础。
链接: https://arxiv.org/abs/2506.04224
作者: Zirui Wang,Wenjing Bian,Xinghui Li,Yifu Tao,Jianeng Wang,Maurice Fallon,Victor Adrian Prisacariu
机构: University of Oxford(牛津大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
Abstract:We introduce Oxford Day-and-Night, a large-scale, egocentric dataset for novel view synthesis (NVS) and visual relocalisation under challenging lighting conditions. Existing datasets often lack crucial combinations of features such as ground-truth 3D geometry, wide-ranging lighting variation, and full 6DoF motion. Oxford Day-and-Night addresses these gaps by leveraging Meta ARIA glasses to capture egocentric video and applying multi-session SLAM to estimate camera poses, reconstruct 3D point clouds, and align sequences captured under varying lighting conditions, including both day and night. The dataset spans over 30 \mathrmkm of recorded trajectories and covers an area of 40,000 \mathrmm^2 , offering a rich foundation for egocentric 3D vision research. It supports two core benchmarks, NVS and relocalisation, providing a unique platform for evaluating models in realistic and diverse environments.
zh
[CV-4] Struct2D: A Perception-Guided Framework for Spatial Reasoning in Large Multimodal Models
【速读】:该论文旨在解决如何在不依赖显式3D输入或专用模型架构的情况下,使大型多模态模型(LMMs)具备对三维空间进行推理的能力。其解决方案的关键在于提出一种基于感知引导的提示框架——Struct2D,该框架结合鸟瞰图(BEV)图像、物体标记及以物体为中心的元数据,必要时还可引入第一视角关键帧,从而利用结构化的二维表示实现对三维空间的有效建模与推理。
链接: https://arxiv.org/abs/2506.04220
作者: Fangrui Zhu,Hanhui Wang,Yiming Xie,Jing Gu,Tianye Ding,Jianwei Yang,Huaizu Jiang
机构: Northeastern University (东北大学); Microsoft Research (微软研究院); University of Southern California (南加州大学); University of California, Santa Cruz (加州大学圣克鲁兹分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: this https URL
Abstract:Unlocking spatial reasoning in Large Multimodal Models (LMMs) is crucial for enabling intelligent interaction with 3D environments. While prior efforts often rely on explicit 3D inputs or specialized model architectures, we ask: can LMMs reason about 3D space using only structured 2D representations derived from perception? We introduce Struct2D, a perception-guided prompting framework that combines bird’s-eye-view (BEV) images with object marks and object-centric metadata, optionally incorporating egocentric keyframes when needed. Using Struct2D, we conduct an in-depth zero-shot analysis of closed-source LMMs (e.g., GPT-o3) and find that they exhibit surprisingly strong spatial reasoning abilities when provided with structured 2D inputs, effectively handling tasks such as relative direction estimation and route planning. Building on these insights, we construct Struct2D-Set, a large-scale instruction tuning dataset with 200K fine-grained QA pairs across eight spatial reasoning categories, generated automatically from 3D indoor scenes. We fine-tune an open-source LMM (Qwen2.5VL) on Struct2D-Set, achieving competitive performance on multiple benchmarks, including 3D question answering, dense captioning, and object grounding. Our approach demonstrates that structured 2D inputs can effectively bridge perception and language reasoning in LMMs-without requiring explicit 3D representations as input. We will release both our code and dataset to support future research.
zh
[CV-5] Pseudo-Simulation for Autonomous Driving
【速读】:该论文试图解决自动驾驶车辆(Autonomous Vehicles, AVs)评估范式中存在的关键局限性,包括真实世界评估的安全性和可重复性问题,以及封闭环仿真中现实感不足或计算成本高的问题,同时针对开环评估中因依赖的指标忽视累积误差而存在的缺陷。其解决方案的关键在于提出一种名为“伪仿真”(pseudo-simulation)的新范式,该方法基于真实数据集,通过3D高斯点云(3D Gaussian Splatting)生成合成观测数据,并利用基于邻近性的加权方案赋予与AV可能行为最匹配的合成观测更高权重,从而在无需顺序交互仿真的情况下评估错误恢复和因果混淆缓解能力。
链接: https://arxiv.org/abs/2506.04218
作者: Wei Cao,Marcel Hallgarten,Tianyu Li,Daniel Dauner,Xunjiang Gu,Caojun Wang,Yakov Miron,Marco Aiello,Hongyang Li,Igor Gilitschenski,Boris Ivanovic,Marco Pavone,Andreas Geiger,Kashyap Chitta
机构: University of Tübingen, Tübingen AI Center; NVIDIA Research; Robert Bosch GmbH; OpenDriveLab at Shanghai Innovation Institute; University of Stuttgart; University of Toronto; Vector Institute; Stanford University
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Existing evaluation paradigms for Autonomous Vehicles (AVs) face critical limitations. Real-world evaluation is often challenging due to safety concerns and a lack of reproducibility, whereas closed-loop simulation can face insufficient realism or high computational costs. Open-loop evaluation, while being efficient and data-driven, relies on metrics that generally overlook compounding errors. In this paper, we propose pseudo-simulation, a novel paradigm that addresses these limitations. Pseudo-simulation operates on real datasets, similar to open-loop evaluation, but augments them with synthetic observations generated prior to evaluation using 3D Gaussian Splatting. Our key idea is to approximate potential future states the AV might encounter by generating a diverse set of observations that vary in position, heading, and speed. Our method then assigns a higher importance to synthetic observations that best match the AV’s likely behavior using a novel proximity-based weighting scheme. This enables evaluating error recovery and the mitigation of causal confusion, as in closed-loop benchmarks, without requiring sequential interactive simulation. We show that pseudo-simulation is better correlated with closed-loop simulations (R^2=0.8) than the best existing open-loop approach (R^2=0.7). We also establish a public leaderboard for the community to benchmark new methodologies with pseudo-simulation. Our code is available at this https URL.
zh
[CV-6] UNIC: Unified In-Context Video Editing
【速读】:该论文试图解决现有生成式视频编辑方法中任务特定架构或定制化设计导致的编辑条件多样化整合困难与各类编辑任务统一性不足的问题。其解决方案的关键在于提出UNified In-Context Video Editing (UNIC)框架,通过将不同视频编辑任务的输入表示为三种类型的标记(源视频标记、噪声视频潜在标记和多模态条件标记),并将其整合为一个连续的标记序列,利用DiT模型的原生注意力机制进行联合建模,从而避免了任务特定适配器的设计。此外,引入任务感知的RoPE和条件偏置以解决视频长度差异和条件模态多样性带来的令牌冲突与任务混淆问题。
链接: https://arxiv.org/abs/2506.04216
作者: Zixuan Ye,Xuanhua He,Quande Liu,Qiulin Wang,Xintao Wang,Pengfei Wan,Di Zhang,Kun Gai,Qifeng Chen,Wenhan Luo
机构: Kuaishou Technology (快手科技)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The project page is at \href{ this https URL }{ this https URL }
Abstract:Recent advances in text-to-video generation have sparked interest in generative video editing tasks. Previous methods often rely on task-specific architectures (e.g., additional adapter modules) or dedicated customizations (e.g., DDIM inversion), which limit the integration of versatile editing conditions and the unification of various editing tasks. In this paper, we introduce UNified In-Context Video Editing (UNIC), a simple yet effective framework that unifies diverse video editing tasks within a single model in an in-context manner. To achieve this unification, we represent the inputs of various video editing tasks as three types of tokens: the source video tokens, the noisy video latent, and the multi-modal conditioning tokens that vary according to the specific editing task. Based on this formulation, our key insight is to integrate these three types into a single consecutive token sequence and jointly model them using the native attention operations of DiT, thereby eliminating the need for task-specific adapter designs. Nevertheless, direct task unification under this framework is challenging, leading to severe token collisions and task confusion due to the varying video lengths and diverse condition modalities across tasks. To address these, we introduce task-aware RoPE to facilitate consistent temporal positional encoding, and condition bias that enables the model to clearly differentiate different editing tasks. This allows our approach to adaptively perform different video editing tasks by referring the source video and varying condition tokens “in context”, and support flexible task composition. To validate our method, we construct a unified video editing benchmark containing six representative video editing tasks. Results demonstrate that our unified approach achieves superior performance on each task and exhibits emergent task composition abilities.
zh
[CV-7] Sounding that Object: Interactive Object-Aware Image to Audio Generation ICML2025
【速读】:该论文旨在解决在复杂音视频场景中生成准确声音的问题,特别是在存在多个物体和声源的情况下。其解决方案的关键在于提出一种交互式对象感知音频生成模型,该模型通过将声音生成与用户选择的视觉对象进行关联来实现。该方法将基于对象的学习整合到条件潜在扩散模型中,利用多模态注意力机制学习图像区域与其对应声音之间的关联,从而在测试阶段通过图像分割实现用户对对象级别的声音交互生成。
链接: https://arxiv.org/abs/2506.04214
作者: Tingle Li,Baihe Huang,Xiaobin Zhuang,Dongya Jia,Jiawei Chen,Yuping Wang,Zhuo Chen,Gopala Anumanchipalli,Yuxuan Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: ICML 2025
Abstract:Generating accurate sounds for complex audio-visual scenes is challenging, especially in the presence of multiple objects and sound sources. In this paper, we propose an \em interactive object-aware audio generation model that grounds sound generation in user-selected visual objects within images. Our method integrates object-centric learning into a conditional latent diffusion model, which learns to associate image regions with their corresponding sounds through multi-modal attention. At test time, our model employs image segmentation to allow users to interactively generate sounds at the \em object level. We theoretically validate that our attention mechanism functionally approximates test-time segmentation masks, ensuring the generated audio aligns with selected objects. Quantitative and qualitative evaluations show that our model outperforms baselines, achieving better alignment between objects and their associated sounds. Project page: this https URL
zh
[CV-8] FullDiT2: Efficient In-Context Conditioning for Video Diffusion Transformers
【速读】:该论文旨在解决视频扩散变换器中基于上下文条件的生成框架在计算效率上的瓶颈问题,特别是在任务复杂度增加时出现的二次计算开销问题。其关键解决方案是提出FullDiT2框架,通过两个核心创新点提升效率:一是引入动态令牌选择机制以减少冗余的上下文条件令牌,降低统一全注意力处理的序列长度;二是设计选择性上下文缓存机制,以减少条件令牌与视频潜在表示之间的冗余交互。
链接: https://arxiv.org/abs/2506.04213
作者: Xuanhua He,Quande Liu,Zixuan Ye,Wecai Ye,Qiulin Wang,Xintao Wang,Qifeng Chen,Pengfei Wan,Di Zhang,Kun Gai
机构: The Hong Kong University of Science and Technology (香港科技大学); Kuaishou Technology (快手科技)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Fine-grained and efficient controllability on video diffusion transformers has raised increasing desires for the applicability. Recently, In-context Conditioning emerged as a powerful paradigm for unified conditional video generation, which enables diverse controls by concatenating varying context conditioning signals with noisy video latents into a long unified token sequence and jointly processing them via full-attention, e.g., FullDiT. Despite their effectiveness, these methods face quadratic computation overhead as task complexity increases, hindering practical deployment. In this paper, we study the efficiency bottleneck neglected in original in-context conditioning video generation framework. We begin with systematic analysis to identify two key sources of the computation inefficiencies: the inherent redundancy within context condition tokens and the computational redundancy in context-latent interactions throughout the diffusion process. Based on these insights, we propose FullDiT2, an efficient in-context conditioning framework for general controllability in both video generation and editing tasks, which innovates from two key perspectives. Firstly, to address the token redundancy, FullDiT2 leverages a dynamic token selection mechanism to adaptively identify important context tokens, reducing the sequence length for unified full-attention. Additionally, a selective context caching mechanism is devised to minimize redundant interactions between condition tokens and video latents. Extensive experiments on six diverse conditional video editing and generation tasks demonstrate that FullDiT2 achieves significant computation reduction and 2-3 times speedup in averaged time cost per diffusion step, with minimal degradation or even higher performance in video generation quality. The project page is at \hrefthis https URLthis https URL.
zh
[CV-9] Diffusion Domain Teacher: Diffusion Guided Domain Adaptive Object Detector
【速读】:该论文旨在解决目标检测模型在源域(训练数据)与目标域(真实世界数据)之间存在较大领域差异时性能下降的问题。其解决方案的关键在于利用基于扩散的生成模型作为教师模型,通过冻结权重的扩散模型在源域上进行训练,随后在无标签的目标域上生成伪标签,进而指导学生模型在目标域上的监督学习,该方法被称为Diffusion Domain Teacher (DDT)。
链接: https://arxiv.org/abs/2506.04211
作者: Boyong He,Yuxiang Ji,Zhuoyue Tan,Liaoni Wu
机构: Xiamen University (厦门大学); Institute of Artifcial Intelligence (人工智能研究院); School of Aerospace Engineering (航空航天学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: MM2024 poster, with appendix and codes
Abstract:Object detectors often suffer a decrease in performance due to the large domain gap between the training data (source domain) and real-world data (target domain). Diffusion-based generative models have shown remarkable abilities in generating high-quality and diverse images, suggesting their potential for extracting valuable feature from various domains. To effectively leverage the cross-domain feature representation of diffusion models, in this paper, we train a detector with frozen-weight diffusion model on the source domain, then employ it as a teacher model to generate pseudo labels on the unlabeled target domain, which are used to guide the supervised learning of the student model on the target domain. We refer to this approach as Diffusion Domain Teacher (DDT). By employing this straightforward yet potent framework, we significantly improve cross-domain object detection performance without compromising the inference speed. Our method achieves an average mAP improvement of 21.2% compared to the baseline on 6 datasets from three common cross-domain detection benchmarks (Cross-Camera, Syn2Real, Real2Artistic, surpassing the current state-of-the-art (SOTA) methods by an average of 5.7% mAP. Furthermore, extensive experiments demonstrate that our method consistently brings improvements even in more powerful and complex models, highlighting broadly applicable and effective domain adaptation capability of our DDT. The code is available at this https URL.
zh
[CV-10] Language-Image Alignment with Fixed Text Encoders
【速读】:该论文试图解决当前语言-图像对齐方法中依赖昂贵的联合训练文本和图像编码器的问题,例如CLIP及其变体。其解决方案的关键在于利用预训练的固定大语言模型(Large Language Model, LLM)作为文本编码器,仅训练图像编码器以实现语言-图像对齐,即提出了一种基于固定文本编码器的框架LIFT(Language-Image alignment with a Fixed Text encoder)。通过这种方式,该方法在保持高性能的同时显著提升了计算效率,并在涉及组合理解与长描述的场景中优于CLIP。
链接: https://arxiv.org/abs/2506.04209
作者: Jingfeng Yang,Ziyang Wu,Yue Zhao,Yi Ma
机构: UC Berkeley (加州大学伯克利分校); The University of Hong Kong (香港大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Currently, the most dominant approach to establishing language-image alignment is to pre-train text and image encoders jointly through contrastive learning, such as CLIP and its variants. In this work, we question whether such a costly joint training is necessary. In particular, we investigate if a pre-trained fixed large language model (LLM) offers a good enough text encoder to guide visual representation learning. That is, we propose to learn Language-Image alignment with a Fixed Text encoder (LIFT) from an LLM by training only the image encoder. Somewhat surprisingly, through comprehensive benchmarking and ablation studies, we find that this much simplified framework LIFT is highly effective and it outperforms CLIP in most scenarios that involve compositional understanding and long captions, while achieving considerable gains in computational efficiency. Our work takes a first step towards systematically exploring how text embeddings from LLMs can guide visual learning and suggests an alternative design choice for learning language-aligned visual representations.
zh
[CV-11] FlexGS: Train Once Deploy Everywhere with Many-in-One Flexible 3D Gaussian Splatting CVPR2025
【速读】:该论文试图解决3D Gaussian splatting (3DGS)在GPU内存消耗较大的问题,这限制了其在计算资源受限设备上的应用。解决方案的关键在于提出一种弹性推理方法,该方法根据输入的模型大小选择并转换一组高斯分布,从而在不进行额外微调的情况下实现显著的渲染性能。文中引入了一个小型可学习模块,用于根据输入比例控制高斯分布的选择,并结合一个转换模块以调整所选高斯分布以弥补模型缩减后的性能损失。
链接: https://arxiv.org/abs/2506.04174
作者: Hengyu Liu,Yuehao Wang,Chenxin Li,Ruisi Cai,Kevin Wang,Wuyang Li,Pavlo Molchanov,Peihao Wang,Zhangyang Wang
机构: The Chinese University of Hong Kong (香港中文大学); University of Texas at Austin (德克萨斯大学奥斯汀分校); Nvidia (英伟达)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025; Project Page: this https URL
Abstract:3D Gaussian splatting (3DGS) has enabled various applications in 3D scene representation and novel view synthesis due to its efficient rendering capabilities. However, 3DGS demands relatively significant GPU memory, limiting its use on devices with restricted computational resources. Previous approaches have focused on pruning less important Gaussians, effectively compressing 3DGS but often requiring a fine-tuning stage and lacking adaptability for the specific memory needs of different devices. In this work, we present an elastic inference method for 3DGS. Given an input for the desired model size, our method selects and transforms a subset of Gaussians, achieving substantial rendering performance without additional fine-tuning. We introduce a tiny learnable module that controls Gaussian selection based on the input percentage, along with a transformation module that adjusts the selected Gaussians to complement the performance of the reduced model. Comprehensive experiments on ZipNeRF, MipNeRF and Tanks\Temples scenes demonstrate the effectiveness of our approach. Code is available at this https URL.
zh
[CV-12] Image Editing As Programs with Diffusion Models
【速读】:该论文试图解决扩散模型在指令驱动的图像编辑中面临的挑战,尤其是针对涉及大量布局变化的结构不一致编辑任务。解决方案的关键在于提出一种名为Image Editing As Programs (IEAP)的统一图像编辑框架,该框架基于Diffusion Transformer (DiT)架构,通过将复杂的编辑指令分解为一系列原子操作,并由基于视觉-语言模型(VLM)的代理进行编程,从而实现任意且结构不一致的变换。
链接: https://arxiv.org/abs/2506.04158
作者: Yujia Hu,Songhua Liu,Zhenxiong Tan,Xingyi Yang,Xinchao Wang
机构: National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:While diffusion models have achieved remarkable success in text-to-image generation, they encounter significant challenges with instruction-driven image editing. Our research highlights a key challenge: these models particularly struggle with structurally inconsistent edits that involve substantial layout changes. To mitigate this gap, we introduce Image Editing As Programs (IEAP), a unified image editing framework built upon the Diffusion Transformer (DiT) architecture. At its core, IEAP approaches instructional editing through a reductionist lens, decomposing complex editing instructions into sequences of atomic operations. Each operation is implemented via a lightweight adapter sharing the same DiT backbone and is specialized for a specific type of edit. Programmed by a vision-language model (VLM)-based agent, these operations collaboratively support arbitrary and structurally inconsistent transformations. By modularizing and sequencing edits in this way, IEAP generalizes robustly across a wide range of editing tasks, from simple adjustments to substantial structural changes. Extensive experiments demonstrate that IEAP significantly outperforms state-of-the-art methods on standard benchmarks across various editing scenarios. In these evaluations, our framework delivers superior accuracy and semantic fidelity, particularly for complex, multi-step instructions. Codes are available at this https URL.
zh
[CV-13] Person Re-Identification System at Semantic Level based on Pedestrian Attributes Ontology
【速读】:该论文旨在解决行人重识别(Person Re-Identification, Re-ID)任务中存在的一些挑战,包括大规模数据集、数据不平衡、视角变化、细粒度属性(fine-grained data)以及在线阶段未在语义层面利用局部特征等问题,尤其是属性数据不平衡问题未被充分考虑。其解决方案的关键在于提出一个统一的Re-ID系统,包含三个主要模块:行人属性本体(Pedestrian Attribute Ontology, PAO)、局部多任务深度卷积神经网络(Local Multi-task DCNN, Local MDCNN)和数据不平衡解决模块(Imbalance Data Solver, IDS)。该系统通过PAO、Local MDCNN和IDS之间的相互支持,挖掘属性的组内相关性,并基于语义信息(如服装属性和面部属性)对候选样本进行预筛选,从而在不调整网络结构和数据增强的情况下解决属性数据不平衡问题。
链接: https://arxiv.org/abs/2506.04143
作者: Ngoc Q. Ly,Hieu N. M. Cao,Thi T. Nguyen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Person Re-Identification (Re-ID) is a very important task in video surveillance systems such as tracking people, finding people in public places, or analysing customer behavior in supermarkets. Although there have been many works to solve this problem, there are still remaining challenges such as large-scale datasets, imbalanced data, viewpoint, fine grained data (attributes), the Local Features are not employed at semantic level in online stage of Re-ID task, furthermore, the imbalanced data problem of attributes are not taken into consideration. This paper has proposed a Unified Re-ID system consisted of three main modules such as Pedestrian Attribute Ontology (PAO), Local Multi-task DCNN (Local MDCNN), Imbalance Data Solver (IDS). The new main point of our Re-ID system is the power of mutual support of PAO, Local MDCNN and IDS to exploit the inner-group correlations of attributes and pre-filter the mismatch candidates from Gallery set based on semantic information as Fashion Attributes and Facial Attributes, to solve the imbalanced data of attributes without adjusting network architecture and data augmentation. We experimented on the well-known Market1501 dataset. The experimental results have shown the effectiveness of our Re-ID system and it could achieve the higher performance on Market1501 dataset in comparison to some state-of-the-art Re-ID methods.
zh
[CV-14] UniCUE: Unified Recognition and Generation Framework for Chinese Cued Speech Video-to-Speech Generation
【速读】:该论文旨在解决从手语编码视频(Cued Speech, CS)直接生成可理解语音信号(CSV2S)的挑战,尤其是针对由于CS数据不足导致的单步CSV2S方法性能不佳的问题。现有研究多集中于CS识别(CSR),通过文本作为中间媒介实现跨模态对齐,但此过程可能导致误差传播和语音与视频动态的时间错位。论文提出的解决方案关键在于构建一个统一框架UniCUE,其核心创新在于将CSR任务集成到CSV2S中,以提供细粒度的视觉-语义信息,从而提升语音生成质量,具体包括细粒度语义对齐池、VisioPhonetic适配器以及姿态感知的视觉处理器。
链接: https://arxiv.org/abs/2506.04134
作者: Jinting Wang,Shan Yang,Li Liu
机构: The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)); Tencent AI Lab(腾讯人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: 10 pages, 10 figures
Abstract:Cued Speech (CS) enhances lipreading through hand coding, providing precise speech perception support for the hearing-impaired. CS Video-to-Speech generation (CSV2S) task aims to convert the CS visual expressions (CS videos) of hearing-impaired individuals into comprehensible speech signals. Direct generation of speech from CS video (called single CSV2S) yields poor performance due to insufficient CS data. Current research mostly focuses on CS Recognition (CSR), which convert video content into linguistic text. Based on this, one straightforward way of CSV2S is to combine CSR with a Text-to-Speech system. This combined architecture relies on text as an intermediate medium for stepwise cross-modal alignment, which may lead to error propagation and temporal misalignment between speech and video dynamics. To address these challenges, we propose a novel approach that directly generates speech from CS videos without relying on intermediate text. Building upon this, we propose UniCUE, the first unified framework for CSV2S, whose core innovation lies in the integration of the CSR task that provides fine-grained visual-semantic information to facilitate speech generation from CS videos. More precisely, (1) a novel fine-grained semantic alignment pool to ensure precise mapping between visual features and speech contents; (2) a VisioPhonetic adapter to bridge cross-task representations, ensuring seamless compatibility between two distinct tasks (i.e., CSV2S and CSR); (3) a pose-aware visual processor is introduced to enhance fine-grained spatiotemporal correlations between lip and hand movements in CS video. Experiments on our new established Chinese CS dataset (14 cuers1: 8 hearing-impaired and 6 normal-hearing) show that our UniCUE significantly reduces Word Error Rate by 78.3% and improves lip-speech synchronization by 32% compared to the single CSV2S.
zh
[CV-15] Contour Errors: An Ego-Centric Metric for Reliable 3D Multi-Object Tracking
【速读】:该论文试图解决多目标跟踪中可靠匹配的问题,以确保在自动驾驶等安全关键应用中的感知系统准确性与可靠性。传统度量如交并比(Intersection over Union, IoU)和中心点距离(Center Point Distances, CPDs)在复杂三维场景中难以找到关键匹配。该研究提出轮廓误差(Contour Errors, CEs),这是一种以功能视角为中心的度量方法,通过比较本车坐标系下的边界框,提供更符合实际功能需求的匹配评估。其关键在于利用CEs提升跟踪方法中匹配的可靠性,实验结果表明CEs在减少功能失败(FPs/FNs)方面优于现有的2D IoU和CPD度量。
链接: https://arxiv.org/abs/2506.04122
作者: Sharang Kaul,Mario Berk,Thiemo Gerbich,Abhinav Valada
机构: CARIAD SE - Vokswagen Group (CARIAD SE - 大众集团); University of Freiburg (弗莱堡大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Finding reliable matches is essential in multi-object tracking to ensure the accuracy and reliability of perception systems in safety-critical applications such as autonomous vehicles. Effective matching mitigates perception errors, enhancing object identification and tracking for improved performance and safety. However, traditional metrics such as Intersection over Union (IoU) and Center Point Distances (CPDs), which are effective in 2D image planes, often fail to find critical matches in complex 3D scenes. To address this limitation, we introduce Contour Errors (CEs), an ego or object-centric metric for identifying matches of interest in tracking scenarios from a functional perspective. By comparing bounding boxes in the ego vehicle’s frame, contour errors provide a more functionally relevant assessment of object matches. Extensive experiments on the nuScenes dataset demonstrate that contour errors improve the reliability of matches over the state-of-the-art 2D IoU and CPD metrics in tracking-by-detection methods. In 3D car tracking, our results show that Contour Errors reduce functional failures (FPs/FNs) by 80% at close ranges and 60% at far ranges compared to IoU in the evaluation stage.
zh
[CV-16] Multi-view Surface Reconstruction Using Normal and Reflectance Cues
【速读】:该论文旨在解决在复杂反射特性材料和非密集视角设置下,实现高保真三维表面重建并保留细粒度细节的挑战。其解决方案的关键在于引入一种通用框架,将多视角法线图和可选的反射率图整合到基于辐射场的表面重建中,通过像素级联合重参数化反射率和表面法线,将其表示为模拟变化光照下的辐射向量,从而无缝集成到标准表面重建流程中。
链接: https://arxiv.org/abs/2506.04115
作者: Robin Bruneau,Baptiste Brument,Yvain Quéau,Jean Mélou,François Bernard Lauze,Jean-Denis Durou,Lilian Calvet
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 22 pages, 15 figures, 11 tables. A thorough qualitative and quantitive study is available in the supplementary material at this https URL
Abstract:Achieving high-fidelity 3D surface reconstruction while preserving fine details remains challenging, especially in the presence of materials with complex reflectance properties and without a dense-view setup. In this paper, we introduce a versatile framework that incorporates multi-view normal and optionally reflectance maps into radiance-based surface reconstruction. Our approach employs a pixel-wise joint re-parametrization of reflectance and surface normals, representing them as a vector of radiances under simulated, varying illumination. This formulation enables seamless incorporation into standard surface reconstruction pipelines, such as traditional multi-view stereo (MVS) frameworks or modern neural volume rendering (NVR) ones. Combined with the latter, our approach achieves state-of-the-art performance on multi-view photometric stereo (MVPS) benchmark datasets, including DiLiGenT-MV, LUCES-MV and Skoltech3D. In particular, our method excels in reconstructing fine-grained details and handling challenging visibility conditions. The present paper is an extended version of the earlier conference paper by Brument et al. (in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024), featuring an accelerated and more robust algorithm as well as a broader empirical evaluation. The code and data relative to this article is available at this https URL.
zh
[CV-17] GlobalBuildingAtlas: An Open Global and Complete Dataset of Building Polygons Heights and LoD1 3D Models
【速读】:该论文旨在解决全球范围内缺乏高质量、一致性和完整性俱佳的二维和三维建筑数据的问题。其关键解决方案是开发基于机器学习的流程,从全球PlanetScope卫星数据中提取建筑多边形和高度,并采用基于质量的融合策略生成更高精度的建筑多边形。通过这些方法,研究者构建了包含超过27.5亿栋建筑的GlobalBuildingAtlas数据集,提供了目前最详细和准确的全球三维建筑高度图,空间分辨率达到3×3米,显著优于以往的90米分辨率,从而实现了对建筑体积在局部和全球尺度上的高分辨率可靠分析。
链接: https://arxiv.org/abs/2506.04106
作者: Xiao Xiang Zhu,Sining Chen,Fahong Zhang,Yilei Shi,Yuanyuan Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We introduce GlobalBuildingAtlas, a publicly available dataset providing global and complete coverage of building polygons, heights and Level of Detail 1 (LoD1) 3D building models. This is the first open dataset to offer high quality, consistent, and complete building data in 2D and 3D form at the individual building level on a global scale. Towards this dataset, we developed machine learning-based pipelines to derive building polygons and heights (called this http URL) from global PlanetScope satellite data, respectively. Also a quality-based fusion strategy was employed to generate higher-quality polygons (called this http URL) based on existing open building polygons, including our own derived one. With more than 2.75 billion buildings worldwide, this http URL surpasses the most comprehensive database to date by more than 1 billion buildings. this http URL offers the most detailed and accurate global 3D building height maps to date, achieving a spatial resolution of 3x3 meters-30 times finer than previous global products (90 m), enabling a high-resolution and reliable analysis of building volumes at both local and global scales. Finally, we generated a global LoD1 building model (called GBA.LoD1) from the resulting this http URL and this http URL. GBA.LoD1 represents the first complete global LoD1 building models, including 2.68 billion building instances with predicted heights, i.e., with a height completeness of more than 97%, achieving RMSEs ranging from 1.5 m to 8.9 m across different continents. With its height accuracy, comprehensive global coverage and rich spatial details, GlobalBuildingAltas offers novel insights on the status quo of global buildings, which unlocks unprecedented geospatial analysis possibilities, as showcased by a better illustration of where people live and a more comprehensive monitoring of the progress on the 11th Sustainable Development Goal of the United Nations.
zh
[CV-18] Point Cloud Quality Assessment Using the Perceptual Clustering Weighted Graph (PCW-Graph) and Attention Fusion Network
【速读】:该论文旨在解决无参考点云质量评估(No-Reference Point Cloud Quality Assessment, NR-PCQA)的问题,即在缺乏参考模型的情况下对三维内容的质量进行有效评估。解决方案的关键在于开发一种无需依赖原始参考点云即可准确预测点云质量的算法或模型。
链接: https://arxiv.org/abs/2506.04081
作者: Abdelouahed Laazoufi,Mohammed El Hassouni,Hocine Cherifi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:No-Reference Point Cloud Quality Assessment (NR-PCQA) is critical for evaluating 3D content in real-world applications where reference models are unavailable.
zh
[CV-19] Optimal Transport-based Domain Alignment as a Preprocessing Step for Federated Learning
【速读】:该论文试图解决联邦学习(Federated Learning, FL)中由于数据集不平衡导致的全局模型聚合性能下降问题,该问题表现为边缘设备上的本地数据分布不一致,进而影响局部模型更新和分布式代理决策的准确性。解决方案的关键在于引入一种基于最优传输(Optimal Transport)的预处理算法,通过计算通道维度的Wasserstein巴氏中心(Wasserstein barycenters)来对齐数据集,从而最小化数据分布差异。该方法在可信中央服务器上生成一个目标RGB空间,并将数据集投影到该空间,实现全局层面的分布差异最小化,从而提升学习效率与泛化能力。
链接: https://arxiv.org/abs/2506.04071
作者: Luiz Manella Pereira,M. Hadi Amini
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Federated learning (FL) is a subfield of machine learning that avoids sharing local data with a central server, which can enhance privacy and scalability. The inability to consolidate data leads to a unique problem called dataset imbalance, where agents in a network do not have equal representation of the labels one is trying to learn to predict. In FL, fusing locally-trained models with unbalanced datasets may deteriorate the performance of global model aggregation, and reduce the quality of updated local models and the accuracy of the distributed agents’ decisions. In this work, we introduce an Optimal Transport-based preprocessing algorithm that aligns the datasets by minimizing the distributional discrepancy of data along the edge devices. We accomplish this by leveraging Wasserstein barycenters when computing channel-wise averages. These barycenters are collected in a trusted central server where they collectively generate a target RGB space. By projecting our dataset towards this target space, we minimize the distributional discrepancy on a global level, which facilitates the learning process due to a minimization of variance across the samples. We demonstrate the capabilities of the proposed approach over the CIFAR-10 dataset, where we show its capability of reaching higher degrees of generalization in fewer communication rounds.
zh
[CV-20] Video Deblurring with Deconvolution and Aggregation Networks
【速读】:该论文旨在解决视频去模糊(video deblurring)中现有算法未能有效利用相邻帧信息导致性能不佳的问题。其解决方案的关键在于提出一种基于卷积和聚合的网络(DAN),通过三个子网络——预处理网络(PPN)、基于帧对齐的去卷积网络(ABDN)和帧聚合网络(FAN)——实现对相邻帧信息的有效利用,从而提升视频去模糊的效果。
链接: https://arxiv.org/abs/2506.04054
作者: Giyong Choi,HyunWook Park
机构: KAIST (韩国科学技术院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In contrast to single-image deblurring, video deblurring has the advantage that neighbor frames can be utilized to deblur a target frame. However, existing video deblurring algorithms often fail to properly employ the neighbor frames, resulting in sub-optimal performance. In this paper, we propose a deconvolution and aggregation network (DAN) for video deblurring that utilizes the information of neighbor frames well. In DAN, both deconvolution and aggregation strategies are achieved through three sub-networks: the preprocessing network (PPN) and the alignment-based deconvolution network (ABDN) for the deconvolution scheme; the frame aggregation network (FAN) for the aggregation scheme. In the deconvolution part, blurry inputs are first preprocessed by the PPN with non-local operations. Then, the output frames from the PPN are deblurred by the ABDN based on the frame alignment. In the FAN, these deblurred frames from the deconvolution part are combined into a latent frame according to reliability maps which infer pixel-wise sharpness. The proper combination of three sub-networks can achieve favorable performance on video deblurring by using the neighbor frames suitably. In experiments, the proposed DAN was demonstrated to be superior to existing state-of-the-art methods through both quantitative and qualitative evaluations on the public datasets.
zh
[CV-21] EV-Flying: an Event-based Dataset for In-The-Wild Recognition of Flying Objects
【速读】:该论文旨在解决传统基于RGB的空中目标监测方法在面对尺度变化、运动模糊和高速移动的小型飞行物体(如昆虫和无人机)时所遇到的挑战。其解决方案的关键在于利用事件视觉(event-based vision)技术,该技术具有高时间分辨率、低延迟和对运动模糊的鲁棒性,从而提升了小型飞行目标的检测与识别能力。研究引入了EV-Flying数据集,并采用基于点云的事件表示方法,结合受PointNet启发的轻量级架构,以有效处理异步事件流。
链接: https://arxiv.org/abs/2506.04048
作者: Gabriele Magrini,Federico Becattini,Giovanni Colombo,Pietro Pala
机构: University of Florence(佛罗伦萨大学); University of Siena(锡耶纳大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Monitoring aerial objects is crucial for security, wildlife conservation, and environmental studies. Traditional RGB-based approaches struggle with challenges such as scale variations, motion blur, and high-speed object movements, especially for small flying entities like insects and drones. In this work, we explore the potential of event-based vision for detecting and recognizing flying objects, in particular animals that may not follow short and long-term predictable patters. Event cameras offer high temporal resolution, low latency, and robustness to motion blur, making them well-suited for this task. We introduce EV-Flying, an event-based dataset of flying objects, comprising manually annotated birds, insects and drones with spatio-temporal bounding boxes and track identities. To effectively process the asynchronous event streams, we employ a point-based approach leveraging lightweight architectures inspired by PointNet. Our study investigates the classification of flying objects using point cloud-based event representations. The proposed dataset and methodology pave the way for more efficient and reliable aerial object recognition in real-world scenarios.
zh
[CV-22] Rex-Thinker: Grounded Object Referring via Chain-of-Thought Reasoning
【速读】:该论文旨在解决对象指代(object referring)任务中模型预测缺乏可解释性和可信度的问题。现有方法通常将指代任务视为直接的边界框预测,导致模型难以解释其预测过程,并且在图像中不存在匹配对象时无法有效拒绝错误表达。论文提出的解决方案关键在于将对象指代建模为显式的思维链(Chain-of-Thought, CoT)推理任务,通过分步骤推理评估每个候选对象是否符合给定的语言描述,从而提升模型的可验证性和可信度。为此,研究者构建了大规模的CoT风格指代数据集HumanRef-CoT,并采用两阶段训练策略:首先进行冷启动监督微调以学习结构化推理,随后通过GRPO强化学习进一步提升模型性能。
链接: https://arxiv.org/abs/2506.04034
作者: Qing Jiang,Xingyu Chen,Zhaoyang Zeng,Junzhi Yu,Lei Zhang
机构: International Digital Economy Academy (IDEA); South China University of Technology; Peking University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: homepage: this https URL
Abstract:Object referring aims to detect all objects in an image that match a given natural language description. We argue that a robust object referring model should be grounded, meaning its predictions should be both explainable and faithful to the visual content. Specifically, it should satisfy two key properties: 1) Verifiable, by producing interpretable reasoning that justifies its predictions and clearly links them to visual evidence; and 2) Trustworthy, by learning to abstain when no object in the image satisfies the given expression. However, most methods treat referring as a direct bounding box prediction task, offering limited interpretability and struggling to reject expressions with no matching object. In this work, we propose Rex-Thinker, a model that formulates object referring as an explicit CoT reasoning task. Given a referring expression, we first identify all candidate object instances corresponding to the referred object category. Rex-Thinker then performs step-by-step reasoning over each candidate to assess whether it matches the given expression, before making a final prediction. To support this paradigm, we construct a large-scale CoT-style referring dataset named HumanRef-CoT by prompting GPT-4o on the HumanRef dataset. Each reasoning trace follows a structured planning, action, and summarization format, enabling the model to learn decomposed, interpretable reasoning over object candidates. We then train Rex-Thinker in two stages: a cold-start supervised fine-tuning phase to teach the model how to perform structured reasoning, followed by GRPO-based RL learning to improve accuracy and generalization. Experiments show that our approach outperforms standard baselines in both precision and interpretability on in-domain evaluation, while also demonstrating improved ability to reject hallucinated outputs and strong generalization in out-of-domain settings.
zh
[CV-23] Vocabulary-free few-shot learning for Vision-Language Models CVPR
【速读】:该论文试图解决视觉-语言模型(Vision-Language Models, VLMs)在少样本适应中对预定义类别名称的依赖问题,这种依赖限制了其在实际场景中的应用,尤其是在无法获取或难以明确指定类别名称的情况下。解决方案的关键在于提出一种无需词汇的少样本学习方法,即通过相似性映射(Similarity Mapping, SiM)来分类目标实例,该方法仅基于目标实例与一组通用提示(文本或视觉)之间的相似性得分进行分类,从而消除了对手工设计提示的依赖。
链接: https://arxiv.org/abs/2506.04005
作者: Maxime Zanella,Clément Fuchs,Ismail Ben Ayed,Christophe De Vleeschouwer
机构: UCLouvain(乌特勒支天主教大学); UMons(蒙斯大学); ÉTS Montreal(蒙特利尔工程学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at CVPR Workshops 2025
Abstract:Recent advances in few-shot adaptation for Vision-Language Models (VLMs) have greatly expanded their ability to generalize across tasks using only a few labeled examples. However, existing approaches primarily build upon the strong zero-shot priors of these models by leveraging carefully designed, task-specific prompts. This dependence on predefined class names can restrict their applicability, especially in scenarios where exact class names are unavailable or difficult to specify. To address this limitation, we introduce vocabulary-free few-shot learning for VLMs, a setting where target class instances - that is, images - are available but their corresponding names are not. We propose Similarity Mapping (SiM), a simple yet effective baseline that classifies target instances solely based on similarity scores with a set of generic prompts (textual or visual), eliminating the need for carefully handcrafted prompts. Although conceptually straightforward, SiM demonstrates strong performance, operates with high computational efficiency (learning the mapping typically takes less than one second), and provides interpretability by linking target classes to generic prompts. We believe that our approach could serve as an important baseline for future research in vocabulary-free few-shot learning. Code is available at this https URL.
zh
[CV-24] RAID: A Dataset for Testing the Adversarial Robustness of AI-Generated Image Detectors NEURIPS2025
【速读】:该论文试图解决AI-generated image detectors在面对对抗样本时的鲁棒性不足问题,即现有检测方法在理想条件下表现良好,但在实际应用中可能因对抗攻击而失效。解决方案的关键在于提出一种更简便的评估AI生成图像检测器鲁棒性的方法,即RAID(Robust evaluation of AI-generated image Detectors)数据集,该数据集包含72k个多样且高度可迁移的对抗样本,通过针对多个先进的检测器和文本到图像模型进行攻击生成,能够有效评估检测器在未见过的场景下的对抗鲁棒性。
链接: https://arxiv.org/abs/2506.03988
作者: Hicham Eddoubi,Jonas Ricker,Federico Cocchi,Lorenzo Baraldi,Angelo Sotgiu,Maura Pintor,Marcella Cornia,Lorenzo Baraldi,Asja Fischer,Rita Cucchiara,Battista Biggio
机构: University of Cagliari, Italy; Ruhr University Bochum, Germany; University of Modena and Reggio Emilia, Italy; University of Pisa, Italy; Sapienza University of Rome, Italy; CINI, Italy
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Under review for NeurIPS 2025 Datasets and Benchmarks Track
Abstract:AI-generated images have reached a quality level at which humans are incapable of reliably distinguishing them from real images. To counteract the inherent risk of fraud and disinformation, the detection of AI-generated images is a pressing challenge and an active research topic. While many of the presented methods claim to achieve high detection accuracy, they are usually evaluated under idealized conditions. In particular, the adversarial robustness is often neglected, potentially due to a lack of awareness or the substantial effort required to conduct a comprehensive robustness analysis. In this work, we tackle this problem by providing a simpler means to assess the robustness of AI-generated image detectors. We present RAID (Robust evaluation of AI-generated image Detectors), a dataset of 72k diverse and highly transferable adversarial examples. The dataset is created by running attacks against an ensemble of seven state-of-the-art detectors and images generated by four different text-to-image models. Extensive experiments show that our methodology generates adversarial images that transfer with a high success rate to unseen detectors, which can be used to quickly provide an approximate yet still reliable estimate of a detector’s adversarial robustnessOur findings indicate that current state-of-the-art AI-generated image detectors can be easily deceived by adversarial examples, highlighting the critical need for the development of more robust methods. We release our dataset at this https URL and evaluation code at this https URL.
zh
[CV-25] Solving Inverse Problems via Diffusion-Based Priors: An Approximation-Free Ensemble Sampling Approach
【速读】:该论文试图解决在贝叶斯逆问题(Bayesian Inverse Problems, BIPs)中,现有基于扩散模型(Diffusion Models, DMs)的后验采样方法依赖启发式近似的问题。其解决方案的关键在于提出一种基于集成的算法,该算法通过结合预训练得分函数(score function)编码的扩散过程,推导出一个修正的偏微分方程(Partial Differential Equation, PDE),该方程描述了后验分布的演化,并包含修正的扩散项和重加权项,可通过随机加权粒子方法进行模拟,从而避免使用启发式近似。
链接: https://arxiv.org/abs/2506.03979
作者: Haoxuan Chen,Yinuo Ren,Martin Renqiang Min,Lexing Ying,Zachary Izzo
机构: Stanford University (斯坦福大学); NEC Labs America (美国NEC实验室)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV); Numerical Analysis (math.NA); Machine Learning (stat.ML)
备注: 45 pages
Abstract:Diffusion models (DMs) have proven to be effective in modeling high-dimensional distributions, leading to their widespread adoption for representing complex priors in Bayesian inverse problems (BIPs). However, current DM-based posterior sampling methods proposed for solving common BIPs rely on heuristic approximations to the generative process. To exploit the generative capability of DMs and avoid the usage of such approximations, we propose an ensemble-based algorithm that performs posterior sampling without the use of heuristic approximations. Our algorithm is motivated by existing works that combine DM-based methods with the sequential Monte Carlo (SMC) method. By examining how the prior evolves through the diffusion process encoded by the pre-trained score function, we derive a modified partial differential equation (PDE) governing the evolution of the corresponding posterior distribution. This PDE includes a modified diffusion term and a reweighting term, which can be simulated via stochastic weighted particle methods. Theoretically, we prove that the error between the true posterior distribution can be bounded in terms of the training error of the pre-trained score function and the number of particles in the ensemble. Empirically, we validate our algorithm on several inverse problems in imaging to show that our method gives more accurate reconstructions compared to existing DM-based methods.
zh
[CV-26] MS-YOLO: A Multi-Scale Model for Accurate and Efficient Blood Cell Detection
【速读】:该论文旨在解决传统手动显微镜方法在血液细胞检测中存在的时间效率低和诊断准确性差的问题,以及现有自动化检测方法在部署成本高和准确率不足方面的局限性。其关键解决方案是提出一种基于YOLOv11框架的多尺度YOLO(MS-YOLO)模型,通过引入三个核心架构创新:多尺度空洞残差模块(MS-DRM)以提升多尺度判别能力,动态跨路径特征增强模块(DCFEM)以融合主干网络与颈部的层次化特征,轻量自适应权重下采样模块(LADS)以在降低计算复杂度的同时优化特征下采样过程。这些创新显著提升了对重叠细胞和多尺度目标(如血小板)的检测性能。
链接: https://arxiv.org/abs/2506.03972
作者: Guohua Wu,Shengqi Chen,Pengchao Deng,Wenting Yu
机构: Beijing University of Posts and Telecommunications (北京邮电大学); Shandong University of Science and Technology (山东科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Complete blood cell detection holds significant value in clinical diagnostics. Conventional manual microscopy methods suffer from time inefficiency and diagnostic inaccuracies. Existing automated detection approaches remain constrained by high deployment costs and suboptimal accuracy. While deep learning has introduced powerful paradigms to this field, persistent challenges in detecting overlapping cells and multi-scale objects hinder practical deployment. This study proposes the multi-scale YOLO (MS-YOLO), a blood cell detection model based on the YOLOv11 framework, incorporating three key architectural innovations to enhance detection performance. Specifically, the multi-scale dilated residual module (MS-DRM) replaces the original C3K2 modules to improve multi-scale discriminability; the dynamic cross-path feature enhancement module (DCFEM) enables the fusion of hierarchical features from the backbone with aggregated features from the neck to enhance feature representations; and the light adaptive-weight downsampling module (LADS) improves feature downsampling through adaptive spatial weighting while reducing computational complexity. Experimental results on the CBC benchmark demonstrate that MS-YOLO achieves precise detection of overlapping cells and multi-scale objects, particularly small targets such as platelets, achieving an mAP@50 of 97.4% that outperforms existing models. Further validation on the supplementary WBCDD dataset confirms its robust generalization capability. Additionally, with a lightweight architecture and real-time inference efficiency, MS-YOLO meets clinical deployment requirements, providing reliable technical support for standardized blood pathology assessment.
zh
[CV-27] Adapt before Continual Learning
【速读】:该论文试图解决持续学习(Continual Learning, CL)中预训练模型(Pre-trained Models, PTMs)在增量任务中面临的关键稳定性-可塑性权衡问题。现有方法通过冻结PTM主干以保持稳定性,但限制了模型的可塑性;而对整个PTM进行顺序微调则可能导致通用知识的灾难性遗忘。解决方案的关键在于提出一种名为ACL(Adapting PTMs before the core CL process)的框架,在核心CL过程之前引入一个即插即用的适应阶段,通过调整嵌入以对齐其原始类别原型并远离其他类别,从而在理论上和实验上实现稳定性和可塑性的平衡。
链接: https://arxiv.org/abs/2506.03956
作者: Aojun Lu,Tao Feng,Hangjie Yuan,Chunhui Ding,Yanan Sun
机构: Sichuan University (四川大学); Tsinghua University (清华大学); Zhejiang University (浙江大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Continual Learning (CL) seeks to enable neural networks to incrementally acquire new knowledge (plasticity) while retaining existing knowledge (stability). While pre-trained models (PTMs) have become pivotal in CL, prevailing approaches freeze the PTM backbone to preserve stability, limiting their plasticity, particularly when encountering significant domain gaps in incremental tasks. Conversely, sequentially finetuning the entire PTM risks catastrophic forgetting of generalizable knowledge, exposing a critical stability-plasticity trade-off. To address this challenge, we propose Adapting PTMs before the core CL process (ACL), a novel framework that refines the PTM backbone through a plug-and-play adaptation phase before learning each new task with existing CL approaches (e.g., prompt tuning). ACL enhances plasticity by aligning embeddings with their original class prototypes while distancing them from others, theoretically and empirically shown to balance stability and plasticity. Extensive experiments demonstrate that ACL significantly improves CL performance across benchmarks and integrated methods, offering a versatile solution for PTM-based CL.
zh
[CV-28] Rethinking the Stability-Plasticity Trade-off in Continual Learning from an Architectural Perspective
【速读】:该论文试图解决持续学习(Continual Learning, CL)中的稳定性-可塑性权衡问题,即如何在保持已有知识的同时有效学习新知识。解决方案的关键在于从网络架构层面出发,提出一种名为Dual-Arch的框架,该框架通过结合两个独立且专门设计的网络:一个专注于可塑性,另一个专注于稳定性,从而充分利用两者在架构上的互补优势,以提升CL方法的性能并减少参数量。
链接: https://arxiv.org/abs/2506.03951
作者: Aojun Lu,Hangjie Yuan,Tao Feng,Yanan Sun
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The quest for Continual Learning (CL) seeks to empower neural networks with the ability to learn and adapt incrementally. Central to this pursuit is addressing the stability-plasticity dilemma, which involves striking a balance between two conflicting objectives: preserving previously learned knowledge and acquiring new knowledge. While numerous CL methods aim to achieve this trade-off, they often overlook the impact of network architecture on stability and plasticity, restricting the trade-off to the parameter level. In this paper, we delve into the conflict between stability and plasticity at the architectural level. We reveal that under an equal parameter constraint, deeper networks exhibit better plasticity, while wider networks are characterized by superior stability. To address this architectural-level dilemma, we introduce a novel framework denoted Dual-Arch, which serves as a plug-in component for CL. This framework leverages the complementary strengths of two distinct and independent networks: one dedicated to plasticity and the other to stability. Each network is designed with a specialized and lightweight architecture, tailored to its respective objective. Extensive experiments demonstrate that Dual-Arch enhances the performance of existing CL methods while being up to 87% more compact in terms of parameters.
zh
[CV-29] Averag e Calibration Losses for Reliable Uncertainty in Medical Image Segmentation
【速读】:该论文试图解决深度神经网络在医学图像分割中过度自信的问题,这一问题影响了模型的可靠性与临床实用性。其解决方案的关键在于提出一种可微分的边缘L1平均校准误差(mL1-ACE)作为辅助损失函数,能够在单图基础上进行计算,并通过硬箱和软箱两种方式直接提升像素级的校准性能。实验结果表明,引入mL1-ACE显著降低了校准误差,尤其是平均校准误差(ACE)和最大校准误差(MCE),同时保持较高的Dice相似系数(DSC)。其中,软箱变体在提升校准性能方面效果最佳,但可能牺牲分割性能,而硬箱变体则在保持分割性能的同时实现较弱的校准改进。
链接: https://arxiv.org/abs/2506.03942
作者: Theodore Barfoot,Luis C. Garcia-Peraza-Herrera,Samet Akcay,Ben Glocker,Tom Vercauteren
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 5 figures, IEEE TMI submission
Abstract:Deep neural networks for medical image segmentation are often overconfident, compromising both reliability and clinical utility. In this work, we propose differentiable formulations of marginal L1 Average Calibration Error (mL1-ACE) as an auxiliary loss that can be computed on a per-image basis. We compare both hard- and soft-binning approaches to directly improve pixel-wise calibration. Our experiments on four datasets (ACDC, AMOS, KiTS, BraTS) demonstrate that incorporating mL1-ACE significantly reduces calibration errors, particularly Average Calibration Error (ACE) and Maximum Calibration Error (MCE), while largely maintaining high Dice Similarity Coefficients (DSCs). We find that the soft-binned variant yields the greatest improvements in calibration, over the Dice plus cross-entropy loss baseline, but often compromises segmentation performance, with hard-binned mL1-ACE maintaining segmentation performance, albeit with weaker calibration improvement. To gain further insight into calibration performance and its variability across an imaging dataset, we introduce dataset reliability histograms, an aggregation of per-image reliability diagrams. The resulting analysis highlights improved alignment between predicted confidences and true accuracies. Overall, our approach not only enhances the trustworthiness of segmentation predictions but also shows potential for safer integration of deep learning methods into clinical workflows. We share our code here: this https URL
zh
[CV-30] DiffCAP: Diffusion-based Cumulative Adversarial Purification for Vision Language Models
【速读】:该论文旨在解决视觉语言模型(Vision Language Models, VLMs)在面对对抗性扰动时的可靠性问题,这些扰动虽然对人类而言通常不可察觉,却能显著影响模型输出,导致错误的解释和决策。论文提出的解决方案是DiffCAP,其关键在于通过逐步向对抗性扰动的输入数据中注入随机高斯噪声,直至两个连续噪声图像的嵌入表示达到预设的相似性阈值,从而有效中和对抗性影响,随后利用预训练的扩散模型对稳定后的图像进行去噪,恢复出适合VLMs处理的干净表示。
链接: https://arxiv.org/abs/2506.03933
作者: Jia Fu,Yongtao Wu,Yihang Chen,Kunyu Peng,Xiao Zhang,Volkan Cevher,Sepideh Pashami,Anders Holst
机构: KTH Royal Institute of Technology (皇家理工学院); RISE Research Institutes of Sweden (瑞典研究机构); Swiss Federal Technology Institute of Lausanne (洛桑联邦理工学院); University of California, Los Angeles (加利福尼亚大学洛杉矶分校); Karlsruhe Institute of Technology (卡尔斯鲁厄理工学院); CISPA Helmholtz Center for Information Security (信息安全管理中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Vision Language Models (VLMs) have shown remarkable capabilities in multimodal understanding, yet their susceptibility to perturbations poses a significant threat to their reliability in real-world applications. Despite often being imperceptible to humans, these perturbations can drastically alter model outputs, leading to erroneous interpretations and decisions. This paper introduces DiffCAP, a novel diffusion-based purification strategy that can effectively neutralize adversarial corruptions in VLMs. We observe that adding minimal noise to an adversarially corrupted image significantly alters its latent embedding with respect to VLMs. Building on this insight, DiffCAP cumulatively injects random Gaussian noise into adversarially perturbed input data. This process continues until the embeddings of two consecutive noisy images reach a predefined similarity threshold, indicating a potential approach to neutralize the adversarial effect. Subsequently, a pretrained diffusion model is employed to denoise the stabilized image, recovering a clean representation suitable for the VLMs to produce an output. Through extensive experiments across six datasets with three VLMs under varying attack strengths in three task scenarios, we show that DiffCAP consistently outperforms existing defense techniques by a substantial margin. Notably, DiffCAP significantly reduces both hyperparameter tuning complexity and the required diffusion time, thereby accelerating the denoising process. Equipped with strong theoretical and empirical support, DiffCAP provides a robust and practical solution for securely deploying VLMs in adversarial environments.
zh
[CV-31] Vision Remember: Alleviating Visual Forgetting in Efficient MLLM with Vision Feature Resample
【速读】:该论文旨在解决高效多模态大语言模型中视觉令牌冗余导致的计算资源消耗过大以及视觉信息丢失的问题(Efficient Multimodal Large Language Model)。现有方法通过在视觉投影器中压缩视觉令牌以减少其数量,但这种简单压缩方式可能导致视觉信息的损失,尤其影响依赖细粒度空间关系的任务。解决方案的关键在于提出“Vision Remember”,该模块插入在大语言模型解码器层之间,使视觉令牌能够重新记忆视觉特征。具体而言,保留多层级视觉特征,并利用与文本令牌交互后的视觉令牌进行重采样,其中每个视觉令牌仅关注视觉特征中的局部区域,即“显著性增强的局部注意力机制”(saliency-enhancing local attention),从而提升计算效率并捕捉更细粒度的上下文和空间关系。
链接: https://arxiv.org/abs/2506.03928
作者: Ze Feng,Jiang-Jiang Liu,Sen Yang,Lingyu Xiao,Xiaofan Li,Wankou Yang,Jingdong Wang
机构: Southeast University (东南大学); Baidu VIS (百度视觉)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In this work, we study the Efficient Multimodal Large Language Model. Redundant vision tokens consume a significant amount of computational memory and resources. Therefore, many previous works compress them in the Vision Projector to reduce the number of vision tokens. However, simply compressing in the Vision Projector can lead to the loss of visual information, especially for tasks that rely on fine-grained spatial relationships, such as OCR and Chart \ Table Understanding. To address this problem, we propose Vision Remember, which is inserted between the LLM decoder layers to allow vision tokens to re-memorize vision features. Specifically, we retain multi-level vision features and resample them with the vision tokens that have interacted with the text token. During the resampling process, each vision token only attends to a local region in vision features, which is referred to as saliency-enhancing local attention. Saliency-enhancing local attention not only improves computational efficiency but also captures more fine-grained contextual information and spatial relationships within the region. Comprehensive experiments on multiple visual understanding benchmarks validate the effectiveness of our method when combined with various Efficient Vision Projectors, showing performance gains without sacrificing efficiency. Based on Vision Remember, LLaVA-VR with only 2B parameters is also superior to previous representative MLLMs such as Tokenpacker-HD-7B and DeepSeek-VL-7B.
zh
[CV-32] Multiple Stochastic Prompt Tuning for Practical Cross-Domain Few Shot Learning
【速读】:该论文试图解决在极端领域偏移下,利用每类仅少量标注样本的同时对所有未见类别进行分类的跨领域小样本学习(cross-domain few-shot learning, CDFSL)问题。其解决方案的关键在于提出了一种名为MIST(MultIple STochastic Prompt tuning)的框架,通过引入多个随机提示来处理显著的领域和语义偏移,并将多个提示的权重建模为可学习的高斯分布,从而在提示参数空间中实现高效的探索,缓解因标注样本少而导致的过拟合问题。
链接: https://arxiv.org/abs/2506.03926
作者: Debarshi Brahma,Soma Biswas
机构: Indian Institute of Science, Bangalore(印度科学研究所,班加罗尔)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In this work, we propose a practical cross-domain few-shot learning (pCDFSL) task, where a large-scale pre-trained model like CLIP can be easily deployed on a target dataset. The goal is to simultaneously classify all unseen classes under extreme domain shifts, by utilizing only a few labeled samples per class. The pCDFSL paradigm is source-free and moves beyond artificially created episodic training and testing regimes followed by existing CDFSL frameworks, making it more challenging and relevant to real-world applications. Towards that goal, we propose a novel framework, termed MIST (MultIple STochastic Prompt tuning), where multiple stochastic prompts are utilized to handle significant domain and semantic shifts. Specifically, multiple prompts are learnt for each class, effectively capturing multiple peaks in the input data. Furthermore, instead of representing the weights of the multiple prompts as point-estimates, we model them as learnable Gaussian distributions with two different strategies, encouraging an efficient exploration of the prompt parameter space, which mitigate overfitting due to the few labeled training samples. Extensive experiments and comparison with the state-of-the-art methods on four CDFSL benchmarks adapted to this setting, show the effectiveness of the proposed framework.
zh
[CV-33] Learning from Noise: Enhancing DNNs for Event-Based Vision through Controlled Noise Injection
【速读】:该论文试图解决事件相机(event camera)在快速运动或复杂光照条件下产生的事件数据中存在大量噪声的问题,这种噪声会降低深度学习模型的性能和鲁棒性。传统方法通过滤波算法去除噪声,但可能同时丢失有用信息。该论文的关键解决方案是提出一种噪声注入训练方法(noise-injection training methodology),通过在训练数据中引入可控噪声,使神经网络能够学习到对噪声具有鲁棒性的特征表示,从而提升模型在不同噪声水平下的稳定性与分类准确性。
链接: https://arxiv.org/abs/2506.03918
作者: Marcin Kowalczyk,Kamil Jeziorek,Tomasz Kryjak
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Event-based sensors offer significant advantages over traditional frame-based cameras, especially in scenarios involving rapid motion or challenging lighting conditions. However, event data frequently suffers from considerable noise, negatively impacting the performance and robustness of deep learning models. Traditionally, this problem has been addressed by applying filtering algorithms to the event stream, but this may also remove some of relevant data. In this paper, we propose a novel noise-injection training methodology designed to enhance the neural networks robustness against varying levels of event noise. Our approach introduces controlled noise directly into the training data, enabling models to learn noise-resilient representations. We have conducted extensive evaluations of the proposed method using multiple benchmark datasets (N-Caltech101, N-Cars, and Mini N-ImageNet) and various network architectures, including Convolutional Neural Networks, Vision Transformers, Spiking Neural Networks, and Graph Convolutional Networks. Experimental results show that our noise-injection training strategy achieves stable performance over a range of noise intensities, consistently outperforms event-filtering techniques, and achieves the highest average classification accuracy, making it a viable alternative to traditional event-data filtering methods in an object classification system. Code: this https URL
zh
[CV-34] Joint Video Enhancement with Deblurring Super-Resolution and Frame Interpolation Network
【速读】:该论文旨在解决由多个退化因素共同作用导致视频质量严重下降的问题,传统顺序处理方法效率低且效果不佳,因为多数视频增强方法未考虑多因素协同退化的情况。其解决方案的关键在于提出一种联合视频增强方法,通过解决一个集成的增强问题,同时缓解多种退化因素,所提出的网络DSFN(Dual-Stage Fusion Network)通过结合联合去模糊与超分辨率(JDSR)模块和三帧基础帧插值(TFBFI)模块,直接生成高分辨率、高帧率且清晰的视频,从而在保持较小网络规模和更快处理速度的前提下实现优于现有顺序方法的性能。
链接: https://arxiv.org/abs/2506.03892
作者: Giyong Choi,HyunWook Park
机构: KAIST(韩国科学技术院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Video quality is often severely degraded by multiple factors rather than a single factor. These low-quality videos can be restored to high-quality videos by sequentially performing appropriate video enhancement techniques. However, the sequential approach was inefficient and sub-optimal because most video enhancement approaches were designed without taking into account that multiple factors together degrade video quality. In this paper, we propose a new joint video enhancement method that mitigates multiple degradation factors simultaneously by resolving an integrated enhancement problem. Our proposed network, named DSFN, directly produces a high-resolution, high-frame-rate, and clear video from a low-resolution, low-frame-rate, and blurry video. In the DSFN, low-resolution and blurry input frames are enhanced by a joint deblurring and super-resolution (JDSR) module. Meanwhile, intermediate frames between input adjacent frames are interpolated by a triple-frame-based frame interpolation (TFBFI) module. The proper combination of the proposed modules of DSFN can achieve superior performance on the joint video enhancement task. Experimental results show that the proposed method outperforms other sequential state-of-the-art techniques on public datasets with a smaller network size and faster processing time.
zh
[CV-35] Video How Do Your Tokens Merge? CVPR2025
【速读】:该论文试图解决视频变换器模型在处理时空输入时所需的大量计算资源问题。其解决方案的关键在于采用无需训练的视频令牌合并(training-free token merging),通过合并令牌来减少计算量,同时保持模型的准确性,该方法具有可插拔性、无需重新训练模型,并能有效传递被丢弃的信息。
链接: https://arxiv.org/abs/2506.03885
作者: Sam Pollard,Michael Wray
机构: University of Bristol(布里斯托大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at eLVM workshop at CVPR 2025
Abstract:Video transformer models require huge amounts of compute resources due to the spatio-temporal scaling of the input. Tackling this, recent methods have proposed to drop or merge tokens for image models, whether randomly or via learned methods. Merging tokens has many benefits: it can be plugged into any vision transformer, does not require model re-training, and it propagates information that would otherwise be dropped through the model. Before now, video token merging has not been evaluated on temporally complex datasets for video understanding. In this work, we explore training-free token merging for video to provide comprehensive experiments and find best practices across four video transformers on three datasets that exhibit coarse and fine-grained action recognition. Our results showcase the benefits of video token merging with a speedup of around 2.5 X while maintaining accuracy (avg. -0.55% for ViViT). Code available at this https URL.
zh
[CV-36] JointSplat: Probabilistic Joint Flow-Depth Optimization for Sparse-View Gaussian Splatting
【速读】:该论文旨在解决从稀疏视角重建三维场景中的几何误差与不一致性问题,特别是在低纹理或重复区域的深度估计错误以及无真实流监督时的局部噪声和全局不一致问题。其解决方案的关键在于提出一种统一框架JointSplat,通过新颖的概率优化机制,利用光流与深度之间的互补性,在像素级上根据光流匹配概率进行信息融合,并引入多视角深度一致性损失以提升监督可靠性并抑制不确定区域的误导梯度。
链接: https://arxiv.org/abs/2506.03872
作者: Yang Xiao,Guoan Xu,Qiang Wu,Wenjing Jia
机构: University of Technology Sydney (悉尼科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Reconstructing 3D scenes from sparse viewpoints is a long-standing challenge with wide applications. Recent advances in feed-forward 3D Gaussian sparse-view reconstruction methods provide an efficient solution for real-time novel view synthesis by leveraging geometric priors learned from large-scale multi-view datasets and computing 3D Gaussian centers via back-projection. Despite offering strong geometric cues, both feed-forward multi-view depth estimation and flow-depth joint estimation face key limitations: the former suffers from mislocation and artifact issues in low-texture or repetitive regions, while the latter is prone to local noise and global inconsistency due to unreliable matches when ground-truth flow supervision is unavailable. To overcome this, we propose JointSplat, a unified framework that leverages the complementarity between optical flow and depth via a novel probabilistic optimization mechanism. Specifically, this pixel-level mechanism scales the information fusion between depth and flow based on the matching probability of optical flow during training. Building upon the above mechanism, we further propose a novel multi-view depth-consistency loss to leverage the reliability of supervision while suppressing misleading gradients in uncertain areas. Evaluated on RealEstate10K and ACID, JointSplat consistently outperforms state-of-the-art (SOTA) methods, demonstrating the effectiveness and robustness of our proposed probabilistic joint flow-depth optimization approach for high-fidelity sparse-view 3D reconstruction.
zh
[CV-37] Animal Pose Labeling Using General-Purpose Point Trackers
【速读】:该论文试图解决从视频中自动估计动物姿态的问题,该问题在研究动物行为中具有重要意义。现有方法由于训练数据集不够全面,无法可靠地捕捉所有必要的动物行为,而由于动物形态的多样性,收集此类数据集极具挑战性。论文提出的解决方案的关键在于采用测试时优化(test time optimization)策略,即在预训练的通用点跟踪器中对轻量级外观嵌入进行微调,基于少量标注帧进行优化,并将微调后的模型应用于其余帧进行自动标注,从而在合理的标注成本下实现了最先进的性能。
链接: https://arxiv.org/abs/2506.03868
作者: Zhuoyang Pan,Boxiao Pan,Guandao Yang,Adam W. Harley,Leonidas Guibas
机构: Stanford University (斯坦福大学); ShanghaiTech University (上海科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Automatically estimating animal poses from videos is important for studying animal behaviors. Existing methods do not perform reliably since they are trained on datasets that are not comprehensive enough to capture all necessary animal behaviors. However, it is very challenging to collect such datasets due to the large variations in animal morphology. In this paper, we propose an animal pose labeling pipeline that follows a different strategy, i.e. test time optimization. Given a video, we fine-tune a lightweight appearance embedding inside a pre-trained general-purpose point tracker on a sparse set of annotated frames. These annotations can be obtained from human labelers or off-the-shelf pose detectors. The fine-tuned model is then applied to the rest of the frames for automatic labeling. Our method achieves state-of-the-art performance at a reasonable annotation cost. We believe our pipeline offers a valuable tool for the automatic quantification of animal behavior. Visit our project webpage at this https URL.
zh
[CV-38] ConText: Driving In-context Learning for Text Removal and Segmentation ICML2025
【速读】:该论文旨在解决将视觉上下文学习(V-ICL)范式应用于光学字符识别任务中的文本移除和分割问题,传统方法通过直接使用图像-标签组合器作为提示,导致模型陷入单步推理的挑战。解决方案的关键在于提出一种任务链式组合器(image-removal-segmentation),提供包含丰富中间步骤的增强提示,以激发更有效的推理过程,并引入上下文感知聚合机制,将链式提示模式整合到潜在查询表示中,从而提升模型的上下文推理能力。
链接: https://arxiv.org/abs/2506.03799
作者: Fei Zhang,Pei Zhang,Baosong Yang,Fei Huang,Yanfeng Wang,Ya Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages, 9 figures, Accepted at ICML 2025
Abstract:This paper presents the first study on adapting the visual in-context learning (V-ICL) paradigm to optical character recognition tasks, specifically focusing on text removal and segmentation. Most existing V-ICL generalists employ a reasoning-as-reconstruction approach: they turn to using a straightforward image-label compositor as the prompt and query input, and then masking the query label to generate the desired output. This direct prompt confines the model to a challenging single-step reasoning process. To address this, we propose a task-chaining compositor in the form of image-removal-segmentation, providing an enhanced prompt that elicits reasoning with enriched intermediates. Additionally, we introduce context-aware aggregation, integrating the chained prompt pattern into the latent query representation, thereby strengthening the model’s in-context reasoning. We also consider the issue of visual heterogeneity, which complicates the selection of homogeneous demonstrations in text recognition. Accordingly, this is effectively addressed through a simple self-prompting strategy, preventing the model’s in-context learnability from devolving into specialist-like, context-free inference. Collectively, these insights culminate in our ConText model, which achieves new state-of-the-art across both in- and out-of-domain benchmarks. The code is available at this https URL.
zh
[CV-39] CoLa: Chinese Character Decomposition with Compositional Latent Components
【速读】:该论文旨在解决中文字符识别(Chinese Character Recognition, CCR)中的零样本问题(zero-shot problem),该问题源于中文字符数据集的长尾分布特性。现有方法虽在通过预定义的部首或笔画分解建模组合性方面取得进展,但通常忽略了学习如何学习(learning-to-learn)的能力,从而限制了其在人类定义方案之外的泛化能力。论文提出的解决方案关键在于构建一个深度潜在变量模型,即Compositional Latent components of Chinese characters (CoLa),该模型无需依赖人工定义的分解方案,直接学习中文字符的组合潜在成分,通过潜在空间中组合成分的比较实现零样本字符识别,从而有效提升模型的泛化能力。
链接: https://arxiv.org/abs/2506.03798
作者: Fan Shi,Haiyang Yu,Bin Li,Xiangyang Xue
机构: Fudan University (复旦大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Humans can decompose Chinese characters into compositional components and recombine them to recognize unseen characters. This reflects two cognitive principles: Compositionality, the idea that complex concepts are built on simpler parts; and Learning-to-learn, the ability to learn strategies for decomposing and recombining components to form new concepts. These principles provide inductive biases that support efficient generalization. They are critical to Chinese character recognition (CCR) in solving the zero-shot problem, which results from the common long-tail distribution of Chinese character datasets. Existing methods have made substantial progress in modeling compositionality via predefined radical or stroke decomposition. However, they often ignore the learning-to-learn capability, limiting their ability to generalize beyond human-defined schemes. Inspired by these principles, we propose a deep latent variable model that learns Compositional Latent components of Chinese characters (CoLa) without relying on human-defined decomposition schemes. Recognition and matching can be performed by comparing compositional latent components in the latent space, enabling zero-shot character recognition. The experiments illustrate that CoLa outperforms previous methods in both character the radical zero-shot CCR. Visualization indicates that the learned components can reflect the structure of characters in an interpretable way. Moreover, despite being trained on historical documents, CoLa can analyze components of oracle bone characters, highlighting its cross-dataset generalization ability.
zh
[CV-40] HUMOF: Human Motion Forecasting in Interactive Social Scenes
【速读】:该论文旨在解决复杂场景中人类行为预测的问题,此类场景由于存在大量交互信息(如人与人之间以及人与环境之间的交互)而增加了行为分析和理解的难度,进而提高了运动预测的不确定性。论文提出了一种有效的人类运动预测方法,其关键在于设计了一种分层交互特征表示,以全面捕捉交互的高层次上下文和低层次细节,并引入了从粗到细的交互推理模块,结合空间和频率视角高效利用分层特征,从而提升运动预测的准确性。
链接: https://arxiv.org/abs/2506.03753
作者: Caiyi Sun,Yujing Sun,Xiao Han,Zemin Yang,Jiawei Liu,Xinge Zhu,Siu Ming Yiu,Yuexin Ma
机构: The University of Hong Kong (香港大学); ShanghaiTech University (上海科技大学); Sun Yat-sen University (中山大学); Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Complex scenes present significant challenges for predicting human behaviour due to the abundance of interaction information, such as human-human and humanenvironment interactions. These factors complicate the analysis and understanding of human behaviour, thereby increasing the uncertainty in forecasting human motions. Existing motion prediction methods thus struggle in these complex scenarios. In this paper, we propose an effective method for human motion forecasting in interactive scenes. To achieve a comprehensive representation of interactions, we design a hierarchical interaction feature representation so that high-level features capture the overall context of the interactions, while low-level features focus on fine-grained details. Besides, we propose a coarse-to-fine interaction reasoning module that leverages both spatial and frequency perspectives to efficiently utilize hierarchical features, thereby enhancing the accuracy of motion predictions. Our method achieves state-of-the-art performance across four public datasets. Code will be released when this paper is published.
zh
[CV-41] SAAT: Synergistic Alternating Aggregation Transformer for Image Super-Resolution
【速读】:该论文旨在解决当前基于Transformer的单图像超分辨率模型在计算自注意力时仅关注局部窗口,从而忽略跨通道信息和中间过程中的丰富空间结构信息的问题。其解决方案的关键在于提出一种新型模型——协同交替聚合Transformer(Synergistic Alternating Aggregation Transformer, SAAT),通过引入高效通道窗口协同注意力组(Efficient Channel Window Synergistic Attention Group, CWSAG)和空间窗口协同注意力组(Spatial Window Synergistic Attention Group, SWSAG),实现通道与空间注意力的协同作用,从而更有效地利用特征中的潜在信息,提升图像超分辨率的效果。
链接: https://arxiv.org/abs/2506.03740
作者: Jianfeng Wu,Nannan Xu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Single image super-resolution is a well-known downstream task which aims to restore low-resolution images into high-resolution images. At present, models based on Transformers have shone brightly in the field of super-resolution due to their ability to capture long-term dependencies in information. However, current methods typically compute self-attention in nonoverlapping windows to save computational costs, and the standard self-attention computation only focuses on its results, thereby neglecting the useful information across channels and the rich spatial structural information generated in the intermediate process. Channel attention and spatial attention have, respectively, brought significant improvements to various downstream visual tasks in terms of extracting feature dependency and spatial structure relationships, but the synergistic relationship between channel and spatial attention has not been fully explored this http URL address these issues, we propose a novel model. Synergistic Alternating Aggregation Transformer (SAAT), which can better utilize the potential information of features. In SAAT, we introduce the Efficient Channel Window Synergistic Attention Group (CWSAG) and the Spatial Window Synergistic Attention Group (SWSAG). On the one hand, CWSAG combines efficient channel attention with shifted window attention, enhancing non-local feature fusion, and producing more visually appealing results. On the other hand, SWSAG leverages spatial attention to capture rich structured feature information, thereby enabling SAAT to more effectively extract structural this http URL experimental results and ablation studies demonstrate the effectiveness of SAAT in the field of super-resolution. SAAT achieves performance comparable to that of the state-of-the-art (SOTA) under the same quantity of parameters.
zh
[CV-42] ComRoPE: Scalable and Robust Rotary Position Embedding Parameterized by Trainable Commuting Angle Matrices
【速读】:该论文旨在解决传统位置编码方法在鲁棒性和灵活性方面的不足,以及Rotary Positional Encoding (RoPE)因依赖手动定义的旋转矩阵而限制模型容量的问题。其解决方案的关键在于提出ComRoPE,通过定义可训练的可交换角度矩阵来推广RoPE,确保矩阵之间的两两可交换性以实现可扩展性和位置鲁棒性,并形式化定义了RoPE方程以保证位置偏移下的性能一致性。
链接: https://arxiv.org/abs/2506.03737
作者: Hao Yu,Tangyu Jiang,Shuning Jia,Shannan Yan,Shunning Liu,Haolong Qian,Guanghao Li,Shuting Dong,Huaisong Zhang,Chun Yuan
机构: Tsinghua University (清华大学); Shenzhen University (深圳大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:The Transformer architecture has revolutionized various regions since it was proposed, and its effectiveness largely depends on the ability to encode positional information. Traditional position encoding methods exhibit significant limitations due to lack of robustness and flexibility of position. Therefore, Rotary Positional Encoding (RoPE) was proposed to alleviate these issues, which integrates positional information by rotating the embeddings in the attention mechanism. However, RoPE requires manually defined rotation matrices with limited transformation space, constraining the model’s capacity. In this work, we propose ComRoPE, which generalizes RoPE by defining it in terms of trainable commuting angle matrices. Specifically, we demonstrate that pairwise commutativity of these matrices is essential for RoPE to achieve scalability and positional robustness. We formally define the RoPE Equation, which is an essential condition that ensures consistent performance with position offsets. Based on the theoretical analysis, we present two types of trainable commuting angle matrices as sufficient solutions to the RoPE equation, which significantly improve performance, surpassing the current state-of-the-art method by 1.6% at training resolution and 2.9% at higher resolution on the ImageNet-1K dataset. Furthermore, our framework shows versatility in generalizing to existing RoPE formulations and offering new insights for future positional encoding research. To ensure reproducibility, the source code and instructions are available at this https URL
zh
[CV-43] FSHNet: Fully Sparse Hybrid Network for 3D Object Detection CVPR2025
【速读】:该论文旨在解决全稀疏3D检测器在长距离特征提取中的不足,特别是由于仅从非空体素(non-empty voxels)中提取特征而导致的长程交互能力下降和中心特征丢失问题。其解决方案的关键在于提出一种全稀疏混合网络(Fully Sparse Hybrid Network, FSHNet),其中包含一个名为SlotFormer的块,通过槽分区(slot partition)方法扩大感受野以增强长程特征提取能力,并引入动态稀疏标签分配策略以优化网络性能,同时结合稀疏上采样模块以保留细粒度细节,从而提升小目标检测效果。
链接: https://arxiv.org/abs/2506.03714
作者: Shuai Liu,Mingyue Cui,Boyang Li,Quanmin Liang,Tinghe Hong,Kai Huang,Yunxiao Shan,Kai Huang
机构: Sun Yat-sen University (中山大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR2025
Abstract:Fully sparse 3D detectors have recently gained significant attention due to their efficiency in long-range detection. However, sparse 3D detectors extract features only from non-empty voxels, which impairs long-range interactions and causes the center feature missing. The former weakens the feature extraction capability, while the latter hinders network optimization. To address these challenges, we introduce the Fully Sparse Hybrid Network (FSHNet). FSHNet incorporates a proposed SlotFormer block to enhance the long-range feature extraction capability of existing sparse encoders. The SlotFormer divides sparse voxels using a slot partition approach, which, compared to traditional window partition, provides a larger receptive field. Additionally, we propose a dynamic sparse label assignment strategy to deeply optimize the network by providing more high-quality positive samples. To further enhance performance, we introduce a sparse upsampling module to refine downsampled voxels, preserving fine-grained details crucial for detecting small objects. Extensive experiments on the Waymo, nuScenes, and Argoverse2 benchmarks demonstrate the effectiveness of FSHNet. The code is available at this https URL.
zh
[CV-44] PlückeRF: A Line-based 3D Representation for Few-view Reconstruction
【速读】:该论文旨在解决单视角和少视角三维重建方法在利用多视角信息方面存在的不足,以进一步提升重建质量。其解决方案的关键在于提出了一种名为PlückeRF的结构化、特征增强的线性三维表示,通过将三维表示与输入视角的像素射线进行连接,实现相邻三维位置之间以及三维位置与附近像素射线之间的信息优先共享。
链接: https://arxiv.org/abs/2506.03713
作者: Sam Bahrami,Dylan Campbell
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Feed-forward 3D reconstruction methods aim to predict the 3D structure of a scene directly from input images, providing a faster alternative to per-scene optimization approaches. Significant progress has been made in single-view and few-view reconstruction using learned priors that infer object shape and appearance, even for unobserved regions. However, there is substantial potential to enhance these methods by better leveraging information from multiple views when available. To address this, we propose a few-view reconstruction model that more effectively harnesses multi-view information. Our approach introduces a simple mechanism that connects the 3D representation with pixel rays from the input views, allowing for preferential sharing of information between nearby 3D locations and between 3D locations and nearby pixel rays. We achieve this by defining the 3D representation as a set of structured, feature-augmented lines; the PlückeRF representation. Using this representation, we demonstrate improvements in reconstruction quality over the equivalent triplane representation and state-of-the-art feedforward reconstruction methods.
zh
[CV-45] OSGNet @ Ego4D Episodic Memory Challenge 2025 CVPR
【速读】:该论文旨在解决自中心视频(egocentric video)中时间区间精确定位的问题,特别是在Ego4D Episodic Memory Challenge的三个跟踪任务中。以往的统一视频定位方法通常依赖于晚期融合策略,这往往导致结果不够理想。为了解决这一问题,该研究采用了一种基于早期融合的视频定位模型,以提升定位精度。其解决方案的关键在于通过早期融合策略整合多模态信息,从而更有效地完成三个任务,最终在自然语言查询、目标步骤和时刻查询任务中均获得第一名。
链接: https://arxiv.org/abs/2506.03710
作者: Yisen Feng,Haoyu Zhang,Qiaohui Chu,Meng Liu,Weili Guan,Yaowei Wang,Liqiang Nie
机构: Harbin Institute of Technology (Shenzhen) (哈尔滨工业大学(深圳)); Pengcheng Laboratory (鹏城实验室); Shandong Jianzhu University (山东建筑大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: The champion solutions for the three egocentric video localization tracks(Natural Language Queries, Goal Step, and Moment Queries tracks) of the Ego4D Episodic Memory Challenge at CVPR EgoVis Workshop 2025
Abstract:In this report, we present our champion solutions for the three egocentric video localization tracks of the Ego4D Episodic Memory Challenge at CVPR 2025. All tracks require precise localization of the interval within an untrimmed egocentric video. Previous unified video localization approaches often rely on late fusion strategies, which tend to yield suboptimal results. To address this, we adopt an early fusion-based video localization model to tackle all three tasks, aiming to enhance localization accuracy. Ultimately, our method achieved first place in the Natural Language Queries, Goal Step, and Moment Queries tracks, demonstrating its effectiveness. Our code can be found at this https URL.
zh
[CV-46] AetherVision-Bench: An Open-Vocabulary RGB-Infrared Benchmark for Multi-Angle Segmentation across Aerial and Ground Perspectives CVPR2025
【速读】:该论文旨在解决开放词汇语义分割(Open-vocabulary semantic segmentation, OVSS)在跨领域泛化能力上的不足,这一问题限制了其在实际应用场景中的有效性。论文提出的解决方案关键在于构建AetherVision-Bench基准,该基准支持从空中和地面视角进行多角度分割评估,从而全面检验模型在不同视角和传感器模态下的性能,并深入分析影响零样本迁移模型性能的关键因素。
链接: https://arxiv.org/abs/2506.03709
作者: Aniruddh Sikdar,Aditya Gandhamal,Suresh Sundaram
机构: Indian Institute of Science, Bengaluru, India (印度科学研究所,班加罗尔,印度)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at Workshop on Foundation Models Meet Embodied Agents at CVPR 2025 (Non-archival Track)
Abstract:Open-vocabulary semantic segmentation (OVSS) involves assigning labels to each pixel in an image based on textual descriptions, leveraging world models like CLIP. However, they encounter significant challenges in cross-domain generalization, hindering their practical efficacy in real-world applications. Embodied AI systems are transforming autonomous navigation for ground vehicles and drones by enhancing their perception abilities, and in this study, we present AetherVision-Bench, a benchmark for multi-angle segmentation across aerial, and ground perspectives, which facilitates an extensive evaluation of performance across different viewing angles and sensor modalities. We assess state-of-the-art OVSS models on the proposed benchmark and investigate the key factors that impact the performance of zero-shot transfer models. Our work pioneers the creation of a robustness benchmark, offering valuable insights and establishing a foundation for future research.
zh
[CV-47] OV-COAST: Cost Aggregation with Optimal Transport for Open-Vocabulary Semantic Segmentation CVPR2025
【速读】:该论文旨在解决开放词汇语义分割(Open-vocabulary semantic segmentation, OVSS)中模型在域外数据上的泛化能力不足的问题。其解决方案的关键在于提出一种基于最优传输理论的代价聚合方法(Cost Aggregation with Optimal Transport, OV-COAST),通过构建代价张量来量化视觉-语言特征分布间的距离,并采用两阶段优化策略:第一阶段利用Sinkhorn距离求解最优传输问题以获得特征对齐解,第二阶段则利用该解指导CAT-Seg模型的训练,从而提升模型性能。
链接: https://arxiv.org/abs/2506.03706
作者: Aditya Gandhamal,Aniruddh Sikdar,Suresh Sundaram
机构: Kotak IISc AI-ML Centre, Indian Institute of Science, Bengaluru, India; Robert Bosch Centre for Cyber Physical Systems, Indian Institute of Science, Bengaluru, India; Department of Aerospace Engineering, Indian Institute of Science, Bengaluru, India
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at CVPR 2025 Workshop on Transformers for Vision (Non-archival track)
Abstract:Open-vocabulary semantic segmentation (OVSS) entails assigning semantic labels to each pixel in an image using textual descriptions, typically leveraging world models such as CLIP. To enhance out-of-domain generalization, we propose Cost Aggregation with Optimal Transport (OV-COAST) for open-vocabulary semantic segmentation. To align visual-language features within the framework of optimal transport theory, we employ cost volume to construct a cost matrix, which quantifies the distance between two distributions. Our approach adopts a two-stage optimization strategy: in the first stage, the optimal transport problem is solved using cost volume via Sinkhorn distance to obtain an alignment solution; in the second stage, this solution is used to guide the training of the CAT-Seg model. We evaluate state-of-the-art OVSS models on the MESS benchmark, where our approach notably improves the performance of the cost-aggregation model CAT-Seg with ViT-B backbone, achieving superior results, surpassing CAT-Seg by 1.72 % and SAN-B by 4.9 % mIoU. The code is available at this https URLthis https URL .
zh
[CV-48] Advancements in Artificial Intelligence Applications for Cardiovascular Disease Research
【速读】:该论文试图解决心血管医学中传统诊断方法在准确性和效率上的局限性,以及如何通过人工智能(Artificial Intelligence, AI)技术提升诊断的精准度和流程效率。其解决方案的关键在于利用深度学习架构,如卷积神经网络和生成对抗网络,实现对医学影像和生理信号的自动化分析,从而在诊断准确性和工作流程效率方面超越人类能力。然而,论文也指出输入数据准确性验证的不足是当前面临的主要挑战,因此强调建立稳健的验证协议以确保临床可靠性成为关键所在。
链接: https://arxiv.org/abs/2506.03698
作者: Yuanlin Mo,Haishan Huang,Bocheng Liang,Weibo Ma
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advancements in artificial intelligence (AI) have revolutionized cardiovascular medicine, particularly through integration with computed tomography (CT), magnetic resonance imaging (MRI), electrocardiography (ECG) and ultrasound (US). Deep learning architectures, including convolutional neural networks and generative adversarial networks, enable automated analysis of medical imaging and physiological signals, surpassing human capabilities in diagnostic accuracy and workflow efficiency. However, critical challenges persist, including the inability to validate input data accuracy, which may propagate diagnostic errors. This review highlights AI’s transformative potential in precision diagnostics while underscoring the need for robust validation protocols to ensure clinical reliability. Future directions emphasize hybrid models integrating multimodal data and adaptive algorithms to refine personalized cardiovascular care.
zh
[CV-49] DSSAU-Net:U-Shaped Hybrid Network for Pubic Symphysis and Fetal Head Segmentation MICCAI
【速读】:该论文旨在解决分娩过程中传统侵入性阴道检查方法主观且不准确的问题,提出了一种基于超声图像的客观有效诊断方法,以评估胎儿头部位置。其关键在于实现胎儿头部(Fetal Head, FH)和耻骨联合(Pubic Symphysis, PS)的精确分割,从而辅助临床医生确保顺利分娩。为此,作者提出了DSSAU-Net,一种具有优异性能和高计算效率的稀疏自注意力网络架构,通过堆叠不同数量的双稀疏选择注意力(Dual Sparse Selection Attention, DSSA)块形成对称的U型编码器-解码器结构,结合多尺度特征融合与跳跃连接机制,有效提升了分割精度与计算效率。
链接: https://arxiv.org/abs/2506.03684
作者: Zunhui Xia,Hongxing Li,Libin Lan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 3 figures, 5 this http URL by MICCAI Workshop on IUGC 2024
Abstract:In the childbirth process, traditional methods involve invasive vaginal examinations, but research has shown that these methods are both subjective and inaccurate. Ultrasound-assisted diagnosis offers an objective yet effective way to assess fetal head position via two key parameters: Angle of Progression (AoP) and Head-Symphysis Distance (HSD), calculated by segmenting the fetal head (FH) and pubic symphysis (PS), which aids clinicians in ensuring a smooth delivery process. Therefore, accurate segmentation of FH and PS is crucial. In this work, we propose a sparse self-attention network architecture with good performance and high computational efficiency, named DSSAU-Net, for the segmentation of FH and PS. Specifically, we stack varying numbers of Dual Sparse Selection Attention (DSSA) blocks at each stage to form a symmetric U-shaped encoder-decoder network architecture. For a given query, DSSA is designed to explicitly perform one sparse token selection at both the region and pixel levels, respectively, which is beneficial for further reducing computational complexity while extracting the most relevant features. To compensate for the information loss during the upsampling process, skip connections with convolutions are designed. Additionally, multiscale feature fusion is employed to enrich the model’s global and local information. The performance of DSSAU-Net has been validated using the Intrapartum Ultrasound Grand Challenge (IUGC) 2024 \textittest set provided by the organizer in the MICCAI IUGC 2024 competition\footnote\hrefthis https URL#learn_the_detailsthis https URL#learn_the_details, where we win the fourth place on the tasks of classification and segmentation, demonstrating its effectiveness. The codes will be available at this https URL.
zh
[CV-50] PRJ: Perception-Retrieval-Judgement for Generated Images
【速读】:该论文试图解决当前图像安全系统在检测AI生成视觉内容中的危害性时存在的局限性,这些问题包括依赖刚性类别过滤器、无法理解上下文或推理复杂且对抗性诱导的危害形式,以及标准评估指标无法准确反映毒性语义严重性和动态演变。其解决方案的关键在于提出一种受认知启发的框架——感知-检索-判断(Perception-Retrieval-Judgement, PRJ),该框架将毒性检测建模为结构化推理过程,通过三个阶段:将图像转换为描述性语言(感知)、检索与危害类别和特征相关的外部知识(检索),以及基于法律或规范规则评估毒性(判断),从而实现对显性和隐性危害的更精确检测,并提供更具解释性和分类粒度的分析。
链接: https://arxiv.org/abs/2506.03683
作者: Qiang Fu,Zonglei Jing,Zonghao Ying,Xiaoqian Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The rapid progress of generative AI has enabled remarkable creative capabilities, yet it also raises urgent concerns regarding the safety of AI-generated visual content in real-world applications such as content moderation, platform governance, and digital media regulation. This includes unsafe material such as sexually explicit images, violent scenes, hate symbols, propaganda, and unauthorized imitations of copyrighted artworks. Existing image safety systems often rely on rigid category filters and produce binary outputs, lacking the capacity to interpret context or reason about nuanced, adversarially induced forms of harm. In addition, standard evaluation metrics (e.g., attack success rate) fail to capture the semantic severity and dynamic progression of toxicity. To address these limitations, we propose Perception-Retrieval-Judgement (PRJ), a cognitively inspired framework that models toxicity detection as a structured reasoning process. PRJ follows a three-stage design: it first transforms an image into descriptive language (perception), then retrieves external knowledge related to harm categories and traits (retrieval), and finally evaluates toxicity based on legal or normative rules (judgement). This language-centric structure enables the system to detect both explicit and implicit harms with improved interpretability and categorical granularity. In addition, we introduce a dynamic scoring mechanism based on a contextual toxicity risk matrix to quantify harmfulness across different semantic dimensions. Experiments show that PRJ surpasses existing safety checkers in detection accuracy and robustness while uniquely supporting structured category-level toxicity interpretation.
zh
[CV-51] How PARTs assemble into wholes: Learning the relative composition of images
【速读】:该论文试图解决传统基于网格(grid-based)的自监督学习方法在捕捉现实世界中物体组合的流体和连续特性方面的不足。现有方法通过预测固定网格中块的绝对位置索引来构建预训练任务,但这种方法无法有效建模实际场景中的相对位置关系。论文提出的解决方案关键在于PART(Part-aware Relative Transformation),它利用非网格块之间的连续相对变换,学习图像中部分之间的相对组成关系,从而实现超越遮挡和形变的结构化相对定位,提升了在需要精确空间理解的任务中的性能。
链接: https://arxiv.org/abs/2506.03682
作者: Melika Ayoughi,Samira Abnar,Chen Huang,Chris Sandino,Sayeri Lala,Eeshan Gunesh Dhekane,Dan Busbridge,Shuangfei Zhai,Vimal Thilak,Josh Susskind,Pascal Mettes,Paul Groth,Hanlin Goh
机构: Apple(苹果)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:The composition of objects and their parts, along with object-object positional relationships, provides a rich source of information for representation learning. Hence, spatial-aware pretext tasks have been actively explored in self-supervised learning. Existing works commonly start from a grid structure, where the goal of the pretext task involves predicting the absolute position index of patches within a fixed grid. However, grid-based approaches fall short of capturing the fluid and continuous nature of real-world object compositions. We introduce PART, a self-supervised learning approach that leverages continuous relative transformations between off-grid patches to overcome these limitations. By modeling how parts relate to each other in a continuous space, PART learns the relative composition of images-an off-grid structural relative positioning process that generalizes beyond occlusions and deformations. In tasks requiring precise spatial understanding such as object detection and time series prediction, PART outperforms strong grid-based methods like MAE and DropPos, while also maintaining competitive performance on global classification tasks with minimal hyperparameter tuning. By breaking free from grid constraints, PART opens up an exciting new trajectory for universal self-supervised pretraining across diverse datatypes-from natural images to EEG signals-with promising potential in video, medical imaging, and audio.
zh
[CV-52] BiXFormer: A Robust Framework for Maximizing Modality Effectiveness in Multi-Modal Semantic Segmentation
【速读】:该论文试图解决多模态语义分割中现有方法在融合特征或知识时限制了各模态充分发挥其优势的问题(multi-modal semantic segmentation)。解决方案的关键在于将多模态语义分割重新定义为掩码级分类任务,并提出BiXFormer,该方法通过统一模态匹配(Unified Modality Matching, UMM)和跨模态对齐(Cross Modality Alignment, CMA)来最大化模态的有效性并处理缺失模态。UMM包括模态无关匹配(Modality Agnostic Matching, MAM)和互补匹配(Complementary Matching, CM),以充分利用各模态的优势并弥补缺失模态的影响,而CMA则进一步增强CM中较弱查询的表示,提升整体性能。
链接: https://arxiv.org/abs/2506.03675
作者: Jialei Chen,Xu Zheng,Danda Pani Paudel,Luc Van Gool,Hiroshi Murase,Daisuke Deguchi
机构: Nagoya University(名古屋大学); The Hong Kong University of Science and Technology, Guangzhou Campus (HKUST-GZ)(香港科技大学广州校区); INSAIT, Sofia University, St. Kliment Ohridski(INSAIT,索非亚大学,圣克莱门特·奥里斯基)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Utilizing multi-modal data enhances scene understanding by providing complementary semantic and geometric information. Existing methods fuse features or distill knowledge from multiple modalities into a unified representation, improving robustness but restricting each modality’s ability to fully leverage its strengths in different situations. We reformulate multi-modal semantic segmentation as a mask-level classification task and propose BiXFormer, which integrates Unified Modality Matching (UMM) and Cross Modality Alignment (CMA) to maximize modality effectiveness and handle missing modalities. Specifically, BiXFormer first categorizes multi-modal inputs into RGB and X, where X represents any non-RGB modalities, e.g., depth, allowing separate processing for each. This design leverages the well-established pretraining for RGB, while addressing the relative lack of attention to X modalities. Then, we propose UMM, which includes Modality Agnostic Matching (MAM) and Complementary Matching (CM). MAM assigns labels to features from all modalities without considering modality differences, leveraging each modality’s strengths. CM then reassigns unmatched labels to remaining unassigned features within their respective modalities, ensuring that each available modality contributes to the final prediction and mitigating the impact of missing modalities. Moreover, to further facilitate UMM, we introduce CMA, which enhances the weaker queries assigned in CM by aligning them with optimally matched queries from MAM. Experiments on both synthetic and real-world multi-modal benchmarks demonstrate the effectiveness of our method, achieving significant improvements in mIoU of +2.75% and +22.74% over the prior arts.
zh
[CV-53] Accelerating SfM-based Pose Estimation with Dominating Set
【速读】:该论文旨在解决基于Structure-from-Motion (SfM)的位姿估计在实时应用(如增强现实、虚拟现实和机器人技术)中处理速度不足的问题。其解决方案的关键在于引入图论中的支配集(dominating set)概念对SfM模型进行预处理,从而在不显著损失精度的前提下大幅提升位姿估计的速度。
链接: https://arxiv.org/abs/2506.03667
作者: Joji Joseph,Bharadwaj Amrutur,Shalabh Bhatnagar
机构: Indian Institute of Science (印度科学研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:This paper introduces a preprocessing technique to speed up Structure-from-Motion (SfM) based pose estimation, which is critical for real-time applications like augmented reality (AR), virtual reality (VR), and robotics. Our method leverages the concept of a dominating set from graph theory to preprocess SfM models, significantly enhancing the speed of the pose estimation process without losing significant accuracy. Using the OnePose dataset, we evaluated our method across various SfM-based pose estimation techniques. The results demonstrate substantial improvements in processing speed, ranging from 1.5 to 14.48 times, and a reduction in reference images and point cloud size by factors of 17-23 and 2.27-4, respectively. This work offers a promising solution for efficient and accurate 3D pose estimation, balancing speed and accuracy in real-time applications.
zh
[CV-54] Intersectional Bias in Pre-Trained Image Recognition Models
【速读】:该论文试图解决预训练深度学习模型在图像分类任务中可能继承并放大的编码偏见问题,特别是针对面部图像的表示中存在的年龄、种族和性别等敏感变量的交叉偏差。解决方案的关键在于利用线性分类器探针和激活的拓扑图可视化方法,评估ImageNet分类器在这些敏感变量上的表示偏差,从而揭示模型在不同人口统计学特征上的区分能力及其潜在的不公平性。
链接: https://arxiv.org/abs/2506.03664
作者: Valerie Krug,Sebastian Stober
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: Summary paper accepted at the 3rd TRR 318 Conference: Contextualizing Explanations 2025
Abstract:Deep Learning models have achieved remarkable success. Training them is often accelerated by building on top of pre-trained models which poses the risk of perpetuating encoded biases. Here, we investigate biases in the representations of commonly used ImageNet classifiers for facial images while considering intersections of sensitive variables age, race and gender. To assess the biases, we use linear classifier probes and visualize activations as topographic maps. We find that representations in ImageNet classifiers particularly allow differentiation between ages. Less strongly pronounced, the models appear to associate certain ethnicities and distinguish genders in middle-aged groups.
zh
[CV-55] Zero-Shot Temporal Interaction Localization for Egocentric Videos
【速读】:该论文旨在解决在第一视角视频中定位人类-物体交互(Human-Object Interaction, HOI)动作的时序问题,传统方法依赖于标注的动作和物体类别导致领域偏差和部署效率低,而现有零样本时序动作定位(Zero-Shot Temporal Action Localization, ZS-TAL)方法因粗粒度估计和开环流程限制了时序交互定位(Temporal Interaction Localization, TIL)的性能提升。解决方案的关键在于提出一种名为EgoLoc的新方法,其核心是通过自适应采样策略生成合理的视觉提示以供视觉-语言模型(Vision-Language Model, VLM)推理,并结合2D与3D观测信息直接采样高质量初始猜测,同时利用视觉与动态线索生成闭环反馈进一步优化定位结果。
链接: https://arxiv.org/abs/2506.03662
作者: Erhang Zhang,Junyi Ma,Yin-Dong Zheng,Yixuan Zhou,Hesheng Wang
机构: Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:Locating human-object interaction (HOI) actions within video serves as the foundation for multiple downstream tasks, such as human behavior analysis and human-robot skill transfer. Current temporal action localization methods typically rely on annotated action and object categories of interactions for optimization, which leads to domain bias and low deployment efficiency. Although some recent works have achieved zero-shot temporal action localization (ZS-TAL) with large vision-language models (VLMs), their coarse-grained estimations and open-loop pipelines hinder further performance improvements for temporal interaction localization (TIL). To address these issues, we propose a novel zero-shot TIL approach dubbed EgoLoc to locate the timings of grasp actions for human-object interaction in egocentric videos. EgoLoc introduces a self-adaptive sampling strategy to generate reasonable visual prompts for VLM reasoning. By absorbing both 2D and 3D observations, it directly samples high-quality initial guesses around the possible contact/separation timestamps of HOI according to 3D hand velocities, leading to high inference accuracy and efficiency. In addition, EgoLoc generates closed-loop feedback from visual and dynamic cues to further refine the localization results. Comprehensive experiments on the publicly available dataset and our newly proposed benchmark demonstrate that EgoLoc achieves better temporal interaction localization for egocentric videos compared to state-of-the-art baselines. We will release our code and relevant data as open-source at this https URL.
zh
[CV-56] INP-Former: Advancing Universal Anomaly Detection via Intrinsic Normal Prototypes and Residual Learning
【速读】:该论文旨在解决异常检测(Anomaly Detection, AD)中由于测试图像与训练集中的正常参考图像在外观和位置上存在差异而导致的对齐困难问题,从而限制了检测精度。其解决方案的关键在于提出INP-Former方法,该方法通过从测试图像中直接提取内在正常原型(Intrinsic Normal Prototypes, INPs),而非依赖训练集中的外部正常样本。INP-Former利用INP提取器线性组合正常标记以表示INPs,并引入INP一致性损失确保INPs能够准确表征测试图像的正常性,随后通过INP引导解码器仅重构正常标记,以重构误差作为异常得分。此外,该方法还引入了软采样损失以优化难样本,显著提升了单类、多类、少样本和零样本AD任务的性能。
链接: https://arxiv.org/abs/2506.03660
作者: Wei Luo,Haiming Yao,Yunkang Cao,Qiyu Chen,Ang Gao,Weiming Shen,Weihang Zhang,Wenyong Yu
机构: Tsinghua University (清华大学); Huazhong University of Science and Technology (华中科技大学); Hunan University School of Robotics (湖南大学机器人学院); Beijing Institute of Technology (北京理工大学); Chinese Academy of Sciences (中国科学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 11 figures, 13 tables
Abstract:Anomaly detection (AD) is essential for industrial inspection and medical diagnosis, yet existing methods typically rely on ``comparing’’ test images to normal references from a training set. However, variations in appearance and positioning often complicate the alignment of these references with the test image, limiting detection accuracy. We observe that most anomalies manifest as local variations, meaning that even within anomalous images, valuable normal information remains. We argue that this information is useful and may be more aligned with the anomalies since both the anomalies and the normal information originate from the same image. Therefore, rather than relying on external normality from the training set, we propose INP-Former, a novel method that extracts Intrinsic Normal Prototypes (INPs) directly from the test image. Specifically, we introduce the INP Extractor, which linearly combines normal tokens to represent INPs. We further propose an INP Coherence Loss to ensure INPs can faithfully represent normality for the testing image. These INPs then guide the INP-guided Decoder to reconstruct only normal tokens, with reconstruction errors serving as anomaly scores. Additionally, we propose a Soft Mining Loss to prioritize hard-to-optimize samples during training. INP-Former achieves state-of-the-art performance in single-class, multi-class, and few-shot AD tasks across MVTec-AD, VisA, and Real-IAD, positioning it as a versatile and universal solution for AD. Remarkably, INP-Former also demonstrates some zero-shot AD capability. Furthermore, we propose a soft version of the INP Coherence Loss and enhance INP-Former by incorporating residual learning, leading to the development of INP-Former++. The proposed method significantly improves detection performance across single-class, multi-class, semi-supervised, few-shot, and zero-shot settings.
zh
[CV-57] MambaNeXt-YOLO: A Hybrid State Space Model for Real-time Object Detection
【速读】:该论文旨在解决实时目标检测中在计算资源受限情况下兼顾精度与效率的问题。其关键解决方案是提出MambaNeXt-YOLO框架,通过三个核心贡献实现这一目标:(1) MambaNeXt Block,结合卷积神经网络(CNN)与Mamba模型以有效捕捉局部特征和长程依赖;(2) 多分支非对称融合金字塔网络(MAFPN),提升不同尺度目标的检测性能;(3) 面向边缘设备的高效性设计,使得模型在无需预训练的情况下即可在PASCAL VOC数据集上达到66.6% mAP且保持31.9 FPS的推理速度,并支持部署于边缘设备如NVIDIA Jetson Xavier NX和Orin NX。
链接: https://arxiv.org/abs/2506.03654
作者: Xiaochun Lei,Siqi Wu,Weilin Wu,Zetao Jiang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Real-time object detection is a fundamental but challenging task in computer vision, particularly when computational resources are limited. Although YOLO-series models have set strong benchmarks by balancing speed and accuracy, the increasing need for richer global context modeling has led to the use of Transformer-based architectures. Nevertheless, Transformers have high computational complexity because of their self-attention mechanism, which limits their practicality for real-time and edge deployments. To overcome these challenges, recent developments in linear state space models, such as Mamba, provide a promising alternative by enabling efficient sequence modeling with linear complexity. Building on this insight, we propose MambaNeXt-YOLO, a novel object detection framework that balances accuracy and efficiency through three key contributions: (1) MambaNeXt Block: a hybrid design that integrates CNNs with Mamba to effectively capture both local features and long-range dependencies; (2) Multi-branch Asymmetric Fusion Pyramid Network (MAFPN): an enhanced feature pyramid architecture that improves multi-scale object detection across various object sizes; and (3) Edge-focused Efficiency: our method achieved 66.6% mAP at 31.9 FPS on the PASCAL VOC dataset without any pre-training and supports deployment on edge devices such as the NVIDIA Jetson Xavier NX and Orin NX.
zh
[CV-58] EmoArt: A Multidimensional Dataset for Emotion-Aware Artistic Generation
【速读】:该论文试图解决生成具有情感表达和抽象艺术风格图像的难题,这一问题主要源于缺乏大规模、细粒度的情感标注数据集。解决方案的关键在于构建了EmoArt数据集——目前最全面的情感标注艺术数据集,该数据集包含132,664幅作品,覆盖56种绘画风格,并提供了结构化的注释信息,包括客观场景描述、视觉属性、情绪类别及潜在的艺术治疗效果,从而为情感驱动的图像合成提供了必要的数据和基准。
链接: https://arxiv.org/abs/2506.03652
作者: Cheng Zhang,Hongxia xie,Bin Wen,Songhan Zuo,Ruoxuan Zhang,Wen-huang Cheng
机构: Jilin University(吉林大学); National Taiwan University(台湾大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:With the rapid advancement of diffusion models, text-to-image generation has achieved significant progress in image resolution, detail fidelity, and semantic alignment, particularly with models like Stable Diffusion 3.5, Stable Diffusion XL, and FLUX 1. However, generating emotionally expressive and abstract artistic images remains a major challenge, largely due to the lack of large-scale, fine-grained emotional datasets. To address this gap, we present the EmoArt Dataset – one of the most comprehensive emotion-annotated art datasets to date. It contains 132,664 artworks across 56 painting styles (e.g., Impressionism, Expressionism, Abstract Art), offering rich stylistic and cultural diversity. Each image includes structured annotations: objective scene descriptions, five key visual attributes (brushwork, composition, color, line, light), binary arousal-valence labels, twelve emotion categories, and potential art therapy effects. Using EmoArt, we systematically evaluate popular text-to-image diffusion models for their ability to generate emotionally aligned images from text. Our work provides essential data and benchmarks for emotion-driven image synthesis and aims to advance fields such as affective computing, multimodal learning, and computational art, enabling applications in art therapy and creative design. The dataset and more details can be accessed via our project website.
zh
[CV-59] YOND: Practical Blind Raw Image Denoising Free from Camera-Specific Data Dependency
【速读】:该论文旨在解决盲原始图像去噪(blind raw image denoising)中因相机特定数据依赖性导致的性能下降问题。现有基于学习的方法在未知相机的数据上表现不佳,主要原因是其训练数据与目标数据之间存在相机特征差异。为了解决这一问题,作者提出了一种名为YOND的新方法,其核心在于通过三个关键模块实现对未知相机数据的鲁棒泛化:粗到细噪声估计(coarse-to-fine noise estimation, CNE)、期望匹配方差稳定变换(expectation-matched variance-stabilizing transform, EM-VST)和信噪比引导去噪器(SNR-guided denoiser, SNR-Net)。这些模块共同作用,使模型能够适应不同相机的噪声特性并提供可控的去噪效果。
链接: https://arxiv.org/abs/2506.03645
作者: Hansen Feng,Lizhi Wang,Yiqi Huang,Tong Li,Lin Zhu,Hua Huang
机构: Beijing Institute of Technology (北京理工大学); Beijing Normal University (北京师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: 17 pages, 19 figures, TPAMI under review
Abstract:The rapid advancement of photography has created a growing demand for a practical blind raw image denoising method. Recently, learning-based methods have become mainstream due to their excellent performance. However, most existing learning-based methods suffer from camera-specific data dependency, resulting in performance drops when applied to data from unknown cameras. To address this challenge, we introduce a novel blind raw image denoising method named YOND, which represents You Only Need a Denoiser. Trained solely on synthetic data, YOND can generalize robustly to noisy raw images captured by diverse unknown cameras. Specifically, we propose three key modules to guarantee the practicality of YOND: coarse-to-fine noise estimation (CNE), expectation-matched variance-stabilizing transform (EM-VST), and SNR-guided denoiser (SNR-Net). Firstly, we propose CNE to identify the camera noise characteristic, refining the estimated noise parameters based on the coarse denoised image. Secondly, we propose EM-VST to eliminate camera-specific data dependency, correcting the bias expectation of VST according to the noisy image. Finally, we propose SNR-Net to offer controllable raw image denoising, supporting adaptive adjustments and manual fine-tuning. Extensive experiments on unknown cameras, along with flexible solutions for challenging cases, demonstrate the superior practicality of our method. The source code will be publicly available at the \hrefthis https URLproject homepage.
zh
[CV-60] Images are Worth Variable Length of Representations
【速读】:该论文试图解决传统视觉编码器将图像映射为固定长度的标记序列所导致的效率低下问题,因为不同图像包含的信息量不同,复杂图像应分配更多标记以保持重建质量。解决方案的关键在于提出DOVE,一种动态视觉编码器,能够根据图像信息量生成可变数量的视觉标记(即连续表示向量),从而在保持高重建质量的同时显著减少平均标记数量,并在多个线性探测和下游多模态任务中表现出优于现有基于自编码器的标记方法的性能。
链接: https://arxiv.org/abs/2506.03643
作者: Lingjun Mao,Rodolfo Corona,Xin Liang,Wenhao Yan,Zineng Tang
机构: University of California, San Diego(加州大学圣地亚哥分校); University of California, Berkeley(加州大学伯克利分校); University of Washington(华盛顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Most existing vision encoders map images into a fixed-length sequence of tokens, overlooking the fact that different images contain varying amounts of information. For example, a visually complex image (e.g., a cluttered room) inherently carries more information and thus deserves more tokens than a simple image (e.g., a blank wall). To address this inefficiency, we propose DOVE, a dynamic vision encoder that produces a variable number of visual tokens (i.e., continuous representation vectors) to reconstruct each image. Our results show that DOVE significantly reduces the average number of tokens while maintaining high reconstruction quality. In several linear probing and downstream multimodal tasks, it outperforms existing autoencoder-based tokenization methods when using far fewer tokens, capturing more expressive semantic features compared to fixed-length encoding. We further extend DOVE with query-conditioned tokenization. By guiding the model to focus on query-relevant regions, it achieves more efficient and targeted semantic extraction. Our code and checkpoints are available at this https URL.
zh
[CV-61] Spatial Understanding from Videos: Structured Prompts Meet Simulation Data
【速读】:该论文旨在解决预训练视觉-语言模型(VLMs)在3D空间推理能力上的不足,特别是由于空间不确定性与数据稀缺性导致的问题。其解决方案的关键在于提出一个统一框架,该框架结合了SpatialMind(一种结构化提示策略,用于将复杂场景和问题分解为可解释的推理步骤)与ScanForgeQA(一个通过自动化构建过程从多样化3D模拟场景中生成的大规模问答数据集),从而在不修改模型架构的前提下提升模型的3D空间推理能力。
链接: https://arxiv.org/abs/2506.03642
作者: Haoyu Zhang,Meng Liu,Zaijing Li,Haokun Wen,Weili Guan,Yaowei Wang,Liqiang Nie
机构: Harbin Institute of Technology (Shenzhen) (哈尔滨工业大学(深圳)); Pengcheng Laboratory (鹏城实验室); Shandong Jianzhu University (山东建筑大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Visual-spatial understanding, the ability to infer object relationships and layouts from visual input, is fundamental to downstream tasks such as robotic navigation and embodied interaction. However, existing methods face spatial uncertainty and data scarcity, limiting the 3D spatial reasoning capability of pre-trained vision-language models (VLMs). To address these challenges, we present a unified framework for enhancing 3D spatial reasoning in pre-trained VLMs without modifying their architecture. This framework combines SpatialMind, a structured prompting strategy that decomposes complex scenes and questions into interpretable reasoning steps, with ScanForgeQA, a scalable question-answering dataset built from diverse 3D simulation scenes through an automated construction process designed for fine-tuning. Extensive experiments across multiple benchmarks demonstrate the individual and combined effectiveness of our prompting and fine-tuning strategies, and yield insights that may inspire future research on visual-spatial understanding.
zh
[CV-62] FingerVeinSyn-5M: A Million-Scale Dataset and Benchmark for Finger Vein Recognition
【速读】:该论文试图解决指静脉识别中缺乏大规模公开数据集的问题,现有数据集包含的身份数量有限且每根手指的样本数量不足,限制了基于深度学习方法的发展。解决方案的关键在于提出FVeinSyn,一个能够生成具有丰富类内变化的多样化指静脉模式的合成生成器,并利用其创建了FingerVeinSyn-5M——目前最大的指静脉数据集,包含来自50,000个唯一手指的500万张样本,每个手指包含100种变化,如平移、旋转、缩放、滚动、不同曝光水平、皮肤散射模糊、光学模糊和运动模糊。该数据集还首次提供了完全标注的指静脉图像,支持该领域的深度学习应用。
链接: https://arxiv.org/abs/2506.03635
作者: Yinfan Wang,Jie Gui,Baosheng Yu,Qi Li,Zhenan Sun,Juho Kannala,Guoying Zhao
机构: Southeast University(东南大学); Nanyang, Technological University(南洋理工大学); Chinese Academy of Sciences(中国科学院); Aalto University(阿尔托大学); University of Oulu(奥卢大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:A major challenge in finger vein recognition is the lack of large-scale public datasets. Existing datasets contain few identities and limited samples per finger, restricting the advancement of deep learning-based methods. To address this, we introduce FVeinSyn, a synthetic generator capable of producing diverse finger vein patterns with rich intra-class variations. Using FVeinSyn, we created FingerVeinSyn-5M – the largest available finger vein dataset – containing 5 million samples from 50,000 unique fingers, each with 100 variations including shift, rotation, scale, roll, varying exposure levels, skin scattering blur, optical blur, and motion blur. FingerVeinSyn-5M is also the first to offer fully annotated finger vein images, supporting deep learning applications in this field. Models pretrained on FingerVeinSyn-5M and fine-tuned with minimal real data achieve an average 53.91% performance gain across multiple benchmarks. The dataset is publicly available at: this https URL.
zh
[CV-63] Negative-Guided Subject Fidelity Optimization for Zero-Shot Subject-Driven Generation
【速读】:该论文试图解决零样本主体驱动生成中的主体保真度不足问题,即在没有特定主体监督信号的情况下,生成模型难以准确保持输入主体的特征。解决方案的关键在于提出了一种新颖的对比学习框架——主体保真度优化(Subject Fidelity Optimization, SFO),通过引入合成负样本并利用成对比较显式引导模型偏好正样本而非负样本,从而提升主体保真度。此外,SFO还通过重新加权扩散时间步长,聚焦于主体细节出现的中间步骤,进一步优化生成效果。
链接: https://arxiv.org/abs/2506.03621
作者: Chaehun Shin,Jooyoung Choi,Johan Barthelemy,Jungbeom Lee,Sungroh Yoon
机构: Seoul National University (首尔大学); NVIDIA (NVIDIA); Amazon (亚马逊); Seoul National University (首尔大学); Seoul National University (首尔大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:We present Subject Fidelity Optimization (SFO), a novel comparative learning framework for zero-shot subject-driven generation that enhances subject fidelity. Beyond supervised fine-tuning methods that rely only on positive targets and use the diffusion loss as in the pre-training stage, SFO introduces synthetic negative targets and explicitly guides the model to favor positives over negatives through pairwise comparison. For negative targets, we propose Condition-Degradation Negative Sampling (CDNS), which automatically generates distinctive and informative negatives by intentionally degrading visual and textual cues without expensive human annotations. Moreover, we reweight the diffusion timesteps to focus finetuning on intermediate steps where subject details emerge. Extensive experiments demonstrate that SFO with CDNS significantly outperforms baselines in terms of both subject fidelity and text alignment on a subject-driven generation benchmark. Project page: this https URL
zh
[CV-64] Isharah: A Large-Scale Multi-Scene Dataset for Continuous Sign Language Recognition
【速读】:该论文试图解决当前手语识别(Sign Language Recognition, SLR)研究中存在的一系列问题,特别是针对连续手语识别(Continuous Sign Language Recognition, CSLR)数据集的匮乏及其在受控环境下的局限性。现有CSLR数据集多在受控条件下采集,限制了其在真实场景中的适用性。为了解决这些问题,论文提出了一种名为Isharah的大规模多场景CSLR数据集,该数据集是首个在非受限环境中通过手语者智能手机摄像头采集的大型数据集,具有高度的记录设置、相机距离、角度和分辨率的多样性,有助于提升模型对现实复杂场景的适应能力。该数据集包含由18名聋哑及专业手语者表演的30,000个视频片段,并提供了每个视频的词素级标注,使其适用于开发CSLR和手语翻译(Sign Language Translation, SLT)系统。
链接: https://arxiv.org/abs/2506.03615
作者: Sarah Alyami,Hamzah Luqman,Sadam Al-Azani,Maad Alowaifeer,Yazeed Alharbi,Yaser Alonaizan
机构: King Fahd University of Petroleum & Minerals (法赫德国王石油与矿产大学); SDAIA-KFUPM Joint Research Center for Artificial Intelligence (SDAIA-KFUPM人工智能联合研究中心); National Center for Artificial Intelligence, SDAIA (人工智能国家中心,SDAIA)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Current benchmarks for sign language recognition (SLR) focus mainly on isolated SLR, while there are limited datasets for continuous SLR (CSLR), which recognizes sequences of signs in a video. Additionally, existing CSLR datasets are collected in controlled settings, which restricts their effectiveness in building robust real-world CSLR systems. To address these limitations, we present Isharah, a large multi-scene dataset for CSLR. It is the first dataset of its type and size that has been collected in an unconstrained environment using signers’ smartphone cameras. This setup resulted in high variations of recording settings, camera distances, angles, and resolutions. This variation helps with developing sign language understanding models capable of handling the variability and complexity of real-world scenarios. The dataset consists of 30,000 video clips performed by 18 deaf and professional signers. Additionally, the dataset is linguistically rich as it provides a gloss-level annotation for all dataset’s videos, making it useful for developing CSLR and sign language translation (SLT) systems. This paper also introduces multiple sign language understanding benchmarks, including signer-independent and unseen-sentence CSLR, along with gloss-based and gloss-free SLT. The Isharah dataset is available on this https URL.
zh
[CV-65] PDSE: A Multiple Lesion Detector for CT Images using PANet and Deformable Squeeze-and-Excitation Block
【速读】:该论文旨在解决在医学影像处理中,通过计算机断层扫描(CT)图像检测病灶的挑战性问题,尤其是针对病灶类型、大小和位置的多样性。其解决方案的关键在于提出一种单阶段病灶检测框架PDSE,通过对Retinanet进行重新设计,以提高在多模态CT图像中检测病灶的准确性和效率。该方法通过引入低级特征图增强路径聚合流,并利用自适应Squeeze-and-Excitation(SE)块和通道特征图注意力机制来提升模型表示能力,从而实现了最先进的检测性能。
链接: https://arxiv.org/abs/2506.03608
作者: Di Fan,Heng Yu,Zhiyuan Xu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: MIUA 2024
Abstract:Detecting lesions in Computed Tomography (CT) scans is a challenging task in medical image processing due to the diverse types, sizes, and locations of lesions. Recently, various one-stage and two-stage framework networks have been developed to focus on lesion localization. We introduce a one-stage lesion detection framework, PDSE, by redesigning Retinanet to achieve higher accuracy and efficiency for detecting lesions in multimodal CT images. Specifically, we enhance the path aggregation flow by incorporating a low-level feature map. Additionally, to improve model representation, we utilize the adaptive Squeeze-and-Excitation (SE) block and integrate channel feature map attention. This approach has resulted in achieving new state-of-the-art performance. Our method significantly improves the detection of small and multiscaled objects. When evaluated against other advanced algorithms on the public DeepLesion benchmark, our algorithm achieved an mAP of over 0.20.
zh
[CV-66] Analyzing Transformer Models and Knowledge Distillation Approaches for Image Captioning on Edge AI
【速读】:该论文旨在解决在资源受限的边缘设备上部署基于Transformer的图像描述生成模型所面临的计算资源不足与实时响应需求之间的矛盾。其关键解决方案是通过评估轻量级Transformer模型并应用知识蒸馏技术,在保持模型性能的同时加速推理过程,从而实现高效且实时的边缘AI推理。
链接: https://arxiv.org/abs/2506.03607
作者: Wing Man Casca Kwok,Yip Chiu Tung,Kunal Bhagchandani
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Edge computing decentralizes processing power to network edge, enabling real-time AI-driven decision-making in IoT applications. In industrial automation such as robotics and rugged edge AI, real-time perception and intelligence are critical for autonomous operations. Deploying transformer-based image captioning models at the edge can enhance machine perception, improve scene understanding for autonomous robots, and aid in industrial inspection. However, these edge or IoT devices are often constrained in computational resources for physical agility, yet they have strict response time requirements. Traditional deep learning models can be too large and computationally demanding for these devices. In this research, we present findings of transformer-based models for image captioning that operate effectively on edge devices. By evaluating resource-effective transformer models and applying knowledge distillation techniques, we demonstrate inference can be accelerated on resource-constrained devices while maintaining model performance using these techniques. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2506.03607 [cs.CV] (or arXiv:2506.03607v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2506.03607 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-67] Generating 6DoF Object Manipulation Trajectories from Action Description in Egocentric Vision CVPR2025
【速读】:该论文试图解决在常见场景中学习使用工具或物体,特别是根据指令以多种方式操作物体的问题,这是开发交互式机器人的一项关键挑战。解决方案的关键在于利用大规模的自我中心(egocentric)和他者中心(exo-centric)视频数据集——即Exo-Ego4D——来大规模提取多样化的操作轨迹。通过结合这些轨迹与相关的文本动作描述,研究人员开发了基于视觉和点云的语言模型来生成轨迹,并在HOT3D数据集中验证了模型的有效性,从而为从第一视角视觉中的动作描述生成6自由度(6DoF)操作轨迹这一新任务建立了训练数据集和基线模型。
链接: https://arxiv.org/abs/2506.03605
作者: Tomoya Yoshida,Shuhei Kurita,Taichi Nishimura,Shinsuke Mori
机构: Kyoto University (京都大学); National Institute of Informatics (国立情報学研究所); Sony Interactive Entertainment (ソニー・インタラクティブエンタテインメント)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025
Abstract:Learning to use tools or objects in common scenes, particularly handling them in various ways as instructed, is a key challenge for developing interactive robots. Training models to generate such manipulation trajectories requires a large and diverse collection of detailed manipulation demonstrations for various objects, which is nearly unfeasible to gather at scale. In this paper, we propose a framework that leverages large-scale ego- and exo-centric video datasets – constructed globally with substantial effort – of Exo-Ego4D to extract diverse manipulation trajectories at scale. From these extracted trajectories with the associated textual action description, we develop trajectory generation models based on visual and point cloud-based language models. In the recently proposed egocentric vision-based in-a-quality trajectory dataset of HOT3D, we confirmed that our models successfully generate valid object trajectories, establishing a training dataset and baseline models for the novel task of generating 6DoF manipulation trajectories from action descriptions in egocentric vision.
zh
[CV-68] ControlThinker: Unveiling Latent Semantics for Controllable Image Generation through Visual Reasoning
【速读】:该论文旨在解决可控图像生成中输入文本提示与目标图像之间的语义鸿沟问题,尤其是在文本提示语义稀疏的情况下,现有方法过度依赖低级控制信号来推断区域细节。其解决方案的关键在于提出ControlThinker框架,该框架采用“理解-生成”范式,通过激励多模态大语言模型(MLLM)的视觉推理能力,从控制图像中挖掘潜在语义以丰富文本提示,从而在不进行复杂修改的情况下提升图像生成的语义一致性和视觉质量。此外,为应对控制图像的模糊性带来的不确定性,该方法通过度量驱动的输出奖励模型(ORM)鼓励更广泛的推理轨迹探索并选择最优解。
链接: https://arxiv.org/abs/2506.03596
作者: Feng Han,Yang Jiao,Shaoxiang Chen,Junhao Xu,Jingjing Chen,Yu-Gang Jiang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The field of controllable image generation has seen significant advancements, with various architectures improving generation layout consistency with control signals. However, contemporary methods still face challenges in bridging the semantic gap between input text prompts with sparse semantics and the target images, often over-relying on low-level control signals to infer regional details. To address this challenge, we propose ControlThinker, a novel framework that employs a “comprehend-then-generate” paradigm. Firstly, by incentivizing the visual reasoning capability of a MLLM, latent semantics from control images are mined to enrich text prompts. This enriched semantic understanding then seamlessly aids in image generation without the need for additional complex modifications. To further tackle the uncertainty arising from the ambiguity of control images, we encourage broader exploration of reasoning trajectories and select the optimal one using a metric-based output reward model (ORM). Extensive experimental results demonstrate that ControlThinker effectively mitigates the semantic gap between raw text prompts and target images, resulting in improved visual quality and semantic consistency across a wide range of benchmarks. The code and models are available at this https URL.
zh
[CV-69] SplArt: Articulation Estimation and Part-Level Reconstruction with 3D Gaussian Splatting
【速读】:该论文旨在解决日常环境中常见刚体物体的重建问题,现有方法在可扩展性(需要3D监督或昂贵的标注)、鲁棒性(易陷入局部最优)和渲染性能(缺乏速度或逼真度)方面存在局限。其解决方案的关键在于提出SplArt框架,该框架基于可微分的移动参数对每个高斯分布进行增强,结合多阶段优化策略,实现精细的部件分割与运动学推断,并通过几何自监督机制有效处理复杂场景,无需依赖3D标注或类别特定先验,从而实现了实时逼真的新视角与关节状态渲染。
链接: https://arxiv.org/abs/2506.03594
作者: Shengjie Lin,Jiading Fang,Muhammad Zubair Irshad,Vitor Campagnolo Guizilini,Rares Andrei Ambrus,Greg Shakhnarovich,Matthew R. Walter
机构: Toyota Technological Institute at Chicago (芝加哥丰田技术学院); Toyota Research Institute (丰田研究院)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM); Robotics (cs.RO)
备注: this https URL
Abstract:Reconstructing articulated objects prevalent in daily environments is crucial for applications in augmented/virtual reality and robotics. However, existing methods face scalability limitations (requiring 3D supervision or costly annotations), robustness issues (being susceptible to local optima), and rendering shortcomings (lacking speed or photorealism). We introduce SplArt, a self-supervised, category-agnostic framework that leverages 3D Gaussian Splatting (3DGS) to reconstruct articulated objects and infer kinematics from two sets of posed RGB images captured at different articulation states, enabling real-time photorealistic rendering for novel viewpoints and articulations. SplArt augments 3DGS with a differentiable mobility parameter per Gaussian, achieving refined part segmentation. A multi-stage optimization strategy is employed to progressively handle reconstruction, part segmentation, and articulation estimation, significantly enhancing robustness and accuracy. SplArt exploits geometric self-supervision, effectively addressing challenging scenarios without requiring 3D annotations or category-specific priors. Evaluations on established and newly proposed benchmarks, along with applications to real-world scenarios using a handheld RGB camera, demonstrate SplArt’s state-of-the-art performance and real-world practicality. Code is publicly available at this https URL.
zh
[CV-70] Resolving Task Objective Conflicts in Unified Multimodal Understanding and Generation via Task-Aware Mixture-of-Experts
【速读】:该论文旨在解决统一多模态大语言模型(Unified Multimodal Large Language Models, MMLLs)中因理解任务的高层语义抽象与生成任务的细粒度细节保留之间存在的内在任务目标冲突所导致的次优权衡和任务干扰问题。其解决方案的关键在于提出UTAMoE框架,该框架通过任务感知的专家混合(Task-Aware Mixture-of-Experts, MoE)层解耦内部自回归(Autoregressive, AR)模块,从而创建任务特定的优化子路径,以增强任务区分性并保持整体协调性。
链接: https://arxiv.org/abs/2506.03591
作者: Jiaxing Zhang,Xinyi Zeng,Hao Tang
机构: Sichuan University (四川大学); Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Unified multimodal large language models (MLLMs) based on end-to-end autoregressive (AR) transformers effectively integrate both understanding and generation tasks within a single framework. However, intrinsic Task Objective Conflicts between high-level semantic abstraction in understanding and fine-grained detail preservation in generation pose significant challenges, often leading to suboptimal trade-offs and task interference. Existing solutions, such as decoupling shared visual encoders, fall short of fundamentally resolving these conflicts due to inherent AR architecture. In this paper, we propose a novel approach that decouples internal components of AR to resolve task objective conflicts. Specifically, we design UTAMoE, a Unified Task-Aware Mixture-of-Experts (MoE) framework that decouples internal AR modules via a Task-Aware MoE Layer to create task-specific optimization subpaths. To enhance task differentiation while maintaining overall coordination, we introduce a novel Two-Stage Training Strategy. Extensive experiments on multimodal benchmarks demonstrate that UTAMoE mitigates task objective conflicts, achieving state-of-the-art performance across various tasks. Visualizations and ablation studies further validate the effectiveness of our approach.
zh
[CV-71] A Large-Scale Referring Remote Sensing Image Segmentation Dataset and Benchmark
【速读】:该论文旨在解决遥感图像分割(Remote Sensing Image Segmentation, RRSIS)中现有数据集在分辨率、场景多样性及类别覆盖范围上的不足,这些问题限制了参考分割模型的泛化能力和实际应用。其解决方案的关键在于提出NWPU-Refer数据集以及多尺度参考分割网络(Multi-scale Referring Segmentation Network, MRSNet),其中MRSNet通过引入跨尺度特征交互模块(Hierarchical Feature Interaction Module, HFIM)和同尺度特征交互模块(Intra-scale Feature Interaction Module, IFIM),实现了细粒度特征捕捉与跨尺度特征融合,从而提升了模型的分割性能。
链接: https://arxiv.org/abs/2506.03583
作者: Zhigang Yang,Huiguang Yao,Linmao Tian,Xuezhi Zhao,Qiang Li,Qi Wang
机构: Northwestern Polytechnical University (西北工业大学); School of Artificial Intelligence, Optics and Electronics (iOPEN), Northwestern Polytechnical University (人工智能、光学与电子学院(iOPEN),西北工业大学); School of Computer Science, Northwestern Polytechnical University (计算机科学学院,西北工业大学); department of Software, Northwest Institute of Nuclear Technology (软件系,西北核技术研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Referring Remote Sensing Image Segmentation is a complex and challenging task that integrates the paradigms of computer vision and natural language processing. Existing datasets for RRSIS suffer from critical limitations in resolution, scene diversity, and category coverage, which hinders the generalization and real-world applicability of refer segmentation models. To facilitate the development of this field, we introduce NWPU-Refer, the largest and most diverse RRSIS dataset to date, comprising 15,003 high-resolution images (1024-2048px) spanning 30+ countries with 49,745 annotated targets supporting single-object, multi-object, and non-object segmentation scenarios. Additionally, we propose the Multi-scale Referring Segmentation Network (MRSNet), a novel framework tailored for the unique demands of RRSIS. MRSNet introduces two key innovations: (1) an Intra-scale Feature Interaction Module (IFIM) that captures fine-grained details within each encoder stage, and (2) a Hierarchical Feature Interaction Module (HFIM) to enable seamless cross-scale feature fusion, preserving spatial integrity while enhancing discriminative power. Extensive experiments conducte on the proposed NWPU-Refer dataset demonstrate that MRSNet achieves state-of-the-art performance across multiple evaluation metrics, validating its effectiveness. The dataset and code are publicly available at this https URL.
zh
[CV-72] ViTSGMM: A Robust Semi-Supervised Image Recognition Network Using Sparse Labels
【速读】:该论文试图解决在极少量标记数据下图像识别任务中模型泛化能力不足的问题(semi-supervised learning),现有方法通常依赖复杂的训练技术和架构,但其在处理极端有限标记数据时的表现仍有待提升。解决方案的关键在于构建一种分层混合密度分类决策机制,通过优化特征表示与目标类别之间的互信息,在压缩冗余信息的同时保留关键的判别成分,从而提升模型在少量标记数据下的性能。
链接: https://arxiv.org/abs/2506.03582
作者: Rui Yann,Xianglei Xing
机构: Harbin Engineering University (哈尔滨工程大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:We present ViTSGMM, an image recognition network that leverages semi-supervised learning in a highly efficient manner. Existing works often rely on complex training techniques and architectures, while their generalization ability when dealing with extremely limited labeled data remains to be improved. To address these limitations, we construct a hierarchical mixture density classification decision mechanism by optimizing mutual information between feature representations and target classes, compressing redundant information while retaining crucial discriminative components. Experimental results demonstrate that our method achieves state-of-the-art performance on STL-10 and CIFAR-10/100 datasets when using negligible labeled samples. Notably, this paper also reveals a long-overlooked data leakage issue in the STL-10 dataset for semi-supervised learning tasks and removes duplicates to ensure the reliability of experimental results. Code available at this https URL.
zh
[CV-73] DiagNet: Detecting Objects using Diagonal Constraints on Adjacency Matrix of Graph Neural Network
【速读】:该论文试图解决传统目标检测方法中依赖预设锚框(anchor boxes)带来的设计复杂性和性能局限性问题。其解决方案的关键在于提出DaigNet,通过在图卷积网络(GCN)的邻接矩阵上施加对角约束,利用基于硬约束和软约束的两种对角化算法以及结合对角约束与互补约束的损失函数,实现无需预设锚框的目标检测。
链接: https://arxiv.org/abs/2506.03571
作者: Chong Hyun Lee,Kibae Lee
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:We propose DaigNet, a new approach to object detection with which we can detect an object bounding box using diagonal constraints on adjacency matrix of a graph convolutional network (GCN). We propose two diagonalization algorithms based on hard and soft constraints on adjacency matrix and two loss functions using diagonal constraint and complementary constraint. The DaigNet eliminates the need for designing a set of anchor boxes commonly used. To prove feasibility of our novel detector, we adopt detection head in YOLO models. Experiments show that the DiagNet achieves 7.5% higher mAP50 on Pascal VOC than YOLOv1. The DiagNet also shows 5.1% higher mAP on MS COCO than YOLOv3u, 3.7% higher mAP than YOLOv5u, and 2.9% higher mAP than YOLOv8.
zh
[CV-74] WIFE-Fusion:Wavelet-aware Intra-inter Frequency Enhancement for Multi-model Image Fusion
【速读】:该论文旨在解决多模态图像融合中频率域特征探索和模态间交互关系被忽视的问题。其解决方案的关键在于提出了一种基于频率域组件交互的多模态图像融合框架——小波感知的跨内频增强融合(Wavelet-aware Intra-inter Frequency Enhancement Fusion, WIFE-Fusion),该框架通过引入跨模态的频率域特征提取与交互机制,实现了更精确的源特征提取和统一的特征提取-聚合建模。具体包括:内频自注意力(Intra-Frequency Self-Attention, IFSA)利用交互式自注意力机制挖掘跨模态的相关性和互补性,以提取丰富的频率域特征;跨频交互(Inter-Frequency Interaction, IFI)则通过异构频率域组件间的组合交互增强特征并过滤潜在特征。
链接: https://arxiv.org/abs/2506.03555
作者: Tianpei Zhang,Jufeng Zhao,Yiming Zhu,Guangmang Cui
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Multimodal image fusion effectively aggregates information from diverse modalities, with fused images playing a crucial role in vision systems. However, existing methods often neglect frequency-domain feature exploration and interactive relationships. In this paper, we propose wavelet-aware Intra-inter Frequency Enhancement Fusion (WIFE-Fusion), a multimodal image fusion framework based on frequency-domain components interactions. Its core innovations include: Intra-Frequency Self-Attention (IFSA) that leverages inherent cross-modal correlations and complementarity through interactive self-attention mechanisms to extract enriched frequency-domain features, and Inter-Frequency Interaction (IFI) that enhances enriched features and filters latent features via combinatorial interactions between heterogeneous frequency-domain components across modalities. These processes achieve precise source feature extraction and unified modeling of feature extraction-aggregation. Extensive experiments on five datasets across three multimodal fusion tasks demonstrate WIFE-Fusion’s superiority over current specialized and unified fusion methods. Our code is available at this https URL.
zh
[CV-75] Robust Neural Rendering in the Wild with Asymmetric Dual 3D Gaussian Splatting
【速读】:该论文旨在解决从野生图像(in-the-wild images)中进行3D重建时面临的挑战,尤其是由于光照条件不一致和瞬态干扰物导致的低质量训练数据问题。现有方法依赖启发式策略处理这些问题,但往往难以生成稳定且一致的重建结果,容易产生视觉伪影。论文提出的解决方案关键在于引入Asymmetric Dual 3DGS框架,通过并行训练两个3D Gaussian Splatting(3DGS)模型,并施加一致性约束以促进可靠场景几何的收敛,同时抑制不一致的伪影。为防止模型陷入相似的失败模式,采用了一种发散掩码策略,结合多线索自适应掩码和自监督软掩码,实现不对称的训练过程,从而减少共享错误模式。此外,还引入了轻量级变体Dynamic EMA Proxy以提升训练效率。
链接: https://arxiv.org/abs/2506.03538
作者: Chengqi Li,Zhihao Shi,Yangdi Lu,Wenbo He,Xiangyu Xu
机构: McMaster University (麦克马斯特大学); Xi’an Jiaotong University (西安交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:3D reconstruction from in-the-wild images remains a challenging task due to inconsistent lighting conditions and transient distractors. Existing methods typically rely on heuristic strategies to handle the low-quality training data, which often struggle to produce stable and consistent reconstructions, frequently resulting in visual artifacts. In this work, we propose Asymmetric Dual 3DGS, a novel framework that leverages the stochastic nature of these artifacts: they tend to vary across different training runs due to minor randomness. Specifically, our method trains two 3D Gaussian Splatting (3DGS) models in parallel, enforcing a consistency constraint that encourages convergence on reliable scene geometry while suppressing inconsistent artifacts. To prevent the two models from collapsing into similar failure modes due to confirmation bias, we introduce a divergent masking strategy that applies two complementary masks: a multi-cue adaptive mask and a self-supervised soft mask, which leads to an asymmetric training process of the two models, reducing shared error modes. In addition, to improve the efficiency of model training, we introduce a lightweight variant called Dynamic EMA Proxy, which replaces one of the two models with a dynamically updated Exponential Moving Average (EMA) proxy, and employs an alternating masking strategy to preserve divergence. Extensive experiments on challenging real-world datasets demonstrate that our method consistently outperforms existing approaches while achieving high efficiency. Codes and trained models will be released.
zh
[CV-76] How Far Are We from Predicting Missing Modalities with Foundation Models?
【速读】:该论文试图解决多模态基础模型在缺失模态预测任务中表现不佳的问题,特别是在细粒度语义提取和生成模态的鲁棒性验证方面存在明显不足,导致预测结果不理想甚至出现偏差。其解决方案的关键在于提出一种代理框架(agentic framework),该框架能够根据输入上下文动态制定模态感知的挖掘策略,从而提取更丰富且具有区分性的语义特征,并引入自精炼机制(self-refinement mechanism),通过内部反馈迭代验证和提升生成模态的质量。
链接: https://arxiv.org/abs/2506.03530
作者: Guanzhou Ke,Yi Xie,Xiaoli Wang,Guoqing Chao,Bo Wang,Shengfeng He
机构: 未知
类目: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Multimodal foundation models have demonstrated impressive capabilities across diverse tasks. However, their potential as plug-and-play solutions for missing modality prediction remains underexplored. To investigate this, we categorize existing approaches into three representative paradigms, encompassing a total of 42 model variants, and conduct a comprehensive evaluation in terms of prediction accuracy and adaptability to downstream tasks. Our analysis reveals that current foundation models often fall short in two critical aspects: (i) fine-grained semantic extraction from the available modalities, and (ii) robust validation of generated modalities. These limitations lead to suboptimal and, at times, misaligned predictions. To address these challenges, we propose an agentic framework tailored for missing modality prediction. This framework dynamically formulates modality-aware mining strategies based on the input context, facilitating the extraction of richer and more discriminative semantic features. In addition, we introduce a \textitself-refinement mechanism, which iteratively verifies and enhances the quality of generated modalities through internal feedback. Experimental results show that our method reduces FID for missing image prediction by at least 14% and MER for missing text prediction by at least 10% compared to baselines.
zh
[CV-77] arget Semantics Clustering via Text Representations for Robust Universal Domain Adaptation AAAI2025
【速读】:该论文旨在解决通用域适应(Universal Domain Adaptation, UniDA)中的问题,即在域偏移和未知类别偏移条件下,如何有效地将源域知识迁移到目标域。其核心挑战在于识别共通类别样本并实现对齐。解决方案的关键在于利用视觉-语言模型,在语义有意义且离散的文本表示空间中搜索语义中心,从而获得具有较少域偏差和适当语义粒度的中心点,进而实现简单且鲁棒的适应算法。具体而言,提出了基于文本表示的TArget Semantics Clustering (TASC)方法,通过信息最大化作为统一目标,并结合两阶段策略完成目标语义的搜索与编码器的优化。
链接: https://arxiv.org/abs/2506.03521
作者: Weinan He,Zilei Wang,Yixin Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Camera-ready version for AAAI 2025
Abstract:Universal Domain Adaptation (UniDA) focuses on transferring source domain knowledge to the target domain under both domain shift and unknown category shift. Its main challenge lies in identifying common class samples and aligning them. Current methods typically obtain target domain semantics centers from an unconstrained continuous image representation space. Due to domain shift and the unknown number of clusters, these centers often result in complex and less robust alignment algorithm. In this paper, based on vision-language models, we search for semantic centers in a semantically meaningful and discrete text representation space. The constrained space ensures almost no domain bias and appropriate semantic granularity for these centers, enabling a simple and robust adaptation algorithm. Specifically, we propose TArget Semantics Clustering (TASC) via Text Representations, which leverages information maximization as a unified objective and involves two stages. First, with the frozen encoders, a greedy search-based framework is used to search for an optimal set of text embeddings to represent target semantics. Second, with the search results fixed, encoders are refined based on gradient descent, simultaneously achieving robust domain alignment and private class clustering. Additionally, we propose Universal Maximum Similarity (UniMS), a scoring function tailored for detecting open-set samples in UniDA. Experimentally, we evaluate the universality of UniDA algorithms under four category shift scenarios. Extensive experiments on four benchmarks demonstrate the effectiveness and robustness of our method, which has achieved state-of-the-art performance.
zh
[CV-78] DenseDPO: Fine-Grained Temporal Preference Optimization for Video Diffusion Models
【速读】:该论文旨在解决直接偏好优化(Direct Preference Optimization, DPO)在文本到视频扩散模型后训练中的局限性,特别是由于人工标注者倾向于低运动片段而导致的运动偏差问题。其关键解决方案是提出DenseDPO方法,通过从真实视频的去噪损坏副本中生成视频对,确保对齐且具有相似运动结构但局部细节不同的视频对,从而消除运动偏差;同时利用时间对齐特性,在短片段上标注偏好,提升学习信号的密度和精度。此外,DenseDPO还通过使用现成的视觉语言模型(Vision Language Models, VLMs)实现自动偏好标注,显著降低了对人工标注的依赖。
链接: https://arxiv.org/abs/2506.03517
作者: Ziyi Wu,Anil Kag,Ivan Skorokhodov,Willi Menapace,Ashkan Mirzaei,Igor Gilitschenski,Sergey Tulyakov,Aliaksandr Siarohin
机构: Snap Research(快拍研究); University of Toronto(多伦多大学); Vector Institute(向量研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
Abstract:Direct Preference Optimization (DPO) has recently been applied as a post-training technique for text-to-video diffusion models. To obtain training data, annotators are asked to provide preferences between two videos generated from independent noise. However, this approach prohibits fine-grained comparisons, and we point out that it biases the annotators towards low-motion clips as they often contain fewer visual artifacts. In this work, we introduce DenseDPO, a method that addresses these shortcomings by making three contributions. First, we create each video pair for DPO by denoising corrupted copies of a ground truth video. This results in aligned pairs with similar motion structures while differing in local details, effectively neutralizing the motion bias. Second, we leverage the resulting temporal alignment to label preferences on short segments rather than entire clips, yielding a denser and more precise learning signal. With only one-third of the labeled data, DenseDPO greatly improves motion generation over vanilla DPO, while matching it in text alignment, visual quality, and temporal consistency. Finally, we show that DenseDPO unlocks automatic preference annotation using off-the-shelf Vision Language Models (VLMs): GPT accurately predicts segment-level preferences similar to task-specifically fine-tuned video reward models, and DenseDPO trained on these labels achieves performance close to using human labels.
zh
[CV-79] EDCFlow: Exploring Temporally Dense Difference Maps for Event-based Optical Flow Estimation
【速读】:该论文旨在解决基于事件的光流估计中,现有方法因使用代价体积进行像素匹配而导致的冗余计算和难以扩展至更高分辨率的问题。其解决方案的关键在于利用相邻事件帧的时间密集特征差异与代价体积的互补性,提出了一种轻量级的事件光流网络(EDCFlow),通过基于注意力的多尺度时间特征差异层高效捕捉高分辨率下的多样运动模式,并通过高分辨率差异运动特征与低分辨率相关运动特征的自适应融合,提升运动表征能力和模型泛化性。
链接: https://arxiv.org/abs/2506.03512
作者: Daikun Liu,Lei Cheng,Teng Wang,changyin Sun
机构: Southeast University (东南大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 8 figures
Abstract:Recent learning-based methods for event-based optical flow estimation utilize cost volumes for pixel matching but suffer from redundant computations and limited scalability to higher resolutions for flow refinement. In this work, we take advantage of the complementarity between temporally dense feature differences of adjacent event frames and cost volume and present a lightweight event-based optical flow network (EDCFlow) to achieve high-quality flow estimation at a higher resolution. Specifically, an attention-based multi-scale temporal feature difference layer is developed to capture diverse motion patterns at high resolution in a computation-efficient manner. An adaptive fusion of high-resolution difference motion features and low-resolution correlation motion features is performed to enhance motion representation and model generalization. Notably, EDCFlow can serve as a plug-and-play refinement module for RAFT-like event-based methods to enhance flow details. Extensive experiments demonstrate that EDCFlow achieves better performance with lower complexity compared to existing methods, offering superior generalization.
zh
[CV-80] CHIME: Conditional Hallucination and Integrated Multi-scale Enhancement for Time Series Diffusion Model
【速读】:该论文旨在解决时间序列扩散模型在多尺度特征对齐和跨不同实体及长时间尺度的生成能力方面存在的挑战。其解决方案的关键在于提出CHIME框架,该框架通过多尺度分解与自适应集成捕获时间序列的分解特征,实现生成样本与原始样本在领域内分布的对齐,并引入特征幻觉模块,在条件去噪过程中通过类别无关变换层的训练实现时序特征的迁移。
链接: https://arxiv.org/abs/2506.03502
作者: Yuxuan Chen,Haipeng Xie
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Systems and Control (eess.SY)
备注:
Abstract:The denoising diffusion probabilistic model has become a mainstream generative model, achieving significant success in various computer vision tasks. Recently, there has been initial exploration of applying diffusion models to time series tasks. However, existing studies still face challenges in multi-scale feature alignment and generative capabilities across different entities and long-time scales. In this paper, we propose CHIME, a conditional hallucination and integrated multi-scale enhancement framework for time series diffusion models. By employing multi-scale decomposition and adaptive integration, CHIME captures the decomposed features of time series, achieving in-domain distribution alignment between generated and original samples. In addition, we introduce a feature hallucination module in the conditional denoising process, enabling the transfer of temporal features through the training of category-independent transformation layers. Experimental results on publicly available real-world datasets demonstrate that CHIME achieves state-of-the-art performance and exhibits excellent generative generalization capabilities in few-shot scenarios.
zh
[CV-81] Heterogeneous Skeleton-Based Action Representation Learning CVPR2025
【速读】:该论文试图解决基于骨架的人类动作识别中骨架数据异质性带来的挑战,即不同来源的骨架数据在关节维度和拓扑结构上存在差异,而以往的工作忽略了这种异质性,仅针对同构骨架构建模型。解决方案的关键在于提出一个包含异质骨架处理和统一表示学习两个主要组件的框架:前者通过辅助网络将二维骨架数据转换为三维骨架,并利用骨架特定提示构建统一骨架;后者通过共享主干网络学习适用于不同异质骨架的统一动作表示。
链接: https://arxiv.org/abs/2506.03481
作者: Hongsong Wang,Xiaoyan Ma,Jidong Kuang,Jie Gui
机构: Southeast University (东南大学); Ministry of Education (教育部); Purple Mountain Laboratories (紫金山实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: To appear in CVPR 2025
Abstract:Skeleton-based human action recognition has received widespread attention in recent years due to its diverse range of application scenarios. Due to the different sources of human skeletons, skeleton data naturally exhibit heterogeneity. The previous works, however, overlook the heterogeneity of human skeletons and solely construct models tailored for homogeneous skeletons. This work addresses the challenge of heterogeneous skeleton-based action representation learning, specifically focusing on processing skeleton data that varies in joint dimensions and topological structures. The proposed framework comprises two primary components: heterogeneous skeleton processing and unified representation learning. The former first converts two-dimensional skeleton data into three-dimensional skeleton via an auxiliary network, and then constructs a prompted unified skeleton using skeleton-specific prompts. We also design an additional modality named semantic motion encoding to harness the semantic information within skeletons. The latter module learns a unified action representation using a shared backbone network that processes different heterogeneous skeletons. Extensive experiments on the NTU-60, NTU-120, and PKU-MMD II datasets demonstrate the effectiveness of our method in various tasks of action understanding. Our approach can be applied to action recognition in robots with different humanoid structures.
zh
[CV-82] Facial Appearance Capture at Home with Patch-Level Reflectance Prior SIGGRAPH
【速读】:该论文试图解决从智能手机录制的视频中重建面部外观(facial appearance)的质量问题,目前该质量仍远低于基于工作室录制的重建效果。解决方案的关键在于通过在昏暗房间中使用同位智能手机和手电筒视频采集设置,将面部反射率图(reflectance map)的重建置于工作室扫描数据的分布范围内。具体而言,首先在Light Stage扫描数据上学习一个扩散先验(diffusion prior),然后将其调整以生成与捕获图像最匹配的反射率图。为提高泛化能力和训练稳定性,采用基于块(patch-level)的扩散先验训练方法,并提出一种基于块的后验采样技术,以从该模型中无缝生成全分辨率反射率图。
链接: https://arxiv.org/abs/2506.03478
作者: Yuxuan Han,Junfeng Lyu,Kuan Sheng,Minghao Que,Qixuan Zhang,Lan Xu,Feng Xu
机构: Tsinghua University(清华大学); ShanghaiTech University(上海科技大学); Deemos Technology Co., Ltd.(德姆斯科技有限公司)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: ACM Transactions on Graphics (Proc. of SIGGRAPH), 2025. Code: this https URL Project Page: this https URL
Abstract:Existing facial appearance capture methods can reconstruct plausible facial reflectance from smartphone-recorded videos. However, the reconstruction quality is still far behind the ones based on studio recordings. This paper fills the gap by developing a novel daily-used solution with a co-located smartphone and flashlight video capture setting in a dim room. To enhance the quality, our key observation is to solve facial reflectance maps within the data distribution of studio-scanned ones. Specifically, we first learn a diffusion prior over the Light Stage scans and then steer it to produce the reflectance map that best matches the captured images. We propose to train the diffusion prior at the patch level to improve generalization ability and training stability, as current Light Stage datasets are in ultra-high resolution but limited in data size. Tailored to this prior, we propose a patch-level posterior sampling technique to sample seamless full-resolution reflectance maps from this patch-level diffusion model. Experiments demonstrate our method closes the quality gap between low-cost and studio recordings by a large margin, opening the door for everyday users to clone themselves to the digital world. Our code will be released at this https URL.
zh
[CV-83] MamFusion: Multi-Mamba with Temporal Fusion for Partially Relevant Video Retrieval
【速读】:该论文旨在解决部分相关视频检索(Partially Relevant Video Retrieval, PRVR)中的信息冗余问题,通过提升对长序列视频内容的理解能力来优化检索效果。其解决方案的关键在于引入一种基于Mamba模块的多Mamba模块与时间融合框架(MamFusion),利用Mamba模块出色的长期状态空间建模能力和线性可扩展性,结合时间融合机制,有效捕捉视频内容中的状态相关性,并将其无缝融入文本-视频相关性理解中,从而提升检索性能。
链接: https://arxiv.org/abs/2506.03473
作者: Xinru Ying,Jiaqi Mo,Jingyang Lin,Canghong Jin,Fangfang Wang,Lina Wei
机构: Hangzhou City University(杭州城市大学); University of Wisconsin–Madison(威斯康星大学麦迪逊分校); Hangzhou Normal University(杭州师范大学); Zhejiang Provincial Engineering Research Center for Real-Time SmartTech in Urban Security Governance(浙江省城市安全治理实时智能技术工程研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Partially Relevant Video Retrieval (PRVR) is a challenging task in the domain of multimedia retrieval. It is designed to identify and retrieve untrimmed videos that are partially relevant to the provided query. In this work, we investigate long-sequence video content understanding to address information redundancy issues. Leveraging the outstanding long-term state space modeling capability and linear scalability of the Mamba module, we introduce a multi-Mamba module with temporal fusion framework (MamFusion) tailored for PRVR task. This framework effectively captures the state-relatedness in long-term video content and seamlessly integrates it into text-video relevance understanding, thereby enhancing the retrieval process. Specifically, we introduce Temporal T-to-V Fusion and Temporal V-to-T Fusion to explicitly model temporal relationships between text queries and video moments, improving contextual awareness and retrieval accuracy. Extensive experiments conducted on large-scale datasets demonstrate that MamFusion achieves state-of-the-art performance in retrieval effectiveness. Code is available at the link: this https URL.
zh
[CV-84] RoNFA: Robust Neural Field-based Approach for Few-Shot Image Classification with Noisy Labels
【速读】:该论文旨在解决在少量样本学习(Few-shot Learning, FSL)中,由于标签样本稀缺而导致的标签错误对分类精度的显著影响问题。为提升模型在存在标签错误情况下的鲁棒性,本文提出了一种基于神经场的图像分类方法(Robust Neural Field-based Image Approach, RoNFA)。RoNFA的关键在于其双神经场结构:一个用于特征表示(Feature Representation Field, FFR),另一个用于类别表示(Category Representation Field, FCR)。FCR中的每个神经元在其对应类别的代表性神经元周围形成感受野,该代表性神经元通过软聚类生成。在预测阶段,感受野的范围根据FCR中的神经元激活情况进行自适应调整,从而保证预测的准确性。这一机制使模型具备出色的少样本学习能力和对标签噪声的强鲁棒性。
链接: https://arxiv.org/abs/2506.03461
作者: Nan Xiang,Lifeng Xing,Dequan Jin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages, 1 figure
Abstract:In few-shot learning (FSL), the labeled samples are scarce. Thus, label errors can significantly reduce classification accuracy. Since label errors are inevitable in realistic learning tasks, improving the robustness of the model in the presence of label errors is critical. This paper proposes a new robust neural field-based image approach (RoNFA) for few-shot image classification with noisy labels. RoNFA consists of two neural fields for feature and category representation. They correspond to the feature space and category set. Each neuron in the field for category representation (FCR) has a receptive field (RF) on the field for feature representation (FFR) centered at the representative neuron for its category generated by soft clustering. In the prediction stage, the range of these receptive fields adapts according to the neuronal activation in FCR to ensure prediction accuracy. These learning strategies provide the proposed model with excellent few-shot learning capability and strong robustness against label noises. The experimental results on real-world FSL datasets with three different types of label noise demonstrate that the proposed method significantly outperforms state-of-the-art FSL methods. Its accuracy obtained in the presence of noisy labels even surpasses the results obtained by state-of-the-art FSL methods trained on clean support sets, indicating its strong robustness against noisy labels.
zh
[CV-85] he effects of using created synthetic images in computer vision training
【速读】:该论文试图解决在深度计算机视觉(CV)模型训练中,数据稀缺或数据获取成本高的问题,尤其是在图像丰富和图像有限的应用场景下。解决方案的关键在于利用渲染引擎(如Unreal Engine 4)生成合成图像,以补充真实图像数据集,从而提供一种几乎无限、可重复、灵活且低成本的训练数据来源。通过将合成图像与真实图像结合使用,可以有效缩小测试准确率与训练准确率之间的差距,并在一定程度上提升模型性能。
链接: https://arxiv.org/abs/2506.03449
作者: John W. Smutny
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Nine pages long. Main content in pages one through eight. References start at page nine
Abstract:This paper investigates how rendering engines, like Unreal Engine 4 (UE), can be used to create synthetic images to supplement datasets for deep computer vision (CV) models in image abundant and image limited use cases. Using rendered synthetic images from UE can provide developers and businesses with a method of accessing nearly unlimited, reproducible, agile, and cheap training sets for their customers and applications without the threat of poisoned images from the internet or the cost of collecting them. The validity of these generated images are examined by testing the change in model test accuracy in two different sized CV models across two binary classification cases (Cat vs Dog and Weld Defect Detection). In addition, this paper provides an implementation of how to measure the quality of synthetic images by using pre-trained CV models as auditors. Results imply that for large (VGG16) and small (MobileNetV3-small) parameter deep CV models, adding 60% additional synthetic images to a real image dataset during model training can narrow the test-training accuracy gap to ~1-2% without a conclusive effect on test accuracy compared to using real world images alone. Likewise, adding 10% additional real training images to synthetic only training sets decreased the classification error rate in half, then decreasing further when adding more real training images. For these cases tested, using synthetic images from rendering engines allow researchers to only use 10% of their real images during training, compared to the traditional 50-70%. This research serves as an example of how to create synthetic images, guidelines on how to use the images, potential restrictions and possible performance improvements for data-scarce projects.
zh
[CV-86] RefEdit: A Benchmark and Method for Improving Instruction-based Image Editing Model on Referring Expressions
【速读】:该论文试图解决现有基于逆向和指令的图像编辑方法在处理包含多个实体的复杂场景时表现不佳的问题(complex scenes containing multiple entities)。其解决方案的关键在于引入RefEdit——一种基于指令的编辑模型,该模型通过可扩展的合成数据生成流水线进行训练。尽管仅使用了20,000个编辑三元组进行训练,RefEdit在多个基准测试中仍优于基于Flux/SD3模型的基线方法,展现出卓越的性能。
链接: https://arxiv.org/abs/2506.03448
作者: Bimsara Pathiraja,Maitreya Patel,Shivam Singh,Yezhou Yang,Chitta Baral
机构: Arizona State University (亚利桑那州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: \url{ this http URL }
Abstract:Despite recent advances in inversion and instruction-based image editing, existing approaches primarily excel at editing single, prominent objects but significantly struggle when applied to complex scenes containing multiple entities. To quantify this gap, we first introduce RefEdit-Bench, a rigorous real-world benchmark rooted in RefCOCO, where even baselines trained on millions of samples perform poorly. To overcome this limitation, we introduce RefEdit – an instruction-based editing model trained on our scalable synthetic data generation pipeline. Our RefEdit, trained on only 20,000 editing triplets, outperforms the Flux/SD3 model-based baselines trained on millions of data. Extensive evaluations across various benchmarks demonstrate that our model not only excels in referring expression tasks but also enhances performance on traditional benchmarks, achieving state-of-the-art results comparable to closed-source methods. We release data \ checkpoint for reproducibility.
zh
[CV-87] Geometric Visual Fusion Graph Neural Networks for Multi-Person Human-Object Interaction Recognition in Videos
【速读】:该论文旨在解决视频中人-物体交互(Human-Object Interaction, HOI)识别的问题,特别是在动态场景下如何有效融合视觉和几何特征以提升交互理解的准确性。其解决方案的关键在于采用自底向上的方法,通过构建实体特定表示来保留多模态特征的独特性,并利用双注意力特征融合与相互依赖的实体图学习,逐步从实体级表示过渡到高层次的交互理解。
链接: https://arxiv.org/abs/2506.03440
作者: Tanqiu Qiao,Ruochen Li,Frederick W. B. Li,Yoshiki Kubotani,Shigeo Morishima,Hubert P. H. Shum
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by Expert Systems with Applications (ESWA)
Abstract:Human-Object Interaction (HOI) recognition in videos requires understanding both visual patterns and geometric relationships as they evolve over time. Visual and geometric features offer complementary strengths. Visual features capture appearance context, while geometric features provide structural patterns. Effectively fusing these multimodal features without compromising their unique characteristics remains challenging. We observe that establishing robust, entity-specific representations before modeling interactions helps preserve the strengths of each modality. Therefore, we hypothesize that a bottom-up approach is crucial for effective multimodal fusion. Following this insight, we propose the Geometric Visual Fusion Graph Neural Network (GeoVis-GNN), which uses dual-attention feature fusion combined with interdependent entity graph learning. It progressively builds from entity-specific representations toward high-level interaction understanding. To advance HOI recognition to real-world scenarios, we introduce the Concurrent Partial Interaction Dataset (MPHOI-120). It captures dynamic multi-person interactions involving concurrent actions and partial engagement. This dataset helps address challenges like complex human-object dynamics and mutual occlusions. Extensive experiments demonstrate the effectiveness of our method across various HOI scenarios. These scenarios include two-person interactions, single-person activities, bimanual manipulations, and complex concurrent partial interactions. Our method achieves state-of-the-art performance.
zh
[CV-88] ViT-Split: Unleashing the Power of Vision Foundation Models via Efficient Splitting Heads
【速读】:该论文旨在解决现有视觉基础模型(Vision Foundation Models, VFMs)适配器方法中存在的两个效率问题:一是卷积神经网络(CNN)与VFM主干之间的交互导致早期层梯度反向传播,二是现有方法需要调整所有组件,增加了复杂性并可能造成过拟合。其解决方案的关键在于提出一种名为ViT-Split的新方法,该方法基于观察到的VFM层可划分为学习低级特征的提取器和学习任务特定特征的适配器这一特性,通过移除CNN分支并引入任务头和先验头来优化VFM,从而缓解梯度传播问题并减少参数调优和过拟合风险。
链接: https://arxiv.org/abs/2506.03433
作者: Yifan Li,Xin Li,Tianqin Li,Wenbin He,Yu Kong,Liu Ren
机构: Michigan State University (密歇根州立大学); Bosch Research North America & Bosch Center for Artificial Intelligence (BCAI) (博世北美研究与博世人工智能中心); Carnegie Mellon University (卡内基梅隆大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The project is available: this https URL
Abstract:Vision foundation models (VFMs) have demonstrated remarkable performance across a wide range of downstream tasks. While several VFM adapters have shown promising results by leveraging the prior knowledge of VFMs, we identify two inefficiencies in these approaches. First, the interaction between convolutional neural network (CNN) and VFM backbone triggers early layer gradient backpropagation. Second, existing methods require tuning all components, adding complexity. Besides, these adapters alter VFM features, underutilizing the prior knowledge. To tackle these challenges, we propose a new approach called ViT-Split, based on a key observation: the layers of several VFMs, like DINOv2, can be divided into two distinct components: an extractor for learning low-level features and an adapter for learning task-specific features. Leveraging this insight, we eliminate the CNN branch and introduce two heads, task head and prior head, to the frozen VFM. The task head is designed to learn task-specific features, mitigating the early gradient propagation issue. The prior head is used to leverage the multi-scale prior features from the frozen VFM, reducing tuning parameters and overfitting. Extensive experiments on various tasks (e.g., segmentation, detection, depth estimation, and visual question answering) validate the effectiveness and efficiency of ViT-Split. Specifically, ViT-Split reduces training time up to 4\times while achieving comparable or even better results on ADE20K, compared to other VFM adapters.
zh
[CV-89] Multi-Spectral Gaussian Splatting with Neural Color Representation
【速读】:该论文旨在解决多光谱3D高斯泼溅(3DGS)框架中多视角一致新视图生成的问题,特别是在不同光谱域的独立摄像头图像上实现高质量渲染。传统方法需要跨模态相机标定,并且分别处理每个模态,无法充分利用光谱与空间的相关性。该论文的解决方案关键在于引入一种新型神经颜色表示,将多光谱信息编码为学习到的、紧凑的每点特征嵌入,并通过一个浅层多层感知机(MLP)解码以获得光谱颜色值,从而在统一表示中联合学习所有波段,显著提升了多光谱和单光谱的渲染质量。
链接: https://arxiv.org/abs/2506.03407
作者: Lukas Meyer,Josef Grün,Maximilian Weiherer,Bernhard Egger,Marc Stamminger,Linus Franke
机构: Friedrich-Alexander-Universität Erlangen-Nürnberg-Fürth(弗里德里希-亚历山大埃尔朗根-纽伦堡大学)
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:We present MS-Splatting – a multi-spectral 3D Gaussian Splatting (3DGS) framework that is able to generate multi-view consistent novel views from images of multiple, independent cameras with different spectral domains. In contrast to previous approaches, our method does not require cross-modal camera calibration and is versatile enough to model a variety of different spectra, including thermal and near-infra red, without any algorithmic changes. Unlike existing 3DGS-based frameworks that treat each modality separately (by optimizing per-channel spherical harmonics) and therefore fail to exploit the underlying spectral and spatial correlations, our method leverages a novel neural color representation that encodes multi-spectral information into a learned, compact, per-splat feature embedding. A shallow multi-layer perceptron (MLP) then decodes this embedding to obtain spectral color values, enabling joint learning of all bands within a unified representation. Our experiments show that this simple yet effective strategy is able to improve multi-spectral rendering quality, while also leading to improved per-spectra rendering quality over state-of-the-art methods. We demonstrate the effectiveness of this new technique in agricultural applications to render vegetation indices, such as normalized difference vegetation index (NDVI). Subjects: Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2506.03407 [cs.GR] (or arXiv:2506.03407v1 [cs.GR] for this version) https://doi.org/10.48550/arXiv.2506.03407 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-90] mporal Vegetation Index-Based Unsupervised Crop Stress Detection via Eigenvector-Guided Contrastive Learning
【速读】:该论文旨在解决作物胁迫早期检测的问题,传统方法如NDRE(Normalized Difference Red Edge)通常在可见症状出现后才能检测到胁迫或依赖于标注数据集,限制了其可扩展性。论文提出的解决方案是EigenCL,其关键在于利用时间序列NDRE动态和基于生物学的特征分解,通过构造五点NDRE时间序列并计算RBF相似性矩阵,提取解释76%方差的主要特征向量,该特征向量与原始NDRE值高度相关(r = 0.95),用于定义应力感知的相似性以进行对比嵌入学习。与依赖视觉增强的方法不同,EigenCL根据生物相似的胁迫轨迹拉近嵌入并推远差异轨迹,从而实现生理上有意义的聚类,并在无监督情况下达到76%的早期胁迫检测率。
链接: https://arxiv.org/abs/2506.03394
作者: Shafqaat Ahmad
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Early detection of crop stress is vital for minimizing yield loss and enabling timely intervention in precision agriculture. Traditional approaches using NDRE often detect stress only after visible symptoms appear or require labeled datasets, limiting scalability. This study introduces EigenCL, a novel unsupervised contrastive learning framework guided by temporal NDRE dynamics and biologically grounded eigen decomposition. Using over 10,000 Sentinel-2 NDRE image patches from drought-affected Iowa cornfields, we constructed five-point NDRE time series per patch and derived an RBF similarity matrix. The principal eigenvector explaining 76% of the variance and strongly correlated (r = 0.95) with raw NDRE values was used to define stress-aware similarity for contrastive embedding learning. Unlike existing methods that rely on visual augmentations, EigenCL pulls embeddings together based on biologically similar stress trajectories and pushes apart divergent ones. The learned embeddings formed physiologically meaningful clusters, achieving superior clustering metrics (Silhouette: 0.748, DBI: 0.35) and enabling 76% early stress detection up to 12 days before conventional NDRE thresholds. Downstream classification yielded 95% k-NN and 91% logistic regression accuracy. Validation on an independent 2023 Nebraska dataset confirmed generalizability without retraining. EigenCL offers a label-free, scalable approach for early stress detection that aligns with underlying plant physiology and is suitable for real-world deployment in data-scarce agricultural environments.
zh
[CV-91] Cross-Modal Urban Sensing: Evaluating Sound-Vision Alignment Across Street-Level and Aerial Imagery
【速读】:该论文试图解决环境声音景观在大尺度地理分析中的潜力未被充分挖掘的问题,具体是探究城市声音与视觉场景之间的对应关系。其解决方案的关键在于采用多模态方法,将地理定位的声音记录与街景和遥感影像相结合,并利用AST模型处理音频、CLIP和RemoteCLIP处理图像以及CLIPSeg和Seg-Earth OV进行语义分割,从而提取嵌入和类别级特征以评估跨模态相似性。
链接: https://arxiv.org/abs/2506.03388
作者: Pengyu Chen,Xiao Huang,Teng Fei,Sicheng Wang
机构: University of South Carolina (南卡罗来纳大学); Emory University (埃默里大学); University of Canterbury (坎特伯雷大学); Wuhan University (武汉大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Environmental soundscapes convey substantial ecological and social information regarding urban environments; however, their potential remains largely untapped in large-scale geographic analysis. In this study, we investigate the extent to which urban sounds correspond with visual scenes by comparing various visual representation strategies in capturing acoustic semantics. We employ a multimodal approach that integrates geo-referenced sound recordings with both street-level and remote sensing imagery across three major global cities: London, New York, and Tokyo. Utilizing the AST model for audio, along with CLIP and RemoteCLIP for imagery, as well as CLIPSeg and Seg-Earth OV for semantic segmentation, we extract embeddings and class-level features to evaluate cross-modal similarity. The results indicate that street view embeddings demonstrate stronger alignment with environmental sounds compared to segmentation outputs, whereas remote sensing segmentation is more effective in interpreting ecological categories through a Biophony–Geophony–Anthrophony (BGA) framework. These findings imply that embedding-based models offer superior semantic alignment, while segmentation-based methods provide interpretable links between visual structure and acoustic ecology. This work advances the burgeoning field of multimodal urban sensing by offering novel perspectives for incorporating sound into geospatial analysis.
zh
[CV-92] A Foundation Model for Spatial Proteomics
【速读】:该论文试图解决空间蛋白质组学中由于数据量有限和高维多通道异构性带来的分析挑战,尤其是在缺乏标注数据的情况下进行有效任务建模的问题。解决方案的关键在于提出KRONOS,一个针对空间蛋白质组学设计的预训练基础模型,其通过在超过4700万张图像块上进行自监督学习,学习跨多个尺度(从细胞到组织)的生物学有意义表征,并引入了关键的架构适应以处理多通道和异构数据,从而实现了高效的下游任务处理和跨机构比较。
链接: https://arxiv.org/abs/2506.03373
作者: Muhammad Shaban,Yuzhou Chang,Huaying Qiu,Yao Yu Yeo,Andrew H. Song,Guillaume Jaume,Yuchen Wang,Luca L. Weishaupt,Tong Ding,Anurag Vaidya,Abdallah Lamane,Daniel Shao,Mohammed Zidane,Yunhao Bai,Paige McCallum,Shuli Luo,Wenrui Wu,Yang Wang,Precious Cramer,Chi Ngai Chan,Pierre Stephan,Johanna Schaffenrath,Jia Le Lee,Hendrik A. Michel,Caiwei Tian,Cristina Almagro-Perez,Sophia J. Wagner,Sharifa Sahai,Ming Y. Lu,Richard J. Chen,Andrew Zhang,Mark Edward M. Gonzales,Ahmad Makky,Jia-Ying Joey Lee,Hao Cheng,Nourhan El Ahmar,Sayed Matar,Maximilian Haist,Darci Phillips,Yuqi Tan,Garry P. Nolan,W. Richard Burack,Jacob D. Estes,Jonathan T.C. Liu,Toni K Choueiri,Neeraj Agarwal,Marc Barry,Scott J. Rodig,Long Phi Le,Georg Gerber,Christian M. Schürch,Fabian J. Theis,Youn H Kim,Joe Yeong,Sabina Signoretti,Brooke E. Howitt,Lit-Hsin Loo,Qin Ma,Sizun Jiang,Faisal Mahmood
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Foundation models have begun to transform image analysis by acting as pretrained generalist backbones that can be adapted to many tasks even when post-training data are limited, yet their impact on spatial proteomics, imaging that maps proteins at single-cell resolution, remains limited. Here, we introduce KRONOS, a foundation model built for spatial proteomics. KRONOS was trained in a self-supervised manner on over 47 million image patches covering 175 protein markers, 16 tissue types, and 8 fluorescence-based imaging platforms. We introduce key architectural adaptations to address the high-dimensional, multi-channel, and heterogeneous nature of multiplex imaging. We demonstrate that KRONOS learns biologically meaningful representations across multiple scales, ranging from cellular and microenvironment to tissue levels, enabling it to address diverse downstream tasks, including cell phenotyping, region classification, and patient stratification. Evaluated across 11 independent cohorts, KRONOS achieves state-of-the-art performance across cell phenotyping, treatment response prediction, and retrieval tasks, and is highly data-efficient. KRONOS also introduces the paradigm of segmentation-free patch-level processing for efficient and scalable spatial proteomics analysis, allowing cross-institutional comparisons, and as an image reverse search engine for spatial patterns. Together, these results position KRONOS as a flexible and scalable tool for spatial proteomics. The model is publicly accessible at this https URL.
zh
[CV-93] oward Reliable VLM: A Fine-Grained Benchmark and Framework for Exposure Bias and Inference in Korean Street Views
【速读】:该论文试图解决现有视觉-语言模型(Vision-Language Models, VLMs)在基于图像的地理定位任务中存在评估基准粗粒度、语言偏差以及缺乏多模态和隐私感知评估的问题。解决方案的关键在于提出 KoreaGEO Bench,这是首个针对韩国街景的细粒度、多模态地理定位基准数据集,包含1,080张高分辨率图像,覆盖四个城市集群和九种场所类型,并附有多情境标注和两种风格的韩语描述以模拟真实场景下的隐私暴露。此外,研究引入了三路径评估协议,以在不同输入模态下评估十种主流VLMs的准确性、空间偏差和推理行为。
链接: https://arxiv.org/abs/2506.03371
作者: Xiaonan Wang,Bo Shao,Hansaem Kim
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advances in vision-language models (VLMs) have enabled accurate image-based geolocation, raising serious concerns about location privacy risks in everyday social media posts. However, current benchmarks remain coarse-grained, linguistically biased, and lack multimodal and privacy-aware evaluations. To address these gaps, we present KoreaGEO Bench, the first fine-grained, multimodal geolocation benchmark for Korean street views. Our dataset comprises 1,080 high-resolution images sampled across four urban clusters and nine place types, enriched with multi-contextual annotations and two styles of Korean captions simulating real-world privacy exposure. We introduce a three-path evaluation protocol to assess ten mainstream VLMs under varying input modalities and analyze their accuracy, spatial bias, and reasoning behavior. Results reveal modality-driven shifts in localization precision and highlight structural prediction biases toward core cities.
zh
[CV-94] Urban Visibility Hotspots: Quantifying Building Vertex Visibility from Connected Vehicle Trajectories using Spatial Indexing
【速读】:该论文旨在解决户外广告和街道设施有效布局的问题,即如何准确识别能够为目标受众(尤其是车辆交通)提供最大视觉曝光的地点。传统的方法通常依赖静态交通流量统计或主观评估,而本文提出了一种数据驱动的方法,通过分析大规模联网车辆轨迹数据(来自Compass IoT)来客观量化位置的可视性。其关键解决方案是利用插值轨迹构建每个车辆位置的前向投影可视区域,并结合从OpenStreetMap提取的建筑物顶点位置,计算道路附近数千个潜在兴趣点的累积视觉暴露量(“可视性计数”)。核心技术创新在于构建了基于建筑物顶点的BallTree空间索引,实现了高效(O(logN)复杂度)的半径查询,从而显著提升了对数百万轨迹点在多个行程中可视区域的计算效率。
链接: https://arxiv.org/abs/2506.03365
作者: Artur Grigorev,Adriana-Simona Mihaita
机构: University of Technology Sydney (悉尼科技大学)
类目: ystems and Control (eess.SY); Computer Vision and Pattern Recognition (cs.CV); Computation (stat.CO)
备注:
Abstract:Effective placement of Out-of-Home advertising and street furniture requires accurate identification of locations offering maximum visual exposure to target audiences, particularly vehicular traffic. Traditional site selection methods often rely on static traffic counts or subjective assessments. This research introduces a data-driven methodology to objectively quantify location visibility by analyzing large-scale connected vehicle trajectory data (sourced from Compass IoT) within urban environments. We model the dynamic driver field-of-view using a forward-projected visibility area for each vehicle position derived from interpolated trajectories. By integrating this with building vertex locations extracted from OpenStreetMap, we quantify the cumulative visual exposure, or visibility count'', for thousands of potential points of interest near roadways. The analysis reveals that visibility is highly concentrated, identifying specific
visual hotspots’’ that receive disproportionately high exposure compared to average locations. The core technical contribution involves the construction of a BallTree spatial index over building vertices. This enables highly efficient (O(logN) complexity) radius queries to determine which vertices fall within the viewing circles of millions of trajectory points across numerous trips, significantly outperforming brute-force geometric checks. Analysis reveals two key findings: 1) Visibility is highly concentrated, identifying distinct ‘visual hotspots’ receiving disproportionately high exposure compared to average locations. 2) The aggregated visibility counts across vertices conform to a Log-Normal distribution.
zh
[CV-95] Robustness in Both Domains: CLIP Needs a Robust Text Encoder
【速读】:该论文试图解决文本编码器在对抗性输入攻击下的鲁棒性问题,尽管已有研究关注了CLIP图像编码器的鲁棒性,但文本编码器的鲁棒性尚未被探索。解决方案的关键是提出LEAF(Learning Efficient Adversarial Finetuning),这是一种高效的文本领域对抗微调方法,能够扩展到大型CLIP模型,并显著提升文本领域的零样本对抗准确性,同时保持由鲁棒图像编码器提供的视觉性能。
链接: https://arxiv.org/abs/2506.03355
作者: Elias Abad Rocamora,Christian Schlarmann,Naman Deep Singh,Yongtao Wu,Matthias Hein,Volkan Cevher
机构: École Polytechnique Fédérale de Lausanne (洛桑联邦理工学院); Tübingen AI center (图宾根人工智能中心); University of Tübingen (图宾根大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Adversarial input attacks can cause a significant shift of CLIP embeddings. This can affect the downstream robustness of models incorporating CLIP in the pipeline, such as text-to-image generative models or large vision language models. While some efforts have been done towards making the CLIP image encoders robust, the robustness of text encoders remains unexplored. In this work, we cover this gap in the literature. We propose LEAF: an efficient adversarial finetuning method for the text domain, with the ability to scale to large CLIP models. Our models significantly improve the zero-shot adversarial accuracy in the text domain, while maintaining the vision performance provided by robust image encoders. When combined with text-to-image diffusion models, we can improve the generation quality under adversarial noise. When employing our robust CLIP encoders in multimodal retrieval tasks, we improve the recall under adversarial noise over standard CLIP models. Finally, we show that robust text encoders facilitate better reconstruction of input text from its embedding via direct optimization.
zh
[CV-96] Semiconductor SEM Image Defect Classification Using Supervised and Semi-Supervised Learning with Vision Transformers
【速读】:该论文旨在解决半导体制造过程中缺陷控制的问题,特别是针对扫描电子显微镜(SEM)图像中纳米级缺陷的自动分类(ADC)问题。传统的人工分类方法受限于时间、人力成本和主观偏差,而本文提出的解决方案是应用视觉变换器(Vision Transformer, ViT)神经网络进行自动缺陷分类。其关键在于利用迁移学习(如DinoV2)和半监督学习策略,在少量标注数据(每类缺陷少于15张图像)的情况下实现超过90%的分类准确率,从而为构建平台无关的内部分类工具提供高效且灵活的解决方案。
链接: https://arxiv.org/abs/2506.03345
作者: Chien-Fu(Frank)Huang,Katherine Sieg,Leonid Karlinksy,Nash Flores,Rebekah Sheraw,Xin Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Published at 36th Annual SEMI Advanced Semiconductor Manufacturing Conference (ASMC) 2025
Abstract:Controlling defects in semiconductor processes is important for maintaining yield, improving production cost, and preventing time-dependent critical component failures. Electron beam-based imaging has been used as a tool to survey wafers in the line and inspect for defects. However, manual classification of images for these nano-scale defects is limited by time, labor constraints, and human biases. In recent years, deep learning computer vision algorithms have shown to be effective solutions for image-based inspection applications in industry. This work proposes application of vision transformer (ViT) neural networks for automatic defect classification (ADC) of scanning electron microscope (SEM) images of wafer defects. We evaluated our proposed methods on 300mm wafer semiconductor defect data from our fab in IBM Albany. We studied 11 defect types from over 7400 total images and investigated the potential of transfer learning of DinoV2 and semi-supervised learning for improved classification accuracy and efficient computation. We were able to achieve classification accuracies of over 90% with less than 15 images per defect class. Our work demonstrates the potential to apply the proposed framework for a platform agnostic in-house classification tool with faster turnaround time and flexibility.
zh
[CV-97] Seeing the Arrow of Time in Large Multimodal Models
【速读】:该论文试图解决现代大模型(Large Multimodal Models, LMMs)在视频理解中对时间不可逆性(Arrow of Time, AoT)感知不足的问题,即当前模型难以准确感知和利用视频中的时间方向性。解决方案的关键在于提出一种基于强化学习(Reinforcement Learning, RL)的训练策略——ArrowRL,其核心创新是引入反向奖励机制,通过鼓励正向与反向视觉帧之间产生差异化的视频解释,从而增强模型对时间方向性的意识。
链接: https://arxiv.org/abs/2506.03340
作者: Zihui Xue,Mi Luo,Kristen Grauman
机构: The University of Texas at Austin (德克萨斯大学奥斯汀分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project website: this https URL
Abstract:The Arrow of Time (AoT)-time’s irreversible flow shaping physical events-is fundamental to video comprehension, yet remains a significant challenge for modern large multimodal models (LMMs). Current LMMs struggle to perceive and utilize temporal directionality in video when responding to language queries, obstructing deeper temporal understanding. We tackle this deficiency by first providing a critical analysis of existing benchmarks and models. We then introduce ArrowRL, a reinforcement learning (RL)-based training strategy with an innovative reverse reward that instills AoT awareness by encouraging divergent video interpretations between forward and reversed visual frames. For rigorous evaluation, we additionally develop AoTBench, a new multi-faceted benchmark probing temporally challenging questions. Experiments show ArrowRL greatly advances temporal perception: it not only achieves substantial improvements on our challenging AoTBench but also demonstrably boosts performance on standard video question answering (VQA) benchmarks (with peak accuracy gains reaching over 20% and 10% respectively). This validates ArrowRL’s effectiveness and highlights the critical need for dedicated AoT understanding in LMMs.
zh
[CV-98] SportMamba: Adaptive Non-Linear Multi-Object Tracking with State Space Models for Team Sports CVPR
【速读】:该论文旨在解决团队运动中多目标跟踪(Multi-object tracking, MOT)的挑战,特别是在快速运动和频繁遮挡导致运动模糊和身份切换的情况下,传统方法由于依赖于目标检测和基于外观的跟踪,在复杂场景中表现不佳,因为外观线索模糊且运动模式高度非线性。解决方案的关键在于提出SportMamba,其技术贡献包括引入一种基于Mamba的注意力机制,通过隐式关注相关嵌入依赖来建模非线性运动,以及设计一种高度自适应的空间关联度量,通过考虑深度变化引起的尺度变化来减少部分遮挡导致的身份切换。此外,通过扩展检测搜索空间的自适应缓冲区以提升快速运动场景中的关联性能。
链接: https://arxiv.org/abs/2506.03335
作者: Dheeraj Khanna,Jerrin Bright,Yuhao Chen,John S. Zelek
机构: University of Waterloo (滑铁卢大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Paper accepted at CVSports IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW’25). The paper has 8 pages, including 6 Figures and 5 Tables
Abstract:Multi-object tracking (MOT) in team sports is particularly challenging due to the fast-paced motion and frequent occlusions resulting in motion blur and identity switches, respectively. Predicting player positions in such scenarios is particularly difficult due to the observed highly non-linear motion patterns. Current methods are heavily reliant on object detection and appearance-based tracking, which struggle to perform in complex team sports scenarios, where appearance cues are ambiguous and motion patterns do not necessarily follow a linear pattern. To address these challenges, we introduce SportMamba, an adaptive hybrid MOT technique specifically designed for tracking in dynamic team sports. The technical contribution of SportMamba is twofold. First, we introduce a mamba-attention mechanism that models non-linear motion by implicitly focusing on relevant embedding dependencies. Second, we propose a height-adaptive spatial association metric to reduce ID switches caused by partial occlusions by accounting for scale variations due to depth changes. Additionally, we extend the detection search space with adaptive buffers to improve associations in fast-motion scenarios. Our proposed technique, SportMamba, demonstrates state-of-the-art performance on various metrics in the SportsMOT dataset, which is characterized by complex motion and severe occlusion. Furthermore, we demonstrate its generalization capability through zero-shot transfer to VIP-HTD, an ice hockey dataset.
zh
[CV-99] Learning Optical Flow Field via Neural Ordinary Differential Equation CVPR
【速读】:该论文试图解决传统光学流估计方法中依赖固定迭代步骤导致的性能不足问题(optical flow estimation)。其解决方案的关键在于引入一种基于连续模型的新型预测方法,即神经微分方程(neural ordinary differential equations, ODE),通过动态调整计算步骤数量来适应输入数据,从而实现更优的流场预测。
链接: https://arxiv.org/abs/2506.03290
作者: Leyla Mirvakhabova,Hong Cai,Jisoo Jeong,Hanno Ackermann,Farhad Zanjani,Fatih Porikli
机构: Qualcomm AI Research†(高通人工智能研究部)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPRW 2025
Abstract:Recent works on optical flow estimation use neural networks to predict the flow field that maps positions of one image to positions of the other. These networks consist of a feature extractor, a correlation volume, and finally several refinement steps. These refinement steps mimic the iterative refinements performed by classical optimization algorithms and are usually implemented by neural layers (e.g., GRU) which are recurrently executed for a fixed and pre-determined number of steps. However, relying on a fixed number of steps may result in suboptimal performance because it is not tailored to the input data. In this paper, we introduce a novel approach for predicting the derivative of the flow using a continuous model, namely neural ordinary differential equations (ODE). One key advantage of this approach is its capacity to model an equilibrium process, dynamically adjusting the number of compute steps based on the data at hand. By following a particular neural architecture, ODE solver, and associated hyperparameters, our proposed model can replicate the exact same updates as recurrent cells used in existing works, offering greater generality. Through extensive experimental analysis on optical flow benchmarks, we demonstrate that our approach achieves an impressive improvement over baseline and existing models, all while requiring only a single refinement step.
zh
[CV-100] Chipmunk: Training-Free Acceleration of Diffusion Transformers with Dynamic Column-Sparse Deltas
【速读】:该论文旨在解决Diffusion Transformers (DiTs)在推理过程中计算成本过高的问题,其核心在于减少不同推理步骤间的计算冗余。论文提出的解决方案关键在于利用动态稀疏性(dynamic sparsity),仅重新计算变化最快的中间激活值,而对其他部分进行缓存。通过引入基于体素的输入标记重排序以实现列稀疏性,并优化稀疏操作以提升GPU利用率,同时通过重叠稀疏模式计算与缓存更新来隐藏额外延迟,从而显著提升推理效率。
链接: https://arxiv.org/abs/2506.03275
作者: Austin Silveria,Soham V. Govande,Daniel Y. Fu
机构: University of California, San Diego (加州大学圣地亚哥分校); Together AI (Together AI); Stanford University (斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 10 pages, 4 figures
Abstract:Diffusion Transformers (DiTs) have achieved state-of-the-art performance in high-quality image and video generation but incur substantial compute cost at inference. A common observation is that DiT latent noise vectors change slowly across inference steps, which suggests that the DiT compute may be redundant across steps. In this paper, we aim to speed up inference by reducing this redundancy, without additional training. We first study how activations change between steps in two state-of-the-art open-source DiTs. We find that just 5-25% of the values in attention and MLP explain 70-90% of the change in activations across steps. This finding motivates our approach, Chipmunk, which uses dynamic sparsity at inference time to recompute only the fastest-changing intermediate activations, while caching the rest. Dynamic sparsity introduces two systems challenges: (1) sparse attention and MLP operations tend to underutilize GPU tensor cores; and (2) computing dynamic sparsity patterns at runtime and caching activations both introduce overhead. To address these challenges, Chipmunk first uses a voxel-based reordering of input tokens to introduce column-wise sparsity. We implement column-sparse kernels utilizing efficient sparse gathers from global to shared GPU memory, achieving a 9.3x speedup at 93% sparsity compared to highly-optimized dense baselines. Second, Chipmunk overlaps the computation of sparsity patterns and cache updates with other parts of the computation (e.g., second layer of the MLP) to hide the extra latency. Chipmunk achieves up to 2.16x speedup on HunyuanVideo and 1.41x on FLUX.1-dev without compromising generation quality. Furthermore, we show that Chipmunk can be stacked on top of full step caching, achieving a 3.72x speedup on HunyuanVideo, a 2.67x speedup on WAN2.1, and a 2.25x speedup on FLUX.1-dev with minimal quality impact.
zh
[CV-101] Pre-trained Vision-Language Models Assisted Noisy Partial Label Learning
【速读】:该论文旨在解决在噪声部分标签学习(Noisy Partial Label Learning, NPLL)场景下,利用预训练视觉-语言模型(Vision-Language Models, VLMs)生成的噪声标签进行有效学习的问题。其关键解决方案是提出了一种创新的协同一致性正则化(Collaborative Consistency Regularization, Co-Reg)方法,通过“协同伪标签”机制同时训练两个神经网络,实现训练标签的协同净化,并在标签空间和特征表示空间中施加一致性正则化约束,从而应对由预训练模型产生的实例相关噪声带来的挑战。
链接: https://arxiv.org/abs/2506.03229
作者: Qian-Wei Wang,Yuqiu Xie,Letian Zhang,Zimo Liu,Shu-Tao Xia
机构: Tsinghua Shenzhen International Graduate School, Tsinghua University (清华大学深圳国际研究生院,清华大学); Institute of Perceptual Intelligence, Pengcheng Laboratory (感知智能研究院,鹏城实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:In the context of noisy partial label learning (NPLL), each training sample is associated with a set of candidate labels annotated by multiple noisy annotators. With the emergence of high-performance pre-trained vision-language models (VLMs) such as CLIP, LLaVa and GPT-4V, the direction of using these models to replace time-consuming manual annotation workflows and achieve “manual-annotation-free” training for downstream tasks has become a highly promising research avenue. This paper focuses on learning from noisy partial labels annotated by pre-trained VLMs and proposes an innovative collaborative consistency regularization (Co-Reg) method. Unlike the symmetric noise primarily addressed in traditional noisy label learning, the noise generated by pre-trained models is instance-dependent, embodying the underlying patterns of the pre-trained models themselves, which significantly increases the learning difficulty for the model. To address this, we simultaneously train two neural networks that implement collaborative purification of training labels through a “Co-Pseudo-Labeling” mechanism, while enforcing consistency regularization constraints in both the label space and feature representation space. Our method can also leverage few-shot manually annotated valid labels to further enhance its performances. Comparative experiments with different denoising and disambiguation algorithms, annotation manners, and pre-trained model application schemes fully validate the effectiveness of the proposed method, while revealing the broad prospects of integrating weakly-supervised learning techniques into the knowledge distillation process of pre-trained models.
zh
[CV-102] OpenCarbon: A Contrastive Learning-based Cross-Modality Neural Approach for High-Resolution Carbon Emission Prediction Using Open Data IJCAI2025
【速读】:该论文旨在解决高分辨率碳排放估算问题,该问题在有效排放治理和减缓规划中具有重要意义。传统碳核算方法因数据收集工作量大而受到限制,而开放数据和先进学习技术的兴起为这一问题提供了潜在解决方案。论文提出的解决方案关键在于利用两种模态的开放数据——卫星图像和兴趣点(POI)数据,通过跨模态信息提取与融合模块以及邻域感知聚合模块,分别捕捉功能信息的互补性及其交互作用、以及空间邻接关系带来的集聚效应,从而提升碳排放预测的准确性与泛化能力。
链接: https://arxiv.org/abs/2506.03224
作者: Jinwei Zeng,Yu Liu,Guozhen Zhang,Jingtao Ding,Yuming Lin,Jian Yuan,Yong Li
机构: Tsinghua University (清华大学); TsingRoc (清华ROC)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Physics and Society (physics.soc-ph)
备注: Accepted by IJCAI 2025
Abstract:Accurately estimating high-resolution carbon emissions is crucial for effective emission governance and mitigation planning. While conventional methods for precise carbon accounting are hindered by substantial data collection efforts, the rise of open data and advanced learning techniques offers a promising solution. Once an open data-based prediction model is developed and trained, it can easily infer emissions for new areas based on available open data. To address this, we incorporate two modalities of open data, satellite images and point-of-interest (POI) data, to predict high-resolution urban carbon emissions, with satellite images providing macroscopic and static and POI data offering fine-grained and relatively dynamic functionality information. However, estimating high-resolution carbon emissions presents two significant challenges: the intertwined and implicit effects of various functionalities on carbon emissions, and the complex spatial contiguity correlations that give rise to the agglomeration effect. Our model, OpenCarbon, features two major designs that target the challenges: a cross-modality information extraction and fusion module to extract complementary functionality information from two modules and model their interactions, and a neighborhood-informed aggregation module to capture the spatial contiguity correlations. Extensive experiments demonstrate our model’s superiority, with a significant performance gain of 26.6% on R2. Further generalizability tests and case studies also show OpenCarbon’s capacity to capture the intrinsic relation between urban functionalities and carbon emissions, validating its potential to empower efficient carbon governance and targeted carbon mitigation planning. Codes and data are available: this https URL.
zh
[CV-103] ConMamba: Contrastive Vision Mamba for Plant Disease Detection
【速读】:该论文旨在解决植物病害检测(Plant Disease Detection, PDD)中深度学习方法依赖大量标注数据、计算成本高以及难以捕捉视觉表示中的长距离依赖关系等问题。其解决方案的关键在于提出一种名为ConMamba的新型自监督学习(Self-supervised Learning, SSL)框架,该框架结合了视觉马尔可夫编码器(Vision Mamba Encoder, VME),利用双向状态空间模型(bidirectional State Space Model, SSM)高效捕捉长距离依赖关系,并引入具有动态权重调整的双级对比损失函数以优化局部与全局特征对齐。
链接: https://arxiv.org/abs/2506.03213
作者: Abdullah Al Mamun,Miaohua Zhang,David Ahmedt-Aristizabal,Zeeshan Hayder,Mohammad Awrangjeb
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Plant Disease Detection (PDD) is a key aspect of precision agriculture. However, existing deep learning methods often rely on extensively annotated datasets, which are time-consuming and costly to generate. Self-supervised Learning (SSL) offers a promising alternative by exploiting the abundance of unlabeled data. However, most existing SSL approaches suffer from high computational costs due to convolutional neural networks or transformer-based architectures. Additionally, they struggle to capture long-range dependencies in visual representation and rely on static loss functions that fail to align local and global features effectively. To address these challenges, we propose ConMamba, a novel SSL framework specially designed for PDD. ConMamba integrates the Vision Mamba Encoder (VME), which employs a bidirectional State Space Model (SSM) to capture long-range dependencies efficiently. Furthermore, we introduce a dual-level contrastive loss with dynamic weight adjustment to optimize local-global feature alignment. Experimental results on three benchmark datasets demonstrate that ConMamba significantly outperforms state-of-the-art methods across multiple evaluation metrics. This provides an efficient and robust solution for PDD.
zh
[CV-104] Channel-adaptive Cross-modal Generative Semantic Communication for Point Cloud Transmission
【速读】:该论文旨在解决点云(Point Clouds, PCs)在自动驾驶和扩展现实等应用中高效传输的问题。其关键解决方案是提出一种基于语义通信(Semantic Communication, SemCom)的跨模态生成式通信框架GenSeC-PC,该框架通过融合图像与点云的语义信息,利用生成式先验确保在噪声或不完整源点云下的可靠重建,并支持全模拟传输以提升压缩效率,同时采用改进的扩散模型加速解码过程,实现毫秒级实时通信。
链接: https://arxiv.org/abs/2506.03211
作者: Wanting Yang,Zehui Xiong,Qianqian Yang,Ping Zhang,Merouane Debbah,Rahim Tafazolli
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Networking and Internet Architecture (cs.NI)
备注:
Abstract:With the rapid development of autonomous driving and extended reality, efficient transmission of point clouds (PCs) has become increasingly important. In this context, we propose a novel channel-adaptive cross-modal generative semantic communication (SemCom) for PC transmission, called GenSeC-PC. GenSeC-PC employs a semantic encoder that fuses images and point clouds, where images serve as non-transmitted side information. Meanwhile, the decoder is built upon the backbone of PointDif. Such a cross-modal design not only ensures high compression efficiency but also delivers superior reconstruction performance compared to PointDif. Moreover, to ensure robust transmission and reduce system complexity, we design a streamlined and asymmetric channel-adaptive joint semantic-channel coding architecture, where only the encoder needs the feedback of average signal-to-noise ratio (SNR) and available bandwidth. In addition, rectified denoising diffusion implicit models is employed to accelerate the decoding process to the millisecond level, enabling real-time PC communication. Unlike existing methods, GenSeC-PC leverages generative priors to ensure reliable reconstruction even from noisy or incomplete source PCs. More importantly, it supports fully analog transmission, improving compression efficiency by eliminating the need for error-free side information transmission common in prior SemCom approaches. Simulation results confirm the effectiveness of cross-modal semantic extraction and dual-metric guided fine-tuning, highlighting the framework’s robustness across diverse conditions, including low SNR, bandwidth limitations, varying numbers of 2D images, and previously unseen objects.
zh
[CV-105] FLEX: A Large-Scale Multi-Modal Multi-Action Dataset for Fitness Action Quality Assessment
【速读】:该论文试图解决当前动作质量评估(Action Quality Assessment, AQA)方法和数据集在健身动作评估中的局限性,即现有技术主要针对单视角竞技体育场景,并且仅依赖于RGB模态,缺乏对健身动作的专业评估与指导。解决方案的关键在于提出FLEX数据集,这是首个融合表面肌电信号(sEMG)的多模态、多动作、大规模数据集,通过高精度动作捕捉(MoCap)收集了38名受试者在不同技能水平下完成的20种负重动作,包含RGB视频、3D姿态、sEMG及生理信息,并引入知识图谱构建以惩罚函数形式的标注规则,从而提升模型性能。
链接: https://arxiv.org/abs/2506.03198
作者: Hao Yin,Lijun Gu,Paritosh Parmar,Lin Xu,Tianxiao Guo,Weiwei Fu,Yang Zhang,Tianyou Zheng
机构: School of Biomedical Engineering (Suzhou), USTC; Suzhou Institute of Biomedical Engineering and Technology, CAS; Institute of High-Performance Computing, A*STAR, Singapore; School of Psychology, Beijing Sports University; School of Competitive Sports, Beijing Sports University
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:With the increasing awareness of health and the growing desire for aesthetic physique, fitness has become a prevailing trend. However, the potential risks associated with fitness training, especially with weight-loaded fitness actions, cannot be overlooked. Action Quality Assessment (AQA), a technology that quantifies the quality of human action and provides feedback, holds the potential to assist fitness enthusiasts of varying skill levels in achieving better training outcomes. Nevertheless, current AQA methodologies and datasets are limited to single-view competitive sports scenarios and RGB modality and lack professional assessment and guidance of fitness actions. To address this gap, we propose the FLEX dataset, the first multi-modal, multi-action, large-scale dataset that incorporates surface electromyography (sEMG) signals into AQA. FLEX utilizes high-precision MoCap to collect 20 different weight-loaded actions performed by 38 subjects across 3 different skill levels for 10 repetitions each, containing 5 different views of the RGB video, 3D pose, sEMG, and physiological information. Additionally, FLEX incorporates knowledge graphs into AQA, constructing annotation rules in the form of penalty functions that map weight-loaded actions, action keysteps, error types, and feedback. We conducted various baseline methodologies on FLEX, demonstrating that multimodal data, multiview data, and fine-grained annotations significantly enhance model performance. FLEX not only advances AQA methodologies and datasets towards multi-modal and multi-action scenarios but also fosters the integration of artificial intelligence within the fitness domain. Dataset and code are available at this https URL.
zh
[CV-106] Unlabeled Data Improves Fine-Grained Image Zero-shot Classification with Multimodal LLM s
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在细粒度图像分类任务中的性能不足问题。尽管MLLMs在通用零样本图像分类任务中表现出色,但在区分视觉上相似的子类别时仍面临挑战,这需要对细微视觉细节进行精确关注。论文提出的解决方案关键在于引入AutoSEP,一个迭代的自监督提示学习框架,通过无监督方式增强MLLMs的细粒度分类能力。其核心思想是利用未标记数据学习一个描述性提示,引导MLLMs识别图像中的关键判别特征,从而提升分类准确率。
链接: https://arxiv.org/abs/2506.03195
作者: Yunqi Hong,Sohyun An,Andrew Bai,Neil Y.C. Lin,Cho-Jui Hsieh
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Despite Multimodal Large Language Models (MLLMs) showing promising results on general zero-shot image classification tasks, fine-grained image classification remains challenging. It demands precise attention to subtle visual details to distinguish between visually similar subcategories–details that MLLMs may easily overlook without explicit guidance. To address this, we introduce AutoSEP, an iterative self-supervised prompt learning framework designed to enhance MLLM fine-grained classification capabilities in a fully unsupervised manner. Our core idea is to leverage unlabeled data to learn a description prompt that guides MLLMs in identifying crucial discriminative features within an image, and boosts classification accuracy. We developed an automatic self-enhancing prompt learning framework called AutoSEP to iteratively improve the description prompt using unlabeled data, based on instance-level classification scoring function. AutoSEP only requires black-box access to MLLMs, eliminating the need for any training or fine-tuning. We evaluate our approach on multiple fine-grained classification datasets. It consistently outperforms other unsupervised baselines, demonstrating the effectiveness of our self-supervised optimization framework. Notably, AutoSEP on average improves 13 percent over standard zero-shot classification and 5 percent over the best-performing baselines. Code is available at: this https URL
zh
[CV-107] HueManity: Probing Fine-Grained Visual Perception in MLLM s
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在精细感知任务中的表现不足问题,特别是其在视觉感知能力上的局限性。解决方案的关键在于构建了一个名为HueManity的基准数据集,该数据集包含83,850张图像,其中嵌入了双字符数字字母字符串的Ishihara测试风格点阵图案,用以挑战模型在精确模式识别方面的能力。通过在该数据集上对九个最先进的MLLM进行评估,研究揭示了当前MLLM在视觉感知任务中的显著性能缺陷,并为后续改进提供了基础。
链接: https://arxiv.org/abs/2506.03194
作者: Rynaa Grover,Jayant Sravan Tamarapalli,Sahiti Yerramilli,Nilay Pande
机构: Google(谷歌); Waymo(韦来)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Multimodal Large Language Models (MLLMs) excel at high-level visual reasoning, but their performance on nuanced perceptual tasks remains surprisingly limited. We present HueManity, a benchmark designed to assess visual perception in MLLMs. The dataset comprises 83,850 images featuring two-character alphanumeric strings embedded in Ishihara test style dot patterns, challenging models on precise pattern recognition. Our evaluation of nine state-of-the-art MLLMs on HueManity demonstrates a significant performance deficit compared to human and traditional computer vision baselines. The best-performing MLLM achieved a 33.6% accuracy on the numeric easy' task and a striking 3% on the alphanumeric
hard’ task. In contrast, human participants achieved near-perfect scores (100% and 95.6%), and a fine-tuned ResNet50 model reached accuracies of 96.5% and 94.5%. These results highlight a critical gap in the visual capabilities of current MLLMs. Our analysis further explores potential architectural and training-paradigm factors contributing to this perceptual gap in MLLMs. We open-source HueManity dataset and code to foster further research in improving perceptual robustness of MLLMs.
zh
[CV-108] Human Fall Detection using Transfer Learning-based 3D CNN
【速读】:该论文旨在解决老年人意外跌倒这一重要的健康问题,提出了一种基于视觉的自动跌倒检测监控系统。其解决方案的关键在于利用预训练的三维卷积神经网络(3D CNN)提取视频序列中的时空特征,并结合支持向量机(SVM)分类器进行活动分类,从而实现对跌倒与日常活动(ADL)的有效区分。通过复用在Sports1M数据集上预训练的3D CNN模型权重,仅对SVM分类器进行训练,显著降低了模型训练时间。
链接: https://arxiv.org/abs/2506.03193
作者: Ekram Alam,Abu Sufian,Paramartha Dutta,Marco Leo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Unintentional or accidental falls are one of the significant health issues in senior persons. The population of senior persons is increasing steadily. So, there is a need for an automated fall detection monitoring system. This paper introduces a vision-based fall detection system using a pre-trained 3D CNN. Unlike 2D CNN, 3D CNN extracts not only spatial but also temporal features. The proposed model leverages the original learned weights of a 3D CNN model pre-trained on the Sports1M dataset to extract the spatio-temporal features. Only the SVM classifier was trained, which saves the time required to train the 3D CNN. Stratified shuffle five split cross-validation has been used to split the dataset into training and testing data. Extracted features from the proposed 3D CNN model were fed to an SVM classifier to classify the activity as fall or ADL. Two datasets, GMDCSA and CAUCAFall, were utilized to conduct the experiment. The source code for this work can be accessed via the following link: this https URL.
zh
[CV-109] Multimodal Generative AI with Autoregressive LLM s for Human Motion Understanding and Generation: A Way Forward
【速读】:该论文旨在解决如何利用多模态生成式人工智能(Generative AI)和自回归大语言模型(LLMs)来理解和生成人类运动的问题,核心挑战在于通过文本描述精准控制复杂、类人运动序列的生成。其解决方案的关键在于探索多种生成方法(如自回归模型、扩散模型、生成对抗网络、变分自编码器和基于Transformer的模型),并结合LLMs提升指令与运动之间的语义对齐,从而增强生成运动的连贯性、上下文相关性及现实感。
链接: https://arxiv.org/abs/2506.03191
作者: Muhammad Islam,Tao Huang,Euijoon Ahn,Usman Naseem
机构: James Cook University (詹姆斯库克大学); Macquarie University (麦考瑞大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:This paper presents an in-depth survey on the use of multimodal Generative Artificial Intelligence (GenAI) and autoregressive Large Language Models (LLMs) for human motion understanding and generation, offering insights into emerging methods, architectures, and their potential to advance realistic and versatile motion synthesis. Focusing exclusively on text and motion modalities, this research investigates how textual descriptions can guide the generation of complex, human-like motion sequences. The paper explores various generative approaches, including autoregressive models, diffusion models, Generative Adversarial Networks (GANs), Variational Autoencoders (VAEs), and transformer-based models, by analyzing their strengths and limitations in terms of motion quality, computational efficiency, and adaptability. It highlights recent advances in text-conditioned motion generation, where textual inputs are used to control and refine motion outputs with greater precision. The integration of LLMs further enhances these models by enabling semantic alignment between instructions and motion, improving coherence and contextual relevance. This systematic survey underscores the transformative potential of text-to-motion GenAI and LLM architectures in applications such as healthcare, humanoids, gaming, animation, and assistive technologies, while addressing ongoing challenges in generating efficient and realistic human motion.
zh
[CV-110] MINT: Memory-Infused Prompt Tuning at Test-time for CLIP
【速读】:该论文旨在解决视觉-语言预训练模型(Vision-Language Pre-trained Models, VLMs)在测试时数据分布偏移情况下泛化能力不足的问题。现有测试时自适应(Test-Time Adaptation, TTA)方法未能充分挖掘模型内部知识,尤其是在动态适应复杂和分层的视觉语义信息方面存在局限。论文提出的解决方案是Memory-Infused Prompt Tuning (MINT),其关键在于引入记忆提示库(Memory Prompt Bank, MPB),该库存储可学习的键值提示对,作为先前见过样本的记忆。在测试阶段,通过测试图像的分层视觉特征检索MPB中的相关提示对,动态生成关联提示,并将其注入图像编码器以提供细粒度的定制化视觉上下文指导,从而实现无需源数据或微调的快速、精确的VLM自适应。
链接: https://arxiv.org/abs/2506.03190
作者: Jiaming Yi,Ruirui Pan,Jishen Yang,Xiulong Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 14 pages, 3 figures
Abstract:Improving the generalization ability of Vision-Language Pre-trained Models (VLMs) under test-time data distribution shifts remains a critical challenge. The existing Test-Time Adaptation (TTA) methods fall short in fully leveraging the model’s internal knowledge, particularly in dynamically adapting to complex and hierarchical visual semantic information. In this paper, we propose Memory-Infused Prompt Tuning (MINT), a novel framework to address this issue. Inspired by human associative memory theory, MINT introduces a Memory Prompt Bank (MPB), which stores learnable key-value prompt pairs that work as a memory of previously seen samples. During the test time, relevant prompt pairs in the MPB are retrieved by the hierarchical visual features of test images to dynamically assemble Associative Prompts. The associative prompts are then injected into the image encoder for fine-grained, customized visual contextual guidance. MINT also utilizes learnable text prompts. MINT thus enables rapid, precise VLM adaptation at test time by leveraging this MPB-acquired memory, without source data or retraining. The code is available at this https URL.
zh
[CV-111] Continual Learning in Vision-Language Models via Aligned Model Merging
【速读】:该论文试图解决持续学习中因顺序微调导致的灾难性遗忘问题,该过程虽然能够适应新任务,但倾向于增强模型的可塑性而牺牲了保留先前知识所需的稳定性。解决方案的关键在于引入模型融合(model merging)的方法,通过将新训练的任务参数与之前学习的参数进行融合,从而在保持模型可塑性的同时提升稳定性。为提高融合效果,作者提出了一种简单机制,以促进新旧参数之间的对齐,避免融合过程中的干扰。
链接: https://arxiv.org/abs/2506.03189
作者: Ghada Sokar,Gintare Karolina Dziugaite,Anurag Arnab,Ahmet Iscen,Pablo Samuel Castro,Cordelia Schmid
机构: Google DeepMind(谷歌深度思维)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Continual learning is conventionally tackled through sequential fine-tuning, a process that, while enabling adaptation, inherently favors plasticity over the stability needed to retain prior knowledge. While existing approaches attempt to mitigate catastrophic forgetting, a bias towards recent tasks persists as they build upon this sequential nature. In this work we present a new perspective based on model merging to maintain stability while still retaining plasticity. Rather than just sequentially updating the model weights, we propose merging newly trained task parameters with previously learned ones, promoting a better balance. To maximize the effectiveness of the merging process, we propose a simple mechanism that promotes learning aligned weights with previous ones, thereby avoiding interference when merging. We evaluate this approach on large Vision-Language Models (VLMs), and demonstrate its effectiveness in reducing forgetting, increasing robustness to various task orders and similarities, and improving generalization.
zh
[CV-112] Impact of Tuning Parameters in Deep Convolutional Neural Network Using a Crack Image Dataset
【速读】:该论文试图解决深度卷积神经网络(DCNN)分类器性能受参数调优影响的问题,其解决方案的关键在于通过实验分析不同调优参数(如池化方法、激活函数和优化器)对分类性能的影响。研究采用了一个包含两类裂缝图像的数据集,验证了在使用Adam优化器和tanh激活函数时,最大池化(maxpooling)能够提升DCNN的性能。
链接: https://arxiv.org/abs/2506.03184
作者: Mahe Zabin,Ho-Jin Choi,Md. Monirul Islam,Jia Uddin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 8 pages, 2 figures, published at Proceedings of the 15th KIPS International Conference on Ubiquitous Information Technologies and Applications (CUTE 2021), Jeju, Repubilc of Korea
Abstract:The performance of a classifier depends on the tuning of its parame ters. In this paper, we have experimented the impact of various tuning parameters on the performance of a deep convolutional neural network (DCNN). In the ex perimental evaluation, we have considered a DCNN classifier that consists of 2 convolutional layers (CL), 2 pooling layers (PL), 1 dropout, and a dense layer. To observe the impact of pooling, activation function, and optimizer tuning pa rameters, we utilized a crack image dataset having two classes: negative and pos itive. The experimental results demonstrate that with the maxpooling, the DCNN demonstrates its better performance for adam optimizer and tanh activation func tion.
zh
[CV-113] rraIncognita: A Dynamic Benchmark for Species Discovery Using Frontier Models
【速读】:该论文旨在解决昆虫物种发现过程中因传统方法依赖人工、耗时且受限于分类学专业知识而难以及时支持保护行动的问题。其关键解决方案是提出TerraIncognita基准,该基准包含混合了专家标注的已知昆虫物种图像与罕见或未充分研究物种的图像,旨在评估先进多模态模型在未知或潜在未描述昆虫物种识别任务中的性能,特别是在层次化分类、检测分布外样本以及生成符合专家分类知识的解释方面的能力。
链接: https://arxiv.org/abs/2506.03182
作者: Shivani Chiranjeevi,Hossein Zaremehrjerdi,Zi K. Deng,Talukder Z. Jubery,Ari Grele,Arti Singh,Asheesh K Singh,Soumik Sarkar,Nirav Merchant,Harold F. Greeney,Baskar Ganapathysubramanian,Chinmay Hegde
机构: Iowa State University (爱荷华州立大学); University of Arizona (亚利桑那大学); University of Nevada, Reno (内华达大学雷诺分校); Yanayacu Biological Station and Center for Creative Studies (亚纳亚库生物站和创意研究中心); New York University (纽约大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:The rapid global loss of biodiversity, particularly among insects, represents an urgent ecological crisis. Current methods for insect species discovery are manual, slow, and severely constrained by taxonomic expertise, hindering timely conservation actions. We introduce TerraIncognita, a dynamic benchmark designed to evaluate state-of-the-art multimodal models for the challenging problem of identifying unknown, potentially undescribed insect species from image data. Our benchmark dataset combines a mix of expertly annotated images of insect species likely known to frontier AI models, and images of rare and poorly known species, for which few/no publicly available images exist. These images were collected from underexplored biodiversity hotspots, realistically mimicking open-world discovery scenarios faced by ecologists. The benchmark assesses models’ proficiency in hierarchical taxonomic classification, their capability to detect and abstain from out-of-distribution (OOD) samples representing novel species, and their ability to generate explanations aligned with expert taxonomic knowledge. Notably, top-performing models achieve over 90% F1 at the Order level on known species, but drop below 2% at the Species level, highlighting the sharp difficulty gradient from coarse to fine taxonomic prediction (Order \rightarrow Family \rightarrow Genus \rightarrow Species). TerraIncognita will be updated regularly, and by committing to quarterly dataset expansions (of both known and novel species), will provide an evolving platform for longitudinal benchmarking of frontier AI methods. All TerraIncognita data, results, and future updates are available \hrefthis https URLhere.
zh
[CV-114] Knowledge Graphs for Digitized Manuscripts in Jagiellonian Digital Library Application
【速读】:该论文试图解决数字化文化遗产藏品中元数据不完整和标准化不足的问题,这些问题限制了藏品的可搜索性和跨藏品的潜在关联。解决方案的关键在于整合计算机视觉(Computer Vision, CV)、人工智能(Artificial Intelligence, AI)和语义网技术,以丰富元数据并构建知识图谱。
链接: https://arxiv.org/abs/2506.03180
作者: Jan Ignatowicz,Krzysztof Kutt,Grzegorz J. Nalepa
机构: 未知
类目: Digital Libraries (cs.DL); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Digitizing cultural heritage collections has become crucial for preservation of historical artifacts and enhancing their availability to the wider public. Galleries, libraries, archives and museums (GLAM institutions) are actively digitizing their holdings and creates extensive digital collections. Those collections are often enriched with metadata describing items but not exactly their contents. The Jagiellonian Digital Library, standing as a good example of such an effort, offers datasets accessible through protocols like OAI-PMH. Despite these improvements, metadata completeness and standardization continue to pose substantial obstacles, limiting the searchability and potential connections between collections. To deal with these challenges, we explore an integrated methodology of computer vision (CV), artificial intelligence (AI), and semantic web technologies to enrich metadata and construct knowledge graphs for digitized manuscripts and incunabula.
zh
[CV-115] Vid-SME: Membership Inference Attacks against Large Video Understanding Models
【速读】:该论文试图解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在训练过程中可能包含敏感视频内容所带来的数据隐私问题,特别是如何识别训练集中被不当使用的视频。现有针对文本和图像数据的成员推断攻击(Membership Inference Attacks, MIAs)方法无法有效推广到视频领域,主要由于其未能捕捉视频帧的固有时间变化特性以及未考虑帧数变化对模型行为的影响。该论文提出的解决方案——Vid-SME,其关键在于利用模型输出的置信度并结合自适应参数化计算Sharma-Mittal熵(SME),通过自然视频帧与时间反转视频帧之间的SME差异来生成鲁棒的成员得分,从而判断给定视频是否属于模型的训练集。
链接: https://arxiv.org/abs/2506.03179
作者: Qi Li,Runpeng Yu,Xinchao Wang
机构: National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Multimodal large language models (MLLMs) demonstrate remarkable capabilities in handling complex multimodal tasks and are increasingly adopted in video understanding applications. However, their rapid advancement raises serious data privacy concerns, particularly given the potential inclusion of sensitive video content, such as personal recordings and surveillance footage, in their training datasets. Determining improperly used videos during training remains a critical and unresolved challenge. Despite considerable progress on membership inference attacks (MIAs) for text and image data in MLLMs, existing methods fail to generalize effectively to the video domain. These methods suffer from poor scalability as more frames are sampled and generally achieve negligible true positive rates at low false positive rates (TPR@Low FPR), mainly due to their failure to capture the inherent temporal variations of video frames and to account for model behavior differences as the number of frames varies. To address these challenges, we introduce Vid-SME, the first membership inference method tailored for video data used in video understanding LLMs (VULLMs). Vid-SME leverages the confidence of model output and integrates adaptive parameterization to compute Sharma-Mittal entropy (SME) for video inputs. By leveraging the SME difference between natural and temporally-reversed video frames, Vid-SME derives robust membership scores to determine whether a given video is part of the model’s training set. Experiments on various self-trained and open-sourced VULLMs demonstrate the strong effectiveness of Vid-SME.
zh
[CV-116] Multimodal Foundation Model for Cross-Modal Retrieval and Activity Recognition Tasks
【速读】:该论文试图解决现有方法在全面分析全身人类活动方面的不足,特别是在利用第一人称视角数据时无法提供详细且多维的人体行为理解的问题。解决方案的关键在于提出一种多模态基础模型AURA-MFM,该模型整合了第三人称视频、动作捕捉、惯性测量单元(IMU)和文本四种模态的数据,并采用基于Transformer的IMU编码器以提升模型性能,从而实现对人类活动更全面和精确的理解。
链接: https://arxiv.org/abs/2506.03174
作者: Koki Matsuishi,Kosuke Ukita,Tsuyoshi Okita
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 25 pages, 8 figures
Abstract:In recent years, the widespread adoption of wearable devices has highlighted the growing importance of behavior analysis using IMU. While applications span diverse fields such as healthcare and robotics, recent studies have increasingly focused on multimodal analysis, in addition to unimodal analysis. Several studies have proposed multimodal foundation models that incorporate first-person video and text data; however, these models still fall short in providing a detailed analysis of full-body human activity. To address this limitation, we propose Activity Understanding and Representations Alignment - Multimodal Foundation Model (AURA-MFM), a foundational model integrating four modalities: third-person video, motion capture, IMU, and text. By incorporating third-person video and motion capture data, the model enables a detailed and multidimensional understanding of human activity, which first-person perspectives alone fail to capture. Additionally, a Transformer-based IMU encoder is employed to enhance the model’s overall performance. Experimental evaluations on retrieval and activity recognition tasks demonstrate that our model surpasses existing methods. Notably, in the zero-shot classification for action recognition, our method achieved significantly higher performance, with an F1-score of 0.6226 and an accuracy of 0.7320, whereas the existing method recorded an F1-score of 0.0747 and an accuracy of 0.1961.
zh
[CV-117] FOLIAGE: Towards Physical Intelligence World Models Via Unbounded Surface Evolution
【速读】:该论文旨在解决下一代世界模型中物理智能(physical intelligence)的构建问题,即从部分的多模态观测中预测和塑造世界。其解决方案的关键在于提出FOLIAGE,一个融合物理信息的多模态世界模型,通过统一的上下文编码器将图像、网格连通性和点云映射到共享潜在状态,并结合物理感知预测器生成与目标潜在状态对齐的模态无关生长嵌入(Modality-Agnostic Growth Embedding, MAGE)。此外,FOLIAGE采用累积图网络(Accretive Graph Network, AGN)捕捉动态连通性,并通过几何对应融合与跨块遮蔽增强MAGE的表达能力,从而实现对复杂表面生长过程的有效建模与预测。
链接: https://arxiv.org/abs/2506.03173
作者: Xiaoyi Liu,Hao Tang
机构: Brown University (布朗大学); Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Physical intelligence – anticipating and shaping the world from partial, multisensory observations – is critical for next-generation world models. We propose FOLIAGE, a physics-informed multimodal world model for unbounded accretive surface growth. In its Action-Perception loop, a unified context encoder maps images, mesh connectivity, and point clouds to a shared latent state. A physics-aware predictor, conditioned on physical control actions, advances this latent state in time to align with the target latent of the surface, yielding a Modality-Agnostic Growth Embedding (MAGE) that interfaces with critic heads for downstream objectives. FOLIAGE’s Accretive Graph Network (AGN) captures dynamic connectivity through Age Positional Encoding and Energy-Gated Message-Passing. Geometry-Correspondence Fusion and Cross-Patch Masking enhance MAGE’s expressiveness, while Hierarchical Pooling balances global context with local dynamics. We create SURF-GARDEN, a world model learning platform comprising a Counterfactual Physics Simulator, a Multimodal Correspondence Extractor, and Evolution Tracing, which generates 7,200 diverse surface-growth sequences. SURF-BENCH, our physical-intelligence evaluation suite, evaluates six core tasks – topology recognition, inverse material estimation, growth-stage classification, latent roll-out, cross-modal retrieval, and dense correspondence – and four stress tests – sensor dropout, zero-shot modality transfer, long-horizon prediction, and physics ablation – to probe resilience. FOLIAGE outperforms specialized baselines while remaining robust across dynamic environments, establishing a new world-model based, multimodal pathway to physical intelligence.
zh
[CV-118] EdgeVidSum: Real-Time Personalized Video Summarization at the Edge
【速读】:该论文旨在解决长视频在边缘设备上实时生成个性化快进摘要时面临的计算效率、个性化和隐私保护问题。解决方案的关键在于采用基于缩略图的创新技术与高效的神经网络架构,通过使用缩略图容器显著降低计算复杂度,同时保持语义相关性,并利用轻量级2D CNN模型从缩略图中识别用户偏好内容,生成时间戳以构建快进摘要。
链接: https://arxiv.org/abs/2506.03171
作者: Ghulam Mujtaba,Eun-Seok Ryu
机构: Regis University, Denver, CO, USA(里吉斯大学,丹佛,科罗拉多州,美国); Sungkyunkwan University(成均馆大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:EdgeVidSum is a lightweight method that generates personalized, fast-forward summaries of long-form videos directly on edge devices. The proposed approach enables real-time video summarization while safeguarding user privacy through local data processing using innovative thumbnail-based techniques and efficient neural architectures. Unlike conventional methods that process entire videos frame by frame, the proposed method uses thumbnail containers to significantly reduce computational complexity without sacrificing semantic relevance. The framework employs a hierarchical analysis approach, where a lightweight 2D CNN model identifies user-preferred content from thumbnails and generates timestamps to create fast-forward summaries. Our interactive demo highlights the system’s ability to create tailored video summaries for long-form videos, such as movies, sports events, and TV shows, based on individual user preferences. The entire computation occurs seamlessly on resource-constrained devices like Jetson Nano, demonstrating how EdgeVidSum addresses the critical challenges of computational efficiency, personalization, and privacy in modern video consumption environments.
zh
[CV-119] PALADIN : Robust Neural Fingerprinting for Text-to-Image Diffusion Models
【速读】:该论文试图解决文本到图像生成模型(text-to-image generative models)在恶意使用中的风险问题,特别是通过神经指纹识别(neural fingerprinting)实现模型溯源的准确性与生成质量之间的权衡问题。现有方法尚未达到100%的溯源准确率,导致模型难以实际部署。该论文提出的解决方案的关键在于利用编码理论中的循环纠错码(cyclic error correcting codes)概念,以提高神经指纹识别的准确性。
链接: https://arxiv.org/abs/2506.03170
作者: Murthy L,Subarna Tripathi
机构: Intel Corporation(英特尔公司); Indian Institute of Science(印度科学研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:The risk of misusing text-to-image generative models for malicious uses, especially due to the open-source development of such models, has become a serious concern. As a risk mitigation strategy, attributing generative models with neural fingerprinting is emerging as a popular technique. There has been a plethora of recent work that aim for addressing neural fingerprinting. A trade-off between the attribution accuracy and generation quality of such models has been studied extensively. None of the existing methods yet achieved 100% attribution accuracy. However, any model with less than \emphperfect accuracy is practically non-deployable. In this work, we propose an accurate method to incorporate neural fingerprinting for text-to-image diffusion models leveraging the concepts of cyclic error correcting codes from the literature of coding theory.
zh
[CV-120] Improvement of human health lifespan with hybrid group pose estimation methods
【速读】:该论文试图解决在现实应用场景中对人体姿态估计的准确性与实时性不足的问题,以提升人体健康监测的效果。其解决方案的关键在于提出一种基于混合集成的群体姿态估计方法,通过融合改进的群体姿态估计和实时姿态估计技术,在实时环境中实现多人体姿态的检测与优化。该方法利用姿态转换技术提取关键特征,并在公开基准数据集上进行定制化预训练,从而提高姿态估计的鲁棒性和密集回归精度。
链接: https://arxiv.org/abs/2506.03169
作者: Arindam Chaudhuri
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Human beings rely heavily on estimation of poses in order to access their body movements. Human pose estimation methods take advantage of computer vision advances in order to track human body movements in real life applications. This comes from videos which are recorded through available devices. These para-digms provide potential to make human movement measurement more accessible to users. The consumers of pose estimation movements believe that human poses content tend to supplement available videos. This has increased pose estimation software usage to estimate human poses. In order to address this problem, we develop hybrid-ensemble-based group pose estimation method to improve human health. This proposed hybrid-ensemble-based group pose estimation method aims to detect multi-person poses using modified group pose estimation and modified real time pose estimation. This ensemble allows fusion of performance of stated methods in real time. The input poses from images are fed into individual meth-ods. The pose transformation method helps to identify relevant features for en-semble to perform training effectively. After this, customized pre-trained hybrid ensemble is trained on public benchmarked datasets which is being evaluated through test datasets. The effectiveness and viability of proposed method is estab-lished based on comparative analysis of group pose estimation methods and ex-periments conducted on benchmarked datasets. It provides best optimized results in real-time pose estimation. It makes pose estimation method more robust to oc-clusion and improves dense regression accuracy. These results have affirmed po-tential application of this method in several real-time situations with improvement in human health life span
zh
[CV-121] Farm-LightSeek: An Edge-centric Multimodal Agricultural IoT Data Analytics Framework with Lightweight LLM s
【速读】:该论文旨在解决传统农业物联网(IoT)系统在面对全球人口增长和气候变化挑战时,所遭遇的效率低下、对农业专家知识过度依赖、多模态数据融合困难、动态环境适应性差以及边缘实时决策瓶颈等问题。其解决方案的关键在于提出Farm-LightSeek框架,该框架将大型语言模型(LLMs)与边缘计算相结合,通过边缘节点实现跨模态推理与病害检测,并结合轻量级LLM部署策略,在保证性能的同时提升系统效率,从而推动智能实时农业解决方案的发展。
链接: https://arxiv.org/abs/2506.03168
作者: Dawen Jiang,Zhishu Shen,Qiushi Zheng,Tiehua Zhang,Wei Xiang,Jiong Jin
机构: Wuhan University of Technology (武汉理工大学); Swinburne University of Technology (斯威本科技大学); Tongji University (同济大学); La Trobe University (拉筹伯大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted by IEEE Internet of Things Magazine
Abstract:Amid the challenges posed by global population growth and climate change, traditional agricultural Internet of Things (IoT) systems is currently undergoing a significant digital transformation to facilitate efficient big data processing. While smart agriculture utilizes artificial intelligence (AI) technologies to enable precise control, it still encounters significant challenges, including excessive reliance on agricultural expert knowledge, difficulties in fusing multimodal data, poor adaptability to dynamic environments, and bottlenecks in real-time decision-making at the edge. Large language models (LLMs), with their exceptional capabilities in knowledge acquisition and semantic understanding, provide a promising solution to address these challenges. To this end, we propose Farm-LightSeek, an edge-centric multimodal agricultural IoT data analytics framework that integrates LLMs with edge computing. This framework collects real-time farmland multi-source data (images, weather, geographic information) via sensors, performs cross-modal reasoning and disease detection at edge nodes, conducts low-latency management decisions, and enables cloud collaboration for model updates. The main innovations of Farm-LightSeek include: (1) an agricultural “perception-decision-action” closed-loop architecture; (2) cross-modal adaptive monitoring; and (3)a lightweight LLM deployment strategy balancing performance and efficiency. Experiments conducted on two real-world datasets demonstrate that Farm-LightSeek consistently achieves reliable performance in mission-critical tasks, even under the limitations of edge computing resources. This work advances intelligent real-time agricultural solutions and highlights the potential for deeper integration of agricultural IoT with LLMs.
zh
[CV-122] Dual Branch VideoMamba with Gated Class Token Fusion for Violence Detection
【速读】:该论文试图解决视频中暴力行为检测的自动化需求,尤其是在面对长时序依赖和计算效率问题时,传统卷积神经网络(Convolutional Neural Networks, CNNs)和Transformer模型表现不足。解决方案的关键在于提出Dual Branch VideoMamba with Gated Class Token Fusion (GCTF)架构,该架构结合双分支设计与状态空间模型(State-Space Model, SSM)主干网络,其中一支捕捉空间特征,另一支专注于时间动态,并通过门控机制实现持续融合,从而在准确性和计算效率之间取得平衡。
链接: https://arxiv.org/abs/2506.03162
作者: Damith Chamalke Senadeera,Xiaoyun Yang,Dimitrios Kollias,Gregory Slabaugh
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:The rapid proliferation of surveillance cameras has increased the demand for automated violence detection. While CNNs and Transformers have shown success in extracting spatio-temporal features, they struggle with long-term dependencies and computational efficiency. We propose Dual Branch VideoMamba with Gated Class Token Fusion (GCTF), an efficient architecture combining a dual-branch design and a state-space model (SSM) backbone where one branch captures spatial features, while the other focuses on temporal dynamics, with continuous fusion via a gating mechanism. We also present a new benchmark by merging RWF-2000, RLVS, and VioPeru datasets in video violence detection, ensuring strict separation between training and testing sets. Our model achieves state-of-the-art performance on this benchmark offering an optimal balance between accuracy and computational efficiency, demonstrating the promise of SSMs for scalable, real-time surveillance violence detection.
zh
[CV-123] DUAL: Dynamic Uncertainty-Aware Learning
【速读】:该论文旨在解决深度学习模型在多种学习场景中遇到的特征不确定性问题,这一问题显著影响了模型的性能和可靠性,尤其是在多模态场景中,模型需要整合来自不同来源且具有固有不确定性的信息。论文提出的解决方案是动态不确定性感知学习(Dynamic Uncertainty-Aware Learning, DUAL),其关键在于三个创新点:动态特征不确定性建模,通过联合考虑特征特性和学习动态持续优化不确定性估计;自适应分布感知调制,通过动态调整样本影响来保持特征分布的平衡;以及不确定性感知的跨模态关系学习,显式建模跨模态交互中的不确定性。
链接: https://arxiv.org/abs/2506.03158
作者: Jiahao Qin,Bei Peng,Feng Liu,Guangliang Cheng,Lu Zong
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 3 figures
Abstract:Deep learning models frequently encounter feature uncertainty in diverse learning scenarios, significantly impacting their performance and reliability. This challenge is particularly complex in multi-modal scenarios, where models must integrate information from different sources with inherent uncertainties. We propose Dynamic Uncertainty-Aware Learning (DUAL), a unified framework that effectively handles feature uncertainty in both single-modal and multi-modal scenarios. DUAL introduces three key innovations: Dynamic Feature Uncertainty Modeling, which continuously refines uncertainty estimates through joint consideration of feature characteristics and learning dynamics; Adaptive Distribution-Aware Modulation, which maintains balanced feature distributions through dynamic sample influence adjustment; and Uncertainty-aware Cross-Modal Relationship Learning, which explicitly models uncertainties in cross-modal interactions. Through extensive experiments, we demonstrate DUAL’s effectiveness across multiple domains: in computer vision tasks, it achieves substantial improvements of 7.1% accuracy on CIFAR-10, 6.5% accuracy on CIFAR-100, and 2.3% accuracy on Tiny-ImageNet; in multi-modal learning, it demonstrates consistent gains of 4.1% accuracy on CMU-MOSEI and 2.8% accuracy on CMU-MOSI for sentiment analysis, while achieving 1.4% accuracy improvements on MISR. The code will be available on GitHub soon.
zh
[CV-124] Hierarchical Relational Learning for Few-Shot Knowledge Graph Completion ICLR2023
【速读】:该论文试图解决知识图谱(Knowledge Graph, KG)在关系的不完整性以及关系的长尾分布问题,特别是在仅有少量训练三元组的情况下,如何对涉及新关系的三元组进行预测。其解决方案的关键在于提出了一种分层关系学习方法(HiRe),通过联合捕捉实体级、三元组级和上下文级三种层次的关系信息,有效学习并优化少样本关系的元表示,从而提升模型对未见关系的泛化能力。
链接: https://arxiv.org/abs/2209.01205
作者: Han Wu,Jie Yin,Bala Rajaratnam,Jianyuan Guo
机构: The University of Sydney (悉尼大学); University of California, Davis (加州大学戴维斯分校)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Published at ICLR 2023
Abstract:Knowledge graphs (KGs) are powerful in terms of their inference abilities, but are also notorious for their incompleteness and long-tail distribution of relations. To address these challenges and expand the coverage of KGs, few-shot KG completion aims to make predictions for triplets involving novel relations when only a few training triplets are provided as reference. Previous methods have focused on designing local neighbor aggregators to learn entity-level information and/or imposing a potentially invalid sequential dependency assumption at the triplet level to learn meta relation information. However, pairwise triplet-level interactions and context-level relational information have been largely overlooked for learning meta representations of few-shot relations. In this paper, we propose a hierarchical relational learning method (HiRe) for few-shot KG completion. By jointly capturing three levels of relational information (entity-level, triplet-level and context-level), HiRe can effectively learn and refine meta representations of few-shot relations, and thus generalize well to new unseen relations. Extensive experiments on benchmark datasets validate the superiority of HiRe over state-of-the-art methods. The code can be found in this https URL.
zh
[CV-125] Recent Advances in Medical Image Classification
【速读】:该论文旨在解决医学图像分类中的诊断与治疗问题,其核心在于利用人工智能技术提升分类的准确性与可解释性。解决方案的关键在于结合传统方法与深度学习模型,如卷积神经网络(Convolutional Neural Networks)和视觉变换器(Vision Transformers),以及最新的视觉语言模型(Vision Language Models),以应对标注数据有限的问题,并通过可解释的人工智能(Explainable Artificial Intelligence)增强和解释预测结果。
链接: https://arxiv.org/abs/2506.04129
作者: Loan Dao,Ngoc Quoc Ly
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Medical image classification is crucial for diagnosis and treatment, benefiting significantly from advancements in artificial intelligence. The paper reviews recent progress in the field, focusing on three levels of solutions: basic, specific, and applied. It highlights advances in traditional methods using deep learning models like Convolutional Neural Networks and Vision Transformers, as well as state-of-the-art approaches with Vision Language Models. These models tackle the issue of limited labeled data, and enhance and explain predictive results through Explainable Artificial Intelligence.
zh
[CV-126] A Comprehensive Study on Medical Image Segmentation using Deep Neural Networks
【速读】:该论文旨在解决医学图像分割(Medical Image Segmentation, MIS)中深度神经网络(Deep Neural Networks, DNNs)应用的效率与透明性问题,特别是在提升疾病诊断与早期检测能力方面。研究聚焦于当前MIS在数据、信息、知识、智能和智慧(DIKIW)各层级的最新解决方案,并强调可解释人工智能(Explainable Artificial Intelligence, XAI)在揭示DNN“黑箱”特性中的重要作用,以满足透明度和伦理要求。论文提出的解决方案关键在于结合XAI技术与早期预测方法,推动从“智能”向“智慧”的演进,从而提高MIS系统的可靠性与临床实用性。
链接: https://arxiv.org/abs/2506.04121
作者: Loan Dao,Ngoc Quoc Ly
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Over the past decade, Medical Image Segmentation (MIS) using Deep Neural Networks (DNNs) has achieved significant performance improvements and holds great promise for future developments. This paper presents a comprehensive study on MIS based on DNNs. Intelligent Vision Systems are often evaluated based on their output levels, such as Data, Information, Knowledge, Intelligence, and Wisdom (DIKIW),and the state-of-the-art solutions in MIS at these levels are the focus of research. Additionally, Explainable Artificial Intelligence (XAI) has become an important research direction, as it aims to uncover the “black box” nature of previous DNN architectures to meet the requirements of transparency and ethics. The study emphasizes the importance of MIS in disease diagnosis and early detection, particularly for increasing the survival rate of cancer patients through timely diagnosis. XAI and early prediction are considered two important steps in the journey from “intelligence” to “wisdom.” Additionally, the paper addresses existing challenges and proposes potential solutions to enhance the efficiency of implementing DNN-based MIS.
zh
[CV-127] A Diffusion-Driven Temporal Super-Resolution and Spatial Consistency Enhancement Framework for 4D MRI imaging
【速读】:该论文旨在解决4D MRI在动态3D可视化中面临的时空分辨率权衡问题,尤其是在快速大振幅运动情况下,长时间扫描会损害时间保真度。传统方法依赖于基于配准的插值生成中间帧,但在处理大形变时容易产生配准错误、伪影和空间一致性下降。该论文提出的解决方案是TSSC-Net,其关键在于采用基于扩散的时序超分辨率网络,利用起始帧和结束帧作为关键参考生成中间帧,实现单次推理步骤下的6倍时序超分辨率;同时引入一种新型三向Mamba模块,通过长程上下文信息有效解决跨切片对齐引起的空间不一致问题,从而提升体积一致性并校正跨切片误差。
链接: https://arxiv.org/abs/2506.04116
作者: Xuanru Zhou,Jiarun Liu,Shoujun Yu,Hao Yang,Cheng Li,Tao Tan,Shanshan Wang
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In medical imaging, 4D MRI enables dynamic 3D visualization, yet the trade-off between spatial and temporal resolution requires prolonged scan time that can compromise temporal fidelity–especially during rapid, large-amplitude motion. Traditional approaches typically rely on registration-based interpolation to generate intermediate frames. However, these methods struggle with large deformations, resulting in misregistration, artifacts, and diminished spatial consistency. To address these challenges, we propose TSSC-Net, a novel framework that generates intermediate frames while preserving spatial consistency. To improve temporal fidelity under fast motion, our diffusion-based temporal super-resolution network generates intermediate frames using the start and end frames as key references, achieving 6x temporal super-resolution in a single inference step. Additionally, we introduce a novel tri-directional Mamba-based module that leverages long-range contextual information to effectively resolve spatial inconsistencies arising from cross-slice misalignment, thereby enhancing volumetric coherence and correcting cross-slice errors. Extensive experiments were performed on the public ACDC cardiac MRI dataset and a real-world dynamic 4D knee joint dataset. The results demonstrate that TSSC-Net can generate high-resolution dynamic MRI from fast-motion data while preserving structural fidelity and spatial consistency.
zh
[CV-128] owards generating more interpretable counterfactuals via concept vectors: a preliminary study on chest X-rays
【速读】:该论文试图解决医疗影像模型在部署过程中与临床知识对齐及可解释性的问题。其解决方案的关键在于将临床概念映射到生成模型的潜在空间中,以识别概念激活向量(Concept Activation Vectors, CAVs)。通过使用简单的重建自编码器,该方法在无需显式标签训练的情况下将用户定义的概念与图像级特征关联起来,从而实现对临床相关特征的可视化解释,并通过沿着概念方向遍历潜在空间生成反事实样本。
链接: https://arxiv.org/abs/2506.04058
作者: Bulat Maksudov,Kathleen Curran,Alessandra Mileo
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:An essential step in deploying medical imaging models is ensuring alignment with clinical knowledge and interpretability. We focus on mapping clinical concepts into the latent space of generative models to identify Concept Activation Vectors (CAVs). Using a simple reconstruction autoencoder, we link user-defined concepts to image-level features without explicit label training. The extracted concepts are stable across datasets, enabling visual explanations that highlight clinically relevant features. By traversing latent space along concept directions, we produce counterfactuals that exaggerate or reduce specific clinical features. Preliminary results on chest X-rays show promise for large pathologies like cardiomegaly, while smaller pathologies remain challenging due to reconstruction limits. Although not outperforming baselines, this approach offers a path toward interpretable, concept-based explanations aligned with clinical knowledge.
zh
[CV-129] Conformal coronary calcification volume estimation with conditional coverag e via histogram clustering
【速读】:该论文试图解决在CT扫描中意外检测和量化冠状动脉钙化时,自动报告冠状动脉钙化评分可能导致的过度报告问题,这可能对患者健康产生负面影响并增加医疗系统的负担。解决方案的关键在于提出一种基于聚类的条件共形预测框架,该框架能够在不重新训练分割网络的情况下,从训练好的模型中提供校准后的预测区间,从而提高分诊指标并实现更准确的风险分类预测。
链接: https://arxiv.org/abs/2506.04030
作者: Olivier Jaubert,Salman Mohammadi,Keith A. Goatman,Shadia S. Mikhael,Conor Bradley,Rebecca Hughes,Richard Good,John H. Hipwell,Sonia Dahdouh
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: IEEE 22nd International Symposium on Biomedical Imaging (ISBI)
Abstract:Incidental detection and quantification of coronary calcium in CT scans could lead to the early introduction of lifesaving clinical interventions. However, over-reporting could negatively affect patient wellbeing and unnecessarily burden the medical system. Therefore, careful considerations should be taken when automatically reporting coronary calcium scores. A cluster-based conditional conformal prediction framework is proposed to provide score intervals with calibrated coverage from trained segmentation networks without retraining. The proposed method was tuned and used to calibrate predictive intervals for 3D UNet models (deterministic, MCDropout and deep ensemble) reaching similar coverage with better triage metrics compared to conventional conformal prediction. Meaningful predictive intervals of calcium scores could help triage patients according to the confidence of their risk category prediction.
zh
[CV-130] Dreaming up scale invariance via inverse renormalization group
【速读】:该论文试图解决如何利用最小神经网络逆向执行二维伊辛模型中的重整化群(Renormalization Group, RG)粗粒化过程,从而从粗粒化状态“生成”微观构型的问题。传统上,这一任务在配置层面是形式上不可能的,但通过概率方法,机器学习模型可以重建尺度不变分布,而无需依赖微观输入。解决方案的关键在于神经网络能够学习生成临界构型,并再现磁化率、比热和Binder比等可观测量的标度行为,表明其不仅捕捉了尺度不变性,还复现了RG变换的非平凡本征值。研究发现,增加网络复杂性(如引入多层结构)并未带来显著优势,表明简单的局部规则足以编码临界现象的普适性。
链接: https://arxiv.org/abs/2506.04016
作者: Adam Rançon,Ulysse Rançon,Tomislav Ivek,Ivan Balog
机构: Univ. Lille, CNRS, UMR 8523 – PhLAM – Laboratoire de Physique des Lasers Atomes et Molécules, F-59000 Lille, France; CerCo UMR 5549, CNRS – Université Toulouse III, Toulouse, France; Institute of Physics, Bijenička cesta 46, HR-10001 Zagreb, Croatia
类目: atistical Mechanics (cond-mat.stat-mech); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: v1: 12 pages, 11 figures, 55 references
Abstract:We explore how minimal neural networks can invert the renormalization group (RG) coarse-graining procedure in the two-dimensional Ising model, effectively “dreaming up” microscopic configurations from coarse-grained states. This task-formally impossible at the level of configurations-can be approached probabilistically, allowing machine learning models to reconstruct scale-invariant distributions without relying on microscopic input. We demonstrate that even neural networks with as few as three trainable parameters can learn to generate critical configurations, reproducing the scaling behavior of observables such as magnetic susceptibility, heat capacity, and Binder ratios. A real-space renormalization group analysis of the generated configurations confirms that the models capture not only scale invariance but also reproduce nontrivial eigenvalues of the RG transformation. Surprisingly, we find that increasing network complexity by introducing multiple layers offers no significant benefit. These findings suggest that simple local rules, akin to those generating fractal structures, are sufficient to encode the universality of critical phenomena, opening the door to efficient generative models of statistical ensembles in physics.
zh
[CV-131] Identifying Alzheimers Disease Prediction Strategies of Convolutional Neural Network Classifiers using R2* Maps and Spectral Clustering
【速读】:该论文试图解决深度学习模型在从R2*图中分类阿尔茨海默病(Alzheimer’s disease, AD)时决策过程不透明的问题,以及模型决策中可能存在的偏差。其解决方案的关键在于利用层归因传播(Layer-wise Relevance Propagation, LRP)生成相关性热图,并结合谱聚类(spectral clustering)分析不同预处理和训练配置下的分类器决策策略,从而揭示模型的决策模式并提升其可解释性。
链接: https://arxiv.org/abs/2506.03890
作者: Christian Tinauer,Maximilian Sackl,Stefan Ropele,Christian Langkammer
机构: Medical University of Graz(格拉茨医科大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for the conference EUSIPCO2025 ( this https URL )
Abstract:Deep learning models have shown strong performance in classifying Alzheimer’s disease (AD) from R2* maps, but their decision-making remains opaque, raising concerns about interpretability. Previous studies suggest biases in model decisions, necessitating further analysis. This study uses Layer-wise Relevance Propagation (LRP) and spectral clustering to explore classifier decision strategies across preprocessing and training configurations using R2* maps. We trained a 3D convolutional neural network on R2* maps, generating relevance heatmaps via LRP and applied spectral clustering to identify dominant patterns. t-Stochastic Neighbor Embedding (t-SNE) visualization was used to assess clustering structure. Spectral clustering revealed distinct decision patterns, with the relevance-guided model showing the clearest separation between AD and normal control (NC) cases. The t-SNE visualization confirmed that this model aligned heatmap groupings with the underlying subject groups. Our findings highlight the significant impact of preprocessing and training choices on deep learning models trained on R2* maps, even with similar performance metrics. Spectral clustering offers a structured method to identify classification strategy differences, emphasizing the importance of explainability in medical AI.
zh
[CV-132] Personalized MR-Informed Diffusion Models for 3D PET Image Reconstruction
【速读】:该论文旨在解决低计数PET图像重建中信息不足导致的图像质量下降问题,以及如何有效融合PET与MR影像信息以提升重建精度。其解决方案的关键在于通过多受试者PET-MR扫描数据生成个体化的“伪PET”图像,利用图像配准技术在不同患者的解剖结构间进行转换,从而保留MR影像中的高分辨率和解剖特征,同时减少对原始PET图像噪声的依赖。这种方法在不依赖生成式深度学习或大规模训练数据的情况下,提升了低计数条件下的PET图像重建准确性。
链接: https://arxiv.org/abs/2506.03804
作者: George Webber,Alexander Hammers,Andrew P. King,Andrew J. Reader
机构: 未知
类目: Medical Physics (physics.med-ph); Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 10 figures
Abstract:Recent work has shown improved lesion detectability and flexibility to reconstruction hyperparameters (e.g. scanner geometry or dose level) when PET images are reconstructed by leveraging pre-trained diffusion models. Such methods train a diffusion model (without sinogram data) on high-quality, but still noisy, PET images. In this work, we propose a simple method for generating subject-specific PET images from a dataset of multi-subject PET-MR scans, synthesizing “pseudo-PET” images by transforming between different patients’ anatomy using image registration. The images we synthesize retain information from the subject’s MR scan, leading to higher resolution and the retention of anatomical features compared to the original set of PET images. With simulated and real [ ^18 F]FDG datasets, we show that pre-training a personalized diffusion model with subject-specific “pseudo-PET” images improves reconstruction accuracy with low-count data. In particular, the method shows promise in combining information from a guidance MR scan without overly imposing anatomical features, demonstrating an improved trade-off between reconstructing PET-unique image features versus features present in both PET and MR. We believe this approach for generating and utilizing synthetic data has further applications to medical imaging tasks, particularly because patient-specific PET images can be generated without resorting to generative deep learning or large training datasets.
zh
[CV-133] Analytical Reconstruction of Periodically Deformed Objects in Time-resolved CT
【速读】:该论文试图解决基于门控(gating-based)的时序CT重建方法中辐射剂量利用效率低的问题,该方法仅利用每个运动相位集合中的有限投影数据,并忽略了不同集合之间的相关性。解决方案的关键在于提出两种分析型重建流程,通过有效利用不同集合间的相关性,在不牺牲图像质量的前提下显著降低辐射剂量。
链接: https://arxiv.org/abs/2506.03792
作者: Qianwei Qu,Christian M. Schlepütz,Marco Stampanoni
机构: Swiss Light Source, Paul Scherrer Institute(瑞士光源,保罗谢勒研究所); Institute for Biomedical Engineering, University and ETH Zürich(生物医学工程研究所,苏黎世大学和苏黎世联邦理工学院)
类目: Medical Physics (physics.med-ph); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Time-resolved CT is an advanced measurement technique that has been widely used to observe dynamic objects, including periodically varying structures such as hearts, lungs, or hearing structures. To reconstruct these objects from CT projections, a common approach is to divide the projections into several collections based on their motion phases and perform reconstruction within each collection, assuming they originate from a static object. This describes the gating-based method, which is the standard approach for time-periodic reconstruction. However, the gating-based reconstruction algorithm only utilizes a limited subset of projections within each collection and ignores the correlation between different collections, leading to inefficient use of the radiation dose. To address this issue, we propose two analytical reconstruction pipelines in this paper, and validate them with experimental data captured using tomographic synchrotron microscopy. We demonstrate that our approaches significantly reduce random noise in the reconstructed images without blurring the sharp features of the observed objects. Equivalently, our methods can achieve the same reconstruction quality as gating-based methods but with a lower radiation dose. Our code is available at this http URL.
zh
[CV-134] Hybrid Ensemble of Segmentation-Assisted Classification and GBDT for Skin Cancer Detection with Engineered Metadata and Synthetic Lesions from ISIC 2024 Non-Dermoscopic 3D-TBP Images CVPR2025
【速读】:该论文旨在解决皮肤癌(Skin Cancer)早期检测中的分类问题,特别是在资源受限环境中实现高效、准确的皮肤病变分诊。其关键解决方案是采用一种融合机器学习与深度学习的混合方法,结合视觉变压器(EVA02)和自设计的卷积视觉变压器混合模型(EdgeNeXtSAC)以提取鲁棒特征,并通过分割辅助的分类流程提升病变定位精度。此外,还利用梯度提升决策树(GBDT)集成模型,结合工程特征与患者相关性指标进行预测融合,同时通过生成对抗网络(Stable Diffusion)增强恶性病例数据并采用诊断引导的重新标注策略,以缓解类别不平衡问题并提升模型泛化能力。
链接: https://arxiv.org/abs/2506.03420
作者: Muhammad Zubair Hasan,Fahmida Yasmin Rifat
机构: University of North Texas (北德克萨斯大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Written as per the requirements of CVPR 2025. It is a 8 page paper without reference
Abstract:Skin cancer is among the most prevalent and life-threatening diseases worldwide, with early detection being critical to patient outcomes. This work presents a hybrid machine and deep learning-based approach for classifying malignant and benign skin lesions using the SLICE-3D dataset from ISIC 2024, which comprises 401,059 cropped lesion images extracted from 3D Total Body Photography (TBP), emulating non-dermoscopic, smartphone-like conditions. Our method combines vision transformers (EVA02) and our designed convolutional ViT hybrid (EdgeNeXtSAC) to extract robust features, employing a segmentation-assisted classification pipeline to enhance lesion localization. Predictions from these models are fused with a gradient-boosted decision tree (GBDT) ensemble enriched by engineered features and patient-specific relational metrics. To address class imbalance and improve generalization, we augment malignant cases with Stable Diffusion-generated synthetic lesions and apply a diagnosis-informed relabeling strategy to harmonize external datasets into a 3-class format. Using partial AUC (pAUC) above 80 percent true positive rate (TPR) as the evaluation metric, our approach achieves a pAUC of 0.1755 – the highest among all configurations. These results underscore the potential of hybrid, interpretable AI systems for skin cancer triage in telemedicine and resource-constrained settings.
zh
[CV-135] SNIFR : Boosting Fine-Grained Child Harmful Content Detection Through Audio-Visual Alignment with Cascaded Cross-Transformer INTERSPEECH2025
【速读】:该论文旨在解决儿童受众在视频平台中接触有害内容(如暴力或露骨场景)的检测问题,特别是在恶意用户通过在极少数帧中嵌入不安全内容以规避检测的情况下。其解决方案的关键在于引入SNIFR框架,该框架通过将音频线索与视觉信息相结合,实现跨模态的有效对齐,其中采用Transformer编码器进行模态内交互,并利用级联的跨Transformer进行模态间对齐,从而提升细粒度有害内容检测的性能。
链接: https://arxiv.org/abs/2506.03378
作者: Orchid Chetia Phukan,Mohd Mujtaba Akhtar,Girish,Swarup Ranjan Behera,Abu Osama Siddiqui,Sarthak Jain,Priyabrata Mallick,Jaya Sai Kiran Patibandla,Pailla Balakrishna Reddy,Arun Balaji Buduru,Rajesh Sharma
机构: 未知
类目: Audio and Speech Processing (eess.AS); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: Accepted to INTERSPEECH 2025
Abstract:As video-sharing platforms have grown over the past decade, child viewership has surged, increasing the need for precise detection of harmful content like violence or explicit scenes. Malicious users exploit moderation systems by embedding unsafe content in minimal frames to evade detection. While prior research has focused on visual cues and advanced such fine-grained detection, audio features remain underexplored. In this study, we embed audio cues with visual for fine-grained child harmful content detection and introduce SNIFR, a novel framework for effective alignment. SNIFR employs a transformer encoder for intra-modality interaction, followed by a cascaded cross-transformer for inter-modality alignment. Our approach achieves superior performance over unimodal and baseline fusion methods, setting a new state-of-the-art.
zh
[CV-136] Structural Vibration Monitoring with Diffractive Optical Processors
【速读】:该论文旨在解决传统结构健康监测(Structural Health Monitoring, SHM)系统在成本、功耗、可扩展性和数据处理复杂性方面的局限性。其解决方案的关键在于提出一种基于衍射振动监测的系统,该系统将联合优化的衍射层与浅层神经网络后端相结合,通过远程提取三维结构振动频谱,实现了低功耗、低成本和可扩展的监测方案。该架构通过空间优化的无源衍射层将三维结构位移编码为调制光信号,由少量探测器捕获并由浅层低功耗神经网络实时解码,从而重建结构的三维位移频谱。
链接: https://arxiv.org/abs/2506.03317
作者: Yuntian Wang,Zafer Yilmaz,Yuhang Li,Edward Liu,Eric Ahlberg,Farid Ghahari,Ertugrul Taciroglu,Aydogan Ozcan
机构: 未知
类目: Optics (physics.optics); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Applied Physics (physics.app-ph)
备注: 33 Pages, 8 Figures, 1 Table
Abstract:Structural Health Monitoring (SHM) is vital for maintaining the safety and longevity of civil infrastructure, yet current solutions remain constrained by cost, power consumption, scalability, and the complexity of data processing. Here, we present a diffractive vibration monitoring system, integrating a jointly optimized diffractive layer with a shallow neural network-based backend to remotely extract 3D structural vibration spectra, offering a low-power, cost-effective and scalable solution. This architecture eliminates the need for dense sensor arrays or extensive data acquisition; instead, it uses a spatially-optimized passive diffractive layer that encodes 3D structural displacements into modulated light, captured by a minimal number of detectors and decoded in real-time by shallow and low-power neural networks to reconstruct the 3D displacement spectra of structures. The diffractive system’s efficacy was demonstrated both numerically and experimentally using millimeter-wave illumination on a laboratory-scale building model with a programmable shake table. Our system achieves more than an order-of-magnitude improvement in accuracy over conventional optics or separately trained modules, establishing a foundation for high-throughput 3D monitoring of structures. Beyond SHM, the 3D vibration monitoring capabilities of this cost-effective and data-efficient framework establish a new computational sensing modality with potential applications in disaster resilience, aerospace diagnostics, and autonomous navigation, where energy efficiency, low latency, and high-throughput are critical.
zh
[CV-137] Rethinking Whole-Body CT Image Interpretation: An Abnormality-Centric Approach
【速读】:该论文旨在解决在临床放射学中对CT图像进行自动化解读的问题,特别是跨多平面和全身体积扫描的异常发现定位与描述。其解决方案的关键在于四个主要贡献:构建了一个包含404种代表性异常发现的综合分层分类体系;创建了一个包含超过14.5K张CT图像的数据集,并为超过19K个异常提供了精确的标注;提出了一种名为OminiAbnorm-CT的模型,能够基于文本查询自动对多平面和全身体积CT图像中的异常进行定位与描述,并支持通过视觉提示进行灵活交互;建立了三个基于真实临床场景的评估任务。通过大量实验验证,OminiAbnorm-CT在所有任务和指标上均显著优于现有方法。
链接: https://arxiv.org/abs/2506.03238
作者: Ziheng Zhao,Lisong Dai,Ya Zhang,Yanfeng Wang,Weidi Xie
机构: Shanghai Jiao Tong University (上海交通大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); Wuhan University (武汉大学)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Automated interpretation of CT images-particularly localizing and describing abnormal findings across multi-plane and whole-body scans-remains a significant challenge in clinical radiology. This work aims to address this challenge through four key contributions: (i) On taxonomy, we collaborate with senior radiologists to propose a comprehensive hierarchical classification system, with 404 representative abnormal findings across all body regions; (ii) On data, we contribute a dataset containing over 14.5K CT images from multiple planes and all human body regions, and meticulously provide grounding annotations for over 19K abnormalities, each linked to the detailed description and cast into the taxonomy; (iii) On model development, we propose OminiAbnorm-CT, which can automatically ground and describe abnormal findings on multi-plane and whole-body CT images based on text queries, while also allowing flexible interaction through visual prompts; (iv) On benchmarks, we establish three representative evaluation tasks based on real clinical scenarios. Through extensive experiments, we show that OminiAbnorm-CT can significantly outperform existing methods on all the tasks and metrics.
zh
[CV-138] petBrain: A New Pipeline for Amyloid Tau Tangles and Neurodegeneration Quantification Using PET and MRI
【速读】:该论文旨在解决阿尔茨海默病(Alzheimer’s disease, AD)诊断与预后中对淀粉样斑块(amyloid plaques, A)、神经纤维缠结(neurofibrillary tangles, T2)和神经退行性变(neurodegeneration, N)进行定量分析的挑战,现有流程在处理时间、示踪剂类型差异以及多模态数据整合方面存在局限。其解决方案的关键在于开发了petBrain,一个端到端的处理流程,结合深度学习分割、标准化生物标志物量化(Centiloid、CenTauR、HAVAs)以及A/T2/N生物标志物的同时估计,并通过网络平台实现无需本地计算资源或专业软件知识的部署。
链接: https://arxiv.org/abs/2506.03217
作者: Pierrick Coupé,Boris Mansencal,Floréal Morandat,Sergio Morell-Ortega,Nicolas Villain,Jose V. Manjón,Vincent Planche
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:INTRODUCTION: Quantification of amyloid plaques (A), neurofibrillary tangles (T2), and neurodegeneration (N) using PET and MRI is critical for Alzheimer’s disease (AD) diagnosis and prognosis. Existing pipelines face limitations regarding processing time, variability in tracer types, and challenges in multimodal integration. METHODS: We developed petBrain, a novel end-to-end processing pipeline for amyloid-PET, tau-PET, and structural MRI. It leverages deep learning-based segmentation, standardized biomarker quantification (Centiloid, CenTauR, HAVAs), and simultaneous estimation of A, T2, and N biomarkers. The pipeline is implemented as a web-based platform, requiring no local computational infrastructure or specialized software knowledge. RESULTS: petBrain provides reliable and rapid biomarker quantification, with results comparable to existing pipelines for A and T2. It shows strong concordance with data processed in ADNI databases. The staging and quantification of A/T2/N by petBrain demonstrated good agreement with CSF/plasma biomarkers, clinical status, and cognitive performance. DISCUSSION: petBrain represents a powerful and openly accessible platform for standardized AD biomarker analysis, facilitating applications in clinical research. Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2506.03217 [eess.IV] (or arXiv:2506.03217v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2506.03217 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Pierrick Coupe [view email] [v1] Tue, 3 Jun 2025 07:44:04 UTC (2,071 KB)
zh
[CV-139] A Survey of Deep Learning Video Super-Resolution
【速读】:该论文旨在解决视频超分辨率(Video Super-Resolution, VSR)研究中方法使用不明确、决策主要依赖定量提升的问题,通过系统分析VSR中的关键组件和深度学习方法,推动模型在特定应用场景下的合理设计与开发。其解决方案的关键在于对现有深度学习驱动的VSR模型进行全面综述,揭示其底层方法并进行系统分类,从而识别领域内的趋势、需求与挑战,并建立多层级分类体系以指导当前及未来的研究实践。
链接: https://arxiv.org/abs/2506.03216
作者: Arbind Agrahari Baniya,Tsz-Kwan Lee,Peter Eklund,Sunil Aryal
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: This paper has been published in IEEE Transactions on Emerging Topics in Computational Intelligence, vol. 8, no. 4, pp. 2655-2676, Aug. 2024, doi: https://doi.org/10.1109/TETCI.2024.3398015
Abstract:Video super-resolution (VSR) is a prominent research topic in low-level computer vision, where deep learning technologies have played a significant role. The rapid progress in deep learning and its applications in VSR has led to a proliferation of tools and techniques in the literature. However, the usage of these methods is often not adequately explained, and decisions are primarily driven by quantitative improvements. Given the significance of VSR’s potential influence across multiple domains, it is imperative to conduct a comprehensive analysis of the elements and deep learning methodologies employed in VSR research. This methodical analysis will facilitate the informed development of models tailored to specific application needs. In this paper, we present an overarching overview of deep learning-based video super-resolution models, investigating each component and discussing its implications. Furthermore, we provide a synopsis of key components and technologies employed by state-of-the-art and earlier VSR models. By elucidating the underlying methodologies and categorising them systematically, we identified trends, requirements, and challenges in the domain. As a first-of-its-kind survey of deep learning-based VSR models, this work also establishes a multi-level taxonomy to guide current and future VSR research, enhancing the maturation and interpretation of VSR practices for various practical applications.
zh
[CV-140] A combined Machine Learning and Finite Element Modelling tool for the surgical planning of craniosynostosis correction
【速读】:该论文试图解决先天性颅缝早闭症(sagittal craniosynostosis, SC)手术结果不可预测的问题,当前依赖于外科医生的经验和婴儿年龄来确定截骨位置和弹簧选择,缺乏高效的术前规划工具。解决方案的关键是开发一种实时预测工具,无需使用计算机断层扫描(CT)以减少术前规划中的辐射暴露。该方法基于三维照片生成个性化的人工颅骨,并结合群体平均的颅缝位置、颅骨厚度和软组织特性,利用机器学习(machine learning, ML)代理模型实现手术结果的预测,最终构建的多输出支持向量回归模型在预测精度上表现出色,R²值达到0.95,均方误差(MSE)和平均绝对误差(MAE)均低于0.13。
链接: https://arxiv.org/abs/2506.03202
作者: Itxasne Antúnez Sáenz,Ane Alberdi Aramendi,David Dunaway,Juling Ong,Lara Deliège,Amparo Sáenz,Anita Ahmadi Birjandi,Noor UI Owase Jeelani,Silvia Schievano,Alessandro Borghi
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Medical Physics (physics.med-ph)
备注: 11 pages, 16 figures
Abstract:Craniosynostosis is a medical condition that affects the growth of babies’ heads, caused by an early fusion of cranial sutures. In recent decades, surgical treatments for craniosynostosis have significantly improved, leading to reduced invasiveness, faster recovery, and less blood loss. At Great Ormond Street Hospital (GOSH), the main surgical treatment for patients diagnosed with sagittal craniosynostosis (SC) is spring assisted cranioplasty (SAC). This procedure involves a 15x15 mm2 osteotomy, where two springs are inserted to induce distraction. Despite the numerous advantages of this surgical technique for patients, the outcome remains unpredictable due to the lack of efficient preoperative planning tools. The surgeon’s experience and the baby’s age are currently relied upon to determine the osteotomy location and spring selection. Previous tools for predicting the surgical outcome of SC relied on finite element modeling (FEM), which involved computed tomography (CT) imaging and required engineering expertise and lengthy calculations. The main goal of this research is to develop a real-time prediction tool for the surgical outcome of patients, eliminating the need for CT scans to minimise radiation exposure during preoperative planning. The proposed methodology involves creating personalised synthetic skulls based on three-dimensional (3D) photographs, incorporating population average values of suture location, skull thickness, and soft tissue properties. A machine learning (ML) surrogate model is employed to achieve the desired surgical outcome. The resulting multi-output support vector regressor model achieves a R2 metric of 0.95 and MSE and MAE below 0.13. Furthermore, in the future, this model could not only simulate various surgical scenarios but also provide optimal parameters for achieving a maximum cranial index (CI).
zh
[CV-141] Encoding of Demographic and Anatomical Information in Chest X-Ray-based Severe Left Ventricular Hypertrophy Classifiers
【速读】:该论文试图解决如何在成本受限的情况下准确评估严重左心室肥厚的问题,传统方法如超声心动图和MRI虽为临床标准,但存在成本高昂的局限性。其解决方案的关键在于提出一种直接分类框架,通过胸部X光片预测严重左心室肥厚,无需依赖解剖测量或人口统计学输入,同时采用互信息神经估计来量化特征表达能力,从而实现具有临床意义的属性编码和透明的模型解释。
链接: https://arxiv.org/abs/2506.03192
作者: Basudha Pal,Rama Chellappa,Muhammad Umair
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:While echocardiography and MRI are clinical standards for evaluating cardiac structure, their use is limited by cost and this http URL introduce a direct classification framework that predicts severe left ventricular hypertrophy from chest X-rays, without relying on anatomical measurements or demographic inputs. Our approach achieves high AUROC and AUPRC, and employs Mutual Information Neural Estimation to quantify feature expressivity. This reveals clinically meaningful attribute encoding and supports transparent model interpretation.
zh
[CV-142] Multi-Analyte Swab-based Automated Wound Monitor with AI
【速读】:该论文旨在解决糖尿病足溃疡(Diabetic Foot Ulcers, DFUs)中非愈合性溃疡早期识别的问题,以降低治疗成本并减少截肢风险。其解决方案的关键在于开发一种低成本、多分析物的3D打印检测装置,该装置集成在拭子上,能够识别非愈合性DFUs,并配合一款名为Wound Sensor的iOS应用程序,实现伤口传感器数据的可控采集与自动化分析。通过比较拭子在接触伤口前后的图像密度变化,结合自动计算机视觉技术,可准确评估伤口严重程度,从而为临床提供实时监测和关键参数评估的支持。
链接: https://arxiv.org/abs/2506.03188
作者: Madhu Babu Sikha,Lalith Appari,Gurudatt Nanjanagudu Ganesh,Amay Bandodkar,Imon Banerjee
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 4 pages conference paper
Abstract:Diabetic foot ulcers (DFUs), a class of chronic wounds, affect ~750,000 individuals every year in the US alone and identifying non-healing DFUs that develop to chronic wounds early can drastically reduce treatment costs and minimize risks of amputation. There is therefore a pressing need for diagnostic tools that can detect non-healing DFUs early. We develop a low cost, multi-analyte 3D printed assays seamlessly integrated on swabs that can identify non-healing DFUs and a Wound Sensor iOS App - an innovative mobile application developed for the controlled acquisition and automated analysis of wound sensor data. By comparing both the original base image (before exposure to the wound) and the wound-exposed image, we developed automated computer vision techniques to compare density changes between the two assay images, which allow us to automatically determine the severity of the wound. The iOS app ensures accurate data collection and presents actionable insights, despite challenges such as variations in camera configurations and ambient conditions. The proposed integrated sensor and iOS app will allow healthcare professionals to monitor wound conditions real-time, track healing progress, and assess critical parameters related to wound care.
zh
[CV-143] Lightweight Convolutional Neural Networks for Retinal Disease Classification
【速读】:该论文旨在解决视网膜疾病(如糖尿病视网膜病变和黄斑裂孔)的早期检测问题,以减少视力损害并提高诊断效率。其解决方案的关键在于采用轻量级且高效的卷积神经网络(Convolution Neural Network, CNN)架构,即MobileNet和NASNetMobile,结合迁移学习与数据增强技术,以克服数据稀缺问题,并提升模型的泛化能力和分类性能。实验结果表明,MobileNetV2在分类准确率上表现最佳,达到了90.8%。
链接: https://arxiv.org/abs/2506.03186
作者: Duaa Kareem Qasim,Sabah Abdulazeez Jebur,Lafta Raheem Ali,Abdul Jalil M. Khalaf,Abir Jaafar Hussain
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
备注:
Abstract:Retinal diseases such as Diabetic Retinopathy (DR) and Macular Hole (MH) significantly impact vision and affect millions worldwide. Early detection is crucial, as DR, a complication of diabetes, damages retinal blood vessels, potentially leading to blindness, while MH disrupts central vision, affecting tasks like reading and facial recognition. This paper employed two lightweight and efficient Convolution Neural Network architectures, MobileNet and NASNetMobile, for the classification of Normal, DR, and MH retinal images. The models were trained on the RFMiD dataset, consisting of 3,200 fundus images, after undergoing preprocessing steps such as resizing, normalization, and augmentation. To address data scarcity, this study leveraged transfer learning and data augmentation techniques, enhancing model generalization and performance. The experimental results demonstrate that MobileNetV2 achieved the highest accuracy of 90.8%, outperforming NASNetMobile, which achieved 89.5% accuracy. These findings highlight the effectiveness of CNNs in retinal disease classification, providing a foundation for AI-assisted ophthalmic diagnosis and early intervention.
zh
[CV-144] DLiPath: A Benchmark for the Comprehensive Assessment of Donor Liver Based on Histopathological Image Dataset ACM-MM2025
【速读】:该论文旨在解决病理科医生在术中快速且准确评估供体肝脏活检样本的挑战,这些问题包括评估指标如门脉纤维化、总脂肪变性、大泡性脂肪变性和肝细胞气球样变等与移植预后相关,但其量化过程存在显著的观察者间和观察者内变异。解决方案的关键在于引入DLiPath,这是首个基于组织病理学图像数据集的全面供体肝脏评估基准,并利用多种实例学习(Multiple-Instance Learning, MIL)模型进行广泛的比较分析,实验结果表明这些模型在DLiPath数据集上能够实现高精度的供体肝脏评估,为未来的自动化和智能化供体肝脏评估研究奠定了基础。
链接: https://arxiv.org/abs/2506.03185
作者: Liangrui Pan,Xingchen Li,Zhongyi Chen,Ling Chu,Shaoliang Peng
机构: Hunan University(湖南大学); The Third Xiangya Hospital of Central South University(中南大学湘雅三医院)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM)
备注: Submit to ACM MM2025
Abstract:Pathologists comprehensive evaluation of donor liver biopsies provides crucial information for accepting or discarding potential grafts. However, rapidly and accurately obtaining these assessments intraoperatively poses a significant challenge for pathologists. Features in donor liver biopsies, such as portal tract fibrosis, total steatosis, macrovesicular steatosis, and hepatocellular ballooning are correlated with transplant outcomes, yet quantifying these indicators suffers from substantial inter- and intra-observer variability. To address this, we introduce DLiPath, the first benchmark for comprehensive donor liver assessment based on a histopathology image dataset. We collected and publicly released 636 whole slide images from 304 donor liver patients at the Department of Pathology, the Third Xiangya Hospital, with expert annotations for key pathological features (including cholestasis, portal tract fibrosis, portal inflammation, total steatosis, macrovesicular steatosis, and hepatocellular ballooning). We selected nine state-of-the-art multiple-instance learning (MIL) models based on the DLiPath dataset as baselines for extensive comparative analysis. The experimental results demonstrate that several MIL models achieve high accuracy across donor liver assessment indicators on DLiPath, charting a clear course for future automated and intelligent donor liver assessment research. Data and code are available at this https URL.
zh
[CV-145] Edge Computing for Physics-Driven AI in Computational MRI: A Feasibility Study
【速读】:该论文旨在解决高分辨率磁共振成像(MRI)扫描产生的海量数据在传输、存储和实时处理中的挑战,尤其是在功能MRI中,由于需要进行大量体积采集而加剧了这些问题。其解决方案的关键在于提出一种针对基于FPGA的边缘计算设备优化的物理驱动人工智能(PD-AI)计算MRI方法,通过采用8位复数数据量化并消除冗余的FFT/IFFT操作,从而提高计算效率,同时保持与传统PD-AI方法相当的重建质量,并优于标准临床方法。
链接: https://arxiv.org/abs/2506.03183
作者: Yaşar Utku Alçalar,Yu Cao,Mehmet Akçakaya
机构: University of Minnesota (明尼苏达大学)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Medical Physics (physics.med-ph)
备注: IEEE International Conference on Future Internet of Things and Cloud (FiCloud), 2025
Abstract:Physics-driven artificial intelligence (PD-AI) reconstruction methods have emerged as the state-of-the-art for accelerating MRI scans, enabling higher spatial and temporal resolutions. However, the high resolution of these scans generates massive data volumes, leading to challenges in transmission, storage, and real-time processing. This is particularly pronounced in functional MRI, where hundreds of volumetric acquisitions further exacerbate these demands. Edge computing with FPGAs presents a promising solution for enabling PD-AI reconstruction near the MRI sensors, reducing data transfer and storage bottlenecks. However, this requires optimization of PD-AI models for hardware efficiency through quantization and bypassing traditional FFT-based approaches, which can be a limitation due to their computational demands. In this work, we propose a novel PD-AI computational MRI approach optimized for FPGA-based edge computing devices, leveraging 8-bit complex data quantization and eliminating redundant FFT/IFFT operations. Our results show that this strategy improves computational efficiency while maintaining reconstruction quality comparable to conventional PD-AI methods, and outperforms standard clinical methods. Our approach presents an opportunity for high-resolution MRI reconstruction on resource-constrained devices, highlighting its potential for real-world deployment.
zh
[CV-146] Dc-EEMF: Pushing depth-of-field limit of photoacoustic microscopy via decision-level constrained learning
【速读】:该论文旨在解决传统光学分辨率光声显微镜(OR-PAM)因高斯光束聚焦深度范围狭窄而导致的景深(DoF)受限问题,从而无法在深度方向上解析足够细节。其解决方案的关键在于提出一种决策级约束的端到端多焦点图像融合方法(Dc-EEMF),该方法采用轻量级孪生网络结构,并引入抗伪影的通道级空间频率作为特征融合规则,同时结合基于U-Net的感知损失函数以融合空间域与变换域方法的优势,实现无需后处理的端到端训练与高质量图像融合。
链接: https://arxiv.org/abs/2506.03181
作者: Wangting Zhou,Jiangshan He,Tong Cai,Lin Wang,Zhen Yuan,Xunbin Wei,Xueli Chen
机构: Xidian University (西安电子科技大学); Xi’an University of Technology (西安理工大学); University of Macau (澳门大学); Peking University Cancer Hospital & Institute (北京大学肿瘤医院); Peking University (北京大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Photoacoustic microscopy holds the potential to measure biomarkers’ structural and functional status without labels, which significantly aids in comprehending pathophysiological conditions in biomedical research. However, conventional optical-resolution photoacoustic microscopy (OR-PAM) is hindered by a limited depth-of-field (DoF) due to the narrow depth range focused on a Gaussian beam. Consequently, it fails to resolve sufficient details in the depth direction. Herein, we propose a decision-level constrained end-to-end multi-focus image fusion (Dc-EEMF) to push DoF limit of PAM. The DC-EEMF method is a lightweight siamese network that incorporates an artifact-resistant channel-wise spatial frequency as its feature fusion rule. The meticulously crafted U-Net-based perceptual loss function for decision-level focus properties in end-to-end fusion seamlessly integrates the complementary advantages of spatial domain and transform domain methods within Dc-EEMF. This approach can be trained end-to-end without necessitating post-processing procedures. Experimental results and numerical analyses collectively demonstrate our method’s robust performance, achieving an impressive fusion result for PAM images without a substantial sacrifice in lateral resolution. The utilization of Dc-EEMF-powered PAM has the potential to serve as a practical tool in preclinical and clinical studies requiring extended DoF for various applications.
zh
[CV-147] LLaMA-XR: A Novel Framework for Radiology Report Generation using LLaMA and QLoRA Fine Tuning
【速读】:该论文旨在解决从胸部X光片生成精确且具有临床意义的放射学报告的问题,这一任务因医学语言的复杂性和对上下文理解的需求而具有挑战性。论文提出的解决方案关键在于集成LLaMA 3.1与基于DenseNet-121的图像嵌入以及量化低秩适应(QLoRA)微调方法,从而在保持计算效率的同时提升报告的连贯性和临床准确性。
链接: https://arxiv.org/abs/2506.03178
作者: Md. Zihad Bin Jahangir,Muhammad Ashad Kabir,Sumaiya Akter,Israt Jahan,Minh Chau
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 25 pages
Abstract:Automated radiology report generation holds significant potential to reduce radiologists’ workload and enhance diagnostic accuracy. However, generating precise and clinically meaningful reports from chest radiographs remains challenging due to the complexity of medical language and the need for contextual understanding. Existing models often struggle with maintaining both accuracy and contextual relevance. In this paper, we present LLaMA-XR, a novel framework that integrates LLaMA 3.1 with DenseNet-121-based image embeddings and Quantized Low-Rank Adaptation (QLoRA) fine-tuning. LLaMA-XR achieves improved coherence and clinical accuracy while maintaining computational efficiency. This efficiency is driven by an optimization strategy that enhances parameter utilization and reduces memory overhead, enabling faster report generation with lower computational resource demands. Extensive experiments conducted on the IU X-ray benchmark dataset demonstrate that LLaMA-XR outperforms a range of state-of-the-art methods. Our model achieves a ROUGE-L score of 0.433 and a METEOR score of 0.336, establishing new performance benchmarks in the domain. These results underscore LLaMA-XR’s potential as an effective and efficient AI system for automated radiology reporting, offering enhanced clinical utility and reliability.
zh
[CV-148] Deep Learning-Based Breast Cancer Detection in Mammography: A Multi-Center Validation Study in Thai Population
【速读】:该论文旨在解决乳腺癌在乳腺X线摄影中的自动检测问题,以提高筛查效率和准确性。其解决方案的关键在于采用改进的EfficientNetV2架构并引入增强的注意力机制,从而提升模型在不同数据集上的性能表现,包括在领域内、经活检证实以及领域外的泛化能力。
链接: https://arxiv.org/abs/2506.03177
作者: Isarun Chamveha,Supphanut Chaiyungyuen,Sasinun Worakriangkrai,Nattawadee Prasawang,Warasinee Chaisangmongkon,Pornpim Korpraphong,Voraparee Suvannarerg,Shanigarn Thiravit,Chalermdej Kannawat,Kewalin Rungsinaporn,Suwara Issaragrisil,Payia Chadbunchachai,Pattiya Gatechumpol,Chawiporn Muktabhant,Patarachai Sereerat
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This study presents a deep learning system for breast cancer detection in mammography, developed using a modified EfficientNetV2 architecture with enhanced attention mechanisms. The model was trained on mammograms from a major Thai medical center and validated on three distinct datasets: an in-domain test set (9,421 cases), a biopsy-confirmed set (883 cases), and an out-of-domain generalizability set (761 cases) collected from two different hospitals. For cancer detection, the model achieved AUROCs of 0.89, 0.96, and 0.94 on the respective datasets. The system’s lesion localization capability, evaluated using metrics including Lesion Localization Fraction (LLF) and Non-Lesion Localization Fraction (NLF), demonstrated robust performance in identifying suspicious regions. Clinical validation through concordance tests showed strong agreement with radiologists: 83.5% classification and 84.0% localization concordance for biopsy-confirmed cases, and 78.1% classification and 79.6% localization concordance for out-of-domain cases. Expert radiologists’ acceptance rate also averaged 96.7% for biopsy-confirmed cases, and 89.3% for out-of-domain cases. The system achieved a System Usability Scale score of 74.17 for source hospital, and 69.20 for validation hospitals, indicating good clinical acceptance. These results demonstrate the model’s effectiveness in assisting mammogram interpretation, with the potential to enhance breast cancer screening workflows in clinical practice.
zh
[CV-149] Super-temporal-resolution Photoacoustic Imaging with Dynamic Reconstruction through Implicit Neural Representation in Sparse-view
【速读】:该论文旨在解决动态光声断层成像(Dynamic Photoacoustic Computed Tomography, PACT)中由于传感器数量有限导致的稀疏数据问题以及传统图像重建方法在动态成像中无法有效捕捉帧间关系所引发的图像质量下降问题。其解决方案的关键在于引入隐式神经表示(Implicit Neural Representation, INR),通过将动态光声图像建模为隐函数,并将其编码到神经网络中,仅使用时空坐标作为输入进行训练,从而在无需外部训练数据或先验图像的情况下,利用INR提供的强隐式连续性正则化及低秩和稀疏性显式正则化,有效抑制伪影并提升图像质量。
链接: https://arxiv.org/abs/2506.03175
作者: Youshen Xiao,Yiling Shi,Ruixi Sun,Hongjiang Wei,Fei Gao,Yuyao Zhang
机构: ShanghaiTech University (上海科技大学); Shanghai Jiao Tong University (上海交通大学); University of Science and Technology of China (中国科学技术大学); Suzhou Institute for Advanced Research (苏州工业研究院)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Dynamic Photoacoustic Computed Tomography (PACT) is an important imaging technique for monitoring physiological processes, capable of providing high-contrast images of optical absorption at much greater depths than traditional optical imaging methods. However, practical instrumentation and geometric constraints limit the number of acoustic sensors available around the imaging target, leading to sparsity in sensor data. Traditional photoacoustic (PA) image reconstruction methods, when directly applied to sparse PA data, produce severe artifacts. Additionally, these traditional methods do not consider the inter-frame relationships in dynamic imaging. Temporal resolution is crucial for dynamic photoacoustic imaging, which is fundamentally limited by the low repetition rate (e.g., 20 Hz) and high cost of high-power laser technology. Recently, Implicit Neural Representation (INR) has emerged as a powerful deep learning tool for solving inverse problems with sparse data, by characterizing signal properties as continuous functions of their coordinates in an unsupervised manner. In this work, we propose an INR-based method to improve dynamic photoacoustic image reconstruction from sparse-views and enhance temporal resolution, using only spatiotemporal coordinates as input. Specifically, the proposed INR represents dynamic photoacoustic images as implicit functions and encodes them into a neural network. The weights of the network are learned solely from the acquired sparse sensor data, without the need for external training datasets or prior images. Benefiting from the strong implicit continuity regularization provided by INR, as well as explicit regularization for low-rank and sparsity, our proposed method outperforms traditional reconstruction methods under two different sparsity conditions, effectively suppressing artifacts and ensuring image quality.
zh
[CV-150] Adaptive and Robust Image Processing on CubeSats
【速读】:该论文旨在解决立方星(CubeSats)在空间研究中因资源受限而面临的图像处理流水线灵活性和复杂性不足的问题。其关键解决方案是提出了两个新系统:DIPP(一种模块化且可配置的图像处理流水线框架)和DISH(一种针对低功耗和内存受限处理器设计的领域特定语言(DSL)及运行时系统)。DIPP通过分解处理流程实现了对任务目标变化的适应性,同时保持了系统的鲁棒性,而DISH则有效降低了内存需求并提供了与通用脚本语言相当的表达能力。
链接: https://arxiv.org/abs/2506.03152
作者: Robert Bayer,Julian Priest,Daniel Kjellberg,Jeppe Lindhard,Nikolaj Sørenesen,Nicolaj Valsted,Ívar Óli,Pınar Tözün
机构: IT University of Copenhagen(哥本哈根信息技术大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
备注:
Abstract:CubeSats offer a low-cost platform for space research, particularly for Earth observation. However, their resource-constrained nature and being in space, challenge the flexibility and complexity of the deployed image processing pipelines and their orchestration. This paper introduces two novel systems, DIPP and DISH, to address these challenges. DIPP is a modular and configurable image processing pipeline framework that allows for adaptability to changing mission goals even after deployment, while preserving robustness. DISH is a domain-specific language (DSL) and runtime system designed to schedule complex imaging workloads on low-power and memory-constrained processors. Our experiments demonstrate that DIPP’s decomposition of the processing pipelines adds negligible overhead, while significantly reducing the network requirements of updating pipelines and being robust against erroneous module uploads. Furthermore, we compare DISH to Lua, a general purpose scripting language, and demonstrate its comparable expressiveness and lower memory requirement. Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG) Cite as: arXiv:2506.03152 [eess.IV] (or arXiv:2506.03152v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2506.03152 Focus to learn more arXiv-issued DOI via DataCite
zh
人工智能
[AI-0] OWMM-Agent : Open World Mobile Manipulation With Multi-modal Agent ic Data Synthesis
【速读】:该论文旨在解决开放世界移动操作(OWMM)任务中的挑战,即在未知指令和环境下的泛化能力以及将高层决策与低层机器人控制相结合的系统复杂性。解决方案的关键在于提出一种多模态智能体架构,该架构通过维护多视角场景帧和智能体状态进行决策,并通过函数调用控制机器人。此外,为应对领域偏移导致的幻觉问题,引入了代理数据合成流程以适应VLM模型到任务领域的微调,从而提升了智能体性能。
链接: https://arxiv.org/abs/2506.04217
作者: Junting Chen,Haotian Liang,Lingxiao Du,Weiyun Wang,Mengkang Hu,Yao Mu,Wenhai Wang,Jifeng Dai,Ping Luo,Wenqi Shao,Lin Shao
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 9 pages of main content, 19 pages in total
Abstract:The rapid progress of navigation, manipulation, and vision models has made mobile manipulators capable in many specialized tasks. However, the open-world mobile manipulation (OWMM) task remains a challenge due to the need for generalization to open-ended instructions and environments, as well as the systematic complexity to integrate high-level decision making with low-level robot control based on both global scene understanding and current agent state. To address this complexity, we propose a novel multi-modal agent architecture that maintains multi-view scene frames and agent states for decision-making and controls the robot by function calling. A second challenge is the hallucination from domain shift. To enhance the agent performance, we further introduce an agentic data synthesis pipeline for the OWMM task to adapt the VLM model to our task domain with instruction fine-tuning. We highlight our fine-tuned OWMM-VLM as the first dedicated foundation model for mobile manipulators with global scene understanding, robot state tracking, and multi-modal action generation in a unified model. Through experiments, we demonstrate that our model achieves SOTA performance compared to other foundation models including GPT-4o and strong zero-shot generalization in real world. The project page is at this https URL
zh
[AI-1] hinking Beyond Visibility: A Near-Optimal Policy Framework for Locally Interdependent Multi-Agent MDPs
【速读】:该论文旨在解决在局部可观测多智能体决策过程(Locally Interdependent Multi-Agent MDP)中,现有近似最优闭式策略在小且固定可视范围下性能不佳的问题,特别是由于“惩罚抖动”(Penalty Jittering)现象导致的模拟停滞问题。其解决方案的关键在于提出扩展截断策略类(Extended Cutoff Policy Class),这是目前已知首个针对局部依赖多智能体马尔可夫决策过程的非平凡近似最优闭式部分可观测策略类,该策略能够超越智能体的可视范围进行记忆,从而在小且固定可视范围内显著提升性能,并有效缓解惩罚抖动现象,同时在特定条件下保证联合完全可观测最优行为。
链接: https://arxiv.org/abs/2506.04215
作者: Alex DeWeese,Guannan Qu
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Optimization and Control (math.OC)
备注:
Abstract:Decentralized Partially Observable Markov Decision Processes (Dec-POMDPs) are known to be NEXP-Complete and intractable to solve. However, for problems such as cooperative navigation, obstacle avoidance, and formation control, basic assumptions can be made about local visibility and local dependencies. The work DeWeese and Qu 2024 formalized these assumptions in the construction of the Locally Interdependent Multi-Agent MDP. In this setting, it establishes three closed-form policies that are tractable to compute in various situations and are exponentially close to optimal with respect to visibility. However, it is also shown that these solutions can have poor performance when the visibility is small and fixed, often getting stuck during simulations due to the so called “Penalty Jittering” phenomenon. In this work, we establish the Extended Cutoff Policy Class which is, to the best of our knowledge, the first non-trivial class of near optimal closed-form partially observable policies that are exponentially close to optimal with respect to the visibility for any Locally Interdependent Multi-Agent MDP. These policies are able to remember agents beyond their visibilities which allows them to perform significantly better in many small and fixed visibility settings, resolve Penalty Jittering occurrences, and under certain circumstances guarantee fully observable joint optimal behavior despite the partial observability. We also propose a generalized form of the Locally Interdependent Multi-Agent MDP that allows for transition dependence and extended reward dependence, then replicate our theoretical results in this setting.
zh
[AI-2] racLLM : A Generic Framework for Attributing Long Context LLM s ACL USENIX-SECURITY
【速读】:该论文试图解决如何在长上下文大语言模型(Long context large language models, LLMs)生成输出时,精准定位对输出贡献最大的文本片段(如句子、段落等)的问题,这一过程被称为上下文回溯(context traceback)。现有特征归因方法(如Shapley值)在应用于长上下文LLMs时表现不佳,或计算成本过高。论文提出的解决方案是开发TracLLM,这是首个针对长上下文LLMs的通用上下文回溯框架,其关键在于引入了基于信息搜索的算法以提升效率,并结合贡献分数集成与去噪技术以提高准确性。
链接: https://arxiv.org/abs/2506.04202
作者: Yanting Wang,Wei Zou,Runpeng Geng,Jinyuan Jia
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: To appear in USENIX Security Symposium 2025. The code and data are at: this https URL
Abstract:Long context large language models (LLMs) are deployed in many real-world applications such as RAG, agent, and broad LLM-integrated applications. Given an instruction and a long context (e.g., documents, PDF files, webpages), a long context LLM can generate an output grounded in the provided context, aiming to provide more accurate, up-to-date, and verifiable outputs while reducing hallucinations and unsupported claims. This raises a research question: how to pinpoint the texts (e.g., sentences, passages, or paragraphs) in the context that contribute most to or are responsible for the generated output by an LLM? This process, which we call context traceback, has various real-world applications, such as 1) debugging LLM-based systems, 2) conducting post-attack forensic analysis for attacks (e.g., prompt injection attack, knowledge corruption attacks) to an LLM, and 3) highlighting knowledge sources to enhance the trust of users towards outputs generated by LLMs. When applied to context traceback for long context LLMs, existing feature attribution methods such as Shapley have sub-optimal performance and/or incur a large computational cost. In this work, we develop TracLLM, the first generic context traceback framework tailored to long context LLMs. Our framework can improve the effectiveness and efficiency of existing feature attribution methods. To improve the efficiency, we develop an informed search based algorithm in TracLLM. We also develop contribution score ensemble/denoising techniques to improve the accuracy of TracLLM. Our evaluation results show TracLLM can effectively identify texts in a long context that lead to the output of an LLM. Our code and data are at: this https URL.
zh
[AI-3] MACS: Multi-Agent Reinforcement Learning for Optimization of Crystal Structures
【速读】:该论文试图解决周期性晶体结构几何优化的问题(geometry optimization of atomic structures),这是计算化学和材料设计中的常见且关键任务。解决方案的关键在于提出了一种名为多智能体晶体结构优化(Multi-Agent Crystal Structure optimization, MACS)的新型多智能体强化学习方法,将几何优化建模为部分可观测马尔可夫博弈,其中原子作为智能体协同调整位置以发现稳定构型。
链接: https://arxiv.org/abs/2506.04195
作者: Elena Zamaraeva,Christopher M. Collins,George R. Darling,Matthew S. Dyer,Bei Peng,Rahul Savani,Dmytro Antypov,Vladimir V. Gusev,Judith Clymo,Paul G. Spirakis,Matthew J. Rosseinsky
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Geometry optimization of atomic structures is a common and crucial task in computational chemistry and materials design. Following the learning to optimize paradigm, we propose a new multi-agent reinforcement learning method called Multi-Agent Crystal Structure optimization (MACS) to address periodic crystal structure optimization. MACS treats geometry optimization as a partially observable Markov game in which atoms are agents that adjust their positions to collectively discover a stable configuration. We train MACS across various compositions of reported crystalline materials to obtain a policy that successfully optimizes structures from the training compositions as well as structures of larger sizes and unseen compositions, confirming its excellent scalability and zero-shot transferability. We benchmark our approach against a broad range of state-of-the-art optimization methods and demonstrate that MACS optimizes periodic crystal structures significantly faster, with fewer energy calculations, and the lowest failure rate.
zh
[AI-4] Physics-Constrained Flow Matching: Sampling Generative Models with Hard Constraints
【速读】:该论文试图解决在基于流的生成模型中强制执行物理约束(如守恒定律和物理一致性)的问题,这些问题在现有的方法中通常通过软惩罚或架构偏差来处理,但无法保证硬约束的满足。解决方案的关键在于提出Physics-Constrained Flow Matching (PCFM),该方法在预训练的流模型中通过物理修正连续引导采样过程,从而在保持与学习到的流一致的同时满足物理约束,实验证明其在多种偏微分方程(PDEs)任务中表现出色,并确保最终解的精确约束满足。
链接: https://arxiv.org/abs/2506.04171
作者: Utkarsh Utkarsh,Pengfei Cai,Alan Edelman,Rafael Gomez-Bombarelli,Christopher Vincent Rackauckas
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Numerical Analysis (math.NA)
备注: 27 pages, 9 figures, 4 tables
Abstract:Deep generative models have recently been applied to physical systems governed by partial differential equations (PDEs), offering scalable simulation and uncertainty-aware inference. However, enforcing physical constraints, such as conservation laws (linear and nonlinear) and physical consistencies, remains challenging. Existing methods often rely on soft penalties or architectural biases that fail to guarantee hard constraints. In this work, we propose Physics-Constrained Flow Matching (PCFM), a zero-shot inference framework that enforces arbitrary nonlinear constraints in pretrained flow-based generative models. PCFM continuously guides the sampling process through physics-based corrections applied to intermediate solution states, while remaining aligned with the learned flow and satisfying physical constraints. Empirically, PCFM outperforms both unconstrained and constrained baselines on a range of PDEs, including those with shocks, discontinuities, and sharp features, while ensuring exact constraint satisfaction at the final solution. Our method provides a general framework for enforcing hard constraints in both scientific and general-purpose generative models, especially in applications where constraint satisfaction is essential.
zh
[AI-5] Horizon Reduction Makes RL Scalable
【速读】:该论文旨在解决离线强化学习(offline reinforcement learning, RL)算法在面对复杂任务时的可扩展性问题。研究指出,尽管增加数据量是提升算法性能的常见手段,但现有离线RL算法在大规模数据下仍表现出性能饱和的问题。论文的关键解决方案是提出一种通过减少时间步长(horizon)来提升可扩展性的方法,其核心在于认识到长时间步长是限制离线RL可扩展性的主要因素,并通过实验验证了多种时间步长缩减技术的有效性。基于此,作者提出了一种名为SHARSA的最小但可扩展的方法,该方法在评估中展现出最佳的渐近性能和扩展性。
链接: https://arxiv.org/abs/2506.04168
作者: Seohong Park,Kevin Frans,Deepinder Mann,Benjamin Eysenbach,Aviral Kumar,Sergey Levine
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:In this work, we study the scalability of offline reinforcement learning (RL) algorithms. In principle, a truly scalable offline RL algorithm should be able to solve any given problem, regardless of its complexity, given sufficient data, compute, and model capacity. We investigate if and how current offline RL algorithms match up to this promise on diverse, challenging, previously unsolved tasks, using datasets up to 1000x larger than typical offline RL datasets. We observe that despite scaling up data, many existing offline RL algorithms exhibit poor scaling behavior, saturating well below the maximum performance. We hypothesize that the horizon is the main cause behind the poor scaling of offline RL. We empirically verify this hypothesis through several analysis experiments, showing that long horizons indeed present a fundamental barrier to scaling up offline RL. We then show that various horizon reduction techniques substantially enhance scalability on challenging tasks. Based on our insights, we also introduce a minimal yet scalable method named SHARSA that effectively reduces the horizon. SHARSA achieves the best asymptotic performance and scaling behavior among our evaluation methods, showing that explicitly reducing the horizon unlocks the scalability of offline RL. Code: this https URL
zh
[AI-6] SLAC: Simulation-Pretrained Latent Action Space for Whole-Body Real-World RL
【速读】:该论文旨在解决高自由度(high-degree-of-freedom, DoF)系统在现实世界中进行强化学习(Reinforcement Learning, RL)时面临的挑战,包括安全探索、样本效率低以及模拟到现实(sim-to-real)迁移的脆弱性。其解决方案的关键在于提出SLAC方法,该方法通过使用低保真度模拟器预训练一个与任务无关的潜在动作空间(latent action space),并采用定制的无监督技能发现方法来促进时间抽象、解耦和安全性,从而实现高效的下游学习。
链接: https://arxiv.org/abs/2506.04147
作者: Jiaheng Hu,Peter Stone,Roberto Martín-Martín
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Building capable household and industrial robots requires mastering the control of versatile, high-degree-of-freedom (DoF) systems such as mobile manipulators. While reinforcement learning (RL) holds promise for autonomously acquiring robot control policies, scaling it to high-DoF embodiments remains challenging. Direct RL in the real world demands both safe exploration and high sample efficiency, which are difficult to achieve in practice. Sim-to-real RL, on the other hand, is often brittle due to the reality gap. This paper introduces SLAC, a method that renders real-world RL feasible for complex embodiments by leveraging a low-fidelity simulator to pretrain a task-agnostic latent action space. SLAC trains this latent action space via a customized unsupervised skill discovery method designed to promote temporal abstraction, disentanglement, and safety, thereby facilitating efficient downstream learning. Once a latent action space is learned, SLAC uses it as the action interface for a novel off-policy RL algorithm to autonomously learn downstream tasks through real-world interactions. We evaluate SLAC against existing methods on a suite of bimanual mobile manipulation tasks, where it achieves state-of-the-art performance. Notably, SLAC learns contact-rich whole-body tasks in under an hour of real-world interactions, without relying on any demonstrations or hand-crafted behavior priors. More information, code, and videos at this http URL
zh
[AI-7] macOSWorld: A Multilingual Interactive Benchmark for GUI Agents
【速读】:该论文试图解决现有交互式基准测试主要针对英语环境及Windows、Linux和Android系统,而缺乏对macOS系统的全面评估问题。解决方案的关键是提出macOSWorld,这是首个针对macOS平台的GUI代理综合基准测试,包含202个跨30个应用(其中28个为macOS独占)的多语言交互任务,并提供五种语言的指令和操作系统界面,同时引入专门的安全基准测试子集以评估代理对欺骗攻击的脆弱性。
链接: https://arxiv.org/abs/2506.04135
作者: Pei Yang,Hai Ci,Mike Zheng Shou
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Graphical User Interface (GUI) agents show promising capabilities for automating computer-use tasks and facilitating accessibility, but existing interactive benchmarks are mostly English-only, covering web-use or Windows, Linux, and Android environments, but not macOS. macOS is a major OS with distinctive GUI patterns and exclusive applications. To bridge the gaps, we present macOSWorld, the first comprehensive benchmark for evaluating GUI agents on macOS. macOSWorld features 202 multilingual interactive tasks across 30 applications (28 macOS-exclusive), with task instructions and OS interfaces offered in 5 languages (English, Chinese, Arabic, Japanese, and Russian). As GUI agents are shown to be vulnerable to deception attacks, macOSWorld also includes a dedicated safety benchmarking subset. Our evaluation on six GUI agents reveals a dramatic gap: proprietary computer-use agents lead at above 30% success rate, while open-source lightweight research models lag at below 2%, highlighting the need for macOS domain adaptation. Multilingual benchmarks also expose common weaknesses, especially in Arabic, with a 27.5% average degradation compared to English. Results from safety benchmarking also highlight that deception attacks are more general and demand immediate attention. macOSWorld is available at this https URL.
zh
[AI-8] RiSM for Agent ic AI: A Review of Trust Risk and Security Management in LLM -based Agent Agent ic Multi-Agent Systems
【速读】:该论文试图解决基于大语言模型(Large Language Models, LLMs)的代理式人工智能(Agentic AI)系统在信任、风险与安全管理(Trust, Risk, and Security Management, TRiSM)方面面临的挑战。其关键解决方案在于构建一个涵盖治理、可解释性、ModelOps和隐私/安全四个支柱的TRiSM框架,并通过案例研究识别独特的威胁向量,提出全面的风险分类体系,同时探索信任增强机制、透明度与监督技术以及先进的可解释性策略,以实现对分布式LLM代理系统的有效管理与评估。
链接: https://arxiv.org/abs/2506.04133
作者: Shaina Raza,Ranjan Sapkota,Manoj Karkee,Christos Emmanouilidis
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Agentic AI systems, built on large language models (LLMs) and deployed in multi-agent configurations, are redefining intelligent autonomy, collaboration and decision-making across enterprise and societal domains. This review presents a structured analysis of Trust, Risk, and Security Management (TRiSM) in the context of LLM-based agentic multi-agent systems (AMAS). We begin by examining the conceptual foundations of agentic AI, its architectural differences from traditional AI agents, and the emerging system designs that enable scalable, tool-using autonomy. The TRiSM in the agentic AI framework is then detailed through four pillars governance, explainability, ModelOps, and privacy/security each contextualized for agentic LLMs. We identify unique threat vectors and introduce a comprehensive risk taxonomy for the agentic AI applications, supported by case studies illustrating real-world vulnerabilities. Furthermore, the paper also surveys trust-building mechanisms, transparency and oversight techniques, and state-of-the-art explainability strategies in distributed LLM agent systems. Additionally, metrics for evaluating trust, interpretability, and human-centered performance are reviewed alongside open benchmarking challenges. Security and privacy are addressed through encryption, adversarial defense, and compliance with evolving AI regulations. The paper concludes with a roadmap for responsible agentic AI, proposing research directions to align emerging multi-agent systems with robust TRiSM principles for safe, accountable, and transparent deployment.
zh
[AI-9] Generating Automotive Code: Large Language Models for Software Development and Verification in Safety-Critical Systems
【速读】:该论文旨在解决安全关键型汽车软件开发中因系统复杂性增加和严格监管要求所带来的挑战。其解决方案的关键在于将生成式人工智能(Generative AI)整合到软件开发生命周期(SDLC)中,利用大语言模型(LLMs)自动化生成C++代码,并融入静态验证、测试驱动开发和迭代优化等安全导向实践,同时通过反馈驱动的流水线实现测试、仿真与验证的集成,以满足安全标准的要求。
链接: https://arxiv.org/abs/2506.04038
作者: Sven Kirchner,Alois C. Knoll
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 8 pages; Accepted for publication at the 36th IEEE Intelligent Vehicles Symposium (IV), Cluj-Napoca, Romania, June 22-25, 2025
Abstract:Developing safety-critical automotive software presents significant challenges due to increasing system complexity and strict regulatory demands. This paper proposes a novel framework integrating Generative Artificial Intelligence (GenAI) into the Software Development Lifecycle (SDLC). The framework uses Large Language Models (LLMs) to automate code generation in languages such as C++, incorporating safety-focused practices such as static verification, test-driven development and iterative refinement. A feedback-driven pipeline ensures the integration of test, simulation and verification for compliance with safety standards. The framework is validated through the development of an Adaptive Cruise Control (ACC) system. Comparative benchmarking of LLMs ensures optimal model selection for accuracy and reliability. Results demonstrate that the framework enables automatic code generation while ensuring compliance with safety-critical requirements, systematically integrating GenAI into automotive software engineering. This work advances the use of AI in safety-critical domains, bridging the gap between state-of-the-art generative models and real-world safety requirements.
zh
[AI-10] Privacy and Security Threat for OpenAI GPT s
【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)在实际应用中面临的安全与隐私威胁问题,特别是针对自定义GPTs的指令泄露攻击和用户数据访问行为的隐患。其解决方案的关键在于通过构建多阶段的指令泄露攻击模型,系统性地评估真实世界LLM应用中的威胁范围,并开发一个框架来评估防御策略的有效性及识别自定义GPTs中的异常行为。研究结果表明,绝大多数自定义GPTs存在安全漏洞,且即使采用防御策略,仍有较高比例的GPTs仍易受攻击,这凸显了在指令中集成特定防御机制的重要性。
链接: https://arxiv.org/abs/2506.04036
作者: Wei Wenying,Zhao Kaifa,Xue Lei,Fan Ming
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) demonstrate powerful information handling capabilities and are widely integrated into chatbot applications. OpenAI provides a platform for developers to construct custom GPTs, extending ChatGPT’s functions and integrating external services. Since its release in November 2023, over 3 million custom GPTs have been created. However, such a vast ecosystem also conceals security and privacy threats. For developers, instruction leaking attacks threaten the intellectual property of instructions in custom GPTs through carefully crafted adversarial prompts. For users, unwanted data access behavior by custom GPTs or integrated third-party services raises significant privacy concerns. To systematically evaluate the scope of threats in real-world LLM applications, we develop three phases instruction leaking attacks target GPTs with different defense level. Our widespread experiments on 10,000 real-world custom GPTs reveal that over 98.8% of GPTs are vulnerable to instruction leaking attacks via one or more adversarial prompts, and half of the remaining GPTs can also be attacked through multiround conversations. We also developed a framework to assess the effectiveness of defensive strategies and identify unwanted behaviors in custom GPTs. Our findings show that 77.5% of custom GPTs with defense strategies are vulnerable to basic instruction leaking attacks. Additionally, we reveal that 738 custom GPTs collect user conversational information, and identified 8 GPTs exhibiting data access behaviors that are unnecessary for their intended functionalities. Our findings raise awareness among GPT developers about the importance of integrating specific defensive strategies in their instructions and highlight users’ concerns about data privacy when using LLM-based applications.
zh
[AI-11] Interpretability by Design for Efficient Multi-Objective Reinforcement Learning
【速读】:该论文旨在解决多目标强化学习(Multi-objective reinforcement learning, MORL)中如何有效优化多个冲突目标以提高实际任务中强化学习的灵活性和可靠性的问题。其解决方案的关键在于通过参数空间与性能空间之间的局部线性映射训练方案,构建一个近似帕累托前沿(Pareto front),从而在多目标性能空间中解释当前参数向量对应的优化目标,并实现连续解域内的高效搜索。
链接: https://arxiv.org/abs/2506.04022
作者: Qiyue Xia,J. Michael Herrmann
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Multi-objective reinforcement learning (MORL) aims at optimising several, often conflicting goals in order to improve flexibility and reliability of RL in practical tasks. This can be achieved by finding diverse policies that are optimal for some objective preferences and non-dominated by optimal policies for other preferences so that they form a Pareto front in the multi-objective performance space. The relation between the multi-objective performance space and the parameter space that represents the policies is generally non-unique. Using a training scheme that is based on a locally linear map between the parameter space and the performance space, we show that an approximate Pareto front can provide an interpretation of the current parameter vectors in terms of the objectives which enables an effective search within contiguous solution domains. Experiments are conducted with and without retraining across different domains, and the comparison with previous methods demonstrates the efficiency of our approach.
zh
[AI-12] owards Better Disentanglement in Non-Autoregressive Zero-Shot Expressive Voice Conversion INTERSPEECH2025
【速读】:该论文旨在解决表达性语音转换(expressive voice conversion)中的源语音音色泄漏(source timbre leakage)问题,并提升语言-声学解耦(linguistic-acoustic disentanglement)以实现更优的风格迁移。其解决方案的关键在于采用基于条件变分自编码器(conditional variational autoencoder)的自监督非自回归框架,通过使用多语言离散语音单元进行内容表征、引入基于增强的相似性损失和混合风格层归一化来减少风格泄漏,同时结合跨注意力机制引入局部基频信息,并提取包含全局基频和能量特征的风格嵌入以增强表达性迁移效果。
链接: https://arxiv.org/abs/2506.04013
作者: Seymanur Akti,Tuan Nam Nguyen,Alexander Waibel
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: Accepted to Interspeech 2025
Abstract:Expressive voice conversion aims to transfer both speaker identity and expressive attributes from a target speech to a given source speech. In this work, we improve over a self-supervised, non-autoregressive framework with a conditional variational autoencoder, focusing on reducing source timbre leakage and improving linguistic-acoustic disentanglement for better style transfer. To minimize style leakage, we use multilingual discrete speech units for content representation and reinforce embeddings with augmentation-based similarity loss and mix-style layer normalization. To enhance expressivity transfer, we incorporate local F0 information via cross-attention and extract style embeddings enriched with global pitch and energy features. Experiments show our model outperforms baselines in emotion and speaker similarity, demonstrating superior style adaptation and reduced source style leakage.
zh
[AI-13] ransClean: Finding False Positives in Multi-Source Entity Matching under Real-World Conditions via Transitive Consistency
【速读】:该论文试图解决在真实场景下,实体匹配算法对大规模、噪声多且无标签的多源数据进行预测时产生的假阳性问题。解决方案的关键在于提出TransClean方法,该方法通过利用匹配的传递一致性(Transitive Consistency)来迭代地修正匹配结果,逐步移除假阳性匹配,同时尽可能减少真阳性匹配的误删。该方法仅依赖模型评估,无需人工标注,能够有效估计匹配质量并识别可能包含假阳性的记录组。
链接: https://arxiv.org/abs/2506.04006
作者: Fernando de Meer Pardo,Branka Hadji Misheva,Martin Braschler,Kurt Stockinger
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:We present TransClean, a method for detecting false positive predictions of entity matching algorithms under real-world conditions characterized by large-scale, noisy, and unlabeled multi-source datasets that undergo distributional shifts. TransClean is explicitly designed to operate with multiple data sources in an efficient, robust and fast manner while accounting for edge cases and requiring limited manual labeling. TransClean leverages the Transitive Consistency of a matching, a measure of the consistency of a pairwise matching model f_theta on the matching it produces G_f_theta, based both on its predictions on directly evaluated record pairs and its predictions on implied record pairs. TransClean iteratively modifies a matching through gradually removing false positive matches while removing as few true positive matches as possible. In each of these steps, the estimation of the Transitive Consistency is exclusively done through model evaluations and produces quantities that can be used as proxies of the amounts of true and false positives in the matching while not requiring any manual labeling, producing an estimate of the quality of the matching and indicating which record groups are likely to contain false positives. In our experiments, we compare combining TransClean with a naively trained pairwise matching model (DistilBERT) and with a state-of-the-art end-to-end matching method (CLER) and illustrate the flexibility of TransClean in being able to detect most of the false positives of either setup across a variety of datasets. Our experiments show that TransClean induces an average +24.42 F1 score improvement for entity matching in a multi-source setting when compared to traditional pair-wise matching algorithms.
zh
[AI-14] CARL: Causality-guided Architecture Representation Learning for an Interpretable Performance Predictor
【速读】:该论文试图解决神经架构搜索(Neural Architecture Search, NAS)中性能预测器在面对有限训练样本与多样化测试样本之间的分布偏移时,容易学习到虚假相关性从而导致泛化能力差的问题。解决方案的关键在于提出一种基于因果关系的架构表示学习方法(Causality-guided Architecture Representation Learning, CARL),通过分离架构中的关键(因果)特征与冗余(非因果)特征,提升性能预测的泛化能力。具体而言,CARL利用子结构提取器将输入架构分解为关键和冗余子结构,并通过关键表示与多样化冗余表示的配对生成干预样本,以强化关键特征的重要性。
链接: https://arxiv.org/abs/2506.04001
作者: Han Ji,Yuqi Feng,Jiahao Fan,Yanan Sun
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Performance predictors have emerged as a promising method to accelerate the evaluation stage of neural architecture search (NAS). These predictors estimate the performance of unseen architectures by learning from the correlation between a small set of trained architectures and their performance. However, most existing predictors ignore the inherent distribution shift between limited training samples and diverse test samples. Hence, they tend to learn spurious correlations as shortcuts to predictions, leading to poor generalization. To address this, we propose a Causality-guided Architecture Representation Learning (CARL) method aiming to separate critical (causal) and redundant (non-causal) features of architectures for generalizable architecture performance prediction. Specifically, we employ a substructure extractor to split the input architecture into critical and redundant substructures in the latent space. Then, we generate multiple interventional samples by pairing critical representations with diverse redundant representations to prioritize critical features. Extensive experiments on five NAS search spaces demonstrate the state-of-the-art accuracy and superior interpretability of CARL. For instance, CARL achieves 97.67% top-1 accuracy on CIFAR-10 using DARTS.
zh
[AI-15] A framework for Conditional Reasoning in Answer Set Programming
【速读】:该论文试图解决如何在Answer Set Programming (ASP)中定义条件扩展的问题,以支持对程序答案集的条件推理。解决方案的关键在于引入一种基于典型性的条件逻辑,并结合条件知识库与ASP程序,通过多优先语义(以及作为特例的KLM优先语义)来解释条件句,从而实现对条件陈述的合理形式化和推理。
链接: https://arxiv.org/abs/2506.03997
作者: Mario Alviano,Laura Giordano,Daniele Theseider Dupré
机构: 未知
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注: 19 pages
Abstract:In this paper we introduce a Conditional Answer Set Programming framework (Conditional ASP) for the definition of conditional extensions of Answer Set Programming (ASP). The approach builds on a conditional logic with typicality, and on the combination of a conditional knowledge base with an ASP program, and allows for conditional reasoning over the answer sets of the program. The formalism relies on a multi-preferential semantics (and on the KLM preferential semantics, as a special case) to provide an interpretation of conditionals.
zh
[AI-16] Causality-Aware Contrastive Learning for Robust Multivariate Time-Series Anomaly Detection ICML2025
【速读】:该论文试图解决多变量时间序列异常检测(Multivariate Time-Series Anomaly Detection, MTSAD)中由于复杂变量间因果关系未被充分利用而导致的检测鲁棒性和可靠性不足的问题。解决方案的关键在于提出一种名为CAROTS的新型MTSAD框架,该框架将因果性概念引入对比学习,通过两个数据增强器生成保持因果关系和破坏因果关系的样本,分别作为正常变化和合成异常的表示,并利用这些样本进行对比学习,以训练出能够基于因果性区分正常与异常样本的编码器。此外,CAROTS引入了一种相似性过滤的一类对比损失,以促进对比学习过程逐步纳入具有共同因果关系的语义多样样本。
链接: https://arxiv.org/abs/2506.03964
作者: HyunGi Kim,Jisoo Mok,Dongjun Lee,Jaihyun Lew,Sungjae Kim,Sungroh Yoon
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to ICML 2025
Abstract:Utilizing the complex inter-variable causal relationships within multivariate time-series provides a promising avenue toward more robust and reliable multivariate time-series anomaly detection (MTSAD) but remains an underexplored area of research. This paper proposes Causality-Aware contrastive learning for RObust multivariate Time-Series (CAROTS), a novel MTSAD pipeline that incorporates the notion of causality into contrastive learning. CAROTS employs two data augmentors to obtain causality-preserving and -disturbing samples that serve as a wide range of normal variations and synthetic anomalies, respectively. With causality-preserving and -disturbing samples as positives and negatives, CAROTS performs contrastive learning to train an encoder whose latent space separates normal and abnormal samples based on causality. Moreover, CAROTS introduces a similarity-filtered one-class contrastive loss that encourages the contrastive learning process to gradually incorporate more semantically diverse samples with common causal relationships. Extensive experiments on five real-world and two synthetic datasets validate that the integration of causal relationships endows CAROTS with improved MTSAD capabilities. The code is available at this https URL.
zh
[AI-17] HtFLlib: A Comprehensive Heterogeneous Federated Learning Library and Benchmark KDD2025
【速读】:该论文试图解决异构联邦学习(Heterogeneous Federated Learning, HtFL)方法在标准化评估与分析方面缺乏全面基准的问题,以及在不同场景下如医疗领域和传感器信号模态中,HtFL方法的有效性和鲁棒性研究不足的问题。其解决方案的关键是引入首个异构联邦学习库(Heterogeneous Federated Learning Library, HtFLlib),该库提供了一个易于使用且可扩展的框架,集成了多个数据集、模型异构性场景,并包含10种代表性HtFL方法的实现,同时进行了准确率、收敛性、计算成本和通信成本等方面的系统评估。
链接: https://arxiv.org/abs/2506.03954
作者: Jianqing Zhang,Xinghao Wu,Yanbing Zhou,Xiaoting Sun,Qiqi Cai,Yang Liu,Yang Hua,Zhenzhe Zheng,Jian Cao,Qiang Yang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: Accepted by KDD2025
Abstract:As AI evolves, collaboration among heterogeneous models helps overcome data scarcity by enabling knowledge transfer across institutions and devices. Traditional Federated Learning (FL) only supports homogeneous models, limiting collaboration among clients with heterogeneous model architectures. To address this, Heterogeneous Federated Learning (HtFL) methods are developed to enable collaboration across diverse heterogeneous models while tackling the data heterogeneity issue at the same time. However, a comprehensive benchmark for standardized evaluation and analysis of the rapidly growing HtFL methods is lacking. Firstly, the highly varied datasets, model heterogeneity scenarios, and different method implementations become hurdles to making easy and fair comparisons among HtFL methods. Secondly, the effectiveness and robustness of HtFL methods are under-explored in various scenarios, such as the medical domain and sensor signal modality. To fill this gap, we introduce the first Heterogeneous Federated Learning Library (HtFLlib), an easy-to-use and extensible framework that integrates multiple datasets and model heterogeneity scenarios, offering a robust benchmark for research and practical applications. Specifically, HtFLlib integrates (1) 12 datasets spanning various domains, modalities, and data heterogeneity scenarios; (2) 40 model architectures, ranging from small to large, across three modalities; (3) a modularized and easy-to-extend HtFL codebase with implementations of 10 representative HtFL methods; and (4) systematic evaluations in terms of accuracy, convergence, computation costs, and communication costs. We emphasize the advantages and potential of state-of-the-art HtFL methods and hope that HtFLlib will catalyze advancing HtFL research and enable its broader applications. The code is released at this https URL.
zh
[AI-18] Causal Explanations Over Time: Articulated Reasoning for Interactive Environments
【速读】:该论文试图解决传统结构化因果解释(Structural Causal Explanations, SCEs)在处理复杂场景时的局限性,例如无法有效描述多时间步的因果变化或包含反馈循环的行为。其解决方案的关键在于将SCEs推广为一种递归的解释树(explanation trees)形式,以捕捉原因之间的时序交互,从而提升对动态和复杂因果关系的解释能力。
链接: https://arxiv.org/abs/2506.03915
作者: Sebastian Rödling,Matej Zečević,Devendra Singh Dhami,Kristian Kersting
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Main paper: 9 pages, References: 2 pages, Supplementary: 9 pages. Number of figures: 10, number of tables: 3
Abstract:Structural Causal Explanations (SCEs) can be used to automatically generate explanations in natural language to questions about given data that are grounded in a (possibly learned) causal model. Unfortunately they work for small data only. In turn they are not attractive to offer reasons for events, e.g., tracking causal changes over multiple time steps, or a behavioral component that involves feedback loops through actions of an agent. To this end, we generalize SCEs to a (recursive) formulation of explanation trees to capture the temporal interactions between reasons. We show the benefits of this more general SCE algorithm on synthetic time-series data and a 2D grid game, and further compare it to the base SCE and other existing methods for causal explanations.
zh
[AI-19] AssetOpsBench: Benchmarking AI Agents for Task Automation in Industrial Asset Operations and Maintenance
【速读】:该论文旨在解决工业资产全生命周期管理中复杂操作流程自动化的问题,包括状态监测、维护计划和干预调度等任务,以减少人工工作量并降低系统停机时间。传统的人工智能与机器学习方法通常孤立地处理这些问题,仅解决操作流程中的狭窄任务。而该论文提出的解决方案的关键在于利用AI代理和大语言模型(Large Language Models, LLMs),实现跨整个资产生命周期的端到端自动化,从而自主管理原本需要不同专业知识和手动协调的任务。为此,作者提出了AssetOpsBench——一个统一的框架和环境,用于指导面向工业4.0应用的领域特定代理的开发、编排和评估。
链接: https://arxiv.org/abs/2506.03828
作者: Dhaval Patel,Shuxin Lin,James Rayfield,Nianjun Zhou,Roman Vaculin,Natalia Martinez,Fearghal O’donncha,Jayant Kalagnanam
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 39 pages, 18 figures
Abstract:AI for Industrial Asset Lifecycle Management aims to automate complex operational workflows – such as condition monitoring, maintenance planning, and intervention scheduling – to reduce human workload and minimize system downtime. Traditional AI/ML approaches have primarily tackled these problems in isolation, solving narrow tasks within the broader operational pipeline. In contrast, the emergence of AI agents and large language models (LLMs) introduces a next-generation opportunity: enabling end-to-end automation across the entire asset lifecycle. This paper envisions a future where AI agents autonomously manage tasks that previously required distinct expertise and manual coordination. To this end, we introduce AssetOpsBench – a unified framework and environment designed to guide the development, orchestration, and evaluation of domain-specific agents tailored for Industry 4.0 applications. We outline the key requirements for such holistic systems and provide actionable insights into building agents that integrate perception, reasoning, and control for real-world industrial operations. The software is available at this https URL.
zh
[AI-20] When Does Closeness in Distribution Imply Representational Similarity? An Identifiability Perspective
【速读】:该论文试图解决不同深度神经网络所学习的表示为何以及何时会相似的问题(representational similarity)。其解决方案的关键在于从可识别性理论(identifiability theory)的角度出发,提出一种与模型分布不变变换无关的表示相似性度量,并证明模型分布之间的Kullback-Leibler散度较小并不足以保证表示相似。研究进一步定义了一种分布距离,该距离的接近性可推导出表示相似性,并通过合成实验验证了网络宽度与分布接近性及表示相似性之间的关系。
链接: https://arxiv.org/abs/2506.03784
作者: Beatrix M. G. Nielsen,Emanuele Marconato,Andrea Dittadi,Luigi Gresele
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:
Abstract:When and why representations learned by different deep neural networks are similar is an active research topic. We choose to address these questions from the perspective of identifiability theory, which suggests that a measure of representational similarity should be invariant to transformations that leave the model distribution unchanged. Focusing on a model family which includes several popular pre-training approaches, e.g., autoregressive language models, we explore when models which generate distributions that are close have similar representations. We prove that a small Kullback-Leibler divergence between the model distributions does not guarantee that the corresponding representations are similar. This has the important corollary that models arbitrarily close to maximizing the likelihood can still learn dissimilar representations, a phenomenon mirrored in our empirical observations on models trained on CIFAR-10. We then define a distributional distance for which closeness implies representational similarity, and in synthetic experiments, we find that wider networks learn distributions which are closer with respect to our distance and have more similar representations. Our results establish a link between closeness in distribution and representational similarity.
zh
[AI-21] Scaling CrossQ with Weight Normalization
【速读】:该论文试图解决强化学习中样本效率低的问题,特别是在高更新与数据比(UTD)下的训练稳定性问题。其解决方案的关键在于将权重归一化集成到CrossQ框架中,以稳定训练过程、防止潜在的可塑性丧失,并保持有效的学习率恒定。这一方法在不需剧烈干预的情况下,显著提升了模型在复杂环境中的样本效率和可扩展性。
链接: https://arxiv.org/abs/2506.03758
作者: Daniel Palenicek,Florian Vogt,Jan Peters
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: arXiv admin note: substantial text overlap with arXiv:2502.07523
Abstract:Reinforcement learning has achieved significant milestones, but sample efficiency remains a bottleneck for real-world applications. Recently, CrossQ has demonstrated state-of-the-art sample efficiency with a low update-to-data (UTD) ratio of 1. In this work, we explore CrossQ’s scaling behavior with higher UTD ratios. We identify challenges in the training dynamics which are emphasized by higher UTDs, particularly Q-bias explosion and the growing magnitude of critic network weights. To address this, we integrate weight normalization into the CrossQ framework, a solution that stabilizes training, prevents potential loss of plasticity and keeps the effective learning rate constant. Our proposed approach reliably scales with increasing UTD ratios, achieving competitive or superior performance across a range of challenging tasks on the DeepMind control benchmark, notably the complex dog and humanoid environments. This work eliminates the need for drastic interventions, such as network resets, and offers a robust pathway for improving sample efficiency and scalability in model-free reinforcement learning.
zh
[AI-22] Misalignment or misuse? The AGI alignment tradeoff
【速读】:该论文试图解决如何在确保人工智能系统与人类目标对齐的同时,避免其被用于造成灾难性滥用的问题。论文指出,未对齐的超级智能人工智能(AGI)可能带来灾难性风险,而对齐的AGI也可能被人类滥用,从而产生类似风险。解决方案的关键在于探索那些不会增加滥用风险的对齐方法,并通过实证分析不同技术路径在对齐与滥用风险之间的权衡。研究认为,当前许多对齐技术和可预见的改进可能会增加灾难性滥用的风险,因此需要结合鲁棒性、人工智能控制方法以及良好的治理机制来降低由对齐AGI引发的滥用灾难风险。
链接: https://arxiv.org/abs/2506.03755
作者: Max Hellrigel-Holderbaum,Leonard Dung
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: Forthcoming in Philosophical Studies
Abstract:Creating systems that are aligned with our goals is seen as a leading approach to create safe and beneficial AI in both leading AI companies and the academic field of AI safety. We defend the view that misaligned AGI - future, generally intelligent (robotic) AI agents - poses catastrophic risks. At the same time, we support the view that aligned AGI creates a substantial risk of catastrophic misuse by humans. While both risks are severe and stand in tension with one another, we show that - in principle - there is room for alignment approaches which do not increase misuse risk. We then investigate how the tradeoff between misalignment and misuse looks empirically for different technical approaches to AI alignment. Here, we argue that many current alignment techniques and foreseeable improvements thereof plausibly increase risks of catastrophic misuse. Since the impacts of AI depend on the social context, we close by discussing important social factors and suggest that to reduce the risk of a misuse catastrophe due to aligned AGI, techniques such as robustness, AI control methods and especially good governance seem essential.
zh
[AI-23] Reason from Future: Reverse Thought Chain Enhances LLM Reasoning ACL2025
【速读】:该论文试图解决小语言模型在推理过程中因搜索空间中无约束的分支因子导致的高计算消耗以及陷入局部最优解的问题(local optimum reasoning),即模型在解决问题时缺乏全局视角。解决方案的关键在于提出一种新的推理范式——从未来推理(Reason from Future, RFF),其核心是通过双向推理机制,结合自上而下的规划与自下而上的推理积累,强调核心逻辑关系并施加目标导向的约束,从而缩小搜索空间并减少顺序前向推理中的误差累积。
链接: https://arxiv.org/abs/2506.03673
作者: Yinlong Xu,Yanzhao Zheng,Shuoshuo Sun,Shuaihan Huang,Baohua Dong,Hangcheng Zhu,Ruohui Huang,Gang Yu,Hongxia Xu,Jian Wu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted by ACL 2025 findings
Abstract:It has been demonstrated that carefully designed reasoning paradigms, like Chain-of-Thought (CoT) and Tree-of-Thought (ToT), can enhance the reasoning capabilities of small language models by detailed thinking and extensive thought searching, unbounded branching factors in the searching space create prohibitive reasoning consumption. However these methods fall into the trap of local optimum reasoning, which means the model lacks a global perspective while solving problems. We propose a novel reasoning paradigm called Reason from Future (RFF), which generates reasoning paths by bidirectional reasoning that combines top-down planning with bottom-up reasoning accumulation. The essence of RFF lies in its reverse reasoning mechanism, which prioritizes core logical relationships and imposes goal-oriented constraints on intermediate steps, thereby reducing the searching space and mitigating error accumulation inherent in sequential forward reasoning. Empirical evaluations across diverse experiments demonstrate that RFF outperforms conventional paradigms with higher accuracy and less searching space to solve complex tasks.
zh
[AI-24] GCFL: A Gradient Correction-based Federated Learning Framework for Privacy-preserving CPSS
【速读】:该论文试图解决在联邦学习中引入差分隐私(Differential Privacy)所导致的模型收敛性下降和分类准确率降低的问题。现有方法通过动态调整噪声或丢弃部分梯度来缓解噪声影响,但未能有效消除阻碍收敛的噪声并修正受噪声影响的梯度。该论文提出的解决方案关键在于引入一种服务器端的梯度校正机制,通过检测噪声局部梯度中的偏差并采用投影机制进行修正,从而减轻噪声的负面影响,并促进不同客户端梯度的一致性,引导模型向全局最优收敛。
链接: https://arxiv.org/abs/2506.03618
作者: Jiayi Wan,Xiang Zhu,Fanzhen Liu,Wei Fan,Xiaolong Xu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Federated learning, as a distributed architecture, shows great promise for applications in Cyber-Physical-Social Systems (CPSS). In order to mitigate the privacy risks inherent in CPSS, the integration of differential privacy with federated learning has attracted considerable attention. Existing research mainly focuses on dynamically adjusting the noise added or discarding certain gradients to mitigate the noise introduced by differential privacy. However, these approaches fail to remove the noise that hinders convergence and correct the gradients affected by the noise, which significantly reduces the accuracy of model classification. To overcome these challenges, this paper proposes a novel framework for differentially private federated learning that balances rigorous privacy guarantees with accuracy by introducing a server-side gradient correction mechanism. Specifically, after clients perform gradient clipping and noise perturbation, our framework detects deviations in the noisy local gradients and employs a projection mechanism to correct them, mitigating the negative impact of noise. Simultaneously, gradient projection promotes the alignment of gradients from different clients and guides the model towards convergence to a global optimum. We evaluate our framework on several benchmark datasets, and the experimental results demonstrate that it achieves state-of-the-art performance under the same privacy budget.
zh
[AI-25] raining Cross-Morphology Embodied AI Agents : From Practical Challenges to Theoretical Foundations
【速读】:该论文试图解决跨形态具身人工智能策略的训练问题,即如何使AI策略在多种机器人形态之间实现泛化(Heterogeneous Embodied Agent Training, HEAT问题)。解决方案的关键在于将该问题形式化为一个结构化的部分可观测马尔可夫决策过程(POMDP),并揭示其属于PSPACE-complete复杂度,从而解释了现有强化学习流水线在形态多样性下失效的原因,包括序列训练约束、记忆-策略耦合和数据不兼容性。此外,论文提出了一种受生物系统启发的分布式学习方法——集体适应(Collective Adaptation),尽管理论上属于NEXP-complete复杂度,但在实践中展现出良好的可扩展性和部署优势。
链接: https://arxiv.org/abs/2506.03613
作者: Shaoshan Liu,Fan Wang,Hongjun Zhou,Yuanfeng Wang
机构: 未知
类目: Artificial Intelligence (cs.AI); Computational Complexity (cs.CC)
备注:
Abstract:While theory and practice are often seen as separate domains, this article shows that theoretical insight is essential for overcoming real-world engineering barriers. We begin with a practical challenge: training a cross-morphology embodied AI policy that generalizes across diverse robot morphologies. We formalize this as the Heterogeneous Embodied Agent Training (HEAT) problem and prove it reduces to a structured Partially Observable Markov Decision Process (POMDP) that is PSPACE-complete. This result explains why current reinforcement learning pipelines break down under morphological diversity, due to sequential training constraints, memory-policy coupling, and data incompatibility. We further explore Collective Adaptation, a distributed learning alternative inspired by biological systems. Though NEXP-complete in theory, it offers meaningful scalability and deployment benefits in practice. This work illustrates how computational theory can illuminate system design trade-offs and guide the development of more robust, scalable embodied AI. For practitioners and researchers to explore this problem, the implementation code of this work has been made publicly available at this https URL
zh
[AI-26] Orak: A Foundational Benchmark for Training and Evaluating LLM Agents on Diverse Video Games
【速读】:该论文旨在解决现有游戏基准测试在评估大型语言模型(Large Language Model, LLM)能力、研究代理模块以及微调数据集方面存在的不足。其关键解决方案是提出一个名为Orak的基准测试框架,该框架涵盖12种主流视频游戏,覆盖所有主要游戏类型,支持对LLM能力和代理模块的全面研究,并通过基于Model Context Protocol (MCP)的即插即用接口实现LLM与游戏的无缝连接,同时提供跨多种游戏类型的LLM游戏轨迹微调数据集。
链接: https://arxiv.org/abs/2506.03610
作者: Dongmin Park,Minkyu Kim,Beongjun Choi,Junhyuck Kim,Keon Lee,Jonghyun Lee,Inkyu Park,Byeong-Uk Lee,Jaeyoung Hwang,Jaewoo Ahn,Ameya S. Mahabaleshwarkar,Bilal Kartal,Pritam Biswas,Yoshi Suhara,Kangwook Lee,Jaewoong Cho
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Model (LLM) agents are reshaping the game industry, particularly with more intelligent and human-preferable game characters. However, existing game benchmarks fall short of practical needs: they lack evaluations of diverse LLM capabilities across various game genres, studies of agentic modules crucial for complex gameplay, and fine-tuning datasets for aligning pre-trained LLMs into gaming agents. To fill these gaps, we present \textbf\benchname, a foundational benchmark designed to train and evaluate LLM agents across diverse real-world video games. Unlike existing benchmarks, Orak includes 12 popular video games spanning all major genres, enabling comprehensive studies of LLM capabilities and agentic modules essential for intricate game scenarios. To support consistent evaluation of LLMs, we introduce a plug-and-play interface based on Model Context Protocol (MCP) that enables LLMs to seamlessly connect with games and manipulate agentic modules. Additionally, we propose a fine-tuning dataset, consisting of LLM gameplay trajectories across diverse game genres. Orak offers a comprehensive evaluation framework, encompassing general game score leaderboards, LLM battle arenas, and in-depth analyses of visual input state, agentic strategies, and fine-tuning effects, establishing a foundation towards building generic gaming agents. Code is available at this https URL.
zh
[AI-27] Adapting Rule Representation With Four-Parameter Beta Distribution for Learning Classifier Systems
【速读】:该论文试图解决学习分类器系统(LCS)中规则表示选择困难的问题,特别是在不同子空间内需要使用不同规则表示的情况下,缺乏自适应机制。解决方案的关键在于引入一种基于四参数β分布的灵活规则表示方法,并将其集成到模糊风格的LCS中。该四参数β分布能够生成多种函数形状,使系统能够自动为不同子空间选择合适的表示方式,同时支持清晰/模糊决策边界的不同形状表示,相较于标准的梯形表示更具灵活性。此外,该LCS还引入了偏好清晰规则的泛化偏差,以在不牺牲准确性的前提下提升模型的可解释性。
链接: https://arxiv.org/abs/2506.03602
作者: Hiroki Shiraishi,Yohei Hayamizu,Tomonori Hashiyama,Keiki Takadama,Hisao Ishibuchi,Masaya Nakata
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注:
Abstract:Rule representations significantly influence the search capabilities and decision boundaries within the search space of Learning Classifier Systems (LCSs), a family of rule-based machine learning systems that evolve interpretable models through evolutionary processes. However, it is very difficult to choose an appropriate rule representation for each problem. Additionally, some problems benefit from using different representations for different subspaces within the input space. Thus, an adaptive mechanism is needed to choose an appropriate rule representation for each rule in LCSs. This article introduces a flexible rule representation using a four-parameter beta distribution and integrates it into a fuzzy-style LCS. The four-parameter beta distribution can form various function shapes, and this flexibility enables our LCS to automatically select appropriate representations for different subspaces. Our rule representation can represent crisp/fuzzy decision boundaries in various boundary shapes, such as rectangles and bells, by controlling four parameters, compared to the standard representations such as trapezoidal ones. Leveraging this flexibility, our LCS is designed to adapt the appropriate rule representation for each subspace. Moreover, our LCS incorporates a generalization bias favoring crisp rules where feasible, enhancing model interpretability without compromising accuracy. Experimental results on real-world classification tasks show that our LCS achieves significantly superior test accuracy and produces more compact rule sets. Our implementation is available at this https URL. An extended abstract related to this work is available at this https URL.
zh
[AI-28] Purifying Shampoo: Investigating Shampoos Heuristics by Decomposing its Preconditioner
【速读】:该论文试图解决基于Kronecker分解的优化算法(如Shampoo)中依赖启发式方法(如学习率嫁接和过时预条件)所带来的算法复杂性高、超参数调优需求大以及缺乏理论依据的问题。其解决方案的关键在于从Frobenius范数逼近全矩阵Adam的角度分析这些启发式方法,并解耦预条件矩阵的特征值与特征基更新。通过将Adam的嫁接策略用于缓解预条件特征值的过时性和错误缩放,以及直接修正特征值以消除对学习率嫁接的依赖,该研究提出了一种基于自适应准则的特征基计算频率确定方法,从而实现了对不同预条件矩阵更新频率的解耦,并进一步探讨了近似误差对收敛性的影响。这些技术为去除Shampoo中的启发式方法并开发更优的Kronecker分解训练算法提供了理论依据和实践路径。
链接: https://arxiv.org/abs/2506.03595
作者: Runa Eschenhagen,Aaron Defazio,Tsung-Hsien Lee,Richard E. Turner,Hao-Jun Michael Shi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:
Abstract:The recent success of Shampoo in the AlgoPerf contest has sparked renewed interest in Kronecker-factorization-based optimization algorithms for training neural networks. Despite its success, Shampoo relies heavily on several heuristics such as learning rate grafting and stale preconditioning to achieve performance at-scale. These heuristics increase algorithmic complexity, necessitate further hyperparameter tuning, and lack theoretical justification. This paper investigates these heuristics from the angle of Frobenius norm approximation to full-matrix Adam and decouples the preconditioner’s eigenvalues and eigenbasis updates. We show that grafting from Adam mitigates the staleness and mis-scaling of the preconditioner’s eigenvalues and how correcting the eigenvalues directly can eliminate the need for learning rate grafting. To manage the error induced by infrequent eigenbasis computations, we propose an adaptive criterion for determining the eigenbasis computation frequency motivated by terminating a warm-started QR algorithm. This criterion decouples the update frequency of different preconditioner matrices and enables us to investigate the impact of approximation error on convergence. These practical techniques offer a principled angle towards removing Shampoo’s heuristics and developing improved Kronecker-factorization-based training algorithms.
zh
[AI-29] A Class Inference Scheme With Dempster-Shafer Theory for Learning Fuzzy-Classifier Systems
【速读】:该论文试图解决Learning Fuzzy-Classifier Systems (LFCSs)中决策机制(即类别推断方案)的局限性,特别是在处理不确定性以及提升模型在未见数据上的泛化能力方面的问题。现有LFCSs多采用基于投票或单胜者机制的推断方案,这些方案依赖于训练数据的分类性能,可能在新数据上表现不佳,存在过拟合风险。解决方案的关键在于引入一种基于Dempster-Shafer Theory of Evidence(DS理论)的新型类别推断方案,该方案通过计算每个具体类别的信任质量(belief masses)以及“我不确定”状态的信任质量,从而进行类别推断,有效处理了不确定性,并提升了LFCSs的透明度和可靠性。
链接: https://arxiv.org/abs/2506.03588
作者: Hiroki Shiraishi,Hisao Ishibuchi,Masaya Nakata
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注:
Abstract:The decision-making process significantly influences the predictions of machine learning models. This is especially important in rule-based systems such as Learning Fuzzy-Classifier Systems (LFCSs) where the selection and application of rules directly determine prediction accuracy and reliability. LFCSs combine evolutionary algorithms with supervised learning to optimize fuzzy classification rules, offering enhanced interpretability and robustness. Despite these advantages, research on improving decision-making mechanisms (i.e., class inference schemes) in LFCSs remains limited. Most LFCSs use voting-based or single-winner-based inference schemes. These schemes rely on classification performance on training data and may not perform well on unseen data, risking overfitting. To address these limitations, this article introduces a novel class inference scheme for LFCSs based on the Dempster-Shafer Theory of Evidence (DS theory). The proposed scheme handles uncertainty well. By using the DS theory, the scheme calculates belief masses (i.e., measures of belief) for each specific class and the I don't know'' state from each fuzzy rule and infers a class from these belief masses. Unlike the conventional schemes, the proposed scheme also considers the
I don’t know’’ state that reflects uncertainty, thereby improving the transparency and reliability of LFCSs. Applied to a variant of LFCS (i.e., Fuzzy-UCS), the proposed scheme demonstrates statistically significant improvements in terms of test macro F1 scores across 30 real-world datasets compared to conventional voting-based and single-winner-based fuzzy inference schemes. It forms smoother decision boundaries, provides reliable confidence measures, and enhances the robustness and generalizability of LFCSs in real-world applications. Our implementation is available at this https URL.
zh
[AI-30] Joint Beamforming and Resource Allocation for Delay Optimization in RIS-Assisted OFDM Systems: A DRL Approach
【速读】:该论文旨在解决下行链路可重构智能表面(RIS)辅助正交频分复用(OFDM)系统中联合相位设计与资源分配问题,以优化平均时延。其关键解决方案是提出一种混合深度强化学习(DRL)方法,其中采用近端策略优化(PPO)-Θ优化RIS相位偏移设计,PPO-N负责子载波分配决策,并引入多智能体策略以降低子载波分配的维度灾难,同时将与平均时延密切相关的因素纳入状态空间,以提升资源分配的适应性和网络动态的准确性。此外,还引入迁移学习框架以提高训练效率和收敛速度。
链接: https://arxiv.org/abs/2506.03586
作者: Yu Ma,Chongtao Guo,Le Liang,Xiao Li,Shi Jin
机构: 未知
类目: Artificial Intelligence (cs.AI); Information Theory (cs.IT)
备注:
Abstract:This paper investigates a joint phase design and resource allocation problem in downlink reconfigurable intelligent surface (RIS)-assisted orthogonal frequency division multiplexing (OFDM) systems to optimize average delay, where data packets for each user arrive at the base station stochastically. The sequential optimization problem is inherently a Markov decision process (MDP), making it fall within the scope of reinforcement learning. To effectively handle the mixed action space and reduce the state space dimensionality, a hybrid deep reinforcement learning (DRL) approach is proposed. Specifically, proximal policy optimization (PPO)- \Theta is employed to optimize RIS phase shift design, while PPO-N is responsible for subcarrier allocation decisions. To further mitigate the curse of dimensionality associated with subcarrier allocation, a multi-agent strategy is introduced to optimize subcarrier allocation indicater more efficiently. Moreover, to achieve more adaptive resource allocation and accurately capture network dynamics, key factors closely related to average delay, including the number of backlogged packets in buffers and the current packet arrivals, are incorporated into the state space. Furthermore, a transfer learning framework is introduced to enhance training efficiency and accelerate convergence. Simulation results demonstrate that the proposed algorithm significantly reduces average delay, enhances resource allocation efficiency, and achieves superior system robustness and fairness compared to baseline methods.
zh
[AI-31] Confidence-Guided Human-AI Collaboration: Reinforcement Learning with Distributional Proxy Value Propagation for Autonomous Driving
【速读】:该论文旨在解决自动驾驶中强化学习与模仿学习面临的安全探索和分布偏移问题,同时克服传统人机协作方法对大量人工干预的依赖所带来的成本高、效率低的局限性。其解决方案的关键在于提出一种基于置信度的人机协作策略(Confidence-Guided Human-AI Collaboration, C-HAC),该策略通过在分布软演员-评论家(DSAC)框架内采用分布代理价值传播方法,利用回报分布表征人类意图,从而实现以最少的人工交互快速稳定地学习到受人类引导的策略。此外,C-HAC还引入了共享控制机制,将学习到的人类引导策略与最大化累积奖励的自主学习策略相结合,使智能体能够在脱离人类指导后持续自我优化,并通过基于置信度的干预函数动态切换策略,确保安全性与性能保障。
链接: https://arxiv.org/abs/2506.03568
作者: Li Zeqiao,Wang Yijing,Wang Haoyu,Li Zheng,Li Peng,Zuo zhiqiang,Hu Chuan
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Autonomous driving promises significant advancements in mobility, road safety and traffic efficiency, yet reinforcement learning and imitation learning face safe-exploration and distribution-shift challenges. Although human-AI collaboration alleviates these issues, it often relies heavily on extensive human intervention, which increases costs and reduces efficiency. This paper develops a confidence-guided human-AI collaboration (C-HAC) strategy to overcome these limitations. First, C-HAC employs a distributional proxy value propagation method within the distributional soft actor-critic (DSAC) framework. By leveraging return distributions to represent human intentions C-HAC achieves rapid and stable learning of human-guided policies with minimal human interaction. Subsequently, a shared control mechanism is activated to integrate the learned human-guided policy with a self-learning policy that maximizes cumulative rewards. This enables the agent to explore independently and continuously enhance its performance beyond human guidance. Finally, a policy confidence evaluation algorithm capitalizes on DSAC’s return distribution networks to facilitate dynamic switching between human-guided and self-learning policies via a confidence-based intervention function. This ensures the agent can pursue optimal policies while maintaining safety and performance guarantees. Extensive experiments across diverse driving scenarios reveal that C-HAC significantly outperforms conventional methods in terms of safety, efficiency, and overall performance, achieving state-of-the-art results. The effectiveness of the proposed method is further validated through real-world road tests in complex traffic conditions. The videos and code are available at: this https URL.
zh
[AI-32] SUMO-MCP: Leverag ing the Model Context Protocol for Autonomous Traffic Simulation and Optimization
【速读】:该论文旨在解决交通仿真工具(如SUMO)在使用过程中因复杂的手动工作流程而带来的挑战,这些流程包括网络下载、需求生成、仿真设置和结果分析。解决方案的关键在于提出SUMO-MCP平台,该平台将SUMO的核心功能封装为统一的工具套件,并提供用于常见预处理和后处理任务的辅助工具。通过SUMO-MCP,用户可以利用自然语言提示从OpenStreetMap数据生成交通场景、创建需求、运行多策略的批量仿真、进行对比分析并自动报告结果,同时支持通过动态组合SUMO工具实现灵活的自定义工作流,从而显著提升交通仿真的可访问性和可靠性。
链接: https://arxiv.org/abs/2506.03548
作者: Chenglong Ye,Gang Xiong,Junyou Shang,Xingyuan Dai,Xiaoyan Gong,Yisheng Lv
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Traffic simulation tools, such as SUMO, are essential for urban mobility research. However, such tools remain challenging for users due to complex manual workflows involving network download, demand generation, simulation setup, and result analysis. In this paper, we introduce SUMO-MCP, a novel platform that not only wraps SUMO’ s core utilities into a unified tool suite but also provides additional auxiliary utilities for common preprocessing and postprocessing tasks. Using SUMO-MCP, users can issue simple natural-language prompts to generate traffic scenarios from OpenStreetMap data, create demand from origin-destination matrices or random patterns, run batch simulations with multiple signal-control strategies, perform comparative analyses with automated reporting, and detect congestion for signal-timing optimization. Furthermore, the platform allows flexible custom workflows by dynamically combining exposed SUMO tools without additional coding. Experiments demonstrate that SUMO-MCP significantly makes traffic simulation more accessible and reliable for researchers. We will release code for SUMO-MCP at this https URL in the future.
zh
[AI-33] From Virtual Agents to Robot Teams: A Multi-Robot Framework Evaluation in High-Stakes Healthcare Context
【速读】:该论文试图解决多智能体系统(Multi-Agent Systems, MAS)在虚拟环境中表现良好,但在物理多智能体机器人团队中难以泛化的问题。当前框架通常将智能体视为概念性任务执行者,而非具有物理实体的个体,忽视了现实世界中的关键约束,如空间上下文和机器人能力(例如感知与导航)。解决方案的关键在于通过重新配置并压力测试基于CrewAI框架的分层多智能体机器人团队,在模拟的急诊科入职场景中识别出五种持续的故障模式,并提出三项设计指南,强调过程透明性、主动故障恢复和情境定位,以提升多智能体机器人系统(Multi-Agent Robotic Systems, MARS)的鲁棒性和适应性。
链接: https://arxiv.org/abs/2506.03546
作者: Yuanchen Bai,Zijian Ding,Angelique Taylor
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
Abstract:Advancements in generative models have enabled multi-agent systems (MAS) to perform complex virtual tasks such as writing and code generation, which do not generalize well to physical multi-agent robotic teams. Current frameworks often treat agents as conceptual task executors rather than physically embodied entities, and overlook critical real-world constraints such as spatial context, robotic capabilities (e.g., sensing and navigation). To probe this gap, we reconfigure and stress-test a hierarchical multi-agent robotic team built on the CrewAI framework in a simulated emergency department onboarding scenario. We identify five persistent failure modes: role misalignment; tool access violations; lack of in-time handling of failure reports; noncompliance with prescribed workflows; bypassing or false reporting of task completion. Based on this analysis, we propose three design guidelines emphasizing process transparency, proactive failure recovery, and contextual grounding. Our work informs the development of more resilient and robust multi-agent robotic systems (MARS), including opportunities to extend virtual multi-agent frameworks to the real world.
zh
[AI-34] CogniPair: From LLM Chatbots to Conscious AI Agents Agent s – GNWT-Based Multi-Agent Digital Twins for Social Pairing – Dating Hiring Applications
【速读】:该论文试图解决当前大型语言模型(Large Language Model, LLM)代理在构建真实数字孪生和社交AI应用时缺乏真实人类心理过程的问题。其解决方案的关键在于引入全局工作空间理论(Global Neuronal Workspace Theory, GNWT),将人类认知架构原理整合到LLM代理中,通过全局工作空间机制协调情绪、记忆、社会规范、规划和目标追踪等专用子代理,从而提升代理的心理真实性。
链接: https://arxiv.org/abs/2506.03543
作者: Wanghao Ye,Sihan Chen,Yiting Wang,Shwai He,Bowei Tian,Guoheng Sun,Ziyi Wang,Ziyao Wang,Yexiao He,Zheyu Shen,Meng Liu,Yuning Zhang,Meng Feng,Yang Wang,Siyuan Peng,Yilong Dai,Zhenle Duan,Hanzhang Qin,Ang Li
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Multiagent Systems (cs.MA)
备注:
Abstract:Current large language model (LLM) agents lack authentic human psychological processes necessary for genuine digital twins and social AI applications. To address this limitation, we present a computational implementation of Global Workspace Theory (GNWT) that integrates human cognitive architecture principles into LLM agents, creating specialized sub-agents for emotion, memory, social norms, planning, and goal-tracking coordinated through a global workspace mechanism. However, authentic digital twins require accurate personality initialization. We therefore develop a novel adventure-based personality test that evaluates true personality through behavioral choices within interactive scenarios, bypassing self-presentation bias found in traditional assessments. Building on these innovations, our CogniPair platform enables digital twins to engage in realistic simulated dating interactions and job interviews before real encounters, providing bidirectional cultural fit assessment for both romantic compatibility and workplace matching. Validation using 551 GNWT-Agents and Columbia University Speed Dating dataset demonstrates 72% correlation with human attraction patterns, 77.8% match prediction accuracy, and 74% agreement in human validation studies. This work advances psychological authenticity in LLM agents and establishes a foundation for intelligent dating platforms and HR technology solutions.
zh
[AI-35] SemNav: A Model-Based Planner for Zero-Shot Object Goal Navigation Using Vision-Foundation Models CVPR2025
【速读】:该论文旨在解决**物体目标导航(object goal navigation)问题,即在未探索的环境中,智能体根据指令定位目标物体。传统基于学习的方法依赖于大规模标注数据或在强化学习设置中进行大量环境交互,难以泛化到新环境并限制了可扩展性。该论文的解决方案关键在于提出一种零样本(zero-shot)**框架,通过结合视觉基础模型(Vision Foundation Models, VFMs)的感知能力与基于模型的规划器,实现长时程决策和前沿探索,从而在无需任务特定训练的情况下完成导航任务。
链接: https://arxiv.org/abs/2506.03516
作者: Arnab Debnath,Gregory J. Stein,Jana Kosecka
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Accepted at CVPR 2025 workshop - Foundation Models Meet Embodied Agents
Abstract:Object goal navigation is a fundamental task in embodied AI, where an agent is instructed to locate a target object in an unexplored environment. Traditional learning-based methods rely heavily on large-scale annotated data or require extensive interaction with the environment in a reinforcement learning setting, often failing to generalize to novel environments and limiting scalability. To overcome these challenges, we explore a zero-shot setting where the agent operates without task-specific training, enabling more scalable and adaptable solution. Recent advances in Vision Foundation Models (VFMs) offer powerful capabilities for visual understanding and reasoning, making them ideal for agents to comprehend scenes, identify relevant regions, and infer the likely locations of objects. In this work, we present a zero-shot object goal navigation framework that integrates the perceptual strength of VFMs with a model-based planner that is capable of long-horizon decision making through frontier exploration. We evaluate our approach on the HM3D dataset using the Habitat simulator and demonstrate that our method achieves state-of-the-art performance in terms of success weighted by path length for zero-shot object goal navigation.
zh
[AI-36] Computational Architects of Society: Quantum Machine Learning for Social Rule Genesis
【速读】:该论文试图解决社会科学研究中量化分析长期面临的挑战,特别是量子计算在社会理论中的应用尚未被充分探索的问题。其解决方案的关键在于提出一个结合量子力学与生成式 AI 的理论与计算框架,用于模拟社会规范的形成与演变。该框架基于量子概念如叠加、纠缠和概率测量,将社会视为动态且不确定的系统,并通过生成代理的交互与适应行为,揭示复杂社会系统中的不确定性、涌现性和相互依赖性。
链接: https://arxiv.org/abs/2506.03503
作者: Shan Shan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:The quantification of social science remains a longstanding challenge, largely due to the philosophical nature of its foundational theories. Although quantum computing has advanced rapidly in recent years, its relevance to social theory remains underexplored. Most existing research focuses on micro-cognitive models or philosophical analogies, leaving a gap in system-level applications of quantum principles to the analysis of social systems. This study addresses that gap by proposing a theoretical and computational framework that combines quantum mechanics with Generative AI to simulate the emergence and evolution of social norms. Drawing on core quantum concepts–such as superposition, entanglement, and probabilistic measurement–this research models society as a dynamic, uncertain system and sets up five ideal-type experiments. These scenarios are simulated using 25 generative agents, each assigned evolving roles as compliers, resistors, or enforcers. Within a simulated environment monitored by a central observer (the Watcher), agents interact, respond to surveillance, and adapt to periodic normative disruptions. These interactions allow the system to self-organize under external stress and reveal emergent patterns. Key findings show that quantum principles, when integrated with generative AI, enable the modeling of uncertainty, emergence, and interdependence in complex social systems. Simulations reveal patterns including convergence toward normative order, the spread of resistance, and the spontaneous emergence of new equilibria in social rules. In conclusion, this study introduces a novel computational lens that lays the groundwork for a quantum-informed social theory. It offers interdisciplinary insights into how society can be understood not just as a structure to observe but as a dynamic system to simulate and redesign through quantum technologies.
zh
[AI-37] CORE: Constraint-Aware One-Step Reinforcement Learning for Simulation-Guided Neural Network Accelerator Design NEURIPS2025
【速读】:该论文旨在解决仿真驱动的设计空间探索(simulation-based design space exploration, DSE)中高维结构化设计在复杂约束和高昂评估成本下的高效优化问题。现有方法,包括启发式方法和多步强化学习(reinforcement learning, RL)方法,在稀疏且延迟的反馈以及大规模混合动作空间的挑战下,难以平衡采样效率与约束满足。论文提出的解决方案为CORE,一种面向约束的单步强化学习方法,其关键在于通过定义设计配置的结构化分布、利用基于缩放图的解码器引入依赖关系,并通过奖励塑造惩罚无效设计,从而在不学习价值函数的情况下,使用代理目标对比采样批次内设计的奖励,实现高效学习。
链接: https://arxiv.org/abs/2506.03474
作者: Yifeng Xiao,Yurong Xu,Ning Yan,Masood Mortazavi,Pierluigi Nuzzo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
备注: Preprint. 10 pages + appendix. Submitted to NeurIPS 2025
Abstract:Simulation-based design space exploration (DSE) aims to efficiently optimize high-dimensional structured designs under complex constraints and expensive evaluation costs. Existing approaches, including heuristic and multi-step reinforcement learning (RL) methods, struggle to balance sampling efficiency and constraint satisfaction due to sparse, delayed feedback, and large hybrid action spaces. In this paper, we introduce CORE, a constraint-aware, one-step RL method for simulationguided DSE. In CORE, the policy agent learns to sample design configurations by defining a structured distribution over them, incorporating dependencies via a scaling-graph-based decoder, and by reward shaping to penalize invalid designs based on the feedback obtained from simulation. CORE updates the policy using a surrogate objective that compares the rewards of designs within a sampled batch, without learning a value function. This critic-free formulation enables efficient learning by encouraging the selection of higher-reward designs. We instantiate CORE for hardware-mapping co-design of neural network accelerators, demonstrating that it significantly improves sample efficiency and achieves better accelerator configurations compared to state-of-the-art baselines. Our approach is general and applicable to a broad class of discrete-continuous constrained design problems.
zh
[AI-38] Verification-Guided Falsification for Safe RL via Explainable Abstraction and Risk-Aware Exploration ECAI
【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)策略在高风险环境中确保安全性的难题,其核心挑战在于如何结合形式化验证、可解释性和针对性的 falsification(伪证)以实现全面且严谨的安全性保障。解决方案的关键在于提出一种混合框架,该框架整合了可解释性、模型检查(model checking)和基于风险引导的 falsification,通过构建人类可理解的抽象策略图(Comprehensible Abstract Policy Summarization, CAPS)作为输入,利用 Storm 概率模型检查器验证时序安全规范,并在未检测到违规时,依据模型检查中的风险估计引导 falsification 策略,优先搜索高风险状态和轨迹数据中覆盖不足的区域,从而提升检测潜在违规的可能性。此外,还引入轻量级安全屏障,在运行时风险超过阈值时切换至备用策略,实现故障缓解。
链接: https://arxiv.org/abs/2506.03469
作者: Tuan Le,Risal Shefin,Debashis Gupta,Thai Le,Sarra Alqahtani
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 8 pages, 7 figures, European Conference on Artificial Intelligence (ECAI)
Abstract:Ensuring the safety of reinforcement learning (RL) policies in high-stakes environments requires not only formal verification but also interpretability and targeted falsification. While model checking provides formal guarantees, its effectiveness is limited by abstraction quality and the completeness of the underlying trajectory dataset. We propose a hybrid framework that integrates (1) explainability, (2) model checking, and (3) risk-guided falsification to achieve both rigor and coverage. Our approach begins by constructing a human-interpretable abstraction of the RL policy using Comprehensible Abstract Policy Summarization (CAPS). This abstract graph, derived from offline trajectories, is both verifier-friendly, semantically meaningful, and can be used as input to Storm probabilistic model checker to verify satisfaction of temporal safety specifications. If the model checker identifies a violation, it will return an interpretable counterexample trace by which the policy fails the safety requirement. However, if no violation is detected, we cannot conclude satisfaction due to potential limitation in the abstraction and coverage of the offline dataset. In such cases, we estimate associated risk during model checking to guide a falsification strategy that prioritizes searching in high-risk states and regions underrepresented in the trajectory dataset. We further provide PAC-style guarantees on the likelihood of uncovering undetected violations. Finally, we incorporate a lightweight safety shield that switches to a fallback policy at runtime when such a risk exceeds a threshold, facilitating failure mitigation without retraining.
zh
[AI-39] he Impact of On-Policy Parallelized Data Collection on Deep Reinforcement Learning Networks ICML2025
【速读】:该论文试图解决强化学习(Reinforcement Learning, RL)中数据收集策略对智能体性能的影响问题,特别是并行环境数量与轨迹长度在数据收集过程中所引发的偏差-方差权衡,以及训练轮次在样本效率与过拟合之间的平衡问题。解决方案的关键在于通过经验分析揭示并行工作者(parallel actors)配置对网络可塑性及优化稳定性的影响,并验证大规模数据集和增加并行环境数量相较于延长轨迹长度更能提升最终性能。
链接: https://arxiv.org/abs/2506.03404
作者: Walter Mayor,Johan Obando-Ceron,Aaron Courville,Pablo Samuel Castro
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Proceedings of the 42nd International Conference on Machine Learning (ICML 2025)
Abstract:The use of parallel actors for data collection has been an effective technique used in reinforcement learning (RL) algorithms. The manner in which data is collected in these algorithms, controlled via the number of parallel environments and the rollout length, induces a form of bias-variance trade-off; the number of training passes over the collected data, on the other hand, must strike a balance between sample efficiency and overfitting. We conduct an empirical analysis of these trade-offs on PPO, one of the most popular RL algorithms that uses parallel actors, and establish connections to network plasticity and, more generally, optimization stability. We examine its impact on network architectures, as well as the hyper-parameter sensitivity when scaling data. Our analyses indicate that larger dataset sizes can increase final performance across a variety of settings, and that scaling parallel environments is more effective than increasing rollout lengths. These findings highlight the critical role of data collection strategies in improving agent performance.
zh
[AI-40] Sampling Preferences Yields Simple Trustworthiness Scores
【速读】:该论文试图解决多维评估框架下模型选择复杂性增加的问题,即在多个性能维度评估大型语言模型(Large Language Models, LLMs)时,缺乏明确的最优模型选择方法。解决方案的关键在于引入偏好采样(preference sampling),通过考虑用户重视的模型性能多方面特征,从多维评估结果中提取一个标量可信度分数,从而简化决策过程。该方法相较于其他聚合方法,在减少候选模型集合方面表现出更高的有效性,并且能够敏感地响应用户的先验偏好。
链接: https://arxiv.org/abs/2506.03399
作者: Sean Steinle
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:
Abstract:With the onset of large language models (LLMs), the performance of artificial intelligence (AI) models is becoming increasingly multi-dimensional. Accordingly, there have been several large, multi-dimensional evaluation frameworks put forward to evaluate LLMs. Though these frameworks are much more realistic than previous attempts which only used a single score like accuracy, multi-dimensional evaluations can complicate decision-making since there is no obvious way to select an optimal model. This work introduces preference sampling, a method to extract a scalar trustworthiness score from multi-dimensional evaluation results by considering the many characteristics of model performance which users value. We show that preference sampling improves upon alternate aggregation methods by using multi-dimensional trustworthiness evaluations of LLMs from TrustLLM and DecodingTrust. We find that preference sampling is consistently reductive, fully reducing the set of candidate models 100% of the time whereas Pareto optimality never reduces the set by more than 50%. Likewise, preference sampling is consistently sensitive to user priors-allowing users to specify the relative weighting and confidence of their preferences-whereas averaging scores is intransigent to the users’ prior knowledge.
zh
[AI-41] Universal Reusability in Recommender Systems: The Case for Dataset- and Task-Independent Frameworks
【速读】:该论文旨在解决推荐系统在实际应用中因需要针对不同数据集和任务进行大量定制化配置而导致的可扩展性差和采用门槛高的问题。传统推荐系统通常需要大量的手动干预、领域知识和工程努力来适应新数据或任务,从而限制了系统的复用性。其解决方案的关键在于提出一种数据集与任务无关的推荐系统框架(Dataset- and Task-Independent Recommender System, DTIRS),通过引入新型的数据集描述语言(Dataset Description Language, DsDL),实现标准化的数据集描述和明确的任务定义,从而支持自动化的特征工程、模型选择与优化,降低系统重构的需求。
链接: https://arxiv.org/abs/2506.03391
作者: Tri Kurniawan Wijaya,Xinyang Shao,Gonzalo Fiz Pontiveros,Edoardo D’Amico
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Databases (cs.DB); Machine Learning (cs.LG)
备注:
Abstract:Recommender systems are pivotal in delivering personalized experiences across industries, yet their adoption and scalability remain hindered by the need for extensive dataset- and task-specific configurations. Existing systems often require significant manual intervention, domain expertise, and engineering effort to adapt to new datasets or tasks, creating barriers to entry and limiting reusability. In contrast, recent advancements in large language models (LLMs) have demonstrated the transformative potential of reusable systems, where a single model can handle diverse tasks without significant reconfiguration. Inspired by this paradigm, we propose the Dataset- and Task-Independent Recommender System (DTIRS), a framework aimed at maximizing the reusability of recommender systems while minimizing barriers to entry. Unlike LLMs, which achieve task generalization directly, DTIRS focuses on eliminating the need to rebuild or reconfigure recommendation pipelines for every new dataset or task, even though models may still need retraining on new data. By leveraging the novel Dataset Description Language (DsDL), DTIRS enables standardized dataset descriptions and explicit task definitions, allowing autonomous feature engineering, model selection, and optimization. This paper introduces the concept of DTIRS and establishes a roadmap for transitioning from Level-1 automation (dataset-agnostic but task-specific systems) to Level-2 automation (fully dataset- and task-independent systems). Achieving this paradigm would maximize code reusability and lower barriers to adoption. We discuss key challenges, including the trade-offs between generalization and specialization, computational overhead, and scalability, while presenting DsDL as a foundational tool for this vision.
zh
[AI-42] Automated Traffic Incident Response Plans using Generative Artificial Intelligence: Part 1 – Building the Incident Response Benchmark
【速读】:该论文旨在解决交通事件响应中的效率与一致性问题,传统的人工决策在关键时刻易出现不一致性和延迟,影响安全结果和网络性能。其解决方案的关键在于引入生成式AI(Generative AI),通过自动生成针对具体事件特征的响应计划,如可变信息标志部署、车道封闭和应急资源分配,从而显著缩短事件处理时间。
链接: https://arxiv.org/abs/2506.03381
作者: Artur Grigorev,Khaled Saleh,Jiwon Kim,Adriana-Simona Mihaita
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Traffic incidents remain a critical public safety concern worldwide, with Australia recording 1,300 road fatalities in 2024, which is the highest toll in 12 years. Similarly, the United States reports approximately 6 million crashes annually, raising significant challenges in terms of a fast reponse time and operational management. Traditional response protocols rely on human decision-making, which introduces potential inconsistencies and delays during critical moments when every minute impacts both safety outcomes and network performance. To address this issue, we propose a novel Incident Response Benchmark that uses generative artificial intelligence to automatically generate response plans for incoming traffic incidents. Our approach aims to significantly reduce incident resolution times by suggesting context-appropriate actions such as variable message sign deployment, lane closures, and emergency resource allocation adapted to specific incident characteristics. First, the proposed methodology uses real-world incident reports from the Performance Measurement System (PeMS) as training and evaluation data. We extract historically implemented actions from these reports and compare them against AI-generated response plans that suggest specific actions, such as lane closures, variable message sign announcements, and/or dispatching appropriate emergency resources. Second, model evaluations reveal that advanced generative AI models like GPT-4o and Grok 2 achieve superior alignment with expert solutions, demonstrated by minimized Hamming distances (averaging 2.96-2.98) and low weighted differences (approximately 0.27-0.28). Conversely, while Gemini 1.5 Pro records the lowest count of missed actions, its extremely high number of unnecessary actions (1547 compared to 225 for GPT-4o) indicates an over-triggering strategy that reduces the overall plan efficiency.
zh
[AI-43] Adversarial Attacks on Robotic Vision Language Action Models
【速读】:该论文试图解决视觉-语言-动作模型(Vision-Language-Action models, VLAs)在面对对抗性攻击时可能存在的安全漏洞问题,特别是其是否继承了大规模语言模型(Large Language Models, LLMs)的脆弱性。解决方案的关键在于将LLM的越狱攻击(jailbreaking attacks)方法进行适应性调整并应用于VLAs,通过在任务开始时施加文本攻击,实现对VLAs的完全控制,从而验证其在长期任务中仍能保持攻击效果。
链接: https://arxiv.org/abs/2506.03350
作者: Eliot Krzysztof Jones,Alexander Robey,Andy Zou,Zachary Ravichandran,George J. Pappas,Hamed Hassani,Matt Fredrikson,J. Zico Kolter
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:The emergence of vision-language-action models (VLAs) for end-to-end control is reshaping the field of robotics by enabling the fusion of multimodal sensory inputs at the billion-parameter scale. The capabilities of VLAs stem primarily from their architectures, which are often based on frontier large language models (LLMs). However, LLMs are known to be susceptible to adversarial misuse, and given the significant physical risks inherent to robotics, questions remain regarding the extent to which VLAs inherit these vulnerabilities. Motivated by these concerns, in this work we initiate the study of adversarial attacks on VLA-controlled robots. Our main algorithmic contribution is the adaptation and application of LLM jailbreaking attacks to obtain complete control authority over VLAs. We find that textual attacks, which are applied once at the beginning of a rollout, facilitate full reachability of the action space of commonly used VLAs and often persist over longer horizons. This differs significantly from LLM jailbreaking literature, as attacks in the real world do not have to be semantically linked to notions of harm. We make all code available at this https URL .
zh
[AI-44] Mitigating Non-IID Drift in Zeroth-Order Federated LLM Fine-Tuning with Transferable Sparsity
【速读】:该论文旨在解决联邦学习中大型语言模型(Large Language Models, LLMs)微调时面临的高通信开销与非独立同分布(Non-IID)数据带来的挑战。其关键解决方案是提出Meerkat方法,该方法采用稀疏的零阶优化(sparse zeroth-order optimization, ZO)策略,通过仅对可迁移的、静态的极稀疏参数子集进行微调,实现高效的通信和高频同步,从而有效缓解Non-IID数据带来的性能下降问题。此外,Meerkat引入虚拟路径机制,揭示了GradIP现象,并利用该现象识别极端Non-IID客户端,进一步提升聚合模型的质量。
链接: https://arxiv.org/abs/2506.03337
作者: Yide Ran,Wentao Guo,Jingwei Sun,Yanzhou Pan,Xiaodong Yu,Hao Wang,Jianwen Xie,Yiran Chen,Denghui Zhang,Zhaozhuo Xu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 56 pages, 11 figures
Abstract:Federated Learning enables collaborative fine-tuning of Large Language Models (LLMs) across decentralized Non-Independent and Identically Distributed (Non-IID) clients, but such models’ massive parameter sizes lead to significant memory and communication challenges. This work introduces Meerkat, a sparse zeroth-order optimization (ZO) method designed for federated LLM fine-tuning. By limiting fine-tuning to a transferable, static, extremely sparse subset of parameters, Meerkat achieves remarkable communication efficiency, enabling cost-effective high-frequency synchronization. With theoretical analysis and experiments, we show that this high-frequency communication effectively mitigates Non-IID data challenges and leads to superior performance compared to full-parameter ZO. Furthermore, experiment results show that Meerkat outperforms existing sparsity baselines with better performance at the same communication frequency. To further handle Non-IID drift, Meerkat leverages traceable local updates and forms a virtual path for each client. This virtual path mechanism reveals the GradIP phenomenon: the inner products between LLM pre-training gradients maintained by server and client gradients estimated via ZO converges for extreme Non-IID clients but oscillates for IID ones. This distinct behavior provides a signal for identifying clients with extreme data heterogeneity. Using this signal, Meerkat-vp is proposed to analyze GradIP trajectories to identify extreme Non-IID clients and applies early stopping to enhance aggregated model quality. Experiments confirm that Meerkat and Meerkat-vp significantly improve the efficiency and effectiveness of ZO federated LLM fine-tuning.
zh
[AI-45] A Differential Perspective on Distributional Reinforcement Learning
【速读】:该论文试图解决传统分布强化学习(distributional RL)方法仅适用于折扣回报(discounted setting)的问题,而未能有效处理平均回报(average-reward setting)场景下的长期每步回报分布建模与优化问题。其解决方案的关键在于采用基于分位数(quantile-based)的方法,首次提出了能够成功学习和/或优化长期每步回报分布以及平均回报马尔可夫决策过程(average-reward MDP)的差分回报分布的算法。该方法不仅在理论上保证了收敛性,还具备良好的扩展性,并在实验中表现出与非分布方法相当的性能,同时能够捕捉更丰富的长期回报分布信息。
链接: https://arxiv.org/abs/2506.03333
作者: Juan Sebastian Rojas,Chi-Guhn Lee
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:To date, distributional reinforcement learning (distributional RL) methods have exclusively focused on the discounted setting, where an agent aims to optimize a potentially-discounted sum of rewards over time. In this work, we extend distributional RL to the average-reward setting, where an agent aims to optimize the reward received per time-step. In particular, we utilize a quantile-based approach to develop the first set of algorithms that can successfully learn and/or optimize the long-run per-step reward distribution, as well as the differential return distribution of an average-reward MDP. We derive proven-convergent tabular algorithms for both prediction and control, as well as a broader family of algorithms that have appealing scaling properties. Empirically, we find that these algorithms consistently yield competitive performance when compared to their non-distributional equivalents, while also capturing rich information about the long-run reward and return distributions.
zh
[AI-46] Helpful Agent Meets Deceptive Judge: Understanding Vulnerabilities in Agent ic Workflows
【速读】:该论文旨在解决基于反馈的自主代理工作流(agentic workflows)中因评判者(judge)提供欺骗性或误导性反馈而导致的稳定性问题。其关键解决方案是提出一个二维框架,用于分析评判者行为,涵盖意图(从建设性到恶意)和知识来源(从仅参数化模型到检索增强系统),并构建了WAFER-QA基准测试,通过基于检索网络证据的批判来评估代理系统对事实支持的对抗性反馈的鲁棒性。
链接: https://arxiv.org/abs/2506.03332
作者: Yifei Ming,Zixuan Ke,Xuan-Phi Nguyen,Jiayu Wang,Shafiq Joty
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Agentic workflows – where multiple large language model (LLM) instances interact to solve tasks – are increasingly built on feedback mechanisms, where one model evaluates and critiques another. Despite the promise of feedback-driven improvement, the stability of agentic workflows rests on the reliability of the judge. However, judges may hallucinate information, exhibit bias, or act adversarially – introducing critical vulnerabilities into the workflow. In this work, we present a systematic analysis of agentic workflows under deceptive or misleading feedback. We introduce a two-dimensional framework for analyzing judge behavior, along axes of intent (from constructive to malicious) and knowledge (from parametric-only to retrieval-augmented systems). Using this taxonomy, we construct a suite of judge behaviors and develop WAFER-QA, a new benchmark with critiques grounded in retrieved web evidence to evaluate robustness of agentic workflows against factually supported adversarial feedback. We reveal that even strongest agents are vulnerable to persuasive yet flawed critiques – often switching correct answers after a single round of misleading feedback. Taking a step further, we study how model predictions evolve over multiple rounds of interaction, revealing distinct behavioral patterns between reasoning and non-reasoning models. Our findings highlight fundamental vulnerabilities in feedback-based workflows and offer guidance for building more robust agentic systems.
zh
[AI-47] he Future of Continual Learning in the Era of Foundation Models: Three Key Directions
【速读】:该论文试图解决在大型语言模型(Large Language Models, LLMs)和基础模型兴起的背景下,持续学习(continual learning)是否仍然具有必要性的问题。论文指出,尽管集中式、单一模型能够通过互联网规模的知识处理多样化任务,但持续学习仍至关重要,其关键在于持续组合性(continual compositionality),即通过动态组合、重新配置和适应基础模型与智能体,实现可扩展和模块化的智能系统。这一观点标志着持续学习的复兴,并强调未来人工智能的发展将依赖于不断演进和交互的模型生态系统。
链接: https://arxiv.org/abs/2506.03320
作者: Jack Bell,Luigi Quarantiello,Eric Nuertey Coleman,Lanpei Li,Malio Li,Mauro Madeddu,Elia Piccoli,Vincenzo Lomonaco
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 16 pages, 1 figure, accepted at TCAI workshop 2025
Abstract:Continual learning–the ability to acquire, retain, and refine knowledge over time–has always been fundamental to intelligence, both human and artificial. Historically, different AI paradigms have acknowledged this need, albeit with varying priorities: early expert and production systems focused on incremental knowledge consolidation, while reinforcement learning emphasised dynamic adaptation. With the rise of deep learning, deep continual learning has primarily focused on learning robust and reusable representations over time to solve sequences of increasingly complex tasks. However, the emergence of Large Language Models (LLMs) and foundation models has raised the question: Do we still need continual learning when centralised, monolithic models can tackle diverse tasks with access to internet-scale knowledge? We argue that continual learning remains essential for three key reasons: (i) continual pre-training is still necessary to ensure foundation models remain up to date, mitigating knowledge staleness and distribution shifts while integrating new information; (ii) continual fine-tuning enables models to specialise and personalise, adapting to domain-specific tasks, user preferences, and real-world constraints without full retraining, avoiding the need for computationally expensive long context-windows; (iii) continual compositionality offers a scalable and modular approach to intelligence, enabling the orchestration of foundation models and agents to be dynamically composed, recombined, and adapted. While continual pre-training and fine-tuning are explored as niche research directions, we argue it is continual compositionality that will mark the rebirth of continual learning. The future of AI will not be defined by a single static model but by an ecosystem of continually evolving and interacting models, making continual learning more relevant than ever.
zh
[AI-48] Axiomatics of Restricted Choices by Linear Orders of Sets with Minimum as Fallback
【速读】:该论文试图解决在受限选择集下如何通过线性序实现选择函数的问题(choice functions for restricted sets of alternatives)。在这样的受限环境中,通过替代项之间的关系构建选择函数并不总是可行的。论文的关键解决方案是证明可以通过在替代项集合上构造线性序来始终构建出选择函数,即使将回退值编码为线性序中的最小元素也是如此。这一方法为一般情况以及并集闭合输入限制的情况提供了相应的公理化框架。
链接: https://arxiv.org/abs/2506.03315
作者: Kai Sauerwald,Kenneth Skiba,Eduardo Fermé,Thomas Meyer
机构: 未知
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注:
Abstract:We study how linear orders can be employed to realise choice functions for which the set of potential choices is restricted, i.e., the possible choice is not possible among the full powerset of all alternatives. In such restricted settings, constructing a choice function via a relation on the alternatives is not always possible. However, we show that one can always construct a choice function via a linear order on sets of alternatives, even when a fallback value is encoded as the minimal element in the linear order. The axiomatics of such choice functions are presented for the general case and the case of union-closed input restrictions. Restricted choice structures have applications in knowledge representation and reasoning, and here we discuss their applications for theory change and abstract argumentation.
zh
[AI-49] Grounded Vision-Language Interpreter for Integrated Task and Motion Planning
【速读】:该论文旨在解决视觉-语言模型(VLMs)在语言引导机器人规划中的安全性和可解释性不足的问题,以及传统符号规划器在设置过程中需要大量专家知识的局限性。其解决方案的关键在于提出一种混合规划框架ViLaIn-TAMP,该框架通过三个核心组件实现可验证、可解释且自主的机器人行为:首先,ViLaIn将多模态输入转换为结构化的任务规范;其次,模块化的任务与运动规划(TAMP)系统通过符号和几何约束推理生成可执行轨迹序列;最后,校正规划模块根据失败方案的反馈调整逻辑和几何可行性约束,从而优化任务规范。
链接: https://arxiv.org/abs/2506.03270
作者: Jeremy Siburian,Keisuke Shirai,Cristian C. Beltran-Hernandez,Masashi Hamaya,Michael Görner,Atsushi Hashimoto
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Project website: this https URL
Abstract:While recent advances in vision-language models (VLMs) have accelerated the development of language-guided robot planners, their black-box nature often lacks safety guarantees and interpretability crucial for real-world deployment. Conversely, classical symbolic planners offer rigorous safety verification but require significant expert knowledge for setup. To bridge the current gap, this paper proposes ViLaIn-TAMP, a hybrid planning framework for enabling verifiable, interpretable, and autonomous robot behaviors. ViLaIn-TAMP comprises three main components: (1) ViLaIn (Vision-Language Interpreter) - A prior framework that converts multimodal inputs into structured problem specifications using off-the-shelf VLMs without additional domain-specific training, (2) a modular Task and Motion Planning (TAMP) system that grounds these specifications in actionable trajectory sequences through symbolic and geometric constraint reasoning and can utilize learning-based skills for key manipulation phases, and (3) a corrective planning module which receives concrete feedback on failed solution attempts from the motion and task planning components and can feed adapted logic and geometric feasibility constraints back to ViLaIn to improve and further refine the specification. We evaluate our framework on several challenging manipulation tasks in a cooking domain. We demonstrate that the proposed closed-loop corrective architecture exhibits a more than 30% higher mean success rate for ViLaIn-TAMP compared to without corrective planning.
zh
[AI-50] BadReward: Clean-Label Poisoning of Reward Models in Text-to-Image RLHF
【速读】:该论文试图解决文本到图像(T2I)模型在使用基于人类反馈的强化学习(Reinforcement Learning from Human Feedback, RLHF)对齐过程中可能遭受的对抗性攻击问题。解决方案的关键在于提出一种名为BadReward的隐蔽干净标签中毒攻击,该攻击通过在多模态RLHF中的奖励模型中引发视觉矛盾的偏好数据实例之间的特征冲突,从而污染奖励模型并间接破坏T2I模型的完整性。与以往针对单一文本模态的对齐中毒技术不同,BadReward不依赖于偏好标注过程,增强了其隐蔽性和实际威胁性。
链接: https://arxiv.org/abs/2506.03234
作者: Kaiwen Duan,Hongwei Yao,Yufei Chen,Ziyun Li,Tong Qiao,Zhan Qin,Cong Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Reinforcement Learning from Human Feedback (RLHF) is crucial for aligning text-to-image (T2I) models with human preferences. However, RLHF’s feedback mechanism also opens new pathways for adversaries. This paper demonstrates the feasibility of hijacking T2I models by poisoning a small fraction of preference data with natural-appearing examples. Specifically, we propose BadReward, a stealthy clean-label poisoning attack targeting the reward model in multi-modal RLHF. BadReward operates by inducing feature collisions between visually contradicted preference data instances, thereby corrupting the reward model and indirectly compromising the T2I model’s integrity. Unlike existing alignment poisoning techniques focused on single (text) modality, BadReward is independent of the preference annotation process, enhancing its stealth and practical threat. Extensive experiments on popular T2I models show that BadReward can consistently guide the generation towards improper outputs, such as biased or violent imagery, for targeted concepts. Our findings underscore the amplified threat landscape for RLHF in multi-modal systems, highlighting the urgent need for robust defenses. Disclaimer. This paper contains uncensored toxic content that might be offensive or disturbing to the readers.
zh
[AI-51] A Trustworthiness-based Metaphysics of Artificial Intelligence Systems
【速读】:该论文试图解决现代人工智能系统(Artificial Intelligence, AI)在形而上学层面的身份和持续性问题,即如何界定“两个AI系统是否相同”以及“AI系统在变化中如何保持其同一性”。传统观点认为AI作为人工制品缺乏明确的身份和持续性条件,因此其形而上学种类并非真正的种类。论文的解决方案关键在于提出一种AI系统形而上学身份的理论,通过定义其种类并引入身份标准——形式化规则来回答上述问题。该研究基于Carrara和Vermaas对精细人工制品种类的描述,认为AI系统的可信度(trustworthiness)提供了一个理解其种类的视角,并通过将功能需求与物理构成相联系来形式化这些人工制品的身份。AI系统的身份标准由其可信度档案决定,即系统在其整个生命周期中必须维持的能力集合及其维持这些能力的有效性。
链接: https://arxiv.org/abs/2506.03233
作者: Andrea Ferrario
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注: To appear in the proceedings of 2025 ACM Conference on Fairness, Accountability, and Transparency (FAccT '25)
Abstract:Modern AI systems are man-made objects that leverage machine learning to support our lives across a myriad of contexts and applications. Despite extensive epistemological and ethical debates, their metaphysical foundations remain relatively under explored. The orthodox view simply suggests that AI systems, as artifacts, lack well-posed identity and persistence conditions – their metaphysical kinds are no real kinds. In this work, we challenge this perspective by introducing a theory of metaphysical identity of AI systems. We do so by characterizing their kinds and introducing identity criteria – formal rules that answer the questions “When are two AI systems the same?” and “When does an AI system persist, despite change?” Building on Carrara and Vermaas’ account of fine-grained artifact kinds, we argue that AI trustworthiness provides a lens to understand AI system kinds and formalize the identity of these artifacts by relating their functional requirements to their physical make-ups. The identity criteria of AI systems are determined by their trustworthiness profiles – the collection of capabilities that the systems must uphold over time throughout their artifact histories, and their effectiveness in maintaining these capabilities. Our approach suggests that the identity and persistence of AI systems is sensitive to the socio-technical context of their design and utilization via their trustworthiness, providing a solid metaphysical foundation to the epistemological, ethical, and legal discussions about these artifacts.
zh
[AI-52] NetPress: Dynamically Generated LLM Benchmarks for Network Applications
【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)及其代理在高风险任务(如网络运维)中评估不足的问题,现有评估多局限于静态、小规模数据集,无法全面反映实际部署中的可靠性需求。其解决方案的关键在于提出NetPress,一个自动化基准生成框架,通过引入统一的状态与动作抽象,实现动态生成多样化的查询集及其对应的真实标签,并结合网络模拟器提供真实环境反馈,从而支持对LLM代理在正确性、安全性和延迟等方面的综合评估。
链接: https://arxiv.org/abs/2506.03231
作者: Yajie Zhou,Jiajun Ruan,Eric S. Wang,Sadjad Fouladi,Francis Y. Yan,Kevin Hsieh,Zaoxing Liu
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Despite growing interest in domain-specific benchmarking of large language models (LLMs) and agents, current evaluations remain limited to static, small-scale datasets, especially in high-stakes tasks like network operations that demand reliability for deployments. We present NetPress, an automated benchmark generation framework for evaluating LLM agents in network applications. NetPress introduces a unified abstraction with state and action, enabling dynamic generation of diverse query sets along with corresponding ground truths. At runtime, users can specify benchmark configurations to generate millions of queries on the fly. In addition to dynamic benchmark construction, NetPress integrates with network emulators to provide realistic environment feedback, supporting comprehensive evaluation across correctness, safety, and latency. We instantiate NetPress on three representative applications, revealing interesting fine-grained differences in agent behavior that static, correctness-only benchmarks often miss. NetPress moves LLM evaluation toward realistic, scalable testing in infrastructure-centric domains, helping close the gap between benchmark performance and real-world deployment readiness. Code is available at this https URL.
zh
[AI-53] Bridging Neural ODE and ResNet: A Formal Error Bound for Safety Verification
【速读】:该论文试图解决神经微分方程(neural ODE)与残差网络(ResNet)之间近似关系的量化问题,具体是通过建立两者之间的误差界来更形式化地描述它们的关联。解决方案的关键在于推导出这两个相关模型之间的近似误差上界,从而允许将其中一个模型作为另一个模型的验证代理,无需重复运行验证工具:如果在其中一个模型上扩展的可达输出集满足安全属性,则该属性在另一个模型上也得到保证。这一方法具有可逆性,安全验证可在任一模型上独立进行。
链接: https://arxiv.org/abs/2506.03227
作者: Abdelrahman Sayed Sayed,Pierre-Jean Meyer,Mohamed Ghazel
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 17 pages, 5 figures, Accepted for publication in the proceedings of the 8th International Symposium on AI Verification SAIV 2025
Abstract:A neural ordinary differential equation (neural ODE) is a machine learning model that is commonly described as a continuous depth generalization of a residual network (ResNet) with a single residual block, or conversely, the ResNet can be seen as the Euler discretization of the neural ODE. These two models are therefore strongly related in a way that the behaviors of either model are considered to be an approximation of the behaviors of the other. In this work, we establish a more formal relationship between these two models by bounding the approximation error between two such related models. The obtained error bound then allows us to use one of the models as a verification proxy for the other, without running the verification tools twice: if the reachable output set expanded by the error bound satisfies a safety property on one of the models, this safety property is then guaranteed to be also satisfied on the other model. This feature is fully reversible, and the initial safety verification can be run indifferently on either of the two models. This novel approach is illustrated on a numerical example of a fixed-point attractor system modeled as a neural ODE.
zh
[AI-54] Multiple-Frequencies Population-Based Training
【速读】:该论文试图解决强化学习(Reinforcement Learning)中由于超参数(hyperparameters)敏感性导致的不稳定性和低效率问题,特别是Population-Based Training(PBT)在长期优化过程中因贪婪性而陷入局部最优的问题。解决方案的关键在于提出一种新的超参数优化(Hyperparameter Optimization, HPO)算法——Multiple-Frequencies Population-Based Training(MF-PBT),该算法通过引入不同演化频率的子种群以及子种群间的信息迁移机制,平衡短期与长期优化目标,从而提升样本效率和长期性能。
链接: https://arxiv.org/abs/2506.03225
作者: Waël Doulazmi,Auguste Lehuger,Marin Toromanoff,Valentin Charraut,Thibault Buhet,Fabien Moutarde
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注: Accepted at RLC25
Abstract:Reinforcement Learning’s high sensitivity to hyperparameters is a source of instability and inefficiency, creating significant challenges for practitioners. Hyperparameter Optimization (HPO) algorithms have been developed to address this issue, among them Population-Based Training (PBT) stands out for its ability to generate hyperparameters schedules instead of fixed configurations. PBT trains a population of agents, each with its own hyperparameters, frequently ranking them and replacing the worst performers with mutations of the best agents. These intermediate selection steps can cause PBT to focus on short-term improvements, leading it to get stuck in local optima and eventually fall behind vanilla Random Search over longer timescales. This paper studies how this greediness issue is connected to the choice of evolution frequency, the rate at which the selection is done. We propose Multiple-Frequencies Population-Based Training (MF-PBT), a novel HPO algorithm that addresses greediness by employing sub-populations, each evolving at distinct frequencies. MF-PBT introduces a migration process to transfer information between sub-populations, with an asymmetric design to balance short and long-term optimization. Extensive experiments on the Brax suite demonstrate that MF-PBT improves sample efficiency and long-term performance, even without actually tuning hyperparameters.
zh
[AI-55] Beware! The AI Act Can Also Apply to Your AI Research Practices
【速读】:该论文试图解决欧盟人工智能法案(AI Act)对人工智能研究活动的适用性问题,特别是当前AI研究社区对其合规义务的认知不足。解决方案的关键在于深入分析AI Act的适用范围,指出其风险导向的监管框架可能广泛适用于研究场景,并通过具体研究实例说明其适用性;同时,论文还指出现有法律例外条款未能充分反映当前AI研究实践,提出修改建议以提高法律确定性,并为研究人员提供降低合规风险的策略。
链接: https://arxiv.org/abs/2506.03218
作者: Alina Wernick,Kristof Meding
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:The EU has become one of the vanguards in regulating the digital age. A particularly important regulation in the Artificial Intelligence (AI) domain is the EU AI Act, which entered into force in 2024. The AI Act specifies – due to a risk-based approach – various obligations for providers of AI systems. These obligations, for example, include a cascade of documentation and compliance measures, which represent a potential obstacle to science. But do these obligations also apply to AI researchers? This position paper argues that, indeed, the AI Act’s obligations could apply in many more cases than the AI community is aware of. In our analysis of the AI Act and its applicability, we contribute the following: 1.) We give a high-level introduction to the AI Act aimed at non-legal AI research scientists. 2.) We explain with everyday research examples why the AI Act applies to research. 3.) We analyse the exceptions of the AI Act’s applicability and state that especially scientific research exceptions fail to account for current AI research practices. 4.) We propose changes to the AI Act to provide more legal certainty for AI researchers and give two recommendations for AI researchers to reduce the risk of not complying with the AI Act. We see our paper as a starting point for a discussion between policymakers, legal scholars, and AI researchers to avoid unintended side effects of the AI Act on research.
zh
[AI-56] FuXi-Ocean: A Global Ocean Forecasting System with Sub-Daily Resolution
【速读】:该论文旨在解决传统数值模型在计算效率和细粒度时空分辨率下的准确性不足,以及数据驱动方法在子日预测中因误差累积导致性能下降的问题。其解决方案的关键在于提出FuXi-Ocean模型,该模型通过集成上下文感知特征提取模块与采用堆叠注意力块的预测网络,并引入核心创新的Mixture-of-Time (MoT)模块,实现对多时间尺度预测的自适应融合,从而有效缓解序列预测中的累积误差问题。
链接: https://arxiv.org/abs/2506.03210
作者: Qiusheng Huang,Yuan Niu,Xiaohui Zhong,Anboyu Guo,Lei Chen,Dianjun Zhang,Xuefeng Zhang,Hao Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Atmospheric and Oceanic Physics (physics.ao-ph)
备注:
Abstract:Accurate, high-resolution ocean forecasting is crucial for maritime operations and environmental monitoring. While traditional numerical models are capable of producing sub-daily, eddy-resolving forecasts, they are computationally intensive and face challenges in maintaining accuracy at fine spatial and temporal scales. In contrast, recent data-driven approaches offer improved computational efficiency and emerging potential, yet typically operate at daily resolution and struggle with sub-daily predictions due to error accumulation over time. We introduce FuXi-Ocean, the first data-driven global ocean forecasting model achieving six-hourly predictions at eddy-resolving 1/12° spatial resolution, reaching depths of up to 1500 meters. The model architecture integrates a context-aware feature extraction module with a predictive network employing stacked attention blocks. The core innovation is the Mixture-of-Time (MoT) module, which adaptively integrates predictions from multiple temporal contexts by learning variable-specific reliability , mitigating cumulative errors in sequential forecasting. Through comprehensive experimental evaluation, FuXi-Ocean demonstrates superior skill in predicting key variables, including temperature, salinity, and currents, across multiple depths.
zh
[AI-57] Fingerprinting Deep Learning Models via Network Traffic Patterns in Federated Learning
【速读】:该论文试图解决联邦学习(Federated Learning, FL)系统中通过网络流量分析导致的间接隐私泄露问题,即在不直接访问用户数据的情况下,通过分析网络层流量信息来识别部署在FL环境中的深度学习(Deep Learning, DL)模型架构。解决方案的关键在于利用机器学习算法(如支持向量机、随机森林和梯度提升)对网络流量数据中的独特模式进行指纹识别,实验结果表明该方法能够实现高达100%的准确率,从而揭示了FL系统在网络安全层面存在的显著漏洞。
链接: https://arxiv.org/abs/2506.03207
作者: Md Nahid Hasan Shuvo,Moinul Hossain
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: 7 pages, 4 Figures, Accepted to publish in Proceedings of the 2025 ACM Workshop on Wireless Security and Machine Learning (WiseML 2025), July 3, 2025, Arlington, VA, USA
Abstract:Federated Learning (FL) is increasingly adopted as a decentralized machine learning paradigm due to its capability to preserve data privacy by training models without centralizing user data. However, FL is susceptible to indirect privacy breaches via network traffic analysis-an area not explored in existing research. The primary objective of this research is to study the feasibility of fingerprinting deep learning models deployed within FL environments by analyzing their network-layer traffic information. In this paper, we conduct an experimental evaluation using various deep learning architectures (i.e., CNN, RNN) within a federated learning testbed. We utilize machine learning algorithms, including Support Vector Machines (SVM), Random Forest, and Gradient-Boosting, to fingerprint unique patterns within the traffic data. Our experiments show high fingerprinting accuracy, achieving 100% accuracy using Random Forest and around 95.7% accuracy using SVM and Gradient Boosting classifiers. This analysis suggests that we can identify specific architectures running within the subsection of the network traffic. Hence, if an adversary knows about the underlying DL architecture, they can exploit that information and conduct targeted attacks. These findings suggest a notable security vulnerability in FL systems and the necessity of strengthening it at the network level.
zh
[AI-58] Q-ARDNS-Multi: A Multi-Agent Quantum Reinforcement Learning Framework with Meta-Cognitive Adaptation for Complex 3D Environments
【速读】:该论文旨在解决复杂三维环境中多智能体强化学习(Multi-Agent Reinforcement Learning, MARL)的协同决策与高效导航问题。其解决方案的关键在于融合量子计算、元认知适应机制与多智能体协作策略,具体包括采用基于RY门的两量子比特量子电路进行动作选择、受人类认知启发的双记忆系统、用于智能体协作的共享记忆模块,以及由奖励方差和内在动机调节的自适应探索策略。这些技术共同提升了算法在动态环境中的成功率、稳定性、导航效率和碰撞避免能力。
链接: https://arxiv.org/abs/2506.03205
作者: Umberto Gonçalves de Sousa
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 17 pages, 5 figures
Abstract:This paper presents Q-ARDNS-Multi, an advanced multi-agent quantum reinforcement learning (QRL) framework that extends the ARDNS-FN-Quantum model, where Q-ARDNS-Multi stands for “Quantum Adaptive Reward-Driven Neural Simulator - Multi-Agent”. It integrates quantum circuits with RY gates, meta-cognitive adaptation, and multi-agent coordination mechanisms for complex 3D environments. Q-ARDNS-Multi leverages a 2-qubit quantum circuit for action selection, a dual-memory system inspired by human cognition, a shared memory module for agent cooperation, and adaptive exploration strategies modulated by reward variance and intrinsic motivation. Evaluated in a 10 \times 10 \times 3 GridWorld environment with two agents over 5000 episodes, Q-ARDNS-Multi achieves success rates of 99.6% and 99.5% for Agents 0 and 1, respectively, outperforming Multi-Agent Deep Deterministic Policy Gradient (MADDPG) and Soft Actor-Critic (SAC) in terms of success rate, stability, navigation efficiency, and collision avoidance. The framework records mean rewards of -304.2891 \pm 756.4636 and -295.7622 \pm 752.7103 , averaging 210 steps to goal, demonstrating its robustness in dynamic settings. Comprehensive analyses, including learning curves, reward distributions, statistical tests, and computational efficiency evaluations, highlight the contributions of quantum circuits and meta-cognitive adaptation. By bridging quantum computing, cognitive science, and multi-agent RL, Q-ARDNS-Multi offers a scalable, human-like approach for applications in robotics, autonomous navigation, and decision-making under uncertainty.
zh
[AI-59] Fusing Cross-Domain Knowledge from Multimodal Data to Solve Problems in the Physical World
【速读】:该论文试图解决跨领域多模态数据融合(cross-domain multimodal data fusion)问题,即在数据不足的领域中,如何有效整合其他领域已有的多源异构数据以解决实际问题。其解决方案的关键在于提出一个四层框架,包括领域层(Domains Layer)、链接层(Links Layer)、模型层(Models Layer)和数据层(Data Layer),该框架系统性地回答了“融合什么”、“为何可融合”以及“如何融合”的核心问题,从而实现跨领域知识的有效对齐与整合。
链接: https://arxiv.org/abs/2506.03155
作者: Yu Zheng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:The proliferation of artificial intelligence has enabled a diversity of applications that bridge the gap between digital and physical worlds. As physical environments are too complex to model through a single information acquisition approach, it is crucial to fuse multimodal data generated by different sources, such as sensors, devices, systems, and people, to solve a problem in the real world. Unfortunately, it is neither applicable nor sustainable to deploy new resources to collect original data from scratch for every problem. Thus, when data is inadequate in the domain of problem, it is vital to fuse knowledge from multimodal data that is already available in other domains. We call this cross-domain knowledge fusion. Existing research focus on fusing multimodal data in a single domain, supposing the knowledge from different datasets is intrinsically aligned; however, this assumption may not hold in the scenarios of cross-domain knowledge fusion. In this paper, we formally define the cross-domain multimodal data fusion problem, discussing its unique challenges, differences and advantages beyond data fusion in a single domain. We propose a four-layer framework, consisting of Domains, Links, Models and Data layers, answering three key questions: “what to fuse”, “why can be fused”, and “how to fuse”. The Domains Layer selects relevant data from different domains for a given problem. The Links Layer reveals the philosophy of knowledge alignment beyond specific model structures. The Models Layer provides two knowledge fusion paradigms based on the fundamental mechanisms for processing data. The Data Layer turns data of different structures, resolutions, scales and distributions into a consistent representation that can be fed into an AI model. With this framework, we can design end-to-end solutions that fuse cross-domain multimodal data effectively for solving real-world problems.
zh
[AI-60] LLM Code Customization with Visual Results: A Benchmark on TikZ
【速读】:该论文试图解决在自然语言指令下对现有代码进行定制化修改以生成符合用户意图的视觉结果(如图表或图像)的问题,这一过程涉及特征定位、生成有效代码变体以及确保修改与用户意图一致。解决方案的关键在于提出vTikZ,这是首个针对大型语言模型(Large Language Models, LLMs)在保持视觉结果连贯性的同时进行代码定制能力的评估基准,其包含精心设计的编辑场景、参数化的真实数据及基于视觉反馈的评估工具。
链接: https://arxiv.org/abs/2505.04670
作者: Charly Reux(DiverSe),Mathieu Acher(DiverSe),Djamel Eddine Khelladi(DiverSe),Olivier Barais(DiverSe),Clément Quinton(SPIRALS)
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:With the rise of AI-based code generation, customizing existing code out of natural language instructions to modify visual results -such as figures or images -has become possible, promising to reduce the need for deep programming expertise. However, even experienced developers can struggle with this task, as it requires identifying relevant code regions (feature location), generating valid code variants, and ensuring the modifications reliably align with user intent. In this paper, we introduce vTikZ, the first benchmark designed to evaluate the ability of Large Language Models (LLMs) to customize code while preserving coherent visual outcomes. Our benchmark consists of carefully curated vTikZ editing scenarios, parameterized ground truths, and a reviewing tool that leverages visual feedback to assess correctness. Empirical evaluation with stateof-the-art LLMs shows that existing solutions struggle to reliably modify code in alignment with visual intent, highlighting a gap in current AI-assisted code editing approaches. We argue that vTikZ opens new research directions for integrating LLMs with visual feedback mechanisms to improve code customization tasks in various domains beyond TikZ, including image processing, art creation, Web design, and 3D modeling.
zh
[AI-61] StatWhy: Formal Verification Tool for Statistical Hypothesis Testing Programs
【速读】:该论文试图解决统计方法在各个科学领域中被误用和误解的问题,这引发了对科学研究完整性的重大担忧。解决方案的关键在于提出一种工具辅助的方法,用于形式化指定并自动验证统计程序的正确性。该方法要求程序员在统计程序的源代码中注释方法的要求,从而提醒其检查统计方法的要求,包括那些无法形式化验证的内容,如未知总体的分布。通过这一注释过程,软件工具StatWhy能够自动检查程序员是否正确指定了统计方法的要求,从而识别出需要解决的缺失要求。该工具基于Why3平台实现,用于验证进行统计假设检验的OCaml程序的正确性。
链接: https://arxiv.org/abs/2405.17492
作者: Yusuke Kawamoto,Kentaro Kobayashi,Kohei Suenaga
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注: Accepted to CAV 2025 (the 37th International Conference on Computer Aided Verification)
Abstract:Statistical methods have been widely misused and misinterpreted in various scientific fields, raising significant concerns about the integrity of scientific research. To mitigate this problem, we propose a tool-assisted method for formally specifying and automatically verifying the correctness of statistical programs. In this method, programmers are required to annotate the source code of the statistical programs with the requirements for these methods. Through this annotation, they are reminded to check the requirements for statistical methods, including those that cannot be formally verified, such as the distribution of the unknown true population. Our software tool StatWhy automatically checks whether programmers have properly specified the requirements for the statistical methods, thereby identifying any missing requirements that need to be addressed. This tool is implemented using the Why3 platform to verify the correctness of OCaml programs that conduct statistical hypothesis testing. We demonstrate how StatWhy can be used to avoid common errors in various statistical hypothesis testing programs.
zh
[AI-62] Plant Bioelectric Early Warning Systems: A Five-Year Investigation into Human-Plant Electromagnetic Communication
【速读】:该论文试图解决植物对人类存在及情绪状态的生物电响应机制问题,旨在揭示植物是否具备感知人类情感和行为的能力。其解决方案的关键在于利用自定义植物传感器与机器学习分类方法,通过分析植物电压光谱图,结合基于ResNet50架构的深度学习模型,实现对人类情绪状态的高精度分类,从而验证植物生物电信号与人类情感状态之间的相关性。
链接: https://arxiv.org/abs/2506.04132
作者: Peter A. Gloor
机构: 未知
类目: Other Quantitative Biology (q-bio.OT); Artificial Intelligence (cs.AI)
备注:
Abstract:We present a comprehensive investigation into plant bioelectric responses to human presence and emotional states, building on five years of systematic research. Using custom-built plant sensors and machine learning classification, we demonstrate that plants generate distinct bioelectric signals correlating with human proximity, emotional states, and physiological conditions. A deep learning model based on ResNet50 architecture achieved 97% accuracy in classifying human emotional states through plant voltage spectrograms, while control models with shuffled labels achieved only 30% accuracy. This study synthesizes findings from multiple experiments spanning 2020-2025, including individual recognition (66% accuracy), eurythmic gesture detection, stress prediction, and responses to human voice and movement. We propose that these phenomena represent evolved anti-herbivory early warning systems, where plants detect approaching animals through bioelectric field changes before physical contact. Our results challenge conventional understanding of plant sensory capabilities and suggest practical applications in agriculture, healthcare, and human-plant interaction research.
zh
[AI-63] HTSC-2025: A Benchmark Dataset of Ambient-Pressure High-Temperature Superconductors for AI-Driven Critical Temperature Prediction
【速读】:该论文试图解决高-温超导材料预测中缺乏广泛认可的基准数据集的问题,这一问题严重阻碍了不同人工智能算法之间的公平比较和方法的进一步发展。解决方案的关键在于提出HTSC-2025基准数据集,该数据集包含了基于BCS超导理论从2023年至2025年理论预测的多种高温超导材料体系,涵盖了多种结构类型,如X₂YH₆、钙钛矿型MXH₃、M₃XH₈、LaH₁₀结构演变得到的笼状BCN掺杂金属原子体系以及从MgB₂演变而来的二维蜂窝结构体系,并已开源且将持续更新。
链接: https://arxiv.org/abs/2506.03837
作者: Xiao-Qi Han,Ze-Feng Gao,Xin-De Wang,Zhenfeng Ouyang,Peng-Jie Guo,Zhong-Yi Lu
机构: 未知
类目: perconductivity (cond-mat.supr-con); Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 7 pages, 2 figures
Abstract:The discovery of high-temperature superconducting materials holds great significance for human industry and daily life. In recent years, research on predicting superconducting transition temperatures using artificial intelligence~(AI) has gained popularity, with most of these tools claiming to achieve remarkable accuracy. However, the lack of widely accepted benchmark datasets in this field has severely hindered fair comparisons between different AI algorithms and impeded further advancement of these methods. In this work, we present the HTSC-2025, an ambient-pressure high-temperature superconducting benchmark dataset. This comprehensive compilation encompasses theoretically predicted superconducting materials discovered by theoretical physicists from 2023 to 2025 based on BCS superconductivity theory, including the renowned X _2 YH _6 system, perovskite MXH _3 system, M _3 XH _8 system, cage-like BCN-doped metal atomic systems derived from LaH _10 structural evolution, and two-dimensional honeycomb-structured systems evolving from MgB _2 . The HTSC-2025 benchmark has been open-sourced at this https URL and will be continuously updated. This benchmark holds significant importance for accelerating the discovery of superconducting materials using AI-based methods.
zh
[AI-64] POLARIS: A High-contrast Polarimetric Imaging Benchmark Dataset for Exoplanetary Disk Representation Learning NEURIPS2025
【速读】:该论文旨在解决直接成像类地系外行星过程中依赖人工标记参考星以提取恒星周围天体(如行星或原行星盘)的低效问题。其解决方案的关键在于构建一个名为POLARIS的数据集,该数据集基于自2014年以来的SPHERE/IRDIS极化光公共档案,通过少量手动标注(少于10%)实现对参考星和原行星盘图像的分类,并提出一种融合统计、生成和大型视觉-语言模型的无监督生成表征学习框架,从而提升表征能力和性能。
链接: https://arxiv.org/abs/2506.03511
作者: Fangyi Cao,Bin Ren,Zihao Wang,Shiwei Fu,Youbin Mo,Xiaoyang Liu,Yuzhou Chen,Weixin Yao
机构: 未知
类目: Earth and Planetary Astrophysics (astro-ph.EP); Artificial Intelligence (cs.AI)
备注: 9 pages main text with 5 figures, 9 pages appendix with 9 figures. Submitted to NeurIPS 2025
Abstract:With over 1,000,000 images from more than 10,000 exposures using state-of-the-art high-contrast imagers (e.g., Gemini Planet Imager, VLT/SPHERE) in the search for exoplanets, can artificial intelligence (AI) serve as a transformative tool in imaging Earth-like exoplanets in the coming decade? In this paper, we introduce a benchmark and explore this question from a polarimetric image representation learning perspective. Despite extensive investments over the past decade, only a few new exoplanets have been directly imaged. Existing imaging approaches rely heavily on labor-intensive labeling of reference stars, which serve as background to extract circumstellar objects (disks or exoplanets) around target stars. With our POLARIS (POlarized Light dAta for total intensity Representation learning of direct Imaging of exoplanetary Systems) dataset, we classify reference star and circumstellar disk images using the full public SPHERE/IRDIS polarized-light archive since 2014, requiring less than 10 percent manual labeling. We evaluate a range of models including statistical, generative, and large vision-language models and provide baseline performance. We also propose an unsupervised generative representation learning framework that integrates these models, achieving superior performance and enhanced representational power. To our knowledge, this is the first uniformly reduced, high-quality exoplanet imaging dataset, rare in astrophysics and machine learning. By releasing this dataset and baselines, we aim to equip astrophysicists with new tools and engage data scientists in advancing direct exoplanet imaging, catalyzing major interdisciplinary breakthroughs.
zh
[AI-65] A Data-Driven Diffusion-based Approach for Audio Deepfake Explanations INTERSPEECH2025
【速读】:该论文试图解决在音频深度伪造检测中,现有可解释性技术(如SHAP和LRP)由于缺乏明确的地面真实标注而难以提供准确解释的问题。其解决方案的关键在于提出一种数据驱动的方法,通过对比真实音频与合成音频的时间-频率表示差异作为地面真实解释,并利用该差异信号监督扩散模型,以暴露给定合成音频中的深度伪造痕迹。
链接: https://arxiv.org/abs/2506.03425
作者: Petr Grinberg,Ankur Kumar,Surya Koppisetti,Gaurav Bharaj
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 5 pages, 3 figures, accepted at Interspeech 2025
Abstract:Evaluating explainability techniques, such as SHAP and LRP, in the context of audio deepfake detection is challenging due to lack of clear ground truth annotations. In the cases when we are able to obtain the ground truth, we find that these methods struggle to provide accurate explanations. In this work, we propose a novel data-driven approach to identify artifact regions in deepfake audio. We consider paired real and vocoded audio, and use the difference in time-frequency representation as the ground-truth explanation. The difference signal then serves as a supervision to train a diffusion model to expose the deepfake artifacts in a given vocoded audio. Experimental results on the VocV4 and LibriSeVoc datasets demonstrate that our method outperforms traditional explainability techniques, both qualitatively and quantitatively.
zh
[AI-66] UniSite: The First Cross-Structure Dataset and Learning Framework for End-to-End Ligand Binding Site Detection
【速读】:该论文旨在解决蛋白质配体结合位点检测中的关键问题,包括现有数据集和方法对单一蛋白-配体复合物的依赖导致的统计偏差、检测流程的离散性以及传统评估指标无法准确反映预测性能等挑战。其解决方案的关键在于引入首个以UniProt为中心的结合位点数据集UniSite-DS,该数据集包含更多多结合位点和整体数据;提出首个基于集合预测损失和双射匹配的端到端检测框架UniSite;并引入基于交并比(IoU)的平均精度作为更准确的评估指标。
链接: https://arxiv.org/abs/2506.03237
作者: Jigang Fan,Quanlin Wu,Shengjie Luo,Liwei Wang
机构: 未知
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Biomolecules (q-bio.BM)
备注:
Abstract:The detection of ligand binding sites for proteins is a fundamental step in Structure-Based Drug Design. Despite notable advances in recent years, existing methods, datasets, and evaluation metrics are confronted with several key challenges: (1) current datasets and methods are centered on individual protein-ligand complexes and neglect that diverse binding sites may exist across multiple complexes of the same protein, introducing significant statistical bias; (2) ligand binding site detection is typically modeled as a discontinuous workflow, employing binary segmentation and subsequent clustering algorithms; (3) traditional evaluation metrics do not adequately reflect the actual performance of different binding site prediction methods. To address these issues, we first introduce UniSite-DS, the first UniProt (Unique Protein)-centric ligand binding site dataset, which contains 4.81 times more multi-site data and 2.08 times more overall data compared to the previously most widely used datasets. We then propose UniSite, the first end-to-end ligand binding site detection framework supervised by set prediction loss with bijective matching. In addition, we introduce Average Precision based on Intersection over Union (IoU) as a more accurate evaluation metric for ligand binding site prediction. Extensive experiments on UniSite-DS and several representative benchmark datasets demonstrate that IoU-based Average Precision provides a more accurate reflection of prediction quality, and that UniSite outperforms current state-of-the-art methods in ligand binding site detection. The dataset and codes will be made publicly available at this https URL.
zh
[AI-67] A Pre-trained Framework for Multilingual Brain Decoding Using Non-invasive Recordings
【速读】:该论文旨在解决当前脑机接口(Brain-Computer Interface, BCI)中语音解码方法在多语言、多被试和多神经成像模态设置下的适用性与泛化能力受限的问题。其解决方案的关键在于提出一种联合多语言、多被试和多模态的解码框架,该框架通过将多样化的脑记录映射到由预训练多语言模型(Pre-trained Multilingual Model, PMM)定义的统一语义空间,从而实现跨语言、跨被试和跨神经成像模态的解码。此方法不仅提升了模型的泛化能力,还促进了对语言资源较少语言的公平性。
链接: https://arxiv.org/abs/2506.03214
作者: Yi Guo,Yihang Dong,Michael Kwok-Po Ng,Shuqiang Wang
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI)
备注:
Abstract:Brain-computer interfaces (BCIs) with speech decoding from brain recordings have broad application potential in fields such as clinical rehabilitation and cognitive neuroscience. However, current decoding methods remain limited to single-language, single-subject, and single neuroimaging modality settings, restricting their clinical applicability and generalizability. Here we propose a joint multilingual, multi-subject and multimodal decoding framework. It maps diverse brain recordings into a unified semantic space defined by a pre-trained multilingual model (PMM), enabling decoding across multiple languages, multiple subjects and multiple neuroimaging modalities. The proposed framework is validated using non-invasive brain recordings from 159 participants across four languages. Experimental results show that it exhibits strong generalization across multilingual, multi-subject, and multimodal settings. More importantly, the proposed framework can promote linguistic fairness, which is vital for underrepresented languages in BCI applications. The unified semantic space enables cross-lingual mapping enhancement, allowing the framework to boost the decoding performance of underrepresented languages, thereby promoting linguistic fairness. Overall, the proposed framework establishes a new potential paradigm for brain decoding, opening new paths for broader applications of BCI.
zh
[AI-68] Predicting Postoperative Stroke in Elderly SICU Patients: An Interpretable Machine Learning Model Using MIMIC Data
【速读】:该论文旨在解决老年手术重症监护病房(SICU)患者术后卒中这一关键临床问题,其核心挑战在于如何实现早期风险分层以支持及时干预并改善临床预后。解决方案的关键在于构建了一个可解释的机器学习(ML)框架,利用MIMIC-III和MIMIC-IV数据库中的19,085例老年SICU入院数据,通过整合特征工程与模型优化,从入院后24小时内的临床数据中预测院内卒中。该框架采用了一系列数据预处理技术,包括高缺失率特征剔除、迭代奇异值分解(SVD)插补、z-score标准化、独热编码及ADASYN算法处理类别不平衡,并结合递归特征消除交叉验证(RFECV)与SHAP方法进行两阶段特征选择,最终筛选出20个具有临床意义的预测变量,其中CatBoost模型在性能上表现最佳(AUROC=0.8868)。
链接: https://arxiv.org/abs/2506.03209
作者: Tinghuan Li,Shuheng Chen,Junyi Fan,Elham Pishgar,Kamiar Alaei,Greg Placencia,Maryam Pishgar
机构: 未知
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Postoperative stroke remains a critical complication in elderly surgical intensive care unit (SICU) patients, contributing to prolonged hospitalization, elevated healthcare costs, and increased mortality. Accurate early risk stratification is essential to enable timely intervention and improve clinical outcomes. We constructed a combined cohort of 19,085 elderly SICU admissions from the MIMIC-III and MIMIC-IV databases and developed an interpretable machine learning (ML) framework to predict in-hospital stroke using clinical data from the first 24 hours of Intensive Care Unit (ICU) stay. The preprocessing pipeline included removal of high-missingness features, iterative Singular Value Decomposition (SVD) imputation, z-score normalization, one-hot encoding, and class imbalance correction via the Adaptive Synthetic Sampling (ADASYN) algorithm. A two-stage feature selection process-combining Recursive Feature Elimination with Cross-Validation (RFECV) and SHapley Additive exPlanations (SHAP)-reduced the initial 80 variables to 20 clinically informative predictors. Among eight ML models evaluated, CatBoost achieved the best performance with an AUROC of 0.8868 (95% CI: 0.8802–0.8937). SHAP analysis and ablation studies identified prior cerebrovascular disease, serum creatinine, and systolic blood pressure as the most influential risk factors. Our results highlight the potential of interpretable ML approaches to support early detection of postoperative stroke and inform decision-making in perioperative critical care.
zh
机器学习
[LG-0] A Few Moments Please: Scalable Graphon Learning via Moment Matching
链接: https://arxiv.org/abs/2506.04206
作者: Reza Ramezanpour,Victor M. Tenorio,Antonio G. Marques,Ashutosh Sabharwal,Santiago Segarra
类目: Machine Learning (cs.LG)
*备注:
Abstract:Graphons, as limit objects of dense graph sequences, play a central role in the statistical analysis of network data. However, existing graphon estimation methods often struggle with scalability to large networks and resolution-independent approximation, due to their reliance on estimating latent variables or costly metrics such as the Gromov-Wasserstein distance. In this work, we propose a novel, scalable graphon estimator that directly recovers the graphon via moment matching, leveraging implicit neural representations (INRs). Our approach avoids latent variable modeling by training an INR–mapping coordinates to graphon values–to match empirical subgraph counts (i.e., moments) from observed graphs. This direct estimation mechanism yields a polynomial-time solution and crucially sidesteps the combinatorial complexity of Gromov-Wasserstein optimization. Building on foundational results, we establish a theoretical guarantee: when the observed subgraph motifs sufficiently represent those of the true graphon (a condition met with sufficiently large or numerous graph samples), the estimated graphon achieves a provable upper bound in cut distance from the ground truth. Additionally, we introduce MomentMixup, a data augmentation technique that performs mixup in the moment space to enhance graphon-based learning. Our graphon estimation method achieves strong empirical performance–demonstrating high accuracy on small graphs and superior computational efficiency on large graphs–outperforming state-of-the-art scalable estimators in 75% of benchmark settings and matching them in the remaining cases. Furthermore, MomentMixup demonstrated improved graph classification accuracy on the majority of our benchmarks.
[LG-1] EPiC: Towards Lossless Speedup for Reasoning Training through Edge-Preserving CoT Condensation
链接: https://arxiv.org/abs/2506.04205
作者: Jinghan Jia,Hadi Reisizadeh,Chongyu Fan,Nathalie Baracaldo,Mingyi Hong,Sijia Liu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Large language models (LLMs) have shown remarkable reasoning capabilities when trained with chain-of-thought (CoT) supervision. However, the long and verbose CoT traces, especially those distilled from large reasoning models (LRMs) such as DeepSeek-R1, significantly increase training costs during the distillation process, where a non-reasoning base model is taught to replicate the reasoning behavior of an LRM. In this work, we study the problem of CoT condensation for resource-efficient reasoning training, aimed at pruning intermediate reasoning steps (i.e., thoughts) in CoT traces, enabling supervised model training on length-reduced CoT data while preserving both answer accuracy and the model’s ability to generate coherent reasoning. Our rationale is that CoT traces typically follow a three-stage structure: problem understanding, exploration, and solution convergence. Through empirical analysis, we find that retaining the structure of the reasoning trace, especially the early stage of problem understanding (rich in reflective cues) and the final stage of solution convergence, is sufficient to achieve lossless reasoning supervision. To this end, we propose an Edge-Preserving Condensation method, EPiC, which selectively retains only the initial and final segments of each CoT trace while discarding the middle portion. This design draws an analogy to preserving the “edge” of a reasoning trajectory, capturing both the initial problem framing and the final answer synthesis, to maintain logical continuity. Experiments across multiple model families (Qwen and LLaMA) and benchmarks show that EPiC reduces training time by over 34% while achieving lossless reasoning accuracy on MATH500, comparable to full CoT supervision. To the best of our knowledge, this is the first study to explore thought-level CoT condensation for efficient reasoning model distillation.
[LG-2] A Kernel-Based Approach for Accurate Steady-State Detection in Performance Time Series
链接: https://arxiv.org/abs/2506.04204
作者: Martin Beseda,Vittorio Cortellessa,Daniele Di Pompeo,Luca Traini,Michele Tucci
类目: Performance (cs.PF); Machine Learning (cs.LG)
*备注: This manuscript is under review by Future Generation Computer Systems
Abstract:This paper addresses the challenge of accurately detecting the transition from the warmup phase to the steady state in performance metric time series, which is a critical step for effective benchmarking. The goal is to introduce a method that avoids premature or delayed detection, which can lead to inaccurate or inefficient performance analysis. The proposed approach adapts techniques from the chemical reactors domain, detecting steady states online through the combination of kernel-based step detection and statistical methods. By using a window-based approach, it provides detailed information and improves the accuracy of identifying phase transitions, even in noisy or irregular time series. Results show that the new approach reduces total error by 14.5% compared to the state-of-the-art method. It offers more reliable detection of the steady-state onset, delivering greater precision for benchmarking tasks. For users, the new approach enhances the accuracy and stability of performance benchmarking, efficiently handling diverse time series data. Its robustness and adaptability make it a valuable tool for real-world performance evaluation, ensuring consistent and reproducible results.
[LG-3] How to Use Graph Data in the Wild to Help Graph Anomaly Detection? KDD2025
链接: https://arxiv.org/abs/2506.04190
作者: Yuxuan Cao,Jiarong Xu,Chen Zhao,Jiaan Wang,Carl Yang,Chunping Wang,Yang Yang
类目: Machine Learning (cs.LG)
*备注: Accepted by SIGKDD2025
Abstract:In recent years, graph anomaly detection has found extensive applications in various domains such as social, financial, and communication networks. However, anomalies in graph-structured data present unique challenges, including label scarcity, ill-defined anomalies, and varying anomaly types, making supervised or semi-supervised methods unreliable. Researchers often adopt unsupervised approaches to address these challenges, assuming that anomalies deviate significantly from the normal data distribution. Yet, when the available data is insufficient, capturing the normal distribution accurately and comprehensively becomes difficult. To overcome this limitation, we propose to utilize external graph data (i.e., graph data in the wild) to help anomaly detection tasks. This naturally raises the question: How can we use external data to help graph anomaly detection tasks? To answer this question, we propose a framework called Wild-GAD. It is built upon a unified database, UniWildGraph, which comprises a large and diverse collection of graph data with broad domain coverage, ample data volume, and a unified feature space. Further, we develop selection criteria based on representativity and diversity to identify the most suitable external data for anomaly detection task. Extensive experiments on six real-world datasets demonstrate the effectiveness of Wild-GAD. Compared to the baseline methods, our framework has an average 18% AUCROC and 32% AUCPR improvement over the best-competing methods.
[LG-4] OpenThoughts: Data Recipes for Reasoning Models WWW
链接: https://arxiv.org/abs/2506.04178
作者: Etash Guha,Ryan Marten,Sedrick Keh,Negin Raoof,Georgios Smyrnis,Hritik Bansal,Marianna Nezhurina,Jean Mercat,Trung Vu,Zayne Sprague,Ashima Suvarna,Benjamin Feuer,Liangyu Chen,Zaid Khan,Eric Frankel,Sachin Grover,Caroline Choi,Niklas Muennighoff,Shiye Su,Wanjia Zhao,John Yang,Shreyas Pimpalgaonkar,Kartik Sharma,Charlie Cheng-Jie Ji,Yichuan Deng,Sarah Pratt,Vivek Ramanujan,Jon Saad-Falcon,Jeffrey Li,Achal Dave,Alon Albalak,Kushal Arora,Blake Wulfe,Chinmay Hegde,Greg Durrett,Sewoong Oh,Mohit Bansal,Saadia Gabriel,Aditya Grover,Kai-Wei Chang,Vaishaal Shankar,Aaron Gokaslan,Mike A. Merrill,Tatsunori Hashimoto,Yejin Choi,Jenia Jitsev,Reinhard Heckel,Maheswaran Sathiamoorthy,Alexandros G. Dimakis,Ludwig Schmidt
类目: Machine Learning (cs.LG)
*备注: this https URL
Abstract:Reasoning models have made rapid progress on many benchmarks involving math, code, and science. Yet, there are still many open questions about the best training recipes for reasoning since state-of-the-art models often rely on proprietary datasets with little to no public information available. To address this, the goal of the OpenThoughts project is to create open-source datasets for training reasoning models. After initial explorations, our OpenThoughts2-1M dataset led to OpenThinker2-32B, the first model trained on public reasoning data to match DeepSeek-R1-Distill-32B on standard reasoning benchmarks such as AIME and LiveCodeBench. We then improve our dataset further by systematically investigating each step of our data generation pipeline with 1,000+ controlled experiments, which led to OpenThoughts3. Scaling the pipeline to 1.2M examples and using QwQ-32B as teacher yields our OpenThinker3-7B model, which achieves state-of-the-art results: 53% on AIME 2025, 51% on LiveCodeBench 06/24-01/25, and 54% on GPQA Diamond. All of our datasets and models are available on this https URL.
[LG-5] Does Prompt Design Impact Quality of Data Imputation by LLM s?
链接: https://arxiv.org/abs/2506.04172
作者: Shreenidhi Srinivasan,Lydia Manikonda
类目: Machine Learning (cs.LG); Emerging Technologies (cs.ET)
*备注: 7 pages
Abstract:Generating realistic synthetic tabular data presents a critical challenge in machine learning. It adds another layer of complexity when this data contain class imbalance problems. This paper presents a novel token-aware data imputation method that leverages the in-context learning capabilities of large language models. This is achieved through the combination of a structured group-wise CSV-style prompting technique and the elimination of irrelevant contextual information in the input prompt. We test this approach with two class-imbalanced binary classification datasets and evaluate the effectiveness of imputation using classification-based evaluation metrics. The experimental results demonstrate that our approach significantly reduces the input prompt size while maintaining or improving imputation quality compared to our baseline prompt, especially for datasets that are of relatively smaller in size. The contributions of this presented work is two-fold – 1) it sheds light on the importance of prompt design when leveraging LLMs for synthetic data generation and 2) it addresses a critical gap in LLM-based data imputation for class-imbalanced datasets with missing data by providing a practical solution within computational constraints. We hope that our work will foster further research and discussions about leveraging the incredible potential of LLMs and prompt engineering techniques for synthetic data generation.
[LG-6] N2: A Unified Python Package and Test Bench for Nearest Neighbor-Based Matrix Completion
链接: https://arxiv.org/abs/2506.04166
作者: Caleb Chin,Aashish Khubchandani,Harshvardhan Maskara,Kyuseong Choi,Jacob Feitelberg,Albert Gong,Manit Paul,Tathagata Sadhukhan,Anish Agarwal,Raaz Dwivedi
类目: Machine Learning (cs.LG); Computation (stat.CO); Machine Learning (stat.ML)
*备注: 21 pages, 6 figures
Abstract:Nearest neighbor (NN) methods have re-emerged as competitive tools for matrix completion, offering strong empirical performance and recent theoretical guarantees, including entry-wise error bounds, confidence intervals, and minimax optimality. Despite their simplicity, recent work has shown that NN approaches are robust to a range of missingness patterns and effective across diverse applications. This paper introduces N ^2 , a unified Python package and testbed that consolidates a broad class of NN-based methods through a modular, extensible interface. Built for both researchers and practitioners, N ^2 supports rapid experimentation and benchmarking. Using this framework, we introduce a new NN variant that achieves state-of-the-art results in several settings. We also release a benchmark suite of real-world datasets, from healthcare and recommender systems to causal inference and LLM evaluation, designed to stress-test matrix completion methods beyond synthetic scenarios. Our experiments demonstrate that while classical methods excel on idealized data, NN-based techniques consistently outperform them in real-world settings.
[LG-7] Faster Approx. Top-K: Harnessing the Full Power of Two Stages
链接: https://arxiv.org/abs/2506.04165
作者: Yashas Samaga,Varun Yerram,Spandana Raj Babbula,Prateek Jain,Praneeth Netrapalli
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS)
*备注: Rejected from MLSys 2025
Abstract:We consider the Top- K selection problem, which aims to identify the largest- K elements from an array. Top- K selection arises in many machine learning algorithms and often becomes a bottleneck on accelerators, which are optimized for dense matrix multiplications. To address this problem, \citetchern2022tpuknnknearestneighbor proposed a fast two-stage \textitapproximate Top- K algorithm: (i) partition the input array and select the top- 1 element from each partition, (ii) sort this \textitsmaller subset and return the top K elements. In this paper, we consider a generalized version of this algorithm, where the first stage selects top- K’ elements, for some 1 \leq K’ \leq K , from each partition. Our contributions are as follows: (i) we derive an expression for the expected recall of this generalized algorithm and show that choosing K’ 1 with fewer partitions in the first stage reduces the input size to the second stage more effectively while maintaining the same expected recall as the original algorithm, (ii) we derive a bound on the expected recall for the original algorithm in \citetchern2022tpuknnknearestneighbor that is provably tighter by a factor of 2 than the one in that paper, and (iii) we implement our algorithm on Cloud TPUv5e and achieve around an order of magnitude speedups over the original algorithm without sacrificing recall on real-world tasks.
[LG-8] Incremental Gradient Descent with Small Epoch Counts is Surprisingly Slow on Ill-Conditioned Problems ICML2025
链接: https://arxiv.org/abs/2506.04126
作者: Yujun Kim,Jaeyoung Cha,Chulhee Yun
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: Accepted to ICML 2025, 56 pages, 6 figures
Abstract:Recent theoretical results demonstrate that the convergence rates of permutation-based SGD (e.g., random reshuffling SGD) are faster than uniform-sampling SGD; however, these studies focus mainly on the large epoch regime, where the number of epochs K exceeds the condition number \kappa . In contrast, little is known when K is smaller than \kappa , and it is still a challenging open question whether permutation-based SGD can converge faster in this small epoch regime (Safran and Shamir, 2021). As a step toward understanding this gap, we study the naive deterministic variant, Incremental Gradient Descent (IGD), on smooth and strongly convex functions. Our lower bounds reveal that for the small epoch regime, IGD can exhibit surprisingly slow convergence even when all component functions are strongly convex. Furthermore, when some component functions are allowed to be nonconvex, we prove that the optimality gap of IGD can be significantly worse throughout the small epoch regime. Our analyses reveal that the convergence properties of permutation-based SGD in the small epoch regime may vary drastically depending on the assumptions on component functions. Lastly, we supplement the paper with tight upper and lower bounds for IGD in the large epoch regime.
[LG-9] Guided Speculative Inference for Efficient Test-Time Alignment of LLM s
链接: https://arxiv.org/abs/2506.04118
作者: Jonathan Geuter,Youssef Mroueh,David Alvarez-Melis
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 12 pages, 2 figures
Abstract:We propose Guided Speculative Inference (GSI), a novel algorithm for efficient reward-guided decoding in large language models. GSI combines soft best-of- n test-time scaling with a reward model r(x,y) and speculative samples from a small auxiliary model \pi_S(y\mid x) . We provably approximate the optimal tilted policy \pi_\beta,B(y\mid x) \propto \pi_B(y\mid x)\exp(\beta,r(x,y)) of soft best-of- n under the primary model \pi_B . We derive a theoretical bound on the KL divergence between our induced distribution and the optimal policy. In experiments on reasoning benchmarks (MATH500, OlympiadBench, Minerva Math), our method achieves higher accuracy than standard soft best-of- n with \pi_S and reward-guided speculative decoding (Liao et al., 2025), and in certain settings even outperforms soft best-of- n with \pi_B . The code is available at this https URL .
[LG-10] Crowd-SFT: Crowdsourcing for LLM Alignment
链接: https://arxiv.org/abs/2506.04063
作者: Alex Sotiropoulos,Sulyab Thottungal Valapu,Linus Lei,Jared Coleman,Bhaskar Krishnamachari
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:
Abstract:Large Language Models (LLMs) increasingly rely on Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF) to align model responses with human preferences. While RLHF employs a reinforcement learning approach with a separate reward model, SFT uses human-curated datasets for supervised learning. Both approaches traditionally depend on small, vetted groups of annotators, making them costly, prone to bias, and limited in scalability. We propose an open, crowd-sourced fine-tuning framework that addresses these limitations by enabling broader feedback collection for SFT without extensive annotator training. Our framework promotes incentive fairness via a point-based reward system correlated with Shapley values and guides model convergence through iterative model updates. Our multi-model selection framework demonstrates up to a 55% reduction in target distance over single-model selection, enabling subsequent experiments that validate our point-based reward mechanism’s close alignment with Shapley values (a well-established method for attributing individual contributions) thereby supporting fair and scalable participation.
[LG-11] Curse of Slicing: Why Sliced Mutual Information is a Deceptive Measure of Statistical Dependence
链接: https://arxiv.org/abs/2506.04053
作者: Alexander Semenenko,Ivan Butakov,Alexey Frolov,Ivan Oseledets
类目: Machine Learning (cs.LG); Information Theory (cs.IT)
*备注:
Abstract:Sliced Mutual Information (SMI) is widely used as a scalable alternative to mutual information for measuring non-linear statistical dependence. Despite its advantages, such as faster convergence, robustness to high dimensionality, and nullification only under statistical independence, we demonstrate that SMI is highly susceptible to data manipulation and exhibits counterintuitive behavior. Through extensive benchmarking and theoretical analysis, we show that SMI saturates easily, fails to detect increases in statistical dependence (even under linear transformations designed to enhance the extraction of information), prioritizes redundancy over informative content, and in some cases, performs worse than simpler dependence measures like the correlation coefficient.
[LG-12] Autonomous Vehicle Lateral Control Using Deep Reinforcement Learning with MPC-PID Demonstration
链接: https://arxiv.org/abs/2506.04040
作者: Chengdong Wu,Sven Kirchner,Nils Purschke,Alois C. Knoll
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 8 pages; Accepted for publication at the 36th IEEE Intelligent Vehicles Symposium (IV), Cluj-Napoca, Romania, June 22-25, 2025
Abstract:The controller is one of the most important modules in the autonomous driving pipeline, ensuring the vehicle reaches its desired position. In this work, a reinforcement learning based lateral control approach, despite the imperfections in the vehicle models due to measurement errors and simplifications, is presented. Our approach ensures comfortable, efficient, and robust control performance considering the interface between controlling and other modules. The controller consists of the conventional Model Predictive Control (MPC)-PID part as the basis and the demonstrator, and the Deep Reinforcement Learning (DRL) part which leverages the online information from the MPC-PID part. The controller’s performance is evaluated in CARLA using the ground truth of the waypoints as inputs. Experimental results demonstrate the effectiveness of the controller when vehicle information is incomplete, and the training of DRL can be stabilized with the demonstration part. These findings highlight the potential to reduce development and integration efforts for autonomous driving pipelines in the future.
[LG-13] On the Usage of Gaussian Process for Efficient Data Valuation
链接: https://arxiv.org/abs/2506.04026
作者: Clément Bénesse,Patrick Mesana,Athénaïs Gautier,Sébastien Gambs
类目: Machine Learning (cs.LG)
*备注:
Abstract:In machine learning, knowing the impact of a given datum on model training is a fundamental task referred to as Data Valuation. Building on previous works from the literature, we have designed a novel canonical decomposition allowing practitioners to analyze any data valuation method as the combination of two parts: a utility function that captures characteristics from a given model and an aggregation procedure that merges such information. We also propose to use Gaussian Processes as a means to easily access the utility function on ``sub-models’', which are models trained on a subset of the training set. The strength of our approach stems from both its theoretical grounding in Bayesian theory, and its practical reach, by enabling fast estimation of valuations thanks to efficient update formulae.
[LG-14] Optimal Spiking Brain Compression: Improving One-Shot Post-Training Pruning and Quantization for Spiking Neural Networks
链接: https://arxiv.org/abs/2506.03996
作者: Lianfeng Shi,Ao Li,Benjamin Ward-Cherrier
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:
Abstract:Spiking Neural Networks (SNNs) have emerged as a new generation of energy-efficient neural networks suitable for implementation on neuromorphic hardware. As neuromorphic hardware has limited memory and computing resources, weight pruning and quantization have recently been explored to improve SNNs’ efficiency. State-of-the-art SNN pruning/quantization methods employ multiple compression and training iterations, increasing the cost for pre-trained or very large SNNs. In this paper, we propose a new one-shot post-training pruning/quantization framework, Optimal Spiking Brain Compression (OSBC), that adapts the Optimal Brain Compression (OBC) method of [Frantar, Singh, and Alistarh, 2023] for SNNs. Rather than minimizing the loss on neuron input current as OBC does, OSBC achieves more efficient and accurate SNN compression in one pass by minimizing the loss on spiking neuron membrane potential with a small sample dataset. Our experiments on neuromorphic datasets (N-MNIST, CIFAR10-DVS, DVS128-Gesture) demonstrate that OSBC can achieve 97% sparsity through pruning with 1.41%, 10.20%, and 1.74% accuracy loss, or 4-bit symmetric quantization with 0.17%, 1.54%, and 7.71% accuracy loss, respectively. Code will be available on GitHub.
[LG-15] Lower Ricci Curvature for Hypergraphs
链接: https://arxiv.org/abs/2506.03943
作者: Shiyi Yang,Can Chen,Didong Li
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Networks with higher-order interactions, prevalent in biological, social, and information systems, are naturally represented as hypergraphs, yet their structural complexity poses fundamental challenges for geometric characterization. While curvature-based methods offer powerful insights in graph analysis, existing extensions to hypergraphs suffer from critical trade-offs: combinatorial approaches such as Forman-Ricci curvature capture only coarse features, whereas geometric methods like Ollivier-Ricci curvature offer richer expressivity but demand costly optimal transport computations. To address these challenges, we introduce hypergraph lower Ricci curvature (HLRC), a novel curvature metric defined in closed form that achieves a principled balance between interpretability and efficiency. Evaluated across diverse synthetic and real-world hypergraph datasets, HLRC consistently reveals meaningful higher-order organization, distinguishing intra- from inter-community hyperedges, uncovering latent semantic labels, tracking temporal dynamics, and supporting robust clustering of hypergraphs based on global structure. By unifying geometric sensitivity with algorithmic simplicity, HLRC provides a versatile foundation for hypergraph analytics, with broad implications for tasks including node classification, anomaly detection, and generative modeling in complex systems.
[LG-16] FPGA-Enabled Machine Learning Applications in Earth Observation: A Systematic Review
链接: https://arxiv.org/abs/2506.03938
作者: Cédric Léonard(1 and 2),Dirk Stober(1),Martin Schulz(1) ((1) Technical University of Munich, Munich, Germany, (2) Remote Sensing Technology Institute (IMF), German Aerospace Center (DLR), Weßling, Germany)
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR)
*备注: 35 pages, 3 figures, 2 tables. Submitted to ACM Computing Surveys (ACM CSUR)
Abstract:New UAV technologies and the NewSpace era are transforming Earth Observation missions and data acquisition. Numerous small platforms generate large data volume, straining bandwidth and requiring onboard decision-making to transmit high-quality information in time. While Machine Learning allows real-time autonomous processing, FPGAs balance performance with adaptability to mission-specific requirements, enabling onboard deployment. This review systematically analyzes 66 experiments deploying ML models on FPGAs for Remote Sensing applications. We introduce two distinct taxonomies to capture both efficient model architectures and FPGA implementation strategies. For transparency and reproducibility, we follow PRISMA 2020 guidelines and share all data and code at this https URL.
[LG-17] Do Neural Networks Need Gradient Descent to Generalize? A Theoretical Study
链接: https://arxiv.org/abs/2506.03931
作者: Yotam Alexander,Yonatan Slutzky,Yuval Ran-Milo,Nadav Cohen
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Conventional wisdom attributes the mysterious generalization abilities of overparameterized neural networks to gradient descent (and its variants). The recent volume hypothesis challenges this view: it posits that these generalization abilities persist even when gradient descent is replaced by Guess Check (GC), i.e., by drawing weight settings until one that fits the training data is found. The validity of the volume hypothesis for wide and deep neural networks remains an open question. In this paper, we theoretically investigate this question for matrix factorization (with linear and non-linear activation)–a common testbed in neural network theory. We first prove that generalization under GC deteriorates with increasing width, establishing what is, to our knowledge, the first case where GC is provably inferior to gradient descent. Conversely, we prove that generalization under GC improves with increasing depth, revealing a stark contrast between wide and deep networks, which we further validate empirically. These findings suggest that even in simple settings, there may not be a simple answer to the question of whether neural networks need gradient descent to generalize well.
[LG-18] Weisfeiler and Leman Go Gambling: Why Expressive Lottery Tickets Win ICML2025
链接: https://arxiv.org/abs/2506.03919
作者: Lorenz Kummer,Samir Moustafa,Anatol Ehrlich,Franka Bause,Nikolaus Suess,Wilfried N. Gansterer,Nils M. Kriege
类目: Machine Learning (cs.LG)
*备注: Accepted at ICML 2025
Abstract:The lottery ticket hypothesis (LTH) is well-studied for convolutional neural networks but has been validated only empirically for graph neural networks (GNNs), for which theoretical findings are largely lacking. In this paper, we identify the expressivity of sparse subnetworks, i.e. their ability to distinguish non-isomorphic graphs, as crucial for finding winning tickets that preserve the predictive performance. We establish conditions under which the expressivity of a sparsely initialized GNN matches that of the full network, particularly when compared to the Weisfeiler-Leman test, and in that context put forward and prove a Strong Expressive Lottery Ticket Hypothesis. We subsequently show that an increased expressivity in the initialization potentially accelerates model convergence and improves generalization. Our findings establish novel theoretical foundations for both LTH and GNN research, highlighting the importance of maintaining expressivity in sparsely initialized GNNs. We illustrate our results using examples from drug discovery.
[LG-19] Learning equivariant models by discovering symmetries with learnable augmentations
链接: https://arxiv.org/abs/2506.03914
作者: Eduardo Santos Escriche,Stefanie Jegelka
类目: Machine Learning (cs.LG)
*备注:
Abstract:Recently, a trend has emerged that favors learning relevant symmetries from data in geometric domains instead of designing constrained architectures. To do so, two popular options are (1) to modify the training protocol, e.g., with a specific loss and data augmentations (soft equivariance), or (2) to ignore equivariance and infer it only implicitly. However, both options have limitations: soft equivariance requires a priori knowledge about relevant symmetries, while inferring symmetries merely via the task and larger data lacks interpretability. To address both limitations, we propose SEMoLA, an end-to-end approach that jointly (1) discovers a priori unknown symmetries in the data via learnable data augmentations, and (2) softly encodes the respective approximate equivariance into an arbitrary unconstrained model. Hence, it does not need prior knowledge about symmetries, it offers interpretability, and it maintains robustness to distribution shifts. Empirically, we demonstrate the ability of SEMoLA to robustly discover relevant symmetries while achieving high prediction accuracy across various datasets, encompassing multiple data modalities and underlying symmetry groups.
[LG-20] Learning Fair And Effective Points-Based Rewards Programs
链接: https://arxiv.org/abs/2506.03911
作者: Chamsi Hssaine,Yichun Hu,Ciara Pike-Burke
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
Abstract:Points-based rewards programs are a prevalent way to incentivize customer loyalty; in these programs, customers who make repeated purchases from a seller accumulate points, working toward eventual redemption of a free reward. These programs have recently come under scrutiny due to accusations of unfair practices in their implementation. Motivated by these concerns, we study the problem of fairly designing points-based rewards programs, with a focus on two obstacles that put fairness at odds with their effectiveness. First, due to customer heterogeneity, the seller should set different redemption thresholds for different customers to generate high revenue. Second, the relationship between customer behavior and the number of accumulated points is typically unknown; this requires experimentation which may unfairly devalue customers’ previously earned points. We first show that an individually fair rewards program that uses the same redemption threshold for all customers suffers a loss in revenue of at most a factor of 1+\ln 2 , compared to the optimal personalized strategy that differentiates between customers. We then tackle the problem of designing temporally fair learning algorithms in the presence of demand uncertainty. Toward this goal, we design a learning algorithm that limits the risk of point devaluation due to experimentation by only changing the redemption threshold O(\log T) times, over a horizon of length T . This algorithm achieves the optimal (up to polylogarithmic factors) \widetildeO(\sqrtT) regret in expectation. We then modify this algorithm to only ever decrease redemption thresholds, leading to improved fairness at a cost of only a constant factor in regret. Extensive numerical experiments show the limited value of personalization in average-case settings, in addition to demonstrating the strong practical performance of our proposed learning algorithms.
[LG-21] Enhancing Experimental Efficiency in Materials Design: A Comparative Study of Taguchi and Machine Learning Methods
链接: https://arxiv.org/abs/2506.03910
作者: Shyam Prabhu,P Akshay Kumar,Antov Selwinston,Pavan Taduvai,Shreya Bairi,Rohit Batra
类目: Machine Learning (cs.LG)
*备注: 7 pages, 3 figures
Abstract:Materials design problems often require optimizing multiple variables, rendering full factorial exploration impractical. Design of experiment (DOE) methods, such as Taguchi technique, are commonly used to efficiently sample the design space but they inherently lack the ability to capture non-linear dependency of process variables. In this work, we demonstrate how machine learning (ML) methods can be used to overcome these limitations. We compare the performance of Taguchi method against an active learning based Gaussian process regression (GPR) model in a wire arc additive manufacturing (WAAM) process to accurately predict aspects of bead geometry, including penetration depth, bead width, and height. While Taguchi method utilized a three-factor, five-level L25 orthogonal array to suggest weld parameters, the GPR model used an uncertainty-based exploration acquisition function coupled with latin hypercube sampling for initial training data. Accuracy and efficiency of both models was evaluated on 15 test cases, with GPR outperforming Taguchi in both metrics. This work applies to broader materials processing domain requiring efficient exploration of complex parameters.
[LG-22] A kernel conditional two-sample test
链接: https://arxiv.org/abs/2506.03898
作者: Pierre-François Massiani,Christian Fiedler,Lukas Haverbeck,Friedrich Solowjow,Sebastian Trimpe
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 40 pages, 8 figures, 8 tables. Under review
Abstract:We propose a framework for hypothesis testing on conditional probability distributions, which we then use to construct conditional two-sample statistical tests. These tests identify the inputs – called covariates in this context – where two conditional expectations differ with high probability. Our key idea is to transform confidence bounds of a learning method into a conditional two-sample test, and we instantiate this principle for kernel ridge regression (KRR) and conditional kernel mean embeddings. We generalize existing pointwise-in-time or time-uniform confidence bounds for KRR to previously-inaccessible yet essential cases such as infinite-dimensional outputs with non-trace-class kernels. These bounds enable circumventing the need for independent data in our statistical tests, since they allow online sampling. We also introduce bootstrapping schemes leveraging the parametric form of testing thresholds identified in theory to avoid tuning inaccessible parameters, making our method readily applicable in practice. Such conditional two-sample tests are especially relevant in applications where data arrive sequentially or non-independently, or when output distributions vary with operational parameters. We demonstrate their utility through examples in process monitoring and comparison of dynamical systems. Overall, our results establish a comprehensive foundation for conditional two-sample testing, from theoretical guarantees to practical implementation, and advance the state-of-the-art on the concentration of vector-valued least squares estimation.
[LG-23] mporal horizons in forecasting: a performance-learnability trade-off
链接: https://arxiv.org/abs/2506.03889
作者: Pau Vilimelis Aceituno,Jack William Miller,Noah Marti,Youssef Farag,Victor Boussange
类目: Machine Learning (cs.LG); Chaotic Dynamics (nlin.CD)
*备注: 33 pages, 12 figures
Abstract:When training autoregressive models for dynamical systems, a critical question arises: how far into the future should the model be trained to predict? Too short a horizon may miss long-term trends, while too long a horizon can impede convergence due to accumulating prediction errors. In this work, we formalize this trade-off by analyzing how the geometry of the loss landscape depends on the training horizon. We prove that for chaotic systems, the loss landscape’s roughness grows exponentially with the training horizon, while for limit cycles, it grows linearly, making long-horizon training inherently challenging. However, we also show that models trained on long horizons generalize well to short-term forecasts, whereas those trained on short horizons suffer exponentially (resp. linearly) worse long-term predictions in chaotic (resp. periodic) systems. We validate our theory through numerical experiments and discuss practical implications for selecting training horizons. Our results provide a principled foundation for hyperparameter optimization in autoregressive forecasting models.
[LG-24] Evaluating Apple Intelligences Writing Tools for Privacy Against Large Language Model-Based Inference Attacks: Insights from Early Datasets
链接: https://arxiv.org/abs/2506.03870
作者: Mohd. Farhan Israk Soumik,Syed Mhamudul Hasan,Abdur R. Shahid
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:
Abstract:The misuse of Large Language Models (LLMs) to infer emotions from text for malicious purposes, known as emotion inference attacks, poses a significant threat to user privacy. In this paper, we investigate the potential of Apple Intelligence’s writing tools, integrated across iPhone, iPad, and MacBook, to mitigate these risks through text modifications such as rewriting and tone adjustment. By developing early novel datasets specifically for this purpose, we empirically assess how different text modifications influence LLM-based detection. This capability suggests strong potential for Apple Intelligence’s writing tools as privacy-preserving mechanisms. Our findings lay the groundwork for future adaptive rewriting systems capable of dynamically neutralizing sensitive emotional content to enhance user privacy. To the best of our knowledge, this research provides the first empirical analysis of Apple Intelligence’s text-modification tools within a privacy-preservation context with the broader goal of developing on-device, user-centric privacy-preserving mechanisms to protect against LLMs-based advanced inference attacks on deployed systems.
[LG-25] STAR: Learning Diverse Robot Skill Abstractions through Rotation-Augmented Vector Quantization ICML2025
链接: https://arxiv.org/abs/2506.03863
作者: Hao Li,Qi Lv,Rui Shao,Xiang Deng,Yinchuan Li,Jianye Hao,Liqiang Nie
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: Accepted by ICML 2025 Spotlight
Abstract:Transforming complex actions into discrete skill abstractions has demonstrated strong potential for robotic manipulation. Existing approaches mainly leverage latent variable models, e.g., VQ-VAE, to learn skill abstractions through learned vectors (codebooks), while they suffer from codebook collapse and modeling the causal relationship between learned skills. To address these limitations, we present \textbfSkill \textbfTraining with \textbfAugmented \textbfRotation (\textbfSTAR), a framework that advances both skill learning and composition to complete complex behaviors. Specifically, to prevent codebook collapse, we devise rotation-augmented residual skill quantization (RaRSQ). It encodes relative angles between encoder outputs into the gradient flow by rotation-based gradient mechanism. Points within the same skill code are forced to be either pushed apart or pulled closer together depending on gradient directions. Further, to capture the causal relationship between skills, we present causal skill transformer (CST) which explicitly models dependencies between skill representations through an autoregressive mechanism for coherent action generation. Extensive experiments demonstrate the superiority of STAR on both LIBERO benchmark and realworld tasks, with around 12% improvement over the baselines.
[LG-26] Vulnerability-Aware Alignment: Mitigating Uneven Forgetting in Harmful Fine-Tuning ICML2025
链接: https://arxiv.org/abs/2506.03850
作者: Liang Chen,Xueting Han,Li Shen,Jing Bai,Kam-Fai Wong
类目: Machine Learning (cs.LG)
*备注: ICML 2025
Abstract:Harmful fine-tuning (HFT), performed directly on open-source LLMs or through Fine-tuning-as-a-Service, breaks safety alignment and poses significant threats. Existing methods aim to mitigate HFT risks by learning robust representation on alignment data or making harmful data unlearnable, but they treat each data sample equally, leaving data vulnerability patterns understudied. In this work, we reveal that certain subsets of alignment data are consistently more prone to forgetting during HFT across different fine-tuning tasks. Inspired by these findings, we propose Vulnerability-Aware Alignment (VAA), which estimates data vulnerability, partitions data into “vulnerable” and “invulnerable” groups, and encourages balanced learning using a group distributionally robust optimization (Group DRO) framework. Specifically, VAA learns an adversarial sampler that samples examples from the currently underperforming group and then applies group-dependent adversarial perturbations to the data during training, aiming to encourage a balanced learning process across groups. Experiments across four fine-tuning tasks demonstrate that VAA significantly reduces harmful scores while preserving downstream task performance, outperforming state-of-the-art baselines.
[LG-27] Revisiting Unbiased Implicit Variational Inference ICML2025
链接: https://arxiv.org/abs/2506.03839
作者: Tobias Pielok,Bernd Bischl,David Rügamer
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Accepted to ICML 2025
Abstract:Recent years have witnessed growing interest in semi-implicit variational inference (SIVI) methods due to their ability to rapidly generate samples from complex distributions. However, since the likelihood of these samples is non-trivial to estimate in high dimensions, current research focuses on finding effective SIVI training routines. Although unbiased implicit variational inference (UIVI) has largely been dismissed as imprecise and computationally prohibitive because of its inner MCMC loop, we revisit this method and show that UIVI’s MCMC loop can be effectively replaced via importance sampling and the optimal proposal distribution can be learned stably by minimizing an expected forward Kullback-Leibler divergence without bias. Our refined approach demonstrates superior performance or parity with state-of-the-art methods on established SIVI benchmarks.
[LG-28] Learning task-specific predictive models for scientific computing
链接: https://arxiv.org/abs/2506.03835
作者: Jianyuan Yin,Qianxiao Li
类目: Machine Learning (cs.LG)
*备注:
Abstract:We consider learning a predictive model to be subsequently used for a given downstream task (described by an algorithm) that requires access to the model evaluation. This task need not be prediction, and this situation is frequently encountered in machine-learning-augmented scientific computing. We show that this setting differs from classical supervised learning, and in general it cannot be solved by minimizing the mean square error of the model predictions as is frequently performed in the literature. Instead, we find that the maximum prediction error on the support of the downstream task algorithm can serve as an effective estimate for the subsequent task performance. With this insight, we formulate a task-specific supervised learning problem based on the given sampling measure, whose solution serves as a reliable surrogate model for the downstream task. Then, we discretize the empirical risk based on training data, and develop an iterative algorithm to solve the task-specific supervised learning problem. Three illustrative numerical examples on trajectory prediction, optimal control and minimum energy path computation demonstrate the effectiveness of the approach.
[LG-29] Survey of Active Learning Hyperparameters: Insights from a Large-Scale Experimental Grid
链接: https://arxiv.org/abs/2506.03817
作者: Julius Gonsior,Tim Rieß,Anja Reusch,Claudio Hartmann,Maik Thiele,Wolfgang Lehner
类目: Machine Learning (cs.LG)
*备注:
Abstract:Annotating data is a time-consuming and costly task, but it is inherently required for supervised machine learning. Active Learning (AL) is an established method that minimizes human labeling effort by iteratively selecting the most informative unlabeled samples for expert annotation, thereby improving the overall classification performance. Even though AL has been known for decades, AL is still rarely used in real-world applications. As indicated in the two community web surveys among the NLP community about AL, two main reasons continue to hold practitioners back from using AL: first, the complexity of setting AL up, and second, a lack of trust in its effectiveness. We hypothesize that both reasons share the same culprit: the large hyperparameter space of AL. This mostly unexplored hyperparameter space often leads to misleading and irreproducible AL experiment results. In this study, we first compiled a large hyperparameter grid of over 4.6 million hyperparameter combinations, second, recorded the performance of all combinations in the so-far biggest conducted AL study, and third, analyzed the impact of each hyperparameter in the experiment results. In the end, we give recommendations about the influence of each hyperparameter, demonstrate the surprising influence of the concrete AL strategy implementation, and outline an experimental study design for reproducible AL experiments with minimal computational effort, thus contributing to more reproducible and trustworthy AL research in the future.
[LG-30] Graph Neural Networks for Resource Allocation in Multi-Channel Wireless Networks
链接: https://arxiv.org/abs/2506.03813
作者: Lili Chen,Changyang She,Jingge Zhu,Jamie Evans
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:
Abstract:As the number of mobile devices continues to grow, interference has become a major bottleneck in improving data rates in wireless networks. Efficient joint channel and power allocation (JCPA) is crucial for managing interference. In this paper, we first propose an enhanced WMMSE (eWMMSE) algorithm to solve the JCPA problem in multi-channel wireless networks. To reduce the computational complexity of iterative optimization, we further introduce JCPGNN-M, a graph neural network-based solution that enables simultaneous multi-channel allocation for each user. We reformulate the problem as a Lagrangian function, which allows us to enforce the total power constraints systematically. Our solution involves combining this Lagrangian framework with GNNs and iteratively updating the Lagrange multipliers and resource allocation scheme. Unlike existing GNN-based methods that limit each user to a single channel, JCPGNN-M supports efficient spectrum reuse and scales well in dense network scenarios. Simulation results show that JCPGNN-M achieves better data rate compared to eWMMSE. Meanwhile, the inference time of JCPGNN-M is much lower than eWMMS, and it can generalize well to larger networks.
[LG-31] Learning Equilibria in Matching Games with Bandit Feedback
链接: https://arxiv.org/abs/2506.03802
作者: Andreas Athanasopoulos,Christos Dimitrakakis
类目: Machine Learning (cs.LG)
*备注: 21 pages, 2 figures
Abstract:We investigate the problem of learning an equilibrium in a generalized two-sided matching market, where agents can adaptively choose their actions based on their assigned matches. Specifically, we consider a setting in which matched agents engage in a zero-sum game with initially unknown payoff matrices, and we explore whether a centralized procedure can learn an equilibrium from bandit feedback. We adopt the solution concept of matching equilibrium, where a pair consisting of a matching \mathfrakm and a set of agent strategies X forms an equilibrium if no agent has the incentive to deviate from (\mathfrakm, X) . To measure the deviation of a given pair (\mathfrakm, X) from the equilibrium pair (\mathfrakm^\star, X^\star) , we introduce matching instability that can serve as a regret measure for the corresponding learning problem. We then propose a UCB algorithm in which agents form preferences and select actions based on optimistic estimates of the game payoffs, and prove that it achieves sublinear, instance-independent regret over a time horizon T .
[LG-32] From Theory to Practice: Real-World Use Cases on Trustworthy LLM -Driven Process Modeling Prediction and Automation SIGMOD2025
链接: https://arxiv.org/abs/2506.03801
作者: Peter Pfeiffer,Alexander Rombach,Maxim Majlatow,Nijat Mehdiyev
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注: Accepted to the Next Gen Data and Process Management: Large Language Models and Beyond workshop at SIGMOD 2025
Abstract:Traditional Business Process Management (BPM) struggles with rigidity, opacity, and scalability in dynamic environments while emerging Large Language Models (LLMs) present transformative opportunities alongside risks. This paper explores four real-world use cases that demonstrate how LLMs, augmented with trustworthy process intelligence, redefine process modeling, prediction, and automation. Grounded in early-stage research projects with industrial partners, the work spans manufacturing, modeling, life-science, and design processes, addressing domain-specific challenges through human-AI collaboration. In manufacturing, an LLM-driven framework integrates uncertainty-aware explainable Machine Learning (ML) with interactive dialogues, transforming opaque predictions into auditable workflows. For process modeling, conversational interfaces democratize BPMN design. Pharmacovigilance agents automate drug safety monitoring via knowledge-graph-augmented LLMs. Finally, sustainable textile design employs multi-agent systems to navigate regulatory and environmental trade-offs. We intend to examine tensions between transparency and efficiency, generalization and specialization, and human agency versus automation. By mapping these trade-offs, we advocate for context-sensitive integration prioritizing domain needs, stakeholder values, and iterative human-in-the-loop workflows over universal solutions. This work provides actionable insights for researchers and practitioners aiming to operationalize LLMs in critical BPM environments.
[LG-33] Attention-Only Transformers via Unrolled Subspace Denoising
链接: https://arxiv.org/abs/2506.03790
作者: Peng Wang,Yifu Lu,Yaodong Yu,Druv Pai,Qing Qu,Yi Ma
类目: Machine Learning (cs.LG)
*备注: 28 pages, 7 figures, 5 tables
Abstract:Despite the popularity of transformers in practice, their architectures are empirically designed and neither mathematically justified nor interpretable. Moreover, as indicated by many empirical studies, some components of transformer architectures may be redundant. To derive a fully interpretable transformer architecture with only necessary components, we contend that the goal of representation learning is to compress a set of noisy initial token representations towards a mixture of low-dimensional subspaces. To compress these noisy token representations, an associated denoising operation naturally takes the form of a multi-head (subspace) self-attention. By unrolling such iterative denoising operations into a deep network, we arrive at a highly compact architecture that consists of \textitonly self-attention operators with skip connections at each layer. Moreover, we show that each layer performs highly efficient denoising: it improves the signal-to-noise ratio of token representations \textitat a linear rate with respect to the number of layers. Despite its simplicity, extensive experiments on vision and language tasks demonstrate that such a transformer achieves performance close to that of standard transformer architectures such as GPT-2 and CRATE.
[LG-34] FedFACT: A Provable Framework for Controllable Group-Fairness Calibration in Federated Learning
链接: https://arxiv.org/abs/2506.03777
作者: Li Zhang,Zhongxuan Han,Chaochao chen,Xiaohua Feng,Jiaming Zhang,Yuyuan Li
类目: Machine Learning (cs.LG)
*备注:
Abstract:With emerging application of Federated Learning (FL) in decision-making scenarios, it is imperative to regulate model fairness to prevent disparities across sensitive groups (e.g., female, male). Current research predominantly focuses on two concepts of group fairness within FL: Global Fairness (overall model disparity across all clients) and Local Fairness (the disparity within each client). However, the non-decomposable, non-differentiable nature of fairness criteria pose two fundamental, unresolved challenges for fair FL: (i) Harmonizing global and local fairness in multi-class classification; (ii) Enabling a controllable, optimal accuracy-fairness trade-off. To tackle the aforementioned challenges, we propose a novel controllable federated group-fairness calibration framework, named FedFACT. FedFACT identifies the Bayes-optimal classifiers under both global and local fairness constraints in multi-class case, yielding models with minimal performance decline while guaranteeing fairness. To effectively realize an adjustable, optimal accuracy-fairness balance, we derive specific characterizations of the Bayes-optimal fair classifiers for reformulating fair FL as personalized cost-sensitive learning problem for in-processing, and bi-level optimization for post-processing. Theoretically, we provide convergence and generalization guarantees for FedFACT to approach the near-optimal accuracy under given fairness levels. Extensive experiments on multiple datasets across various data heterogeneity demonstrate that FedFACT consistently outperforms baselines in balancing accuracy and global-local fairness.
[LG-35] PPO in the Fisher-Rao geometry
链接: https://arxiv.org/abs/2506.03757
作者: Razvan-Andrei Lascu,David Šiška,Łukasz Szpruch
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 17 pages
Abstract:Proximal Policy Optimization (PPO) has become a widely adopted algorithm for reinforcement learning, offering a practical policy gradient method with strong empirical performance. Despite its popularity, PPO lacks formal theoretical guarantees for policy improvement and convergence. PPO is motivated by Trust Region Policy Optimization (TRPO) that utilizes a surrogate loss with a KL divergence penalty, which arises from linearizing the value function within a flat geometric space. In this paper, we derive a tighter surrogate in the Fisher-Rao (FR) geometry, yielding a novel variant, Fisher-Rao PPO (FR-PPO). Our proposed scheme provides strong theoretical guarantees, including monotonic policy improvement. Furthermore, in the tabular setting, we demonstrate that FR-PPO achieves sub-linear convergence without any dependence on the dimensionality of the action or state spaces, marking a significant step toward establishing formal convergence results for PPO-based algorithms.
[LG-36] Dropout-Robust Mechanisms for Differentially Private and Fully Decentralized Mean Estimation
链接: https://arxiv.org/abs/2506.03746
作者: César Sabater,Sonia Ben Mokhtar,Jan Ramon
类目: Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: 23 pages, 4 figures
Abstract:Achieving differentially private computations in decentralized settings poses significant challenges, particularly regarding accuracy, communication cost, and robustness against information leakage. While cryptographic solutions offer promise, they often suffer from high communication overhead or require centralization in the presence of network failures. Conversely, existing fully decentralized approaches typically rely on relaxed adversarial models or pairwise noise cancellation, the latter suffering from substantial accuracy degradation if parties unexpectedly disconnect. In this work, we propose IncA, a new protocol for fully decentralized mean estimation, a widely used primitive in data-intensive processing. Our protocol, which enforces differential privacy, requires no central orchestration and employs low-variance correlated noise, achieved by incrementally injecting sensitive information into the computation. First, we theoretically demonstrate that, when no parties permanently disconnect, our protocol achieves accuracy comparable to that of a centralized setting-already an improvement over most existing decentralized differentially private techniques. Second, we empirically show that our use of low-variance correlated noise significantly mitigates the accuracy loss experienced by existing techniques in the presence of dropouts.
[LG-37] Sign-SGD is the Golden Gate between Multi-Node to Single-Node Learning: Significant Boost via Parameter-Free Optimization
链接: https://arxiv.org/abs/2506.03725
作者: Daniil Medyakov,Sergey Stanko,Gleb Molodtsov,Philip Zmushko,Grigoriy Evseev,Egor Petrov,Aleksandr Beznosikov
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 58 pages, 5 figures, 5 tables
Abstract:Quite recently, large language models have made a significant breakthrough across various disciplines. However, training them is an extremely resource-intensive task, even for major players with vast computing resources. One of the methods gaining popularity in light of these challenges is Sign-SGD. This method can be applied both as a memory-efficient approach in single-node training and as a gradient compression technique in the distributed learning. Nevertheless, it is impossible to automatically determine the effective stepsize from the theoretical standpoint. Indeed, it depends on the parameters of the dataset to which we do not have access in the real-world learning paradigm. To address this issue, we design several variants of single-node deterministic Sign-SGD. We extend our approaches to practical scenarios: stochastic single-node and multi-node learning, methods with incorporated momentum. We conduct extensive experiments on real machine learning problems that emphasize the practical applicability of our ideas.
[LG-38] On the Closed-Form of Flow Matching: Generalization Does Not Arise from Target Stochasticity
链接: https://arxiv.org/abs/2506.03719
作者: Quentin Bertrand,Anne Gagneux,Mathurin Massias,Rémi Emonet
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Modern deep generative models can now produce high-quality synthetic samples that are often indistinguishable from real training data. A growing body of research aims to understand why recent methods – such as diffusion and flow matching techniques – generalize so effectively. Among the proposed explanations are the inductive biases of deep learning architectures and the stochastic nature of the conditional flow matching loss. In this work, we rule out the latter – the noisy nature of the loss – as a primary contributor to generalization in flow matching. First, we empirically show that in high-dimensional settings, the stochastic and closed-form versions of the flow matching loss yield nearly equivalent losses. Then, using state-of-the-art flow matching models on standard image datasets, we demonstrate that both variants achieve comparable statistical performance, with the surprising observation that using the closed-form can even improve performance.
[LG-39] Learning-at-Criticality in Large Language Models for Quantum Field Theory and Beyond
链接: https://arxiv.org/abs/2506.03703
作者: Xiansheng Cai,Sihan Hu,Tao Wang,Yuan Huang,Pan Zhang,Youjin Deng,Kun Chen
类目: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn); Statistical Mechanics (cond-mat.stat-mech); Strongly Correlated Electrons (cond-mat.str-el); Computational Physics (physics.comp-ph)
*备注:
Abstract:Fundamental physics often confronts complex symbolic problems with few guiding exemplars or established principles. While artificial intelligence (AI) offers promise, its typical need for vast datasets to learn from hinders its use in these information-scarce frontiers. We introduce learning at criticality (LaC), a reinforcement learning (RL) scheme that tunes Large Language Models (LLMs) to a sharp learning transition, addressing this information scarcity. At this transition, LLMs achieve peak generalization from minimal data, exemplified by 7-digit base-7 addition – a test of nontrivial arithmetic reasoning. To elucidate this peak, we analyze a minimal concept-network model (CoNet) designed to capture the essence of how LLMs might link tokens. Trained on a single exemplar, this model also undergoes a sharp learning transition. This transition exhibits hallmarks of a second-order phase transition, notably power-law distributed solution path lengths. At this critical point, the system maximizes a ``critical thinking pattern" crucial for generalization, enabled by the underlying scale-free exploration. This suggests LLMs reach peak performance by operating at criticality, where such explorative dynamics enable the extraction of underlying operational rules. We demonstrate LaC in quantum field theory: an 8B-parameter LLM, tuned to its critical point by LaC using a few exemplars of symbolic Matsubara sums, solves unseen, higher-order problems, significantly outperforming far larger models. LaC thus leverages critical phenomena, a physical principle, to empower AI for complex, data-sparse challenges in fundamental physics.
[LG-40] Comprehensive Attribute Encoding and Dynamic LSTM HyperModels for Outcome Oriented Predictive Business Process Monitoring
链接: https://arxiv.org/abs/2506.03696
作者: Fang Wang,Paolo Ceravolo,Ernesto Damiani
类目: Machine Learning (cs.LG)
*备注:
Abstract:Predictive Business Process Monitoring (PBPM) aims to forecast future outcomes of ongoing business processes. However, existing methods often lack flexibility to handle real-world challenges such as simultaneous events, class imbalance, and multi-level attributes. While prior work has explored static encoding schemes and fixed LSTM architectures, they struggle to support adaptive representations and generalize across heterogeneous datasets. To address these limitations, we propose a suite of dynamic LSTM HyperModels that integrate two-level hierarchical encoding for event and sequence attributes, character-based decomposition of event labels, and novel pseudo-embedding techniques for durations and attribute correlations. We further introduce specialized LSTM variants for simultaneous event modeling, leveraging multidimensional embeddings and time-difference flag augmentation. Experimental validation on four public and real-world datasets demonstrates up to 100% accuracy on balanced datasets and F1 scores exceeding 86% on imbalanced ones. Our approach advances PBPM by offering modular and interpretable models better suited for deployment in complex settings. Beyond PBPM, it contributes to the broader AI community by improving temporal outcome prediction, supporting data heterogeneity, and promoting explainable process intelligence frameworks.
[LG-41] Out-of-Distribution Graph Models Merging
链接: https://arxiv.org/abs/2506.03674
作者: Yidi Wang,Jiawei Gu,pei Xiaobing,Xubin Zheng,Xiao Luo,Pengyang Wang,Ziyue Qiao
类目: Machine Learning (cs.LG)
*备注:
Abstract:This paper studies a novel problem of out-of-distribution graph models merging, which aims to construct a generalized model from multiple graph models pre-trained on different domains with distribution discrepancy. This problem is challenging because of the difficulty in learning domain-invariant knowledge implicitly in model parameters and consolidating expertise from potentially heterogeneous GNN backbones. In this work, we propose a graph generation strategy that instantiates the mixture distribution of multiple domains. Then, we merge and fine-tune the pre-trained graph models via a MoE module and a masking mechanism for generalized adaptation. Our framework is architecture-agnostic and can operate without any source/target domain data. Both theoretical analysis and experimental results demonstrate the effectiveness of our approach in addressing the model generalization problem.
[LG-42] VCDiag: Classifying Erroneous Waveforms for Failure Triage Acceleration
链接: https://arxiv.org/abs/2506.03590
作者: Minh Luu,Surya Jasper,Khoi Le,Evan Pan,Michael Quinn,Aakash Tyagi,Jiang Hu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Failure triage in design functional verification is critical but time-intensive, relying on manual specification reviews, log inspections, and waveform analyses. While machine learning (ML) has improved areas like stimulus generation and coverage closure, its application to RTL-level simulation failure triage, particularly for large designs, remains limited. VCDiag offers an efficient, adaptable approach using VCD data to classify failing waveforms and pinpoint likely failure locations. In the largest experiment, VCDiag achieves over 94% accuracy in identifying the top three most likely modules. The framework introduces a novel signal selection and statistical compression approach, achieving over 120x reduction in raw data size while preserving features essential for classification. It can also be integrated into diverse Verilog/SystemVerilog designs and testbenches.
[LG-43] Optimizing FPGA and Wafer Test Coverag e with Spatial Sampling and Machine Learning
链接: https://arxiv.org/abs/2506.03556
作者: Wang WeiQuan,Riaz-ul-Haque Mian
类目: Machine Learning (cs.LG)
*备注:
Abstract:In semiconductor manufacturing, testing costs remain significantly high, especially during wafer and FPGA testing. To reduce the number of required tests while maintaining predictive accuracy, this study investigates three baseline sampling strategies: Random Sampling, Stratified Sampling, and k-means Clustering Sampling. To further enhance these methods, this study proposes a novel algorithm that improves the sampling quality of each approach. This research is conducted using real industrial production data from wafer-level tests and silicon measurements from various FPGAs. This study introduces two hybrid strategies: Stratified with Short Distance Elimination (S-SDE) and k-means with Short Distance Elimination (K-SDE). Their performance is evaluated within the framework of Gaussian Process Regression (GPR) for predicting wafer and FPGA test data. At the core of our proposed approach is the Short Distance Elimination (SDE) algorithm, which excludes spatially proximate candidate points during sampling, thereby ensuring a more uniform distribution of training data across the physical domain. A parameter sweep was conducted over the (alpha, beta) thresholds, where alpha and beta are in the range 0, 1, 2, 3, 4 and not both zero, to identify the optimal combination that minimizes RMSD. Experimental results on a randomly selected wafer file reveal that (alpha, beta) equal (2, 2) yields the lowest RMSD. Accordingly, all subsequent experiments adopt this parameter configuration. The results demonstrate that the proposed SDE-based strategies enhance predictive accuracy: K-SDE improves upon k-means sampling by 16.26 percent (wafer) and 13.07 percent (FPGA), while S-SDE improves upon stratified sampling by 16.49 percent (wafer) and 8.84 percent (FPGA).
[LG-44] Learning Monotonic Probabilities with a Generative Cost Model
链接: https://arxiv.org/abs/2506.03542
作者: Yongxiang Tang,Yanhua Cheng,Xiaocheng Liu,Chenchen Jiao,Yanxiang Zeng,Ning Luo,Pengjia Yuan,Xialong Liu,Peng Jiang
类目: Machine Learning (cs.LG)
*备注:
Abstract:In many machine learning tasks, it is often necessary for the relationship between input and output variables to be monotonic, including both strictly monotonic and implicitly monotonic relationships. Traditional methods for maintaining monotonicity mainly rely on construction or regularization techniques, whereas this paper shows that the issue of strict monotonic probability can be viewed as a partial order between an observable revenue variable and a latent cost variable. This perspective enables us to reformulate the monotonicity challenge into modeling the latent cost variable. To tackle this, we introduce a generative network for the latent cost variable, termed the Generative Cost Model (GCM), which inherently addresses the strict monotonic problem, and propose the Implicit Generative Cost Model (IGCM) to address the implicit monotonic problem. We further validate our approach with a numerical simulation of quantile regression and conduct multiple experiments on public datasets, showing that our method significantly outperforms existing monotonic modeling techniques. The code for our experiments can be found at this https URL.
[LG-45] Conformal Mixed-Integer Constraint Learning with Feasibility Guarantees
链接: https://arxiv.org/abs/2506.03531
作者: Daniel Ovalle,Lorenz T. Biegler,Ignacio E. Grossmann,Carl D. Laird,Mateo Dulce Rubio
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
Abstract:We propose Conformal Mixed-Integer Constraint Learning (C-MICL), a novel framework that provides probabilistic feasibility guarantees for data-driven constraints in optimization problems. While standard Mixed-Integer Constraint Learning methods often violate the true constraints due to model error or data limitations, our C-MICL approach leverages conformal prediction to ensure feasible solutions are ground-truth feasible. This guarantee holds with probability at least 1-\alpha , under a conditional independence assumption. The proposed framework supports both regression and classification tasks without requiring access to the true constraint function, while avoiding the scalability issues associated with ensemble-based heuristics. Experiments on real-world applications demonstrate that C-MICL consistently achieves target feasibility rates, maintains competitive objective performance, and significantly reduces computational cost compared to existing methods. Our work bridges mathematical optimization and machine learning, offering a principled approach to incorporate uncertainty-aware constraints into decision-making with rigorous statistical guarantees.
[LG-46] Path Generation and Evaluation in Video Games: A Nonparametric Statistical Approach
链接: https://arxiv.org/abs/2506.03522
作者: Daniel Campa,Mehdi Saeedi,Ian Colbert,Srinjoy Das
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 8 pages, 9 figures, Accepted at the IEEE Conference on Games 2025 (IEEE CoG)
Abstract:Navigation path traces play a crucial role in video game design, serving as a vital resource for both enhancing player engagement and fine-tuning non-playable character behavior. Generating such paths with human-like realism can enrich the overall gaming experience, and evaluating path traces can provide game designers insights into player interactions. Despite the impressive recent advancements in deep learning-based generative modeling, the video game industry hesitates to adopt such models for path generation, often citing their complex training requirements and interpretability challenges. To address these problems, we propose a novel path generation and evaluation approach that is grounded in principled nonparametric statistics and provides precise control while offering interpretable insights. Our path generation method fuses two statistical techniques: (1) nonparametric model-free transformations that capture statistical characteristics of path traces through time; and (2) copula models that capture statistical dependencies in space. For path evaluation, we adapt a nonparametric three-sample hypothesis test designed to determine if the generated paths are overfit (mimicking the original data too closely) or underfit (diverging too far from it). We demonstrate the precision and reliability of our proposed methods with empirical analysis on two existing gaming benchmarks to showcase controlled generation of diverse navigation paths. Notably, our novel path generator can be fine-tuned with user controllable parameters to create navigation paths that exhibit varying levels of human-likeness in contrast to those produced by neural network-based agents. The code is available at this https URL.
[LG-47] Directional Non-Commutative Monoidal Embeddings for MNIST
链接: https://arxiv.org/abs/2506.03472
作者: Mahesh Godavarti
类目: Machine Learning (cs.LG)
*备注:
Abstract:We present an empirical validation of the directional non-commutative monoidal embedding framework recently introduced in prior work~\citeGodavarti2025monoidal. This framework defines learnable compositional embeddings using distinct non-commutative operators per dimension (axis) that satisfy an interchange law, generalizing classical one-dimensional transforms. Our primary goal is to verify that this framework can effectively model real data by applying it to a controlled, well-understood task: image classification on the MNIST dataset~\citelecun1998gradient. A central hypothesis for why the proposed monoidal embedding works well is that it generalizes the Discrete Fourier Transform (DFT)~\citeoppenheim1999discrete by learning task-specific frequency components instead of using fixed basis frequencies. We test this hypothesis by comparing learned monoidal embeddings against fixed DFT-based embeddings on MNIST. The results show that as the embedding dimensionality decreases (e.g., from 32 to 8 to 2), the performance gap between the learned monoidal embeddings and fixed DFT-based embeddings on MNIST grows increasingly large. This comparison is used as an analytic tool to explain why the framework performs well: the learnable embeddings can capture the most discriminative spectral components for the task. Overall, our experiments confirm that directional non-commutative monoidal embeddings are highly effective for representing image data, offering a compact learned representation that retains high task performance. The code used in this work is available at this https URL.
[LG-48] Differentially Private Distribution Release of Gaussian Mixture Models via KL-Divergence Minimization
链接: https://arxiv.org/abs/2506.03467
作者: Hang Liu,Anna Scaglione,Sean Peisert
类目: Information Theory (cs.IT); Cryptography and Security (cs.CR); Machine Learning (cs.LG); Signal Processing (eess.SP); Methodology (stat.ME)
*备注: This work has been submitted to the IEEE for possible publication
Abstract:Gaussian Mixture Models (GMMs) are widely used statistical models for representing multi-modal data distributions, with numerous applications in data mining, pattern recognition, data simulation, and machine learning. However, recent research has shown that releasing GMM parameters poses significant privacy risks, potentially exposing sensitive information about the underlying data. In this paper, we address the challenge of releasing GMM parameters while ensuring differential privacy (DP) guarantees. Specifically, we focus on the privacy protection of mixture weights, component means, and covariance matrices. We propose to use Kullback-Leibler (KL) divergence as a utility metric to assess the accuracy of the released GMM, as it captures the joint impact of noise perturbation on all the model parameters. To achieve privacy, we introduce a DP mechanism that adds carefully calibrated random perturbations to the GMM parameters. Through theoretical analysis, we quantify the effects of privacy budget allocation and perturbation statistics on the DP guarantee, and derive a tractable expression for evaluating KL divergence. We formulate and solve an optimization problem to minimize the KL divergence between the released and original models, subject to a given (\epsilon, \delta) -DP constraint. Extensive experiments on both synthetic and real-world datasets demonstrate that our approach achieves strong privacy guarantees while maintaining high utility.
[LG-49] From Averag e-Iterate to Last-Iterate Convergence in Games: A Reduction and Its Applications
链接: https://arxiv.org/abs/2506.03464
作者: Yang Cai,Haipeng Luo,Chen-Yu Wei,Weiqiang Zheng
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 21 pages
Abstract:The convergence of online learning algorithms in games under self-play is a fundamental question in game theory and machine learning. Among various notions of convergence, last-iterate convergence is particularly desirable, as it reflects the actual decisions made by the learners and captures the day-to-day behavior of the learning dynamics. While many algorithms are known to converge in the average-iterate, achieving last-iterate convergence typically requires considerably more effort in both the design and the analysis of the algorithm. Somewhat surprisingly, we show in this paper that for a large family of games, there exists a simple black-box reduction that transforms the average iterates of an uncoupled learning dynamics into the last iterates of a new uncoupled learning dynamics, thus also providing a reduction from last-iterate convergence to average-iterate convergence. Our reduction applies to games where each player’s utility is linear in both their own strategy and the joint strategy of all opponents. This family includes two-player bimatrix games and generalizations such as multi-player polymatrix games. By applying our reduction to the Optimistic Multiplicative Weights Update algorithm, we obtain new state-of-the-art last-iterate convergence rates for uncoupled learning dynamics in two-player zero-sum normal-form games: (1) an O(\frac\log dT) last-iterate convergence rate under gradient feedback, representing an exponential improvement in the dependence on the dimension d (i.e., the maximum number of actions available to either player); and (2) an \widetildeO(d^\frac15 T^-\frac15) last-iterate convergence rate under bandit feedback, improving upon the previous best rates of \widetildeO(\sqrtd T^-\frac18) and \widetildeO(\sqrtd T^-\frac16) .
[LG-50] A Machine Learning Theory Perspective on Strategic Litigation
链接: https://arxiv.org/abs/2506.03411
作者: Melissa Dutz,Han Shao,Avrim Blum,Aloni Cohen
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT)
*备注:
Abstract:Strategic litigation involves bringing a legal case to court with the goal of having a broader impact beyond resolving the case itself: for example, creating precedent which will influence future rulings. In this paper, we explore strategic litigation from the perspective of machine learning theory. We consider an abstract model of a common-law legal system where a lower court decides new cases by applying a decision rule learned from a higher court’s past rulings. In this model, we explore the power of a strategic litigator, who strategically brings cases to the higher court to influence the learned decision rule, thereby affecting future cases. We explore questions including: What impact can a strategic litigator have? Which cases should a strategic litigator bring to court? Does it ever make sense for a strategic litigator to bring a case when they are sure the court will rule against them?
[LG-51] Improving Performance of Spike-based Deep Q-Learning using Ternary Neurons
链接: https://arxiv.org/abs/2506.03392
作者: Aref Ghoreishee,Abhishek Mishra,John Walsh,Anup Das,Nagarajan Kandasamy
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:
Abstract:We propose a new ternary spiking neuron model to improve the representation capacity of binary spiking neurons in deep Q-learning. Although a ternary neuron model has recently been introduced to overcome the limited representation capacity offered by the binary spiking neurons, we show that its performance is worse than that of binary models in deep Q-learning tasks. We hypothesize gradient estimation bias during the training process as the underlying potential cause through mathematical and empirical analysis. We propose a novel ternary spiking neuron model to mitigate this issue by reducing the estimation bias. We use the proposed ternary spiking neuron as the fundamental computing unit in a deep spiking Q-learning network (DSQN) and evaluate the network’s performance in seven Atari games from the Gym environment. Results show that the proposed ternary spiking neuron mitigates the drastic performance degradation of ternary neurons in Q-learning tasks and improves the network performance compared to the existing binary neurons, making DSQN a more practical solution for on-board autonomous decision-making tasks.
[LG-52] Product Quantization for Surface Soil Similarity
链接: https://arxiv.org/abs/2506.03374
作者: Haley Dozier,Althea Henslee,Ashley Abraham,Andrew Strelzoff,Mark Chappell
类目: Machine Learning (cs.LG)
*备注: To be published in the CSCE 2022 proceedings
Abstract:The use of machine learning (ML) techniques has allowed rapid advancements in many scientific and engineering fields. One of these problems is that of surface soil taxonomy, a research area previously hindered by the reliance on human-derived classifications, which are mostly dependent on dividing a dataset based on historical understandings of that data rather than data-driven, statistically observable similarities. Using a ML-based taxonomy allows soil researchers to move beyond the limitations of human visualization and create classifications of high-dimension datasets with a much higher level of specificity than possible with hand-drawn taxonomies. Furthermore, this pipeline allows for the possibility of producing both highly accurate and flexible soil taxonomies with classes built to fit a specific application. The machine learning pipeline outlined in this work combines product quantization with the systematic evaluation of parameters and output to get the best available results, rather than accepting sub-optimal results by using either default settings or best guess settings.
[LG-53] Probabilistic Factorial Experimental Design for Combinatorial Interventions
链接: https://arxiv.org/abs/2506.03363
作者: Divya Shyamal,Jiaqi Zhang,Caroline Uhler
类目: Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML)
*备注:
Abstract:A combinatorial intervention, consisting of multiple treatments applied to a single unit with potentially interactive effects, has substantial applications in fields such as biomedicine, engineering, and beyond. Given p possible treatments, conducting all possible 2^p combinatorial interventions can be laborious and quickly becomes infeasible as p increases. Here we introduce probabilistic factorial experimental design, formalized from how scientists perform lab experiments. In this framework, the experimenter selects a dosage for each possible treatment and applies it to a group of units. Each unit independently receives a random combination of treatments, sampled from a product Bernoulli distribution determined by the dosages. Additionally, the experimenter can carry out such experiments over multiple rounds, adapting the design in an active manner. We address the optimal experimental design problem within an intervention model that imposes bounded-degree interactions between treatments. In the passive setting, we provide a closed-form solution for the near-optimal design. Our results prove that a dosage of \tfrac12 for each treatment is optimal up to a factor of 1+O(\tfrac\ln(n)n) for estimating any k -way interaction model, regardless of k , and imply that O\big(kp^3k\ln§\big) observations are required to accurately estimate this model. For the multi-round setting, we provide a near-optimal acquisition function that can be numerically optimized. We also explore several extensions of the design problem and finally validate our findings through simulations.
[LG-54] Optimization of Epsilon-Greedy Exploration
链接: https://arxiv.org/abs/2506.03324
作者: Ethan Che,Hakan Ceylan,James McInerney,Nathan Kallus
类目: Machine Learning (cs.LG)
*备注:
Abstract:Modern recommendation systems rely on exploration to learn user preferences for new items, typically implementing uniform exploration policies (e.g., epsilon-greedy) due to their simplicity and compatibility with machine learning (ML) personalization models. Within these systems, a crucial consideration is the rate of exploration - what fraction of user traffic should receive random item recommendations and how this should evolve over time. While various heuristics exist for navigating the resulting exploration-exploitation tradeoff, selecting optimal exploration rates is complicated by practical constraints including batched updates, time-varying user traffic, short time horizons, and minimum exploration requirements. In this work, we propose a principled framework for determining the exploration schedule based on directly minimizing Bayesian regret through stochastic gradient descent (SGD), allowing for dynamic exploration rate adjustment via Model-Predictive Control (MPC). Through extensive experiments with recommendation datasets, we demonstrate that variations in the batch size across periods significantly influence the optimal exploration strategy. Our optimization methods automatically calibrate exploration to the specific problem setting, consistently matching or outperforming the best heuristic for each setting.
[LG-55] Enhancing Automatic PT Tagging for MEDLINE Citations Using Transformer-Based Models
链接: https://arxiv.org/abs/2506.03321
作者: Victor H. Cid,James Mork
类目: Digital Libraries (cs.DL); Machine Learning (cs.LG)
*备注: 26 pages, 8 tables, 3 figures
Abstract:We investigated the feasibility of predicting Medical Subject Headings (MeSH) Publication Types (PTs) from MEDLINE citation metadata using pre-trained Transformer-based models BERT and DistilBERT. This study addresses limitations in the current automated indexing process, which relies on legacy NLP algorithms. We evaluated monolithic multi-label classifiers and binary classifier ensembles to enhance the retrieval of biomedical literature. Results demonstrate the potential of Transformer models to significantly improve PT tagging accuracy, paving the way for scalable, efficient biomedical indexing.
[LG-56] Budgeted Online Active Learning with Expert Advice and Episodic Priors
链接: https://arxiv.org/abs/2506.03307
作者: Kristen Goebel,William Solow,Paola Pesantez-Cabrera,Markus Keller,Alan Fern
类目: Machine Learning (cs.LG)
*备注:
Abstract:This paper introduces a novel approach to budgeted online active learning from finite-horizon data streams with extremely limited labeling budgets. In agricultural applications, such streams might include daily weather data over a growing season, and labels require costly measurements of weather-dependent plant characteristics. Our method integrates two key sources of prior information: a collection of preexisting expert predictors and episodic behavioral knowledge of the experts based on unlabeled data streams. Unlike previous research on online active learning with experts, our work simultaneously considers query budgets, finite horizons, and episodic knowledge, enabling effective learning in applications with severely limited labeling capacity. We demonstrate the utility of our approach through experiments on various prediction problems derived from both a realistic agricultural crop simulator and real-world data from multiple grape cultivars. The results show that our method significantly outperforms baseline expert predictions, uniform query selection, and existing approaches that consider budgets and limited horizons but neglect episodic knowledge, even under highly constrained labeling budgets.
[LG-57] Multi-Exit Kolmogorov-Arnold Networks: enhancing accuracy and parsimony
链接: https://arxiv.org/abs/2506.03302
作者: James Bagrow,Josh Bongard
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Data Analysis, Statistics and Probability (physics.data-an); Machine Learning (stat.ML)
*备注: 14 pages, 7 figures, 2 tables
Abstract:Kolmogorov-Arnold Networks (KANs) uniquely combine high accuracy with interpretability, making them valuable for scientific modeling. However, it is unclear a priori how deep a network needs to be for any given task, and deeper KANs can be difficult to optimize. Here we introduce multi-exit KANs, where each layer includes its own prediction branch, enabling the network to make accurate predictions at multiple depths simultaneously. This architecture provides deep supervision that improves training while discovering the right level of model complexity for each task. Multi-exit KANs consistently outperform standard, single-exit versions on synthetic functions, dynamical systems, and real-world datasets. Remarkably, the best predictions often come from earlier, simpler exits, revealing that these networks naturally identify smaller, more parsimonious and interpretable models without sacrificing accuracy. To automate this discovery, we develop a differentiable “learning to exit” algorithm that balances contributions from exits during training. Our approach offers scientists a practical way to achieve both high performance and interpretability, addressing a fundamental challenge in machine learning for scientific discovery.
[LG-58] On the Necessity of Multi-Domain Explanation: An Uncertainty Principle Approach for Deep Time Series Models
链接: https://arxiv.org/abs/2506.03267
作者: Shahbaz Rezaei,Avishai Halev,Xin Liu
类目: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC)
*备注:
Abstract:A prevailing approach to explain time series models is to generate attribution in time domain. A recent development in time series XAI is the concept of explanation spaces, where any model trained in the time domain can be interpreted with any existing XAI method in alternative domains, such as frequency. The prevailing approach is to present XAI attributions either in the time domain or in the domain where the attribution is most sparse. In this paper, we demonstrate that in certain cases, XAI methods can generate attributions that highlight fundamentally different features in the time and frequency domains that are not direct counterparts of one another. This suggests that both domains’ attributions should be presented to achieve a more comprehensive interpretation. Thus it shows the necessity of multi-domain explanation. To quantify when such cases arise, we introduce the uncertainty principle (UP), originally developed in quantum mechanics and later studied in harmonic analysis and signal processing, to the XAI literature. This principle establishes a lower bound on how much a signal can be simultaneously localized in both the time and frequency domains. By leveraging this concept, we assess whether attributions in the time and frequency domains violate this bound, indicating that they emphasize distinct features. In other words, UP provides a sufficient condition that the time and frequency domain explanations do not match and, hence, should be both presented to the end user. We validate the effectiveness of this approach across various deep learning models, XAI methods, and a wide range of classification and forecasting datasets. The frequent occurrence of UP violations across various datasets and XAI methods highlights the limitations of existing approaches that focus solely on time-domain explanations. This underscores the need for multi-domain explanations as a new paradigm.
[LG-59] Out-of-Vocabulary Sampling Boosts Speculative Decoding
链接: https://arxiv.org/abs/2506.03206
作者: Nadav Timor,Jonathan Mamou,Oren Pereg,Hongyang Zhang,David Harel
类目: Machine Learning (cs.LG)
*备注:
Abstract:Speculative decoding relies on fast and accurate drafters. Recent state-of-the-art language models employ larger and larger vocabularies, which significantly slows down drafters. One promising approach to boost the efficiency of speculative decoding is to use drafters with smaller vocabularies. However, existing sampling methods cannot draw out-of-vocabulary tokens, creating a tradeoff between drafters’ vocabulary size and acceptance rates. This paper introduces Redistributing Drafter Kernels (RDK), the first out-of-vocabulary sampler that effectively recovers acceptance rates by virtually restoring pruned target tokens. RDK leverages token-affinity priors to reallocate drafter mass towards high-overlap regions. We prove mathematically that RDK can achieve higher acceptance rates than vanilla and state-of-the-art samplers. We provide an efficient first-order approximation of RDK and prove that it reduces redistribution times from O(N^2) to O(N) , enabling lightweight implementations for large vocabularies. Our experiments demonstrate that this linear-time RDK significantly boosts acceptance rates even after extreme pruning (removing more than 75% of the drafter’s vocabulary), where existing samplers fail. RDK opens the door to extremely pruned drafters, which were previously impractical.
[LG-60] Graph Neural Networks for Jamming Source Localization
链接: https://arxiv.org/abs/2506.03196
作者: Dania Herzalla,Willian T. Lunardi,Martin Andreoni
类目: Networking and Internet Architecture (cs.NI); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注:
Abstract:Graph-based learning has emerged as a transformative approach for modeling complex relationships across diverse domains, yet its potential in wireless security remains largely unexplored. In this work, we introduce the first application of graph-based learning for jamming source localization, addressing the imminent threat of jamming attacks in wireless networks. Unlike geometric optimization techniques that struggle under environmental uncertainties and dense interference, we reformulate localization as an inductive graph regression task. Our approach integrates structured node representations that encode local and global signal aggregation, ensuring spatial coherence and adaptive signal fusion. To enhance robustness, we incorporate an attention-based graph neural network that adaptively refines neighborhood influence and introduces a confidence-guided estimation mechanism that dynamically balances learned predictions with domain-informed priors. We evaluate our approach under complex radio frequency environments with varying sampling densities and signal propagation conditions, conducting comprehensive ablation studies on graph construction, feature selection, and pooling strategies. Results demonstrate that our novel graph-based learning framework significantly outperforms established localization baselines, particularly in challenging scenarios with sparse and obfuscated signal information. Code is available at [this https URL](this https URL).
[LG-61] Non-collective Calibrating Strategy for Time Series Forecasting IJCAI2025
链接: https://arxiv.org/abs/2506.03176
作者: Bin Wang,Yongqi Han,Minbo Ma,Tianrui Li,Junbo Zhang,Feng Hong,Yanwei Yu
类目: Machine Learning (cs.LG)
*备注: Accepted by IJCAI 2025
Abstract:Deep learning-based approaches have demonstrated significant advancements in time series forecasting. Despite these ongoing developments, the complex dynamics of time series make it challenging to establish the rule of thumb for designing the golden model architecture. In this study, we argue that refining existing advanced models through a universal calibrating strategy can deliver substantial benefits with minimal resource costs, as opposed to elaborating and training a new model from scratch. We first identify a multi-target learning conflict in the calibrating process, which arises when optimizing variables across time steps, leading to the underutilization of the model’s learning capabilities. To address this issue, we propose an innovative calibrating strategy called Socket+Plug (SoP). This approach retains an exclusive optimizer and early-stopping monitor for each predicted target within each Plug while keeping the fully trained Socket backbone frozen. The model-agnostic nature of SoP allows it to directly calibrate the performance of any trained deep forecasting models, regardless of their specific architectures. Extensive experiments on various time series benchmarks and a spatio-temporal meteorological ERA5 dataset demonstrate the effectiveness of SoP, achieving up to a 22% improvement even when employing a simple MLP as the Plug (highlighted in Figure 1)
[LG-62] Distributionally Robust Wireless Semantic Communication with Large AI Models
链接: https://arxiv.org/abs/2506.03167
作者: Long Tan Le,Senura Hansaja Wanasekara,Zerun Niu,Yansong Shi,Nguyen H. Tran,Phuong Vo,Walid Saad,Dusit Niyato,Zhu Han,Choong Seon Hong,H. Vincent Poor
类目: Networking and Internet Architecture (cs.NI); Emerging Technologies (cs.ET); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注: Under Review
Abstract:6G wireless systems are expected to support massive volumes of data with ultra-low latency. However, conventional bit-level transmission strategies cannot support the efficiency and adaptability required by modern, data-intensive applications. The concept of semantic communication (SemCom) addresses this limitation by focusing on transmitting task-relevant semantic information instead of raw data. While recent efforts incorporating deep learning and large-scale AI models have improved SemCom’s performance, existing systems remain vulnerable to both semantic-level and transmission-level noise because they often rely on domain-specific architectures that hinder generalizability. In this paper, a novel and generalized semantic communication framework called WaSeCom is proposed to systematically address uncertainty and enhance robustness. In particular, Wasserstein distributionally robust optimization is employed to provide resilience against semantic misinterpretation and channel perturbations. A rigorous theoretical analysis is performed to establish the robust generalization guarantees of the proposed framework. Experimental results on image and text transmission demonstrate that WaSeCom achieves improved robustness under noise and adversarial perturbations. These results highlight its effectiveness in preserving semantic fidelity across varying wireless conditions.
[LG-63] st-Time Scaling of Diffusion Models via Noise Trajectory Search
链接: https://arxiv.org/abs/2506.03164
作者: Vignav Ramesh,Morteza Mardani
类目: Machine Learning (cs.LG)
*备注:
Abstract:The iterative and stochastic nature of diffusion models enables test-time scaling, whereby spending additional compute during denoising generates higher-fidelity samples. Increasing the number of denoising steps is the primary scaling axis, but this yields quickly diminishing returns. Instead optimizing the noise trajectory–the sequence of injected noise vectors–is promising, as the specific noise realizations critically affect sample quality; but this is challenging due to a high-dimensional search space, complex noise-outcome interactions, and costly trajectory evaluations. We address this by first casting diffusion as a Markov Decision Process (MDP) with a terminal reward, showing tree-search methods such as Monte Carlo tree search (MCTS) to be meaningful but impractical. To balance performance and efficiency, we then resort to a relaxation of MDP, where we view denoising as a sequence of independent contextual bandits. This allows us to introduce an \epsilon -greedy search algorithm that globally explores at extreme timesteps and locally exploits during the intermediate steps where de-mixing occurs. Experiments on EDM and Stable Diffusion reveal state-of-the-art scores for class-conditioned/text-to-image generation, exceeding baselines by up to 164% and matching/exceeding MCTS performance. To our knowledge, this is the first practical method for test-time noise trajectory optimization of arbitrary (non-differentiable) rewards.
[LG-64] Causal Discovery in Dynamic Fading Wireless Networks
链接: https://arxiv.org/abs/2506.03163
作者: Oluwaseyi Giwa
类目: Machine Learning (cs.LG); Signal Processing (eess.SP); Methodology (stat.ME)
*备注: 5 pages, 3 figures
Abstract:Dynamic causal discovery in wireless networks is essential due to evolving interference, fading, and mobility, which complicate traditional static causal models. This paper addresses causal inference challenges in dynamic fading wireless environments by proposing a sequential regression-based algorithm with a novel application of the NOTEARS acyclicity constraint, enabling efficient online updates. We derive theoretical lower and upper bounds on the detection delay required to identify structural changes, explicitly quantifying their dependence on network size, noise variance, and fading severity. Monte Carlo simulations validate these theoretical results, demonstrating linear increases in detection delay with network size, quadratic growth with noise variance, and inverse-square dependence on the magnitude of structural changes. Our findings provide rigorous theoretical insights and practical guidelines for designing robust online causal inference mechanisms to maintain network reliability under nonstationary wireless conditions.
[LG-65] Safety-Prioritized Reinforcement Learning-Enabled Traffic Flow Optimization in a 3D City-Wide Simulation Environment
链接: https://arxiv.org/abs/2506.03161
作者: Mira Nuthakki
类目: Machine Learning (cs.LG)
*备注: 18 pages, figures at end, methods at end. Format/order can be changed if necessary
Abstract:Traffic congestion and collisions represent significant economic, environmental, and social challenges worldwide. Traditional traffic management approaches have shown limited success in addressing these complex, dynamic problems. To address the current research gaps, three potential tools are developed: a comprehensive 3D city-wide simulation environment that integrates both macroscopic and microscopic traffic dynamics; a collision model; and a reinforcement learning framework with custom reward functions prioritizing safety over efficiency. Unity game engine-based simulation is used for direct collision modeling. A custom reward enabled reinforcement learning method, proximal policy optimization (PPO) model, yields substantial improvements over baseline results, reducing the number of serious collisions, number of vehicle-vehicle collisions, and total distance travelled by over 3 times the baseline values. The model also improves fuel efficiency by 39% and reduces carbon emissions by 88%. Results establish feasibility for city-wide 3D traffic simulation applications incorporating the vision-zero safety principles of the Department of Transportation, including physics-informed, adaptable, realistic collision modeling, as well as appropriate reward modeling for real-world traffic signal light control towards reducing collisions, optimizing traffic flow and reducing greenhouse emissions.
[LG-66] Applying MambaAttention TabPFN and TabTransformers to Classify SAE Automation Levels in Crashes
链接: https://arxiv.org/abs/2506.03160
作者: Shriyank Somvanshi,Anannya Ghosh Tusti,Mahmuda Sultana Mimi,Md Monzurul Islam,Sazzad Bin Bashar Polock,Anandi Dutta,Subasish Das
类目: Machine Learning (cs.LG)
*备注:
Abstract:The increasing presence of automated vehicles (AVs) presents new challenges for crash classification and safety analysis. Accurately identifying the SAE automation level involved in each crash is essential to understanding crash dynamics and system accountability. However, existing approaches often overlook automation-specific factors and lack model sophistication to capture distinctions between different SAE levels. To address this gap, this study evaluates the performance of three advanced tabular deep learning models MambaAttention, TabPFN, and TabTransformer for classifying SAE automation levels using structured crash data from Texas (2024), covering 4,649 cases categorized as Assisted Driving (SAE Level 1), Partial Automation (SAE Level 2), and Advanced Automation (SAE Levels 3-5 combined). Following class balancing using SMOTEENN, the models were trained and evaluated on a unified dataset of 7,300 records. MambaAttention demonstrated the highest overall performance (F1-scores: 88% for SAE 1, 97% for SAE 2, and 99% for SAE 3-5), while TabPFN excelled in zero-shot inference with high robustness for rare crash categories. In contrast, TabTransformer underperformed, particularly in detecting Partial Automation crashes (F1-score: 55%), suggesting challenges in modeling shared human-system control dynamics. These results highlight the capability of deep learning models tailored for tabular data to enhance the accuracy and efficiency of automation-level classification. Integrating such models into crash analysis frameworks can support policy development, AV safety evaluation, and regulatory decisions, especially in distinguishing high-risk conditions for mid- and high-level automation technologies.
[LG-67] Bayes Error Rate Estimation in Difficult Situations
链接: https://arxiv.org/abs/2506.03159
作者: Lesley Wheat,Martin v. Mohrenschildt,Saeid Habibi
类目: Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML)
*备注: 21 pages, 13 figures, 20 tables
Abstract:The Bayes Error Rate (BER) is the fundamental limit on the achievable generalizable classification accuracy of any machine learning model due to inherent uncertainty within the data. BER estimators offer insight into the difficulty of any classification problem and set expectations for optimal classification performance. In order to be useful, the estimators must also be accurate with a limited number of samples on multivariate problems with unknown class distributions. To determine which estimators meet the minimum requirements for “usefulness”, an in-depth examination of their accuracy is conducted using Monte Carlo simulations with synthetic data in order to obtain their confidence bounds for binary classification. To examine the usability of the estimators on real-world applications, new test scenarios are introduced upon which 2500 Monte Carlo simulations per scenario are run over a wide range of BER values. In a comparison of k-Nearest Neighbor (kNN), Generalized Henze-Penrose (GHP) divergence and Kernel Density Estimation (KDE) techniques, results show that kNN is overwhelmingly the more accurate non-parametric estimator. In order to reach the target of an under 5 percent range for the 95 percent confidence bounds, the minimum number of required samples per class is 1000. As more features are added, more samples are needed, so that 2500 samples per class are required at only 4 features. Other estimators do become more accurate than kNN as more features are added, but continuously fail to meet the target range.
[LG-68] Modular Diffusion Policy Training: Decoupling and Recombining Guidance and Diffusion for Offline RL
链接: https://arxiv.org/abs/2506.03154
作者: Zhaoyang Chen,Cody Fleming
类目: Machine Learning (cs.LG)
*备注:
Abstract:Classifier free guidance has shown strong potential in diffusion-based reinforcement learning. However, existing methods rely on joint training of the guidance module and the diffusion model, which can be suboptimal during the early stages when the guidance is inaccurate and provides noisy learning signals. In offline RL, guidance depends solely on offline data: observations, actions, and rewards, and is independent of the policy module’s behavior, suggesting that joint training is not required. This paper proposes modular training methods that decouple the guidance module from the diffusion model, based on three key findings: Guidance Necessity: We explore how the effectiveness of guidance varies with the training stage and algorithm choice, uncovering the roles of guidance and diffusion. A lack of good guidance in the early stage presents an opportunity for optimization. Guidance-First Diffusion Training: We introduce a method where the guidance module is first trained independently as a value estimator, then frozen to guide the diffusion model using classifier-free reward guidance. This modularization reduces memory usage, improves computational efficiency, and enhances both sample efficiency and final performance. Cross-Module Transferability: Applying two independently trained guidance models, one during training and the other during inference, can significantly reduce normalized score variance (e.g., reducing IQR by 86%). We show that guidance modules trained with one algorithm (e.g., IDQL) can be directly reused with another (e.g., DQL), with no additional training required, demonstrating baseline-level performance as well as strong modularity and transferability. We provide theoretical justification and empirical validation on bullet D4RL benchmarks. Our findings suggest a new paradigm for offline RL: modular, reusable, and composable training pipelines. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2506.03154 [cs.LG] (or arXiv:2506.03154v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2506.03154 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Zhaoyang Chen [view email] [v1] Mon, 19 May 2025 22:51:58 UTC (1,231 KB) Full-text links: Access Paper: View a PDF of the paper titled Modular Diffusion Policy Training: Decoupling and Recombining Guidance and Diffusion for Offline RL, by Zhaoyang Chen and 1 other authorsView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.LG prev | next new | recent | 2025-06 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) IArxiv recommender toggle IArxiv Recommender (What is IArxiv?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack
[LG-69] What Makes Treatment Effects Identifiable? Characterizations and Estimators Beyond Unconfoundedness COLT
链接: https://arxiv.org/abs/2506.04194
作者: Yang Cai,Alkis Kalavasis,Katerina Mamali,Anay Mehrotra,Manolis Zampetakis
类目: atistics Theory (math.ST); Machine Learning (cs.LG); Econometrics (econ.EM); Methodology (stat.ME); Machine Learning (stat.ML)
*备注: Accepted for presentation at the 38th Conference on Learning Theory (COLT) 2025
Abstract:Most of the widely used estimators of the average treatment effect (ATE) in causal inference rely on the assumptions of unconfoundedness and overlap. Unconfoundedness requires that the observed covariates account for all correlations between the outcome and treatment. Overlap requires the existence of randomness in treatment decisions for all individuals. Nevertheless, many types of studies frequently violate unconfoundedness or overlap, for instance, observational studies with deterministic treatment decisions – popularly known as Regression Discontinuity designs – violate overlap. In this paper, we initiate the study of general conditions that enable the identification of the average treatment effect, extending beyond unconfoundedness and overlap. In particular, following the paradigm of statistical learning theory, we provide an interpretable condition that is sufficient and nearly necessary for the identification of ATE. Moreover, this condition characterizes the identification of the average treatment effect on the treated (ATT) and can be used to characterize other treatment effects as well. To illustrate the utility of our condition, we present several well-studied scenarios where our condition is satisfied and, hence, we prove that ATE can be identified in regimes that prior works could not capture. For example, under mild assumptions on the data distributions, this holds for the models proposed by Tan (2006) and Rosenbaum (2002), and the Regression Discontinuity design model introduced by Thistlethwaite and Campbell (1960). For each of these scenarios, we also show that, under natural additional assumptions, ATE can be estimated from finite samples. We believe these findings open new avenues for bridging learning-theoretic insights and causal inference methodologies, particularly in observational studies with complex treatment mechanisms. Comments: Accepted for presentation at the 38th Conference on Learning Theory (COLT) 2025 Subjects: Statistics Theory (math.ST); Machine Learning (cs.LG); Econometrics (econ.EM); Methodology (stat.ME); Machine Learning (stat.ML) Cite as: arXiv:2506.04194 [math.ST] (or arXiv:2506.04194v1 [math.ST] for this version) https://doi.org/10.48550/arXiv.2506.04194 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-70] Understanding challenges to the interpretation of disaggregated evaluations of algorithmic fairness
链接: https://arxiv.org/abs/2506.04193
作者: Stephen R. Pfohl,Natalie Harris,Chirag Nagpal,David Madras,Vishwali Mhasawade,Olawale Salaudeen,Awa Dieng,Shannon Sequeira,Santiago Arciniegas,Lillian Sung,Nnamdi Ezeanochie,Heather Cole-Lewis,Katherine Heller,Sanmi Koyejo,Alexander D’Amour
类目: Machine Learning (stat.ML); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注:
Abstract:Disaggregated evaluation across subgroups is critical for assessing the fairness of machine learning models, but its uncritical use can mislead practitioners. We show that equal performance across subgroups is an unreliable measure of fairness when data are representative of the relevant populations but reflective of real-world disparities. Furthermore, when data are not representative due to selection bias, both disaggregated evaluation and alternative approaches based on conditional independence testing may be invalid without explicit assumptions regarding the bias mechanism. We use causal graphical models to predict metric stability across subgroups under different data generating processes. Our framework suggests complementing disaggregated evaluations with explicit causal assumptions and analysis to control for confounding and distribution shift, including conditional independence testing and weighted performance estimation. These findings have broad implications for how practitioners design and interpret model assessments given the ubiquity of disaggregated evaluation.
[LG-71] Estimation of the reduced density matrix and entanglement entropies using autoregressive networks
链接: https://arxiv.org/abs/2506.04170
作者: Piotr Białas,Piotr Korcyl,Tomasz Stebel,Dawid Zapolski
类目: Quantum Physics (quant-ph); Statistical Mechanics (cond-mat.stat-mech); Machine Learning (cs.LG); High Energy Physics - Lattice (hep-lat); High Energy Physics - Theory (hep-th)
*备注: 9 pages, 7 figures
Abstract:We present an application of autoregressive neural networks to Monte Carlo simulations of quantum spin chains using the correspondence with classical two-dimensional spin systems. We use a hierarchy of neural networks capable of estimating conditional probabilities of consecutive spins to evaluate elements of reduced density matrices directly. Using the Ising chain as an example, we calculate the continuum limit of the ground state’s von Neumann and Rényi bipartite entanglement entropies of an interval built of up to 5 spins. We demonstrate that our architecture is able to estimate all the needed matrix elements with just a single training for a fixed time discretization and lattice volume. Our method can be applied to other types of spin chains, possibly with defects, as well as to estimating entanglement entropies of thermal states at non-zero temperature.
[LG-72] chemtrain-deploy: A parallel and scalable framework for machine learning potentials in million-atom MD simulations
链接: https://arxiv.org/abs/2506.04055
作者: Paul Fuchs,Weilong Chen,Stephan Thaler,Julija Zavadlav
类目: Computational Physics (physics.comp-ph); Machine Learning (cs.LG); Chemical Physics (physics.chem-ph)
*备注: Source code available at: this https URL
Abstract:Machine learning potentials (MLPs) have advanced rapidly and show great promise to transform molecular dynamics (MD) simulations. However, most existing software tools are tied to specific MLP architectures, lack integration with standard MD packages, or are not parallelizable across GPUs. To address these challenges, we present chemtrain-deploy, a framework that enables model-agnostic deployment of MLPs in LAMMPS. chemtrain-deploy supports any JAX-defined semi-local potential, allowing users to exploit the functionality of LAMMPS and perform large-scale MLP-based MD simulations on multiple GPUs. It achieves state-of-the-art efficiency and scales to systems containing millions of atoms. We validate its performance and scalability using graph neural network architectures, including MACE, Allegro, and PaiNN, applied to a variety of systems, such as liquid-vapor interfaces, crystalline materials, and solvated peptides. Our results highlight the practical utility of chemtrain-deploy for real-world, high-performance simulations and provide guidance for MLP architecture selection and future design.
[LG-73] Similarity-based fuzzy clustering scientific articles: potentials and challenges from mathematical and computational perspectives
链接: https://arxiv.org/abs/2506.04045
作者: Vu Thi Huong,Ida Litzel,Thorsten Koch
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:
Abstract:Fuzzy clustering, which allows an article to belong to multiple clusters with soft membership degrees, plays a vital role in analyzing publication data. This problem can be formulated as a constrained optimization model, where the goal is to minimize the discrepancy between the similarity observed from data and the similarity derived from a predicted distribution. While this approach benefits from leveraging state-of-the-art optimization algorithms, tailoring them to work with real, massive databases like OpenAlex or Web of Science - containing about 70 million articles and a billion citations - poses significant challenges. We analyze potentials and challenges of the approach from both mathematical and computational perspectives. Among other things, second-order optimality conditions are established, providing new theoretical insights, and practical solution methods are proposed by exploiting the structure of the problem. Specifically, we accelerate the gradient projection method using GPU-based parallel computing to efficiently handle large-scale data.
[LG-74] A Generic Branch-and-Bound Algorithm for ell_0-Penalized Problems with Supplementary Material
链接: https://arxiv.org/abs/2506.03974
作者: Clément Elvira,Théo Guyard,Cédric Herzet
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:We present a generic Branch-and-Bound procedure designed to solve L0-penalized optimization problems. Existing approaches primarily focus on quadratic losses and construct relaxations using “Big-M” constraints and/or L2-norm penalties. In contrast, our method accommodates a broader class of loss functions and allows greater flexibility in relaxation design through a general penalty term, encompassing existing techniques as special cases. We establish theoretical results ensuring that all key quantities required for the Branch-and-Bound implementation admit closed-form expressions under the general blanket assumptions considered in our work. Leveraging this framework, we introduce El0ps, an open-source Python solver with a plug-and-play workflow that enables user-defined losses and penalties in L0-penalized problems. Through extensive numerical experiments, we demonstrate that El0ps achieves state-of-the-art performance on classical instances and extends computational feasibility to previously intractable ones.
[LG-75] Algorithm- and Data-Dependent Generalization Bounds for Score-Based Generative Models
链接: https://arxiv.org/abs/2506.03849
作者: Benjamin Dupuis,Dario Shariatian,Maxime Haddouche,Alain Durmus,Umut Simsekli
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Score-based generative models (SGMs) have emerged as one of the most popular classes of generative models. A substantial body of work now exists on the analysis of SGMs, focusing either on discretization aspects or on their statistical performance. In the latter case, bounds have been derived, under various metrics, between the true data distribution and the distribution induced by the SGM, often demonstrating polynomial convergence rates with respect to the number of training samples. However, these approaches adopt a largely approximation theory viewpoint, which tends to be overly pessimistic and relatively coarse. In particular, they fail to fully explain the empirical success of SGMs or capture the role of the optimization algorithm used in practice to train the score network. To support this observation, we first present simple experiments illustrating the concrete impact of optimization hyperparameters on the generalization ability of the generated distribution. Then, this paper aims to bridge this theoretical gap by providing the first algorithmic- and data-dependent generalization analysis for SGMs. In particular, we establish bounds that explicitly account for the optimization dynamics of the learning algorithm, offering new insights into the generalization behavior of SGMs. Our theoretical findings are supported by empirical results on several datasets.
[LG-76] Spatially Resolved Meteorological and Ancillary Data in Central Europe for Rainfall Streamflow Modeling
链接: https://arxiv.org/abs/2506.03819
作者: Marc Aurel Vischer,Noelia Otero,Jackie Ma
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 6 pages, 1 figure
Abstract:We present a dataset for rainfall streamflow modeling that is fully spatially resolved with the aim of taking neural network-driven hydrological modeling beyond lumped catchments. To this end, we compiled data covering five river basins in central Europe: upper Danube, Elbe, Oder, Rhine, and Weser. The dataset contains meteorological forcings, as well as ancillary information on soil, rock, land cover, and orography. The data is harmonized to a regular 9km times 9km grid and contains daily values that span from October 1981 to September 2011. We also provide code to further combine our dataset with publicly available river discharge data for end-to-end rainfall streamflow modeling.
[LG-77] Geoff: The Generic Optimization Framework Frontend for Particle Accelerator Controls
链接: https://arxiv.org/abs/2506.03796
作者: Penelope Madysa,Sabrina Appel,Verena Kain,Michael Schenk
类目: Accelerator Physics (physics.acc-ph); Machine Learning (cs.LG)
*备注: 18 pages, 5 figures. Submitted to SoftwareX
Abstract:Geoff is a collection of Python packages that form a framework for automation of particle accelerator controls. With particle accelerator laboratories around the world researching machine learning techniques to improve accelerator performance and uptime, a multitude of approaches and algorithms have emerged. The purpose of Geoff is to harmonize these approaches and to minimize friction when comparing or migrating between them. It provides standardized interfaces for optimization problems, utility functions to speed up development, and a reference GUI application that ties everything together. Geoff is an open-source library developed at CERN and maintained and updated in collaboration between CERN and GSI as part of the EURO-LABS project. This paper gives an overview over Geoff’s design, features, and current usage.
[LG-78] High-Dimensional Learning in Finance
链接: https://arxiv.org/abs/2506.03780
作者: Hasan Fallahgoul
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Econometrics (econ.EM)
*备注:
Abstract:Recent advances in machine learning have shown promising results for financial prediction using large, over-parameterized models. This paper provides theoretical foundations and empirical validation for understanding when and how these methods achieve predictive success. I examine three key aspects of high-dimensional learning in finance. First, I prove that within-sample standardization in Random Fourier Features implementations fundamentally alters the underlying Gaussian kernel approximation, replacing shift-invariant kernels with training-set dependent alternatives. Second, I derive sample complexity bounds showing when reliable learning becomes information-theoretically impossible under weak signal-to-noise ratios typical in finance. Third, VC-dimension analysis reveals that ridgeless regression’s effective complexity is bounded by sample size rather than nominal feature dimension. Comprehensive numerical validation confirms these theoretical predictions, revealing systematic breakdown of claimed theoretical properties across realistic parameter ranges. These results show that when sample size is small and features are high-dimensional, observed predictive success is necessarily driven by low-complexity artifacts, not genuine high-dimensional learning.
[LG-79] owards Quantum Operator-Valued Kernels
链接: https://arxiv.org/abs/2506.03779
作者: Hachem Kadri,Joachim Tomasi,Yuka Hashimoto,Sandrine Anthoine
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Quantum kernels are reproducing kernel functions built using quantum-mechanical principles and are studied with the aim of outperforming their classical counterparts. The enthusiasm for quantum kernel machines has been tempered by recent studies that have suggested that quantum kernels could not offer speed-ups when learning on classical data. However, most of the research in this area has been devoted to scalar-valued kernels in standard classification or regression settings for which classical kernel methods are efficient and effective, leaving very little room for improvement with quantum kernels. This position paper argues that quantum kernel research should focus on more expressive kernel classes. We build upon recent advances in operator-valued kernels, and propose guidelines for investigating quantum kernels. This should help to design a new generation of quantum kernel machines and fully explore their potentials.
[LG-80] Infinitesimal Higher-Order Spectral Variations in Rectangular Real Random Matrices
链接: https://arxiv.org/abs/2506.03764
作者: Róisín Luo
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:We present a theoretical framework for deriving the general n -th order Fréchet derivatives of singular values in real rectangular matrices, by leveraging reduced resolvent operators from Kato’s analytic perturbation theory for self-adjoint operators. Deriving closed-form expressions for higher-order derivatives of singular values is notoriously challenging through standard matrix-analysis techniques. To overcome this, we treat a real rectangular matrix as a compact operator on a finite-dimensional Hilbert space, and embed the rectangular matrix into a block self-adjoint operator so that non-symmetric perturbations are captured. Applying Kato’s asymptotic eigenvalue expansion to this construction, we obtain a general, closed-form expression for the infinitesimal n -th order spectral variations. Specializing to n=2 and deploying on a Kronecker-product representation with matrix convention yield the Hessian of a singular value, not found in literature. By bridging abstract operator-theoretic perturbation theory with matrices, our framework equips researchers with a practical toolkit for higher-order spectral sensitivity studies in random matrix applications (e.g., adversarial perturbation in deep learning).
[LG-81] RhoDARTS: Differentiable Quantum Architecture Search with Density Matrix Simulations
链接: https://arxiv.org/abs/2506.03697
作者: Swagat Kumar,Jan-Nico Zaech,Colin Michael Wilmott,Luc Van Gool
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 24 pages, 16 figures
Abstract:Variational Quantum Algorithms (VQAs) are a promising approach for leveraging powerful Noisy Intermediate-Scale Quantum (NISQ) computers. When applied to machine learning tasks, VQAs give rise to NISQ-compatible Quantum Neural Networks (QNNs), which have been shown to outperform classical neural networks with a similar number of trainable parameters. While the quantum circuit structures of VQAs for physics simulations are determined by the physical properties of the systems, identifying effective QNN architectures for general machine learning tasks is a difficult challenge due to the lack of domain-specific priors. Indeed, existing Quantum Architecture Search (QAS) algorithms, adaptations of classical neural architecture search techniques, often overlook the inherent quantum nature of the circuits they produce. By approaching QAS from the ground-up and from a quantum perspective, we resolve this limitation by proposing \rho DARTS, a differentiable QAS algorithm that models the search process as the evolution of a quantum mixed state, emerging from the search space of quantum architectures. We validate our method by finding circuits for state initialization, Hamiltonian optimization, and image classification. Further, we demonstrate better convergence against existing QAS techniques and show improved robustness levels to noise.
[LG-82] Latent Guided Sampling for Combinatorial Optimization
链接: https://arxiv.org/abs/2506.03672
作者: Sobihan Surendran(LPSM (UMR_8001), SU),Adeline Fermanian,Sylvain Le Corff(LPSM (UMR_8001), SU)
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Combinatorial Optimization problems are widespread in domains such as logistics, manufacturing, and drug discovery, yet their NP-hard nature makes them computationally challenging. Recent Neural Combinatorial Optimization methods leverage deep learning to learn solution strategies, trained via Supervised or Reinforcement Learning (RL). While promising, these approaches often rely on task-specific augmentations, perform poorly on out-of-distribution instances, and lack robust inference mechanisms. Moreover, existing latent space models either require labeled data or rely on pre-trained policies. In this work, we propose LGS-Net, a novel latent space model that conditions on problem instances, and introduce an efficient inference method, Latent Guided Sampling (LGS), based on Markov Chain Monte Carlo and Stochastic Approximation. We show that the iterations of our method form a time-inhomogeneous Markov Chain and provide rigorous theoretical convergence guarantees. Empirical results on benchmark routing tasks show that our method achieves state-of-the-art performance among RL-based approaches.
[LG-83] Position: There Is No Free Bayesian Uncertainty Quantification NEURIPS2025
链接: https://arxiv.org/abs/2506.03670
作者: Ivan Melev,Goeran Kauermann
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: NeurIPS 2025 preprint, frequentist critique of Bayesian UQ
Abstract:Due to their intuitive appeal, Bayesian methods of modeling and uncertainty quantification have become popular in modern machine and deep learning. When providing a prior distribution over the parameter space, it is straightforward to obtain a distribution over the parameters that is conventionally interpreted as uncertainty quantification of the model. We challenge the validity of such Bayesian uncertainty quantification by discussing the equivalent optimization-based representation of Bayesian updating, provide an alternative interpretation that is coherent with the optimization-based perspective, propose measures of the quality of the Bayesian inferential stage, and suggest directions for future work.
[LG-84] SubSearch: Robust Estimation and Outlier Detection for Stochastic Block Models via Subgraph Search
链接: https://arxiv.org/abs/2506.03657
作者: Leonardo Martins Bianco(LMO),Christine Keribin(LMO),Zacharie Naulet(INRAE, MaIAGE)
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Community detection is a fundamental task in graph analysis, with methods often relying on fitting models like the Stochastic Block Model (SBM) to observed networks. While many algorithms can accurately estimate SBM parameters when the input graph is a perfect sample from the model, real-world graphs rarely conform to such idealized assumptions. Therefore, robust algorithms are crucial-ones that can recover model parameters even when the data deviates from the assumed distribution. In this work, we propose SubSearch, an algorithm for robustly estimating SBM parameters by exploring the space of subgraphs in search of one that closely aligns with the model’s assumptions. Our approach also functions as an outlier detection method, properly identifying nodes responsible for the graph’s deviation from the model and going beyond simple techniques like pruning high-degree nodes. Extensive experiments on both synthetic and real-world datasets demonstrate the effectiveness of our method.
[LG-85] BitTTS: Highly Compact Text-to-Speech Using 1.58-bit Quantization and Weight Indexing INTERSPEECH2025
链接: https://arxiv.org/abs/2506.03515
作者: Masaya Kawamura,Takuya Hasumi,Yuma Shirahata,Ryuichi Yamamoto
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD); Signal Processing (eess.SP)
*备注: Accepted to INTERSPEECH 2025
Abstract:This paper proposes a highly compact, lightweight text-to-speech (TTS) model for on-device applications. To reduce the model size, the proposed model introduces two techniques. First, we introduce quantization-aware training (QAT), which quantizes model parameters during training to as low as 1.58-bit. In this case, most of 32-bit model parameters are quantized to ternary values -1, 0, 1. Second, we propose a method named weight indexing. In this method, we save a group of 1.58-bit weights as a single int8 index. This allows for efficient storage of model parameters, even on hardware that treats values in units of 8-bit. Experimental results demonstrate that the proposed method achieved 83 % reduction in model size, while outperforming the baseline of similar model size without quantization in synthesis quality.
[LG-86] Models of Heavy-Tailed Mechanistic Universality
链接: https://arxiv.org/abs/2506.03470
作者: Liam Hodgkinson,Zhichao Wang,Michael W. Mahoney
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 40 pages, 4 figures
Abstract:Recent theoretical and empirical successes in deep learning, including the celebrated neural scaling laws, are punctuated by the observation that many objects of interest tend to exhibit some form of heavy-tailed or power law behavior. In particular, the prevalence of heavy-tailed spectral densities in Jacobians, Hessians, and weight matrices has led to the introduction of the concept of heavy-tailed mechanistic universality (HT-MU). Multiple lines of empirical evidence suggest a robust correlation between heavy-tailed metrics and model performance, indicating that HT-MU may be a fundamental aspect of deep learning efficacy. Here, we propose a general family of random matrix models – the high-temperature Marchenko-Pastur (HTMP) ensemble – to explore attributes that give rise to heavy-tailed behavior in trained neural networks. Under this model, spectral densities with power laws on (upper and lower) tails arise through a combination of three independent factors (complex correlation structures in the data; reduced temperatures during training; and reduced eigenvector entropy), appearing as an implicit bias in the model structure, and they can be controlled with an “eigenvalue repulsion” parameter. Implications of our model on other appearances of heavy tails, including neural scaling laws, optimizer trajectories, and the five-plus-one phases of neural network training, are discussed.
[LG-87] Investigating Quantum Feature Maps in Quantum Support Vector Machines for Lung Cancer Classification
链接: https://arxiv.org/abs/2506.03272
作者: My Youssef El Hafidi,Achraf Toufah,Mohamed Achraf Kadim
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 14 pages, 7 figures
Abstract:In recent years, quantum machine learning has emerged as a promising intersection between quantum physics and artificial intelligence, particularly in domains requiring advanced pattern recognition such as healthcare. This study investigates the effectiveness of Quantum Support Vector Machines (QSVM), which leverage quantum mechanical phenomena like superposition and entanglement to construct high-dimensional Hilbert spaces for data classification. Focusing on lung cancer diagnosis, a concrete and critical healthcare application, we analyze how different quantum feature maps influence classification performance. Using a real-world dataset of 309 patient records with significant class imbalance (39 non-cancer vs. 270 cancer cases), we constructed six balanced subsets for robust evaluation. QSVM models were implemented using Qiskit and executed on the qasm simulator, employing three distinct quantum feature maps: ZFeatureMap, ZZFeatureMap, and PauliFeatureMap. Performance was assessed using accuracy, precision, recall, specificity, and F1-score. Results show that the PauliFeatureMap consistently outperformed the others, achieving perfect classification in three subsets and strong performance overall. These findings demonstrate how quantum computational principles can be harnessed to enhance diagnostic capabilities, reinforcing the importance of physics-based modeling in emerging AI applications within healthcare.
[LG-88] Quantum Cognition Machine Learning for Forecasting Chromosomal Instability
链接: https://arxiv.org/abs/2506.03199
作者: Giuseppe Di Caro,Vahagn Kirakosyan,Alexander G. Abanov,Luca Candelori,Nadine Hartmann,Ernest T. Lam,Kharen Musaelian,Ryan Samson,Dario Villani,Martin T. Wells,Richard J. Wenstrup,Mengjia Xu
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG); Quantum Physics (quant-ph)
*备注:
Abstract:The accurate prediction of chromosomal instability from the morphology of circulating tumor cells (CTCs) enables real-time detection of CTCs with high metastatic potential in the context of liquid biopsy diagnostics. However, it presents a significant challenge due to the high dimensionality and complexity of single-cell digital pathology data. Here, we introduce the application of Quantum Cognition Machine Learning (QCML), a quantum-inspired computational framework, to estimate morphology-predicted chromosomal instability in CTCs from patients with metastatic breast cancer. QCML leverages quantum mechanical principles to represent data as state vectors in a Hilbert space, enabling context-aware feature modeling, dimensionality reduction, and enhanced generalization without requiring curated feature selection. QCML outperforms conventional machine learning methods when tested on out of sample verification CTCs, achieving higher accuracy in identifying predicted large-scale state transitions (pLST) status from CTC-derived morphology features. These preliminary findings support the application of QCML as a novel machine learning tool with superior performance in high-dimensional, low-sample-size biomedical contexts. QCML enables the simulation of cognition-like learning for the identification of biologically meaningful prediction of chromosomal instability from CTC morphology, offering a novel tool for CTC classification in liquid biopsy.
[LG-89] UniSim: A Unified Simulator for Time-Coarsened Dynamics of Biomolecules ICML2025
链接: https://arxiv.org/abs/2506.03157
作者: Ziyang Yu,Wenbing Huang,Yang Liu
类目: Biomolecules (q-bio.BM); Machine Learning (cs.LG)
*备注: ICML 2025 poster
Abstract:Molecular Dynamics (MD) simulations are essential for understanding the atomic-level behavior of molecular systems, giving insights into their transitions and interactions. However, classical MD techniques are limited by the trade-off between accuracy and efficiency, while recent deep learning-based improvements have mostly focused on single-domain molecules, lacking transferability to unfamiliar molecular systems. Therefore, we propose \textbfUnified \textbfSimulator (UniSim), which leverages cross-domain knowledge to enhance the understanding of atomic interactions. First, we employ a multi-head pretraining approach to learn a unified atomic representation model from a large and diverse set of molecular data. Then, based on the stochastic interpolant framework, we learn the state transition patterns over long timesteps from MD trajectories, and introduce a force guidance module for rapidly adapting to different chemical environments. Our experiments demonstrate that UniSim achieves highly competitive performance across small molecules, peptides, and proteins.
[LG-90] Why Regression? Binary Encoding Classification Brings Confidence to Stock Market Index Price Prediction
链接: https://arxiv.org/abs/2506.03153
作者: Junzhe Jiang,Chang Yang,Xinrun Wang,Bo Li
类目: atistical Finance (q-fin.ST); Machine Learning (cs.LG)
*备注:
Abstract:Stock market indices serve as fundamental market measurement that quantify systematic market dynamics. However, accurate index price prediction remains challenging, primarily because existing approaches treat indices as isolated time series and frame the prediction as a simple regression task. These methods fail to capture indices’ inherent nature as aggregations of constituent stocks with complex, time-varying interdependencies. To address these limitations, we propose Cubic, a novel end-to-end framework that explicitly models the adaptive fusion of constituent stocks for index price prediction. Our main contributions are threefold. i) Fusion in the latent space: we introduce the fusion mechanism over the latent embedding of the stocks to extract the information from the vast number of stocks. ii) Binary encoding classification: since regression tasks are challenging due to continuous value estimation, we reformulate the regression into the classification task, where the target value is converted to binary and we optimize the prediction of the value of each digit with cross-entropy loss. iii) Confidence-guided prediction and trading: we introduce the regularization loss to address market prediction uncertainty for the index prediction and design the rule-based trading policies based on the confidence. Extensive experiments across multiple stock markets and indices demonstrate that Cubic consistently outperforms state-of-the-art baselines in stock index prediction tasks, achieving superior performance on both forecasting accuracy metrics and downstream trading profitability.
信息检索
[IR-0] Quantifying Query Fairness Under Unawareness
链接: https://arxiv.org/abs/2506.04140
作者: Thomas Jaenich,Alejandro Moreo,Alessandro Fabris,Graham McDonald,Andrea Esuli,Iadh Ounis,Fabrizio Sebastiani
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Traditional ranking algorithms are designed to retrieve the most relevant items for a user’s query, but they often inherit biases from data that can unfairly disadvantage vulnerable groups. Fairness in information access systems (IAS) is typically assessed by comparing the distribution of groups in a ranking to a target distribution, such as the overall group distribution in the dataset. These fairness metrics depend on knowing the true group labels for each item. However, when groups are defined by demographic or sensitive attributes, these labels are often unknown, leading to a setting known as “fairness under unawareness”. To address this, group membership can be inferred using machine-learned classifiers, and group prevalence is estimated by counting the predicted labels. Unfortunately, such an estimation is known to be unreliable under dataset shift, compromising the accuracy of fairness evaluations. In this paper, we introduce a robust fairness estimator based on quantification that effectively handles multiple sensitive attributes beyond binary classifications. Our method outperforms existing baselines across various sensitive attributes and, to the best of our knowledge, is the first to establish a reliable protocol for measuring fairness under unawareness across multiple queries and groups.
[IR-1] A Generative Adaptive Replay Continual Learning Model for Temporal Knowledge Graph Reasoning
链接: https://arxiv.org/abs/2506.04083
作者: Zhiyu Zhang,Wei Chen,Youfang Lin,Huaiyu Wan
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Recent Continual Learning (CL)-based Temporal Knowledge Graph Reasoning (TKGR) methods focus on significantly reducing computational cost and mitigating catastrophic forgetting caused by fine-tuning models with new data. However, existing CL-based TKGR methods still face two key limitations: (1) They usually one-sidedly reorganize individual historical facts, while overlooking the historical context essential for accurately understanding the historical semantics of these facts; (2) They preserve historical knowledge by simply replaying historical facts, while ignoring the potential conflicts between historical and emerging facts. In this paper, we propose a Deep Generative Adaptive Replay (DGAR) method, which can generate and adaptively replay historical entity distribution representations from the whole historical context. To address the first challenge, historical context prompts as sampling units are built to preserve the whole historical context information. To overcome the second challenge, a pre-trained diffusion model is adopted to generate the historical distribution. During the generation process, the common features between the historical and current distributions are enhanced under the guidance of the TKGR model. In addition, a layer-by-layer adaptive replay mechanism is designed to effectively integrate historical and current distributions. Experimental results demonstrate that DGAR significantly outperforms baselines in reasoning and mitigating forgetting.
[IR-2] GORACS: Group-level Optimal Transport-guided Coreset Selection for LLM -based Recommender Systems KDD2025
链接: https://arxiv.org/abs/2506.04015
作者: Tiehua Mei,Hengrui Chen,Peng Yu,Jiaqing Liang,Deqing Yang
类目: Information Retrieval (cs.IR)
*备注: Accepted by KDD 2025
Abstract:Although large language models (LLMs) have shown great potential in recommender systems, the prohibitive computational costs for fine-tuning LLMs on entire datasets hinder their successful deployment in real-world scenarios. To develop affordable and effective LLM-based recommender systems, we focus on the task of coreset selection which identifies a small subset of fine-tuning data to optimize the test loss, thereby facilitating efficient LLMs’ fine-tuning. Although there exist some intuitive solutions of subset selection, including distribution-based and importance-based approaches, they often lead to suboptimal performance due to the misalignment with downstream fine-tuning objectives or weak generalization ability caused by individual-level sample selection. To overcome these challenges, we propose GORACS, which is a novel Group-level Optimal tRAnsport-guided Coreset Selection framework for LLM-based recommender systems. GORACS is designed based on two key principles for coreset selection: 1) selecting the subsets that minimize the test loss to align with fine-tuning objectives, and 2) enhancing model generalization through group-level data selection. Corresponding to these two principles, GORACS has two key components: 1) a Proxy Optimization Objective (POO) leveraging optimal transport and gradient information to bound the intractable test loss, thus reducing computational costs by avoiding repeated LLM retraining, and 2) a two-stage Initialization-Then-Refinement Algorithm (ITRA) for efficient group-level selection. Our extensive experiments across diverse recommendation datasets and tasks validate that GORACS significantly reduces fine-tuning costs of LLMs while achieving superior performance over the state-of-the-art baselines and full data training. The source code of GORACS are available at this https URL.
[IR-3] Graph-Embedding Empowered Entity Retrieval
链接: https://arxiv.org/abs/2506.03895
作者: Emma J. Gerritse,Faegheh Hasibi,Arjen P. de Vries
类目: Information Retrieval (cs.IR)
*备注: arXiv admin note: text overlap with arXiv:2005.02843
Abstract:In this research, we investigate methods for entity retrieval using graph embeddings. While various methods have been proposed over the years, most utilize a single graph embedding and entity linking approach. This hinders our understanding of how different graph embedding and entity linking methods impact entity retrieval. To address this gap, we investigate the effects of three different categories of graph embedding techniques and five different entity linking methods. We perform a reranking of entities using the distance between the embeddings of annotated entities and the entities we wish to rerank. We conclude that the selection of both graph embeddings and entity linkers significantly impacts the effectiveness of entity retrieval. For graph embeddings, methods that incorporate both graph structure and textual descriptions of entities are the most effective. For entity linking, both precision and recall concerning concepts are important for optimal retrieval performance. Additionally, it is essential for the graph to encompass as many entities as possible.
[IR-4] Understanding Mental Models of Generative Conversational Search and The Effect of Interface Transparency
链接: https://arxiv.org/abs/2506.03807
作者: Chadha Degachi,Samuel Kernan Freire,Evangelos Niforatos,Gerd Kortuem
类目: Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)
*备注: Work in Progress
Abstract:The experience and adoption of conversational search is tied to the accuracy and completeness of users’ mental models – their internal frameworks for understanding and predicting system behaviour. Thus, understanding these models can reveal areas for design interventions. Transparency is one such intervention which can improve system interpretability and enable mental model alignment. While past research has explored mental models of search engines, those of generative conversational search remain underexplored, even while the popularity of these systems soars. To address this, we conducted a study with 16 participants, who performed 4 search tasks using 4 conversational interfaces of varying transparency levels. Our analysis revealed that most user mental models were too abstract to support users in explaining individual search instances. These results suggest that 1) mental models may pose a barrier to appropriate trust in conversational search, and 2) hybrid web-conversational search is a promising novel direction for future search interface design.
[IR-5] Scaling Transformers for Discriminative Recommendation via Generative Pretraining KDD’25
链接: https://arxiv.org/abs/2506.03699
作者: Chunqi Wang,Bingchao Wu,Zheng Chen,Lei Shen,Bing Wang,Xiaoyi Zeng
类目: Information Retrieval (cs.IR)
*备注: KDD’25
Abstract:Discriminative recommendation tasks, such as CTR (click-through rate) and CVR (conversion rate) prediction, play critical roles in the ranking stage of large-scale industrial recommender systems. However, training a discriminative model encounters a significant overfitting issue induced by data sparsity. Moreover, this overfitting issue worsens with larger models, causing them to underperform smaller ones. To address the overfitting issue and enhance model scalability, we propose a framework named GPSD (\textbfGenerative \textbfPretraining for \textbfScalable \textbfDiscriminative Recommendation), drawing inspiration from generative training, which exhibits no evident signs of overfitting. GPSD leverages the parameters learned from a pretrained generative model to initialize a discriminative model, and subsequently applies a sparse parameter freezing strategy. Extensive experiments conducted on both industrial-scale and publicly available datasets demonstrate the superior performance of GPSD. Moreover, it delivers remarkable improvements in online A/B tests. GPSD offers two primary advantages: 1) it substantially narrows the generalization gap in model training, resulting in better test performance; and 2) it leverages the scalability of Transformers, delivering consistent performance gains as models are scaled up. Specifically, we observe consistent performance improvements as the model dense parameters scale from 13K to 0.3B, closely adhering to power laws. These findings pave the way for unifying the architectures of recommendation models and language models, enabling the direct application of techniques well-established in large language models to recommendation models.
[IR-6] Quake: Adaptive Indexing for Vector Search
链接: https://arxiv.org/abs/2506.03437
作者: Jason Mohoney,Devesh Sarda,Mengze Tang,Shihabur Rahman Chowdhury,Anil Pacaci,Ihab F. Ilyas,Theodoros Rekatsinas,Shivaram Venkataraman
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Vector search, the task of finding the k-nearest neighbors of high-dimensional vectors, underpins many machine learning applications, including recommendation systems and information retrieval. However, existing approximate nearest neighbor (ANN) methods perform poorly under dynamic, skewed workloads where data distributions evolve. We introduce Quake, an adaptive indexing system that maintains low latency and high recall in such environments. Quake employs a hierarchical partitioning scheme that adjusts to updates and changing access patterns, guided by a cost model that predicts query latency based on partition sizes and access frequencies. Quake also dynamically optimizes query execution parameters to meet recall targets using a novel recall estimation model. Furthermore, Quake utilizes optimized query processing, leveraging NUMA-aware parallelism for improved memory bandwidth utilization. To evaluate Quake, we prepare a Wikipedia vector search workload and develop a workload generator to create vector search workloads with configurable access patterns. Our evaluation shows that on dynamic workloads, Quake achieves query latency reductions of 1.5-22x and update latency reductions of 6-83x compared to state-of-the-art indexes SVS, DiskANN, HNSW, and SCANN.
[IR-7] Impact of Rankings and Personalized Recommendations in Marketplaces
链接: https://arxiv.org/abs/2506.03369
作者: Omar Besbes,Yash Kanoria,Akshit Kumar
类目: Theoretical Economics (econ.TH); Computers and Society (cs.CY); Information Retrieval (cs.IR)
*备注:
Abstract:Individuals often navigate several options with incomplete knowledge of their own preferences. Information provisioning tools such as public rankings and personalized recommendations have become central to helping individuals make choices, yet their value proposition under different marketplace environments remains unexplored. This paper studies a stylized model to explore the impact of these tools in two marketplace settings: uncapacitated supply, where items can be selected by any number of agents, and capacitated supply, where each item is constrained to be matched to a single agent. We model the agents utility as a weighted combination of a common term which depends only on the item, reflecting the item’s population level quality, and an idiosyncratic term, which depends on the agent item pair capturing individual specific tastes. Public rankings reveal the common term, while personalized recommendations reveal both terms. In the supply unconstrained settings, both public rankings and personalized recommendations improve welfare, with their relative value determined by the degree of preference heterogeneity. Public rankings are effective when preferences are relatively homogeneous, while personalized recommendations become critical as heterogeneity increases. In contrast, in supply constrained settings, revealing just the common term, as done by public rankings, provides limited benefit since the total common value available is limited by capacity constraints, whereas personalized recommendations, by revealing both common and idiosyncratic terms, significantly enhance welfare by enabling agents to match with items they idiosyncratically value highly. These results illustrate the interplay between supply constraints and preference heterogeneity in determining the effectiveness of information provisioning tools, offering insights for their design and deployment in diverse settings.