本篇博文主要内容为 2025-06-11 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。
友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。
目录
概览 (2025-06-11)
今日共更新582篇论文,其中:
- 自然语言处理共110篇(Computation and Language (cs.CL))
- 人工智能共183篇(Artificial Intelligence (cs.AI))
- 计算机视觉共130篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共207篇(Machine Learning (cs.LG))
自然语言处理
[NLP-0] Same Task Different Circuits: Disentangling Modality-Specific Mechanisms in VLMs
【速读】: 该论文试图解决视觉-语言模型(Vision-Language Models, VLMs)在处理不同模态任务时的性能差距问题,即在图像上的任务(如计数图像中的物体)准确率低于对应文本任务(如计数文本中的单词)的现象。解决方案的关键在于识别并比较不同模态中的任务特定计算子图(circuits),发现图像数据表示在较晚层才与文本表示对齐,但此时已无法有效影响后续处理。为解决此问题,作者将视觉数据令牌的后期层表示回传至早期层,从而改善视觉表示的质量,实验表明该方法平均缩小了模态间性能差距的三分之一。
链接: https://arxiv.org/abs/2506.09047
作者: Yaniv Nikankin,Dana Arad,Yossi Gandelsman,Yonatan Belinkov
机构: Technion – Israel Institute of Technology (以色列理工学院); UC Berkeley (加州大学伯克利分校)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Vision-Language models (VLMs) show impressive abilities to answer questions on visual inputs (e.g., counting objects in an image), yet demonstrate higher accuracies when performing an analogous task on text (e.g., counting words in a text). We investigate this accuracy gap by identifying and comparing the \textitcircuits - the task-specific computational sub-graphs - in different modalities. We show that while circuits are largely disjoint between modalities, they implement relatively similar functionalities: the differences lie primarily in processing modality-specific data positions (an image or a text sequence). Zooming in on the image data representations, we observe they become aligned with the higher-performing analogous textual representations only towards later layers, too late in processing to effectively influence subsequent positions. To overcome this, we patch the representations of visual data tokens from later layers back into earlier layers. In experiments with multiple tasks and models, this simple intervention closes a third of the performance gap between the modalities, on average. Our analysis sheds light on the multi-modal performance gap in VLMs and suggests a training-free approach for reducing it.
zh
[NLP-1] Autoregressive Semantic Visual Reconstruction Helps VLMs Understand Better
【速读】: 该论文旨在解决传统大型视觉-语言模型(LVLM)在学习过程中未能充分融合视觉模态的问题,从而导致无法有效利用无标题图、文本描述可能遗漏关键视觉细节以及某些以视觉为中心的内容难以通过文本传达等局限性。其解决方案的关键在于提出一种自回归语义视觉重建(ASVR)方法,通过在统一的自回归框架中实现视觉与文本模态的联合学习,特别是通过自回归地重建图像的语义表示而非原始视觉外观,从而显著提升多模态理解能力。
链接: https://arxiv.org/abs/2506.09040
作者: Dianyi Wang,Wei Song,Yikun Wang,Siyuan Wang,Kaicheng Yu,Zhongyu Wei,Jiaqi Wang
机构: Shanghai Innovation Institute; Fudan University; AutoLab, Westlake University; Zhejiang University; Shanghai AI Lab; University of Southern California
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Typical large vision-language models (LVLMs) apply autoregressive supervision solely to textual sequences, without fully incorporating the visual modality into the learning process. This results in three key limitations: (1) an inability to utilize images without accompanying captions, (2) the risk that captions omit critical visual details, and (3) the challenge that certain vision-centric content cannot be adequately conveyed through text. As a result, current LVLMs often prioritize vision-to-language alignment while potentially overlooking fine-grained visual information. While some prior works have explored autoregressive image generation, effectively leveraging autoregressive visual supervision to enhance image understanding remains an open challenge. In this paper, we introduce Autoregressive Semantic Visual Reconstruction (ASVR), which enables joint learning of visual and textual modalities within a unified autoregressive framework. We show that autoregressively reconstructing the raw visual appearance of images does not enhance and may even impair multimodal understanding. In contrast, autoregressively reconstructing the semantic representation of images consistently improves comprehension. Notably, we find that even when models are given continuous image features as input, they can effectively reconstruct discrete semantic tokens, resulting in stable and consistent improvements across a wide range of multimodal understanding benchmarks. Our approach delivers significant performance gains across varying data scales (556k-2M) and types of LLM bacbones. Specifically, ASVR improves LLaVA-1.5 by 5% in average scores across 14 multimodal benchmarks. The code is available at this https URL.
zh
[NLP-2] Router-R1: Teaching LLM s Multi-Round Routing and Aggregation via Reinforcement Learning
【速读】: 该论文试图解决现有大型语言模型(Large Language Model, LLM)路由器在处理复杂任务时的局限性,即其通常采用单轮一对一映射策略,无法有效利用多个LLM的互补优势。解决方案的关键在于提出一种基于强化学习(Reinforcement Learning, RL)的框架——Router-R1,该框架将多LLM路由与聚合建模为一个序列决策过程,通过将路由器自身实例化为一个具备推理能力的LLM,实现“思考”与“路由”动作的交替执行,并将每次响应整合到不断演化的上下文中。此外,Router-R1采用轻量级规则奖励机制,结合格式奖励、最终结果奖励以及新型成本奖励,以优化性能与成本之间的权衡。
链接: https://arxiv.org/abs/2506.09033
作者: Haozhen Zhang,Tao Feng,Jiaxuan You
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Code is available at this https URL
点击查看摘要
Abstract:The rapid emergence of diverse large language models (LLMs) has spurred the development of LLM routers that assign user queries to the most suitable model. However, existing LLM routers typically perform a single-round, one-to-one mapping (\textiti.e., assigning each query to a single model in isolation), which limits their capability to tackle complex tasks that demand the complementary strengths of multiple LLMs. In this paper, we present \textbfRouter-R1, a reinforcement learning (RL)-based framework that formulates multi-LLM routing and aggregation as a sequential decision process. Router-R1 instantiates the router itself as a capable LLM, leveraging its reasoning ability to interleave “think” actions (internal deliberation) with “route” actions (dynamic model invocation), and integrates each response into its evolving context. To guide learning, we employ a lightweight rule-based reward comprising format rewards, final outcome rewards, and a novel cost reward for performance and cost trade-off optimization, opening a pathway toward optimizing performance-cost tradeoffs via RL. Router-R1 also conditions only on simple model descriptors such as pricing, latency, and example performance, enabling strong generalization to unseen model selection. Experiments on seven general and multi-hop QA benchmarks show that Router-R1 outperforms over several strong baselines, achieving superior performance while maintaining robust generalization and cost this http URL is available at this https URL.
zh
[NLP-3] 3: Learning to Explore Enables Extrapolation of Test-Time Compute for LLM s
【速读】: 该论文试图解决大型语言模型(Large Language Model, LLM)在推理任务中难以有效外推(extrapolation)的问题,即在超出训练时最大token预算的情况下,模型性能无法持续提升。解决方案的关键在于通过训练LLM进行上下文探索(in-context exploration),使其在测试阶段通过链式操作(如生成、验证、精炼等)高效利用计算资源,从而实现更长的“思考”过程以提高复杂问题的解决能力。该方法的核心要素包括:(1)利用基础LLM在不同技能上的非对称能力进行链式操作;(2)通过错误轨迹的“负”梯度增强强化学习中的探索;(3)通过设计特定的课程学习策略将任务难度与训练token预算耦合,以结构化地引导上下文探索。
链接: https://arxiv.org/abs/2506.09026
作者: Amrith Setlur,Matthew Y. R. Yang,Charlie Snell,Jeremy Greer,Ian Wu,Virginia Smith,Max Simchowitz,Aviral Kumar
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Test-time scaling offers a promising path to improve LLM reasoning by utilizing more compute at inference time; however, the true promise of this paradigm lies in extrapolation (i.e., improvement in performance on hard problems as LLMs keep “thinking” for longer, beyond the maximum token budget they were trained on). Surprisingly, we find that most existing reasoning models do not extrapolate well. We show that one way to enable extrapolation is by training the LLM to perform in-context exploration: training the LLM to effectively spend its test time budget by chaining operations (such as generation, verification, refinement, etc.), or testing multiple hypotheses before it commits to an answer. To enable in-context exploration, we identify three key ingredients as part of our recipe e3: (1) chaining skills that the base LLM has asymmetric competence in, e.g., chaining verification (easy) with generation (hard), as a way to implement in-context search; (2) leveraging “negative” gradients from incorrect traces to amplify exploration during RL, resulting in longer search traces that chains additional asymmetries; and (3) coupling task difficulty with training token budget during training via a specifically-designed curriculum to structure in-context exploration. Our recipe e3 produces the best known 1.7B model according to AIME’25 and HMMT’25 scores, and extrapolates to 2x the training token budget. Our e3-1.7B model not only attains high pass@1 scores, but also improves pass@k over the base model.
zh
[NLP-4] Comparing human and LLM proofreading in L2 writing: Impact on lexical and syntactic features
【速读】: 该论文试图解决如何通过人类和大语言模型(Large Language Models, LLMs)的校对干预来提升第二语言写作的整体可理解性问题。其解决方案的关键在于分析并比较人类与LLM在校对过程中对词汇和句法特征的修改效果,发现两者均能增强二元组(bigram)词汇特征,从而改善相邻词语之间的连贯性和语境衔接,而LLM校对则表现出更加强调生成性的特点,如重新构建词汇和句法结构,使用更多样化和复杂的词汇以及在名词短语中增加形容词修饰语。
链接: https://arxiv.org/abs/2506.09021
作者: Hakyung Sung,Karla Csuros,Min-Chang Sung
机构: LCR-ADS Lab, Linguistics, University of Oregon (LCR-ADS实验室,语言学,俄勒冈大学); West University of Timisoara (蒂米什瓦拉西大学); Gyeongin National University of Education (庆南教育大学)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:This study examines the lexical and syntactic interventions of human and LLM proofreading aimed at improving overall intelligibility in identical second language writings, and evaluates the consistency of outcomes across three LLMs (ChatGPT-4o, Llama3.1-8b, Deepseek-r1-8b). Findings show that both human and LLM proofreading enhance bigram lexical features, which may contribute to better coherence and contextual connectedness between adjacent words. However, LLM proofreading exhibits a more generative approach, extensively reworking vocabulary and sentence structures, such as employing more diverse and sophisticated vocabulary and incorporating a greater number of adjective modifiers in noun phrases. The proofreading outcomes are highly consistent in major lexical and syntactic features across the three models.
zh
[NLP-5] Learning to Reason Across Parallel Samples for LLM Reasoning
【速读】: 该论文试图解决在大型语言模型(Large Language Models, LLMs)中通过测试时计算扩展提升性能的问题,特别是在数学领域中如何有效利用多个样本以提高答案准确性。其解决方案的关键在于训练一个轻量级的LLM,称为Sample Set Aggregator (SSA),该模型能够接收多个样本的拼接序列并输出最终答案,通过强化学习优化答案的准确性。相比基于奖励模型的重排序等方法,SSA在多个推理数据集上表现出更优的性能,并展现出良好的泛化能力。
链接: https://arxiv.org/abs/2506.09014
作者: Jianing Qi,Xi Ye,Hao Tang,Zhigang Zhu,Eunsol Choi
机构: CUNY Grad Center (纽约市立大学研究生中心); Princeton University (普林斯顿大学); BMCC (纽约市立大学伯纳德·马奎尔学院); CCNY (纽约市立大学城市学院); New York University (纽约大学)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Scaling test-time compute brings substantial performance gains for large language models (LLMs). By sampling multiple answers and heuristically aggregate their answers (e.g., either through majority voting or using verifiers to rank the answers), one can achieve consistent performance gains in math domains. In this paper, we propose a new way to leverage such multiple sample set. We train a compact LLM, called Sample Set Aggregator (SSA), that takes a concatenated sequence of multiple samples and output the final answer, optimizing it for the answer accuracy with reinforcement learning. Experiments on multiple reasoning datasets show that SSA outperforms other test-time scaling methods such as reward model-based re-ranking. Our approach also shows a promising generalization ability, across sample set sizes, base model families and scales, and tasks. By separating LLMs to generate answers and LLMs to analyze and aggregate sampled answers, our approach can work with the outputs from premier black box models easily and efficiently.
zh
[NLP-6] UD-KSL Treebank v1.3: A semi-automated framework for aligning XPOS-extracted units with UPOS tags
【速读】: 该论文旨在解决第二语言(L2)韩语在通用依存关系(Universal Dependencies)标注中存在标注一致性不足及数据量有限的问题。其解决方案的关键在于提出一种半自动化框架,该框架能够从XPOS序列中识别形态句法结构,并将其与对应的UPOS类别对齐,从而提升标注的一致性与模型性能。
链接: https://arxiv.org/abs/2506.09009
作者: Hakyung Sung,Gyu-Ho Shin,Chanyoung Lee,You Kyung Sung,Boo Kyung Jung
机构: University of Oregon(俄勒冈大学); University of Illinois Chicago(伊利诺伊大学芝加哥分校); Konkuk University(国民大学); Chung-Ang University(中央大学); Yale University(耶鲁大学)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:The present study extends recent work on Universal Dependencies annotations for second-language (L2) Korean by introducing a semi-automated framework that identifies morphosyntactic constructions from XPOS sequences and aligns those constructions with corresponding UPOS categories. We also broaden the existing L2-Korean corpus by annotating 2,998 new sentences from argumentative essays. To evaluate the impact of XPOS-UPOS alignments, we fine-tune L2-Korean morphosyntactic analysis models on datasets both with and without these alignments, using two NLP toolkits. Our results indicate that the aligned dataset not only improves consistency across annotation layers but also enhances morphosyntactic tagging and dependency-parsing accuracy, particularly in cases of limited annotated data.
zh
[NLP-7] SWE-Flow: Synthesizing Software Engineering Data in a Test-Driven Manner ICML2025
【速读】: 该论文试图解决传统软件工程数据依赖人工提交问题的局限性,从而难以全面反映真实开发过程中的需求与代码演化。其解决方案的关键在于提出SWE-Flow框架,该框架基于测试驱动开发(Test-Driven Development, TDD),通过自动从单元测试中推断出增量开发步骤,构建运行时依赖图(Runtime Dependency Graph, RDG),以精确捕捉函数间的交互关系,进而生成结构化的开发计划。此方法实现了从单元测试到部分代码库及相应修改的自动化生成,支持可验证的TDD任务。
链接: https://arxiv.org/abs/2506.09003
作者: Lei Zhang,Jiaxi Yang,Min Yang,Jian Yang,Mouxiang Chen,Jiajun Zhang,Zeyu Cui,Binyuan Hui,Junyang Lin
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted by ICML2025
点击查看摘要
Abstract:We introduce SWE-Flow, a novel data synthesis framework grounded in Test-Driven Development (TDD). Unlike existing software engineering data that rely on human-submitted issues, SWE-Flow automatically infers incremental development steps directly from unit tests, which inherently encapsulate high-level requirements. The core of SWE-Flow is the construction of a Runtime Dependency Graph (RDG), which precisely captures function interactions, enabling the generation of a structured, step-by-step development schedule. At each step, SWE-Flow produces a partial codebase, the corresponding unit tests, and the necessary code modifications, resulting in fully verifiable TDD tasks. With this approach, we generated 16,061 training instances and 2,020 test instances from real-world GitHub projects, creating the SWE-Flow-Eval benchmark. Our experiments show that fine-tuning open model on this dataset significantly improves performance in TDD-based coding. To facilitate further research, we release all code, datasets, models, and Docker images at [Github](this https URL).
zh
[NLP-8] Employing self-supervised learning models for cross-linguistic child speech maturity classification INTERSPEECH2025
【速读】: 该论文旨在解决儿童语音在下游任务中表现不佳的问题,主要原因是训练语料库规模较小以及儿童语音本身的复杂性。其解决方案的关键在于引入了一个新的数据集SpeechMaturity,该数据集包含了来自多个语言环境的大量生态效度最高的儿童语音样本,共计242,004个标注的发声样本,远超以往工作。通过将这一数据集应用于先进的Transformer模型,研究人员成功实现了对儿童发声类型的分类任务,包括哭声、笑声、成熟语音(辅音+元音)和不成熟语音(仅辅音或元音),并在分类准确率上达到了与人类相当的水平,同时表现出跨城乡环境的鲁棒性。
链接: https://arxiv.org/abs/2506.08999
作者: Theo Zhang,Madurya Suresh,Anne S. Warlaumont,Kasia Hitczenko,Alejandrina Cristia,Margaret Cychosz
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: To be published in Interspeech 2025. 5 pages, 2 figures. For associated Github repository, see this https URL
点击查看摘要
Abstract:Speech technology systems struggle with many downstream tasks for child speech due to small training corpora and the difficulties that child speech pose. We apply a novel dataset, SpeechMaturity, to state-of-the-art transformer models to address a fundamental classification task: identifying child vocalizations. Unlike previous corpora, our dataset captures maximally ecologically-valid child vocalizations across an unprecedented sample, comprising children acquiring 25+ languages in the U.S., Bolivia, Vanuatu, Papua New Guinea, Solomon Islands, and France. The dataset contains 242,004 labeled vocalizations, magnitudes larger than previous work. Models were trained to distinguish between cry, laughter, mature (consonant+vowel), and immature speech (just consonant or vowel). Models trained on the dataset outperform state-of-the-art models trained on previous datasets, achieved classification accuracy comparable to humans, and were robust across rural and urban settings.
zh
[NLP-9] SwS: Self-aware Weakness-driven Problem Synthesis in Reinforcement Learning for LLM Reasoning
【速读】: 该论文旨在解决在强化学习(Reinforcement Learning, RL)中训练大规模语言模型(Large Language Models, LLMs)时,由于缺乏高质量、可验证答案的问题集以及问题生成策略效率低下导致的性能瓶颈问题。其解决方案的关键在于提出一种自感知的弱点驱动问题合成框架(Self-aware Weakness-driven problem Synthesis, SwS),该框架通过系统识别模型在迭代采样过程中的持续失败问题,提取核心概念并生成针对性问题,从而在后续训练中增强模型的薄弱环节,实现模型自我识别和改进弱点的能力,无需依赖外部知识蒸馏即可提升模型在多个推理基准上的平均性能。
链接: https://arxiv.org/abs/2506.08989
作者: Xiao Liang,Zhong-Zhi Li,Yeyun Gong,Yang Wang,Hengyuan Zhang,Yelong Shen,Ying Nian Wu,Weizhu Chen
机构: University of California, Los Angeles (加州大学洛杉矶分校); School of Artificial Intelligence (人工智能学院); Chinese Academy of Sciences (中国科学院); Microsoft (微软); Tsinghua University (清华大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Reinforcement Learning; Large Language Models; LLM Reasoning
点击查看摘要
Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) has proven effective for training large language models (LLMs) on complex reasoning tasks, such as mathematical problem solving. A prerequisite for the scalability of RLVR is a high-quality problem set with precise and verifiable answers. However, the scarcity of well-crafted human-labeled math problems and limited-verification answers in existing distillation-oriented synthetic datasets limit their effectiveness in RL. Additionally, most problem synthesis strategies indiscriminately expand the problem set without considering the model’s capabilities, leading to low efficiency in generating useful questions. To mitigate this issue, we introduce a Self-aware Weakness-driven problem Synthesis framework (SwS) that systematically identifies model deficiencies and leverages them for problem augmentation. Specifically, we define weaknesses as questions that the model consistently fails to learn through its iterative sampling during RL training. We then extract the core concepts from these failure cases and synthesize new problems to strengthen the model’s weak areas in subsequent augmented training, enabling it to focus on and gradually overcome its weaknesses. Without relying on external knowledge distillation, our framework enables robust generalization byempowering the model to self-identify and address its weaknesses in RL, yielding average performance gains of 10.0% and 7.7% on 7B and 32B models across eight mainstream reasoning benchmarks.
zh
[NLP-10] Naturalistic Language-related Movie-Watching fMRI Task for Detecting Neurocognitive Decline and Disorder
【速读】: 该论文旨在解决老年人群中神经认知障碍(Neurocognitive Disorder, NCD)早期检测的问题,以实现及时干预和延缓疾病进展。其解决方案的关键在于提出了一种新颖的自然语言相关功能性磁共振成像(fMRI)任务,并结合机器学习分类模型,利用从该任务中提取的fMRI特征及人口学信息(如年龄、性别和受教育年限)进行认知状态分类,结果显示该方法在区分正常与认知下降状态方面具有较高的准确性(平均曲线下面积为0.86)。
链接: https://arxiv.org/abs/2506.08986
作者: Yuejiao Wang,Xianmin Gong,Xixin Wu,Patrick Wong,Hoi-lam Helene Fung,Man Wai Mak,Helen Meng
机构: 未知
类目: Computation and Language (cs.CL)
备注: 5 pages,3 figures, accepted by ISCSLP 2024
点击查看摘要
Abstract:Early detection is crucial for timely intervention aimed at preventing and slowing the progression of neurocognitive disorder (NCD), a common and significant health problem among the aging population. Recent evidence has suggested that language-related functional magnetic resonance imaging (fMRI) may be a promising approach for detecting cognitive decline and early NCD. In this paper, we proposed a novel, naturalistic language-related fMRI task for this purpose. We examined the effectiveness of this task among 97 non-demented Chinese older adults from Hong Kong. The results showed that machine-learning classification models based on fMRI features extracted from the task and demographics (age, gender, and education year) achieved an average area under the curve of 0.86 when classifying participants’ cognitive status (labeled as NORMAL vs DECLINE based on their scores on a standard neurcognitive test). Feature localization revealed that the fMRI features most frequently selected by the data-driven approach came primarily from brain regions associated with language processing, such as the superior temporal gyrus, middle temporal gyrus, and right cerebellum. The study demonstrated the potential of the naturalistic language-related fMRI task for early detection of aging-related cognitive decline and NCD.
zh
[NLP-11] FROST-EMA: Finnish and Russian Oral Speech Dataset of Electromagnetic Articulography Measurements with L1 L2 and Imitated L2 Accents INTERSPEECH2025
【速读】: 该论文试图解决语言变体在语音识别与声纹验证中的影响问题,特别是针对双语者在母语(L1)、第二语言(L2)及模仿的L2(假外语口音)中的语音差异。解决方案的关键在于构建FROST-EMA(芬兰语和俄语口语语音电磁构音图数据库),该数据集包含18名双语说话人不同语言状态下的语音数据,为从语音学和技术角度研究语言变异性提供了基础。通过两个初步案例研究,分别评估了L2及模仿L2对自动声纹验证系统性能的影响,并分析了单个说话人在不同语言状态下的构音模式。
链接: https://arxiv.org/abs/2506.08981
作者: Satu Hopponen,Tomi Kinnunen,Alexandre Nikolaev,Rosa González Hautamäki,Lauri Tavi,Einar Meister
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted in Interspeech 2025
点击查看摘要
Abstract:We introduce a new FROST-EMA (Finnish and Russian Oral Speech Dataset of Electromagnetic Articulography) corpus. It consists of 18 bilingual speakers, who produced speech in their native language (L1), second language (L2), and imitated L2 (fake foreign accent). The new corpus enables research into language variability from phonetic and technological points of view. Accordingly, we include two preliminary case studies to demonstrate both perspectives. The first case study explores the impact of L2 and imitated L2 on the performance of an automatic speaker verification system, while the second illustrates the articulatory patterns of one speaker in L1, L2, and a fake accent.
zh
[NLP-12] Atomic-to-Compositional Generalization for Mobile Agents with A New Benchmark and Scheduling System
【速读】: 该论文旨在解决现有基于多模态大语言模型的自主代理在执行组合性任务(compositional tasks)时的泛化能力不足问题,而现有研究主要关注原子任务(atomic tasks)。其解决方案的关键在于提出AGENT-NEXUS,一个轻量且高效的调度系统,通过动态分解长周期任务为一系列自包含的原子子任务,从而提升现有移动代理在组合性操作任务中的表现。
链接: https://arxiv.org/abs/2506.08972
作者: Yuan Guo,Tingjia Miao,Zheng Wu,Pengzhou Cheng,Ming Zhou,Zhuosheng Zhang
机构: 未知
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Autonomous agents powered by multimodal large language models have been developed to facilitate task execution on mobile devices. However, prior work has predominantly focused on atomic tasks – such as shot-chain execution tasks and single-screen grounding tasks – while overlooking the generalization to compositional tasks, which are indispensable for real-world applications. This work introduces UI-NEXUS, a comprehensive benchmark designed to evaluate mobile agents on three categories of compositional operations: Simple Concatenation, Context Transition, and Deep Dive. UI-NEXUS supports interactive evaluation in 20 fully controllable local utility app environments, as well as 30 online Chinese and English service apps. It comprises 100 interactive task templates with an average optimal step count of 14.05. Experimental results across a range of mobile agents with agentic workflow or agent-as-a-model show that UI-NEXUS presents significant challenges. Specifically, existing agents generally struggle to balance performance and efficiency, exhibiting representative failure modes such as under-execution, over-execution, and attention drift, causing visible atomic-to-compositional generalization gap. Inspired by these findings, we propose AGENT-NEXUS, a lightweight and efficient scheduling system to tackle compositional mobile tasks. AGENT-NEXUS extrapolates the abilities of existing mobile agents by dynamically decomposing long-horizon tasks to a series of self-contained atomic subtasks. AGENT-NEXUS achieves 24% to 40% task success rate improvement for existing mobile agents on compositional operation tasks within the UI-NEXUS benchmark without significantly sacrificing inference overhead. The demo video, dataset, and code are available on the project page at this https URL.
zh
[NLP-13] Step-Audio-AQAA: a Fully End-to-End Expressive Large Audio Language Model
【速读】: 该论文旨在解决大型音频-语言模型(Large Audio-Language Models, LALMs)在生成自然语音响应方面的局限性,即其对文本输出的依赖限制了直接生成流畅语音交互的能力。解决方案的关键在于提出Step-Audio-AQAA,一个全端到端的LALM,用于音频查询-音频回答(AQAA)任务。该模型通过集成双码本音频分词器提取语言和语义特征、1300亿参数的主干大语言模型以及神经声码器实现高质量语音合成,并采用交错文本与音频标记输出的后训练方法结合直接偏好优化(DPO)与模型融合技术,以提升语义连贯性和整体性能。
链接: https://arxiv.org/abs/2506.08967
作者: Ailin Huang,Bingxin Li,Bruce Wang,Boyong Wu,Chao Yan,Chengli Feng,Heng Wang,Hongyu Zhou,Hongyuan Wang,Jingbei Li,Jianjian Sun,Joanna Wang,Mingrui Chen,Peng Liu,Ruihang Miao,Shilei Jiang,Tian Fei,Wang You,Xi Chen,Xuerui Yang,Yechang Huang,Yuxiang Zhang,Zheng Ge,Zheng Gong,Zhewei Huang,Zixin Zhang,Bin Wang,Bo Li,Buyun Ma,Changxin Miao,Changyi Wan,Chen Xu,Dapeng Shi,Dingyuan Hu,Enle Liu,Guanzhe Huang,Gulin Yan,Hanpeng Hu,Haonan Jia,Jiahao Gong,Jiaoren Wu,Jie Wu,Jie Yang,Junzhe Lin,Kaixiang Li,Lei Xia,Longlong Gu,Ming Li,Nie Hao,Ranchen Ming,Shaoliang Pang,Siqi Liu,Song Yuan,Tiancheng Cao,Wen Li,Wenqing He,Xu Zhao,Xuelin Zhang,Yanbo Yu,Yinmin Zhong,Yu Zhou,Yuanwei Liang,Yuanwei Lu,Yuxiang Yang,Zidong Yang,Zili Zhang,Binxing Jiao,Heung-Yeung Shum,Jiansheng Chen,Jing Li,Xiangyu Zhang,Xinhao Zhang,Yibo Zhu,Daxin Jiang,Shuchang Zhou,Chen Hu
机构: Step-Audio Team (步骤音频团队); StepFun (步骤函数)
类目: ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: 12 pages, 3 figures
点击查看摘要
Abstract:Large Audio-Language Models (LALMs) have significantly advanced intelligent human-computer interaction, yet their reliance on text-based outputs limits their ability to generate natural speech responses directly, hindering seamless audio interactions. To address this, we introduce Step-Audio-AQAA, a fully end-to-end LALM designed for Audio Query-Audio Answer (AQAA) tasks. The model integrates a dual-codebook audio tokenizer for linguistic and semantic feature extraction, a 130-billion-parameter backbone LLM and a neural vocoder for high-fidelity speech synthesis. Our post-training approach employs interleaved token-output of text and audio to enhance semantic coherence and combines Direct Preference Optimization (DPO) with model merge to improve performance. Evaluations on the StepEval-Audio-360 benchmark demonstrate that Step-Audio-AQAA excels especially in speech control, outperforming the state-of-art LALMs in key areas. This work contributes a promising solution for end-to-end LALMs and highlights the critical role of token-based vocoder in enhancing overall performance for AQAA tasks.
zh
[NLP-14] Pre-trained Language Models Learn Remarkably Accurate Representations of Numbers
【速读】: 该论文试图解决预训练语言模型(Pretrained Language Models, LMs)在算术任务中容易出现错误的问题,特别是由于分布学习的嵌入表示在表达精确数值时的固有不可靠性所导致的误差。其解决方案的关键在于提出了一种新的探测技术,该技术能够从输入嵌入中以接近完美的准确度解码出数值,证明了在仅进行预训练后,LMs 对数字的表示具有显著的精度。这一发现表明,通过我们的探测器判断的嵌入精度可以解释LM在基础算术中的大部分错误,并且通过将嵌入与探测器发现的模式对齐,可以减轻这些错误。
链接: https://arxiv.org/abs/2506.08966
作者: Marek Kadlčík,Michal Štefánik,Timothee Mickus,Michal Spiegel,Josef Kuchař
机构: TransformersClub @; Faculty of Informatics, Masaryk University (信息学院,马萨里克大学); Language Technology, University of Helsinki (语言技术,赫尔辛基大学); Kempelen Institute of Intelligent Technologies (Kempelen智能技术研究所)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
备注:
点击查看摘要
Abstract:Pretrained language models (LMs) are prone to arithmetic errors. Existing work showed limited success in probing numeric values from models’ representations, indicating that these errors can be attributed to the inherent unreliability of distributionally learned embeddings in representing exact quantities. However, we observe that previous probing methods are inadequate for the emergent structure of learned number embeddings with sinusoidal patterns. In response, we propose a novel probing technique that decodes numeric values from input embeddings with near-perfect accuracy across a range of open-source LMs. This proves that after the sole pre-training, LMs represent numbers with remarkable precision. Finally, we find that the embeddings’ preciseness judged by our probe’s accuracy explains a large portion of LM’s errors in elementary arithmetic, and show that aligning the embeddings with the pattern discovered by our probe can mitigate these errors. Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE) Cite as: arXiv:2506.08966 [cs.CL] (or arXiv:2506.08966v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2506.08966 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-15] Can LLM s Ground when they (Dont) Know: A Study on Direct and Loaded Political Questions ACL
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在政治领域中处理共同基础(common ground)及识别和纠正用户错误信念的能力问题。研究关注LLMs在面对包含错误预设的加载问题(loaded questions)时,是否能够进行主动的对话 grounding 并拒绝虚假信息。解决方案的关键在于评估LLMs在不同知识水平和政治偏见背景下,对直接知识性问题和加载问题的回答能力,从而揭示其在政治话语中抑制虚假信息的潜在局限性。
链接: https://arxiv.org/abs/2506.08952
作者: Clara Lachenmaier,Judith Sieker,Sina Zarrieß
机构: Bielefeld University (比勒费尔德大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Preprint accepted at ACL Main Conference 2025
点击查看摘要
Abstract:Communication among humans relies on conversational grounding, allowing interlocutors to reach mutual understanding even when they do not have perfect knowledge and must resolve discrepancies in each other’s beliefs. This paper investigates how large language models (LLMs) manage common ground in cases where they (don’t) possess knowledge, focusing on facts in the political domain where the risk of misinformation and grounding failure is high. We examine the ability of LLMs to answer direct knowledge questions and loaded questions that presuppose misinformation. We evaluate whether loaded questions lead LLMs to engage in active grounding and correct false user beliefs, in connection to their level of knowledge and their political bias. Our findings highlight significant challenges in LLMs’ ability to engage in grounding and reject false user beliefs, raising concerns about their role in mitigating misinformation in political discourse.
zh
[NLP-16] FaithfulRAG : Fact-Level Conflict Modeling for Context-Faithful Retrieval-Augmented Generation
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在结合检索系统进行知识密集型任务时出现的不忠实问题,即模型生成的输出可能忽略或不一致地融合检索到的上下文与模型自身的参数化知识。现有方法通过严格遵循上下文或修改解码策略来保证忠实性,但其关键局限在于强制抑制模型的参数化知识,从而破坏了模型内部的知识结构并增加了对上下文误读的风险。本文提出的解决方案——FaithfulRAG,核心在于通过显式建模模型参数化知识与检索内容之间的差异来解决知识冲突,具体包括在事实层面识别冲突知识,并设计一种自我思考过程,使LLMs能够在生成回答前推理和整合冲突事实。
链接: https://arxiv.org/abs/2506.08938
作者: Qinggang Zhang,Zhishang Xiang,Yilin Xiao,Le Wang,Junhui Li,Xinrun Wang,Jinsong Su
机构: Xiamen University(厦门大学); Institute of Artificial Intelligence, Xiamen University(厦门大学人工智能研究所); The Hong Kong Polytechnic University(香港理工大学); Migu Meland Co., Ltd(咪咕美兰公司); Soochow University(苏州大学); Singapore Management University(新加坡管理大学); Shanghai Artificial Intelligence Laboratory(上海人工智能实验室)
类目: Computation and Language (cs.CL)
备注: Qinggang Zhang and Zhishang Xiang contributed equally to this work. Corresponding author: Jinsong Su
点击查看摘要
Abstract:Large language models (LLMs) augmented with retrieval systems have demonstrated significant potential in handling knowledge-intensive tasks. However, these models often struggle with unfaithfulness issues, generating outputs that either ignore the retrieved context or inconsistently blend it with the LLMs parametric knowledge. This issue is particularly severe in cases of knowledge conflict, where the retrieved context conflicts with the model
s parametric knowledge. While existing faithful RAG approaches enforce strict context adherence through well-designed prompts or modified decoding strategies, our analysis reveals a critical limitation: they achieve faithfulness by forcibly suppressing the models parametric knowledge, which undermines the model
s internal knowledge structure and increases the risk of misinterpreting the context. To this end, this paper proposes FaithfulRAG, a novel framework that resolves knowledge conflicts by explicitly modeling discrepancies between the model`s parametric knowledge and retrieved context. Specifically, FaithfulRAG identifies conflicting knowledge at the fact level and designs a self-thinking process, allowing LLMs to reason about and integrate conflicting facts before generating responses. Extensive experiments demonstrate that our method outperforms state-of-the-art methods. The code is available at https:// this http URL
zh
[NLP-17] Can A Gamer Train A Mathematical Reasoning Model?
【速读】: 该论文试图解决大规模语言模型(Large Language Models, LLMs)在数学推理任务中需要高昂计算资源的问题,传统方法通常依赖于高端硬件集群进行训练。其解决方案的关键在于通过整合强化学习和内存优化技术,使得在单块普通游戏级GPU(如RTX 3080 Ti)上即可训练出性能优越的数学推理模型。实验表明,该方法在资源受限环境下能够实现与数倍参数量模型相当或更优的性能,从而挑战了当前高性能数学推理模型必须依赖大规模基础设施的观念。
链接: https://arxiv.org/abs/2506.08935
作者: Andrew Shin
机构: Youron Artificial Intelligence(Youron人工智能); Keio University(庆应大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:While large language models (LLMs) have achieved remarkable performance in various tasks including mathematical reasoning, their development typically demands prohibitive computational resources. Recent advancements have reduced costs for training capable models, yet even these approaches rely on high-end hardware clusters. In this paper, we demonstrate that a single average gaming GPU can train a solid mathematical reasoning model, by integrating reinforcement learning and memory optimization techniques. Specifically, we train a 1.5B parameter mathematical reasoning model on RTX 3080 Ti of 16GB memory that achieves comparable or better performance on mathematical reasoning benchmarks than models several times larger, in resource-constrained environments. Our results challenge the paradigm that state-of-the-art mathematical reasoning necessitates massive infrastructure, democratizing access to high-performance AI research. this https URL.
zh
[NLP-18] Socratic-MCTS: Test-Time Visual Reasoning by Asking the Right Questions
【速读】: 该论文试图解决如何在无需额外训练或监督的情况下,使已部署的非推理型视觉-语言模型(VLMs)产生隐含的长链式思维推理过程。其解决方案的关键在于引入一种受蒙特卡洛树搜索(MCTS)启发的算法,通过向模型输出流中注入子问题-子答案对,将推理过程建模为搜索过程,从而引导模型“连接碎片化知识”并生成扩展的推理轨迹。
链接: https://arxiv.org/abs/2506.08927
作者: David Acuna,Ximing Lu,Jaehun Jung,Hyunwoo Kim,Amlan Kar,Sanja Fidler,Yejin Choi
机构: NVIDIA(英伟达); University of Washington(华盛顿大学); University of Toronto(多伦多大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Recent research in vision-language models (VLMs) has centered around the possibility of equipping them with implicit long-form chain-of-thought reasoning – akin to the success observed in language models – via distillation and reinforcement learning. But what about the non-reasoning models already trained and deployed across the internet? Should we simply abandon them, or is there hope for a search mechanism that can elicit hidden knowledge and induce long reasoning traces – without any additional training or supervision? In this paper, we explore this possibility using a Monte Carlo Tree Search (MCTS)-inspired algorithm, which injects subquestion-subanswer pairs into the model’s output stream. We show that framing reasoning as a search process – where subquestions act as latent decisions within a broader inference trajectory – helps the model “connect the dots” between fragmented knowledge and produce extended reasoning traces in non-reasoning models. We evaluate our method across three benchmarks and observe consistent improvements. Notably, our approach yields a 2% overall improvement on MMMU-PRO, including a significant 9% gain in Liberal Arts.
zh
[NLP-19] PropMEND: Hypernetworks for Knowledge Propagation in LLM s
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在知识编辑后无法有效传播注入知识的问题,即模型无法基于注入的知识进行推理以回答需要多跳推理的问题。解决方案的关键在于提出一种基于超网络(hypernetwork)的知识传播方法——PropMEND,通过元学习调整语言建模损失的梯度,使注入的信息能够被有效传播,从而提升模型在涉及多跳推理任务中的表现。
链接: https://arxiv.org/abs/2506.08920
作者: Zeyu Leo Liu,Greg Durrett,Eunsol Choi
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Under review
点击查看摘要
Abstract:Knowledge editing techniques for large language models (LLMs) can inject knowledge that is later reproducible verbatim, but they fall short on propagating that knowledge: models cannot answer questions that require reasoning with the injected knowledge. We present a hypernetwork-based approach for knowledge propagation, named PropMEND, where we meta-learn how to modify gradients of a language modeling loss to encourage injected information to propagate. Our approach extends the meta-objective of MEND [29] so that gradient updates on knowledge are transformed to enable answering multi-hop questions involving that knowledge. We show improved performance on the RippleEdit dataset, showing almost 2x accuracy on challenging multi-hop questions whose answers are not explicitly stated in the injected fact. We further introduce a new dataset, Controlled RippleEdit, to evaluate the generalization of our hypernetwork, testing knowledge propagation along relations and entities unseen during hypernetwork training. PropMEND still outperforms existing approaches in unseen entity-relation pairs, yet the performance gap decreases substantially, suggesting future work in propagating knowledge to a wide range of relations.
zh
[NLP-20] Dialect Normalization using Large Language Models and Morphological Rules
【速读】: 该论文试图解决低资源语言(包括高资源语言的方言)在自然语言理解系统中的处理难题,具体通过方言到标准语的归一化任务来实现。其解决方案的关键在于提出一种结合基于规则的语言学启发式转换与大型语言模型(Large Language Models, LLMs)的新型归一化方法,并采用针对性的少量样本提示(few-shot prompting),而无需任何平行数据。
链接: https://arxiv.org/abs/2506.08907
作者: Antonios Dimakis(1 and 2),John Pavlopoulos(1 and 3),Antonios Anastasopoulos(1 and 4) ((1) Archimedes, Athena Research Center, Greece, (2) Department of Informatics and Telecommunications, NKUA, (3) Department of Informatics, Athens University of Economics and Business, Greece, (4) Department of Computer Science, George Mason University)
机构: Archimedes, Athena Research Center, Greece (阿基米德,雅典研究中心,希腊); Department of Informatics and Telecommunications, NKUA (信息与电信系,NKUA); Department of Informatics, Athens University of Economics and Business, Greece (信息系,雅典经济与商业大学,希腊); Department of Computer Science, George Mason University (计算机科学系,乔治梅森大学)
类目: Computation and Language (cs.CL)
备注: 19 pages, 18 figures, to be published in the Findings of the Association for Computational Linguistics 2025
点击查看摘要
Abstract:Natural language understanding systems struggle with low-resource languages, including many dialects of high-resource ones. Dialect-to-standard normalization attempts to tackle this issue by transforming dialectal text so that it can be used by standard-language tools downstream. In this study, we tackle this task by introducing a new normalization method that combines rule-based linguistically informed transformations and large language models (LLMs) with targeted few-shot prompting, without requiring any parallel data. We implement our method for Greek dialects and apply it on a dataset of regional proverbs, evaluating the outputs using human annotators. We then use this dataset to conduct downstream experiments, finding that previous results regarding these proverbs relied solely on superficial linguistic information, including orthographic artifacts, while new observations can still be made through the remaining semantics.
zh
[NLP-21] From Legal Texts to Defeasible Deontic Logic via LLM s: A Study in Automated Semantic Analysis
【速读】: 该论文试图解决法律文本的自动化语义分析问题,旨在将法律文本转换为可形式化的Defeasible Deontic Logic (DDL)表示。解决方案的关键在于提出一个结构化的处理流程,该流程能够将复杂的规范语言分割为原子片段,提取道义规则,并评估其语法和语义的一致性。通过在不同大型语言模型(LLM)配置下的实验验证,表明有效的提示工程可以显著提升模型在法律规范形式化任务中的表现。
链接: https://arxiv.org/abs/2506.08899
作者: Elias Horner,Cristinel Mateis,Guido Governatori,Agata Ciabattoni
机构: TU Wien (Vienna University of Technology); AIT Austrian Institute of Technology (AIT 奥地利技术研究所); Central Queensland University (中央昆士兰大学); Charles Sturt University (查尔斯·斯特尔特大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Logic in Computer Science (cs.LO)
备注:
点击查看摘要
Abstract:We present a novel approach to the automated semantic analysis of legal texts using large language models (LLMs), targeting their transformation into formal representations in Defeasible Deontic Logic (DDL). We propose a structured pipeline that segments complex normative language into atomic snippets, extracts deontic rules, and evaluates them for syntactic and semantic coherence. Our methodology is evaluated across various LLM configurations, including prompt engineering strategies, fine-tuned models, and multi-stage pipelines, focusing on legal norms from the Australian Telecommunications Consumer Protections Code. Empirical results demonstrate promising alignment between machine-generated and expert-crafted formalizations, showing that LLMs - particularly when prompted effectively - can significantly contribute to scalable legal informatics.
zh
[NLP-22] PlantBert: An Open Source Language Model for Plant Science
【速读】: 该论文试图解决植物科学领域在生物医学和临床自然语言处理中所取得的进展与植物相关领域工具匮乏之间的显著差距,特别是针对植物胁迫响应文献中的结构化知识提取问题。解决方案的关键在于构建PlantBert,这是一个基于DeBERTa架构的高性能、开源语言模型,通过在专家标注的抽象语料库上进行微调,并结合基于规则的语言后处理和本体支撑的实体归一化方法,以精确捕捉生物学上有意义的关系并保持语义完整性。
链接: https://arxiv.org/abs/2506.08897
作者: Hiba Khey,Amine Lakhder,Salma Rouichi,Imane El Ghabi,Kamal Hejjaoui,Younes En-nahli,Fahd Kalloubi,Moez Amri
机构: Mohammed VI Polytechnic University (摩洛哥穆罕默德六世理工学院); Sidi Mohamed Ben Abdellah University (西迪·穆罕默德·本·阿卜杜拉大学); Cadi Ayyad University (卡迪·阿亚德大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:The rapid advancement of transformer-based language models has catalyzed breakthroughs in biomedical and clinical natural language processing; however, plant science remains markedly underserved by such domain-adapted tools. In this work, we present PlantBert, a high-performance, open-source language model specifically tailored for extracting structured knowledge from plant stress-response literature. Built upon the DeBERTa architecture-known for its disentangled attention and robust contextual encoding-PlantBert is fine-tuned on a meticulously curated corpus of expert-annotated abstracts, with a primary focus on lentil (Lens culinaris) responses to diverse abiotic and biotic stressors. Our methodology combines transformer-based modeling with rule-enhanced linguistic post-processing and ontology-grounded entity normalization, enabling PlantBert to capture biologically meaningful relationships with precision and semantic fidelity. The underlying corpus is annotated using a hierarchical schema aligned with the Crop Ontology, encompassing molecular, physiological, biochemical, and agronomic dimensions of plant adaptation. PlantBert exhibits strong generalization capabilities across entity types and demonstrates the feasibility of robust domain adaptation in low-resource scientific fields. By providing a scalable and reproducible framework for high-resolution entity recognition, PlantBert bridges a critical gap in agricultural NLP and paves the way for intelligent, data-driven systems in plant genomics, phenomics, and agronomic knowledge discovery. Our model is publicly released to promote transparency and accelerate cross-disciplinary innovation in computational plant science.
zh
[NLP-23] AdversariaL attacK sAfety aLIgnment(ALKALI): Safeguarding LLM s through GRACE: Geometric Representation-Aware Contrastive Enhancement- Introducing Adversarial Vulnerability Quality Index (AVQI)
【速读】: 该论文旨在解决生成式 AI (Generative AI) 模型在面对对抗性攻击时存在的潜在几何盲点问题,即对抗性提示通过嵌入到安全表示流形附近来隐藏其不安全意图,从而绕过如直接偏好优化(DPO)等表层防御机制。解决方案的关键在于提出 ALKALI 基准测试集和 GRACE 对齐框架,其中 GRACE 通过将偏好学习与潜在空间正则化相结合,强制安全与对抗性补全之间的潜在分离,并增强不安全行为的对抗一致性,从而有效降低攻击成功率(ASR)。
链接: https://arxiv.org/abs/2506.08885
作者: Danush Khanna,Krishna Kumar,Basab Ghosh,Vinija Jain,Vasu Sharma,Aman Chadha,Amitava Das
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Adversarial threats against LLMs are escalating faster than current defenses can adapt. We expose a critical geometric blind spot in alignment: adversarial prompts exploit latent camouflage, embedding perilously close to the safe representation manifold while encoding unsafe intent thereby evading surface level defenses like Direct Preference Optimization (DPO), which remain blind to the latent geometry. We introduce ALKALI, the first rigorously curated adversarial benchmark and the most comprehensive to date spanning 9,000 prompts across three macro categories, six subtypes, and fifteen attack families. Evaluation of 21 leading LLMs reveals alarmingly high Attack Success Rates (ASRs) across both open and closed source models, exposing an underlying vulnerability we term latent camouflage, a structural blind spot where adversarial completions mimic the latent geometry of safe ones. To mitigate this vulnerability, we introduce GRACE - Geometric Representation Aware Contrastive Enhancement, an alignment framework coupling preference learning with latent space regularization. GRACE enforces two constraints: latent separation between safe and adversarial completions, and adversarial cohesion among unsafe and jailbreak behaviors. These operate over layerwise pooled embeddings guided by a learned attention profile, reshaping internal geometry without modifying the base model, and achieve up to 39% ASR reduction. Moreover, we introduce AVQI, a geometry aware metric that quantifies latent alignment failure via cluster separation and compactness. AVQI reveals when unsafe completions mimic the geometry of safe ones, offering a principled lens into how models internally encode safety. We make the code publicly available at this https URL.
zh
[NLP-24] Advancing STT for Low-Resource Real-World Speech
【速读】: 该论文旨在解决低资源语言瑞士德语(Swiss German)在实际应用场景中语音识别(Speech-to-Text, STT)性能不足的问题。由于瑞士德语缺乏标准化的书面形式,现有数据集多为受控环境下的句子级语料,难以处理真实场景中的自发对话。为此,研究者提出了SRB-300数据集,这是一个包含300小时标注语音的语料库,覆盖了多种瑞士德语方言和真实环境下的长音频录音。关键解决方案是基于该数据集对OpenAI Whisper模型进行微调,显著提升了词错误率(WER)和BLEU分数,其中最佳模型large-v3在WER上达到了17.1%,在BLEU分数上达到了74.8%。
链接: https://arxiv.org/abs/2506.08836
作者: Flavio D’Intino,Hans-Peter Hutter
机构: 未知
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: Conference: HCI International 2025, 20 pages, 4 figures
点击查看摘要
Abstract:Swiss German is a low-resource language represented by diverse dialects that differ significantly from Standard German and from each other, lacking a standardized written form. As a result, transcribing Swiss German involves translating into Standard German. Existing datasets have been collected in controlled environments, yielding effective speech-to-text (STT) models, but these models struggle with spontaneous conversational speech. This paper, therefore, introduces the new SRB-300 dataset, a 300-hour annotated speech corpus featuring real-world long-audio recordings from 39 Swiss German radio and TV stations. It captures spontaneous speech across all major Swiss dialects recorded in various realistic environments and overcomes the limitation of prior sentence-level corpora. We fine-tuned multiple OpenAI Whisper models on the SRB-300 dataset, achieving notable enhancements over previous zero-shot performance metrics. Improvements in word error rate (WER) ranged from 19% to 33%, while BLEU scores increased between 8% and 40%. The best fine-tuned model, large-v3, achieved a WER of 17.1% and a BLEU score of 74.8. This advancement is crucial for developing effective and robust STT systems for Swiss German and other low-resource languages in real-world contexts. Comments: Conference: HCI International 2025, 20 pages, 4 figures Subjects: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC) Cite as: arXiv:2506.08836 [cs.CL] (or arXiv:2506.08836v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2506.08836 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: Artificial Intelligence in HCI. HCII 2025. Lecture Notes in Computer Science(), vol 15822. Springer, Cham.; pages 290 - 309 Related DOI: https://doi.org/10.1007/978-3-031-93429-2_20 Focus to learn more DOI(s) linking to related resources
zh
[NLP-25] CulturalFrames: Assessing Cultural Expectation Alignment in Text-to-Image Models and Evaluation Metrics
【速读】: 该论文试图解决文本到图像(text-to-image, T2I)模型在生成视觉内容时对多元文化语境表征不准确的问题,特别是其与显性和隐性文化期望之间的对齐程度。解决方案的关键在于引入CulturalFrames,这是一个新的基准测试框架,旨在通过严格的用户评估来衡量视觉生成中的文化代表性。CulturalFrames涵盖了10个国家和5个社会文化领域,包含大量提示、生成图像及详细的人类标注数据,从而系统地量化了T2I模型与文化期望的偏差,并揭示了现有评估指标与人类文化对齐判断之间的低相关性。
链接: https://arxiv.org/abs/2506.08835
作者: Shravan Nayak,Mehar Bhatia,Xiaofeng Zhang,Verena Rieser,Lisa Anne Hendricks,Sjoerd van Steenkiste,Yash Goyal,Karolina Stańczak,Aishwarya Agrawal
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:The increasing ubiquity of text-to-image (T2I) models as tools for visual content generation raises concerns about their ability to accurately represent diverse cultural contexts. In this work, we present the first study to systematically quantify the alignment of T2I models and evaluation metrics with respect to both explicit as well as implicit cultural expectations. To this end, we introduce CulturalFrames, a novel benchmark designed for rigorous human evaluation of cultural representation in visual generations. Spanning 10 countries and 5 socio-cultural domains, CulturalFrames comprises 983 prompts, 3637 corresponding images generated by 4 state-of-the-art T2I models, and over 10k detailed human annotations. We find that T2I models not only fail to meet the more challenging implicit expectations but also the less challenging explicit expectations. Across models and countries, cultural expectations are missed an average of 44% of the time. Among these failures, explicit expectations are missed at a surprisingly high average rate of 68%, while implicit expectation failures are also significant, averaging 49%. Furthermore, we demonstrate that existing T2I evaluation metrics correlate poorly with human judgments of cultural alignment, irrespective of their internal reasoning. Collectively, our findings expose critical gaps, providing actionable directions for developing more culturally informed T2I models and evaluation methodologies.
zh
[NLP-26] he impact of fine tuning in LLaMA on hallucinations for named entity extraction in legal documentation
【速读】: 该论文旨在解决从法律文件中提取交通事故相关信息(如伤残比例和赔偿金额)的难题,这一过程对量化保险公司成本至关重要。传统方法在处理法庭判决中微妙的论证和推理时存在较大挑战。解决方案的关键在于采用两步流程:首先通过文本分割技术识别文档中的关键段落,然后利用大语言模型(LLM)进行实体抽取。在文本分割阶段,比较了基于正则表达式的经典方法与基于n-gram分块并使用多语言嵌入模型(text-embedding-ada-002/MiniLM-L12-v2)进行语义搜索的方法,后者表现出更优效果。随后,应用经过提示工程的大语言模型(如LLaMA-2 7b、70b,LLaMA-3 8b和GPT-4 Turbo)进行实体抽取,并对部分模型进行了LoRA微调,显著减少了抽取过程中的幻觉现象,提升了抽取准确性。
链接: https://arxiv.org/abs/2506.08827
作者: Francisco Vargas,Alejandro González Coene,Gaston Escalante,Exequiel Lobón,Manuel Pulido
机构: FaCENA, Universidad Nacional del Nordeste, Corrientes(法森,东北国立大学,科里恩特斯); Legalhub S. A., Buenos Aires(法律集团有限公司,布宜诺斯艾利斯); Instituto de Modelado e Innovación Tecnológica, CONICET(建模与技术创新研究所,国家科学和技术研究委员会)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:The extraction of information about traffic accidents from legal documents is crucial for quantifying insurance company costs. Extracting entities such as percentages of physical and/or psychological disability and the involved compensation amounts is a challenging process, even for experts, due to the subtle arguments and reasoning in the court decision. A two-step procedure is proposed: first, segmenting the document identifying the most relevant segments, and then extracting the entities. For text segmentation, two methodologies are compared: a classic method based on regular expressions and a second approach that divides the document into blocks of n-tokens, which are then vectorized using multilingual models for semantic searches (text-embedding-ada-002/MiniLM-L12-v2 ). Subsequently, large language models (LLaMA-2 7b, 70b, LLaMA-3 8b, and GPT-4 Turbo) are applied with prompting to the selected segments for entity extraction. For the LLaMA models, fine-tuning is performed using LoRA. LLaMA-2 7b, even with zero temperature, shows a significant number of hallucinations in extractions which are an important contention point for named entity extraction. This work shows that these hallucinations are substantially reduced after finetuning the model. The performance of the methodology based on segment vectorization and subsequent use of LLMs significantly surpasses the classic method which achieves an accuracy of 39.5%. Among open-source models, LLaMA-2 70B with finetuning achieves the highest accuracy 79.4%, surpassing its base version 61.7%. Notably, the base LLaMA-3 8B model already performs comparably to the finetuned LLaMA-2 70B model, achieving 76.6%, highlighting the rapid progress in model development. Meanwhile, GPT-4 Turbo achieves the highest accuracy at 86.1%.
zh
[NLP-27] Measuring Data Science Automation: A Survey of Evaluation Tools for AI Assistants and Agents
【速读】: 该论文试图解决当前大型语言模型(Large Language Models, LLMs)在数据科学领域应用中的评估不足问题,特别是对LLM助手和代理在数据科学任务中的性能与适用范围缺乏系统性分析。其解决方案的关键在于揭示现有研究的局限性,包括对目标导向任务的过度关注、对完全自主代理或纯辅助模式的偏重,以及忽视人机协作的中间层次和任务转换带来的更高自动化潜力。通过梳理这些问题,论文旨在为未来LLM在数据科学中的有效集成与评估提供理论基础和方向指引。
链接: https://arxiv.org/abs/2506.08800
作者: Irene Testini,José Hernández-Orallo,Lorenzo Pacchiardi
机构: Leverhulme Centre for the Future of Intelligence, University of Cambridge, UK; Valencian Research Institute for Artificial Intelligence (VRAIN), Universitat Politècnica de València, Spain
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Data science aims to extract insights from data to support decision-making processes. Recently, Large Language Models (LLMs) are increasingly used as assistants for data science, by suggesting ideas, techniques and small code snippets, or for the interpretation of results and reporting. Proper automation of some data-science activities is now promised by the rise of LLM agents, i.e., AI systems powered by an LLM equipped with additional affordances–such as code execution and knowledge bases–that can perform self-directed actions and interact with digital environments. In this paper, we survey the evaluation of LLM assistants and agents for data science. We find (1) a dominant focus on a small subset of goal-oriented activities, largely ignoring data management and exploratory activities; (2) a concentration on pure assistance or fully autonomous agents, without considering intermediate levels of human-AI collaboration; and (3) an emphasis on human substitution, therefore neglecting the possibility of higher levels of automation thanks to task transformation.
zh
[NLP-28] Paths to Causality: Finding Informative Subgraphs Within Knowledge Graphs for Knowledge-Based Causal Discovery KDD2025
【速读】: 该论文试图解决在复杂系统中推断变量对之间因果关系的问题,特别是在传统基于观测数据的方法存在局限的情况下,如何提高知识驱动的因果发现的可靠性。其解决方案的关键在于将知识图谱(Knowledge Graphs, KGs)与大型语言模型(Large Language Models, LLMs)相结合,通过识别信息丰富的元路径子图,并利用基于学习排序(Learning-to-Rank)的模型优化子图选择,最终将排名靠前的子图整合到零样本提示中,以提升LLMs在因果推理中的效果。
链接: https://arxiv.org/abs/2506.08771
作者: Yuni Susanti,Michael Färber
机构: Fujitsu Limited(富士通有限公司); ScaDS.AI, TU Dresden(萨克勒斯人工智能,德累斯顿工业大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: Accepted at KDD 2025 (full research paper)
点击查看摘要
Abstract:Inferring causal relationships between variable pairs is crucial for understanding multivariate interactions in complex systems. Knowledge-based causal discovery – which involves inferring causal relationships by reasoning over the metadata of variables (e.g., names or textual context) – offers a compelling alternative to traditional methods that rely on observational data. However, existing methods using Large Language Models (LLMs) often produce unstable and inconsistent results, compromising their reliability for causal inference. To address this, we introduce a novel approach that integrates Knowledge Graphs (KGs) with LLMs to enhance knowledge-based causal discovery. Our approach identifies informative metapath-based subgraphs within KGs and further refines the selection of these subgraphs using Learning-to-Rank-based models. The top-ranked subgraphs are then incorporated into zero-shot prompts, improving the effectiveness of LLMs in inferring the causal relationship. Extensive experiments on biomedical and open-domain datasets demonstrate that our method outperforms most baselines by up to 44.4 points in F1 scores, evaluated across diverse LLMs and KGs. Our code and datasets are available on GitHub: this https URL
zh
[NLP-29] AraReason er: Evaluating Reasoning -Based LLM s for Arabic NLP
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在阿拉伯语数据上的表现不足问题,特别是针对阿拉伯语丰富的形态学、多样的方言和复杂的书写系统。其解决方案的关键在于通过一系列实验评估不同策略(包括零样本、少样本和微调)在十五个阿拉伯语自然语言处理任务中的效果,并重点分析了基于推理优化的DeepSeek模型在复杂推理任务中的优势。研究发现,精心选择的上下文示例能够显著提升分类任务的性能,而推理导向的DeepSeek架构在零样本设置下优于强基准模型,同时LoRA微调方法在模型规模增加的情况下进一步提升了性能。
链接: https://arxiv.org/abs/2506.08768
作者: Ahmed Hasanaath,Aisha Alansari,Ahmed Ashraf,Chafik Salmane,Hamzah Luqman,Saad Ezzini
机构: King Fahd University of Petroleum and Minerals (国王法赫德石油矿产大学); Mohammed VI Polytechnic University (穆罕默德六世理工大学)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Large language models (LLMs) have shown remarkable progress in reasoning abilities and general natural language processing (NLP) tasks, yet their performance on Arabic data, characterized by rich morphology, diverse dialects, and complex script, remains underexplored. This paper presents a comprehensive benchmarking study of multiple reasoning-focused LLMs, with a special emphasis on the newly introduced DeepSeek models, across a suite of fifteen Arabic NLP tasks. We experiment with various strategies, including zero-shot, few-shot, and fine-tuning. This allows us to systematically evaluate performance on datasets covering a range of applications to examine their capacity for linguistic reasoning under different levels of complexity. Our experiments reveal several key findings. First, carefully selecting just three in-context examples delivers an average uplift of over 13 F1 points on classification tasks-boosting sentiment analysis from 35.3% to 87.5% and paraphrase detection from 56.1% to 87.0%. Second, reasoning-focused DeepSeek architectures outperform a strong GPT o4-mini baseline by an average of 12 F1 points on complex inference tasks in the zero-shot setting. Third, LoRA-based fine-tuning yields up to an additional 8 points in F1 and BLEU compared to equivalent increases in model scale. The code is available at this https URL
zh
[NLP-30] Enhancing Accuracy and Maintainability in Nuclear Plant Data Retrieval: A Function-Calling LLM Approach Over NL-to-SQL
【速读】: 该论文旨在解决从核电厂操作数据中检索信息时面临的准确性与透明度不足的问题,尤其是在使用自然语言到SQL(NL-to-SQL)方法时,由于用户难以验证生成的SQL查询以及老旧核电厂数据库结构复杂,导致查询生成困难,进而增加错误概率并降低信任度。该研究提出的解决方案关键在于采用函数调用的大规模语言模型(LLM),通过定义一组预批准、用途特定的功能来替代直接生成SQL查询,这些功能封装了经过验证的SQL逻辑,从而在查询生成前由专家进行审查和优化,有效降低了直接NL-to-SQL转换的风险。
链接: https://arxiv.org/abs/2506.08757
作者: Mishca de Costa,Muhammad Anwar,Dave Mercier,Mark Randall,Issam Hammad
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 44th Annual CNS Conference and the 49th Annual CNS/CNA Student Conference, Westin Harbour Castle Hotel, Toronto, ON, Canada, June 8-11, 2025
点击查看摘要
Abstract:Retrieving operational data from nuclear power plants requires exceptional accuracy and transparency due to the criticality of the decisions it supports. Traditionally, natural language to SQL (NL-to-SQL) approaches have been explored for querying such data. While NL-to-SQL promises ease of use, it poses significant risks: end-users cannot easily validate generated SQL queries, and legacy nuclear plant databases – often complex and poorly structured – complicate query generation due to decades of incremental modifications. These challenges increase the likelihood of inaccuracies and reduce trust in the approach. In this work, we propose an alternative paradigm: leveraging function-calling large language models (LLMs) to address these challenges. Instead of directly generating SQL queries, we define a set of pre-approved, purpose-specific functions representing common use cases. Queries are processed by invoking these functions, which encapsulate validated SQL logic. This hybrid approach mitigates the risks associated with direct NL-to-SQL translations by ensuring that SQL queries are reviewed and optimized by experts before deployment. While this strategy introduces the upfront cost of developing and maintaining the function library, we demonstrate how NL-to-SQL tools can assist in the initial generation of function code, allowing experts to focus on validation rather than creation. Our study includes a performance comparison between direct NL-to-SQL generation and the proposed function-based approach, highlighting improvements in accuracy and maintainability. This work underscores the importance of balancing user accessibility with operational safety and provides a novel, actionable framework for robust data retrieval in critical systems.
zh
[NLP-31] Factors affecting the in-context learning abilities of LLM s for dialogue state tracking INTERSPEECH2025
【速读】: 该论文试图解决对话状态追踪(Dialogue State Tracking, DST)问题中如何有效应用上下文学习(In-Context Learning, ICL)的问题,以及影响其效果的关键因素。解决方案的关键在于采用基于句子嵌入的k最近邻方法检索适用于ICL的示范样本,并将这些样本与测试样本按照预定义模板结构化后输入大语言模型(Large Language Model, LLM),以提升DST性能。
链接: https://arxiv.org/abs/2506.08753
作者: Pradyoth Hegde,Santosh Kesiraju,Jan Švec,Šimon Sedláček,Bolaji Yusuf,Oldřich Plchot,Deepak K T,Jan Černocký
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to Interspeech 2025
点击查看摘要
Abstract:This study explores the application of in-context learning (ICL) to the dialogue state tracking (DST) problem and investigates the factors that influence its effectiveness. We use a sentence embedding based k-nearest neighbour method to retrieve the suitable demonstrations for ICL. The selected demonstrations, along with the test samples, are structured within a template as input to the LLM. We then conduct a systematic study to analyse the impact of factors related to demonstration selection and prompt context on DST performance. This work is conducted using the MultiWoZ2.4 dataset and focuses primarily on the OLMo-7B-instruct, Mistral-7B-Instruct-v0.3, and Llama3.2-3B-Instruct models. Our findings provide several useful insights on in-context learning abilities of LLMs for dialogue state tracking.
zh
[NLP-32] Unlocking the Potential of Large Language Models in the Nuclear Industry with Synthetic Data
【速读】: 该论文旨在解决核工业中大量有价值的信息被锁定在非结构化文本数据中,难以直接用于高级大型语言模型(Large Language Model, LLM)应用的问题。其核心挑战在于数据稀缺性和隐私问题,而解决方案的关键在于通过合成数据生成技术将现有文本数据转换为可用于模型训练、微调和评估的结构化问答对。该方法利用LLM分析文本、提取关键信息、生成相关问题,并评估合成数据集的质量,从而释放LLM在核领域的潜力。
链接: https://arxiv.org/abs/2506.08750
作者: Muhammad Anwar,Daniel Lau,Mishca de Costa,Issam Hammad
机构: 未知
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:The nuclear industry possesses a wealth of valuable information locked away in unstructured text data. This data, however, is not readily usable for advanced Large Language Model (LLM) applications that require clean, structured question-answer pairs for tasks like model training, fine-tuning, and evaluation. This paper explores how synthetic data generation can bridge this gap, enabling the development of robust LLMs for the nuclear domain. We discuss the challenges of data scarcity and privacy concerns inherent in the nuclear industry and how synthetic data provides a solution by transforming existing text data into usable QA pairs. This approach leverages LLMs to analyze text, extract key information, generate relevant questions, and evaluate the quality of the resulting synthetic dataset. By unlocking the potential of LLMs in the nuclear industry, synthetic data can pave the way for improved information retrieval, enhanced knowledge sharing, and more informed decision-making in this critical sector.
zh
[NLP-33] owards Secure and Private Language Models for Nuclear Power Plants
【速读】: 该论文试图解决在核能领域中构建符合网络安全和数据保密要求的专用大语言模型(Large Language Model)的问题,其解决方案的关键在于基于公开可获取的《Essential CANDU》教材构建模型,并采用紧凑的Transformer架构,在单块GPU上进行训练以保护核操作中的敏感数据。通过专注于核能内容,该方法展示了内部部署大语言模型的可行性,同时揭示了需要更丰富的语料库、更复杂的预处理和指令微调来提高领域准确性的需求。
链接: https://arxiv.org/abs/2506.08746
作者: Muhammad Anwar,Mishca de Costa,Issam Hammad,Daniel Lau
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:This paper introduces a domain-specific Large Language Model for nuclear applications, built from the publicly accessible Essential CANDU textbook. Drawing on a compact Transformer-based architecture, the model is trained on a single GPU to protect the sensitive data inherent in nuclear operations. Despite relying on a relatively small dataset, it shows encouraging signs of capturing specialized nuclear vocabulary, though the generated text sometimes lacks syntactic coherence. By focusing exclusively on nuclear content, this approach demonstrates the feasibility of in-house LLM solutions that align with rigorous cybersecurity and data confidentiality standards. Early successes in text generation underscore the model’s utility for specialized tasks, while also revealing the need for richer corpora, more sophisticated preprocessing, and instruction fine-tuning to enhance domain accuracy. Future directions include extending the dataset to cover diverse nuclear subtopics, refining tokenization to reduce noise, and systematically evaluating the model’s readiness for real-world applications in nuclear domain.
zh
[NLP-34] Consistent Paths Lead to Truth: Self-Rewarding Reinforcement Learning for LLM Reasoning
【速读】: 该论文旨在解决强化学习(Reinforcement Learning, RL)在复杂推理任务中依赖外部监督的问题,从而限制了其更广泛的应用性。论文提出的解决方案的关键在于引入一种自奖励的强化学习框架,通过利用不同推理轨迹中中间推理状态的一致性来增强大语言模型(Large Language Model, LLM)的推理能力。核心思想是正确响应在模型似然上表现出一致的轨迹模式,即其中间推理状态倾向于向自身最终答案收敛(高一致性),而对其他候选答案的偏离较小(低波动性)。基于此,作者设计了CoVo机制,通过稳健的向量空间聚合策略整合一致性与波动性,并结合好奇心奖励以促进多样化探索,使LLM能够在无外部监督的情况下进行自奖励的强化学习。
链接: https://arxiv.org/abs/2506.08745
作者: Kongcheng Zhang,Qi Yao,Shunyu Liu,Yingjie Wang,Baisheng Lai,Jieping Ye,Mingli Song,Dacheng Tao
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Recent advances of Reinforcement Learning (RL) have highlighted its potential in complex reasoning tasks, yet effective training often relies on external supervision, which limits the broader applicability. In this work, we propose a novel self-rewarding reinforcement learning framework to enhance Large Language Model (LLM) reasoning by leveraging the consistency of intermediate reasoning states across different reasoning trajectories. Our key insight is that correct responses often exhibit consistent trajectory patterns in terms of model likelihood: their intermediate reasoning states tend to converge toward their own final answers (high consistency) with minimal deviation toward other candidates (low volatility). Inspired by this observation, we introduce CoVo, an intrinsic reward mechanism that integrates Consistency and Volatility via a robust vector-space aggregation strategy, complemented by a curiosity bonus to promote diverse exploration. CoVo enables LLMs to perform RL in a self-rewarding manner, offering a scalable pathway for learning to reason without external supervision. Extensive experiments on diverse reasoning benchmarks show that CoVo achieves performance comparable to or even surpassing supervised RL. Our code is available at this https URL.
zh
[NLP-35] Societal AI Research Has Become Less Interdisciplinary
【速读】: 该论文试图解决的问题是:在人工智能(AI)系统日益融入日常生活背景下,如何将伦理价值和社会关切整合到技术性AI研究中,以及不同学科背景的研究团队在这一过程中的实际贡献差异。其解决方案的关键在于开发一个分类器来识别论文中的社会相关内容,并衡量研究论文表达这些考量的程度,从而分析跨学科团队与仅计算机科学团队在推动社会导向AI研究方面的表现和变化趋势。
链接: https://arxiv.org/abs/2506.08738
作者: Dror Kris Markus,Fabrizio Gilardi,Daria Stetsenko
机构: University of Zurich (苏黎世大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:
点击查看摘要
Abstract:As artificial intelligence (AI) systems become deeply embedded in everyday life, calls to align AI development with ethical and societal values have intensified. Interdisciplinary collaboration is often championed as a key pathway for fostering such engagement. Yet it remains unclear whether interdisciplinary research teams are actually leading this shift in practice. This study analyzes over 100,000 AI-related papers published on ArXiv between 2014 and 2024 to examine how ethical values and societal concerns are integrated into technical AI research. We develop a classifier to identify societal content and measure the extent to which research papers express these considerations. We find a striking shift: while interdisciplinary teams remain more likely to produce societally-oriented research, computer science-only teams now account for a growing share of the field’s overall societal output. These teams are increasingly integrating societal concerns into their papers and tackling a wide range of domains - from fairness and safety to healthcare and misinformation. These findings challenge common assumptions about the drivers of societal AI and raise important questions. First, what are the implications for emerging understandings of AI safety and governance if most societally-oriented research is being undertaken by exclusively technical teams? Second, for scholars in the social sciences and humanities: in a technical field increasingly responsive to societal demands, what distinctive perspectives can we still offer to help shape the future of AI?
zh
[NLP-36] Improved LLM Agents for Financial Document Question Answering
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在处理包含表格和文本数据的金融文档时,对数值问题回答能力不足的问题。其解决方案的关键在于提出一种改进的批评者代理(critic agent)以及计算器代理(calculator agent),该方案在无需oracle标签的情况下表现出优于先前最先进方法(program-of-thought)的性能,并且更加安全。同时,研究还探讨了代理之间的交互机制及其对性能的影响。
链接: https://arxiv.org/abs/2506.08726
作者: Nelvin Tan,Zian Seng,Liang Zhang,Yu-Ching Shih,Dong Yang,Amol Salunkhe
机构: American Express(美国运通)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 12 pages, 5 figures
点击查看摘要
Abstract:Large language models (LLMs) have shown impressive capabilities on numerous natural language processing tasks. However, LLMs still struggle with numerical question answering for financial documents that include tabular and textual data. Recent works have showed the effectiveness of critic agents (i.e., self-correction) for this task given oracle labels. Building upon this framework, this paper examines the effectiveness of the traditional critic agent when oracle labels are not available, and show, through experiments, that this critic agent’s performance deteriorates in this scenario. With this in mind, we present an improved critic agent, along with the calculator agent which outperforms the previous state-of-the-art approach (program-of-thought) and is safer. Furthermore, we investigate how our agents interact with each other, and how this interaction affects their performance.
zh
[NLP-37] Multi-Teacher Language-Aware Knowledge Distillation for Multilingual Speech Emotion Recognition INTERSPEECH2025
【速读】: 该论文试图解决多语言语音情感识别(Multilingual Speech Emotion Recognition, SER)中如何构建一个能够跨语言泛化的单一模型的问题。其解决方案的关键在于通过一种新颖的语言感知多教师知识蒸馏方法,将多个单语言教师模型的知识进行融合,并将其迁移到一个统一的多语言学生模型中,从而提升模型在不同语言上的情感识别性能。
链接: https://arxiv.org/abs/2506.08717
作者: Mehedi Hasan Bijoy,Dejan Porjazovski,Tamás Grósz,Mikko Kurimo
机构: 未知
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted to INTERSPEECH 2025 conference
点击查看摘要
Abstract:Speech Emotion Recognition (SER) is crucial for improving human-computer interaction. Despite strides in monolingual SER, extending them to build a multilingual system remains challenging. Our goal is to train a single model capable of multilingual SER by distilling knowledge from multiple teacher models. To address this, we introduce a novel language-aware multi-teacher knowledge distillation method to advance SER in English, Finnish, and French. It leverages Wav2Vec2.0 as the foundation of monolingual teacher models and then distills their knowledge into a single multilingual student model. The student model demonstrates state-of-the-art performance, with a weighted recall of 72.9 on the English dataset and an unweighted recall of 63.4 on the Finnish dataset, surpassing fine-tuning and knowledge distillation baselines. Our method excels in improving recall for sad and neutral emotions, although it still faces challenges in recognizing anger and happiness.
zh
[NLP-38] Explainable Compliance Detection with Multi-Hop Natural Language Inference on Assurance Case Structure
【速读】: 该论文旨在解决复杂系统在满足法规要求时,通过声明-论据-证据框架验证保证案例(assurance case)的有效性所面临的挑战,包括法律和技术文本的复杂性、模型解释的需求以及保证案例数据的访问限制。其解决方案的关键在于提出一种基于自然语言推理(Natural Language Inference, NLI)的合规检测方法——EXplainable CompLiance detection with Argumentative Inference of Multi-hop reasoning (EXCLAIM),将保证案例的声明-论据-证据结构建模为多跳推理任务,以实现可解释且可追溯的合规检测,并利用大语言模型(Large Language Models, LLMs)生成保证案例以缓解数据不足的问题。
链接: https://arxiv.org/abs/2506.08713
作者: Fariz Ikhwantri,Dusica Marijan
机构: Simula Research Laboratory (Simula 研究实验室)
类目: Computation and Language (cs.CL); Software Engineering (cs.SE)
备注:
点击查看摘要
Abstract:Ensuring complex systems meet regulations typically requires checking the validity of assurance cases through a claim-argument-evidence framework. Some challenges in this process include the complicated nature of legal and technical texts, the need for model explanations, and limited access to assurance case data. We propose a compliance detection approach based on Natural Language Inference (NLI): EXplainable CompLiance detection with Argumentative Inference of Multi-hop reasoning (EXCLAIM). We formulate the claim-argument-evidence structure of an assurance case as a multi-hop inference for explainable and traceable compliance detection. We address the limited number of assurance cases by generating them using large language models (LLMs). We introduce metrics that measure the coverage and structural consistency. We demonstrate the effectiveness of the generated assurance case from GDPR requirements in a multi-hop inference task as a case study. Our results highlight the potential of NLI-based approaches in automating the regulatory compliance process.
zh
[NLP-39] ConfPO: Exploiting Policy Model Confidence for Critical Token Selection in Large Language Model Preference Optimization ICML2025
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)中的偏好学习问题,即如何有效对齐模型输出与人类偏好。传统直接对齐算法(Direct Alignment Algorithms, DAAs)如直接偏好优化(DPO)存在对所有标记概率进行均匀调整的局限性,这可能导致过度优化(reward hacking)并降低对齐质量。该论文提出的解决方案ConfPO的关键在于基于训练策略的置信度识别并优化与偏好相关的关键标记,从而更高效地利用KL散度预算,提升对齐效果,同时避免了对辅助模型或计算资源的依赖,实现了轻量级、无需模型的优化方法。
链接: https://arxiv.org/abs/2506.08712
作者: Hee Suk Yoon,Eunseop Yoon,Mark A. Hasegawa-Johnson,Sungwoong Kim,Chang D. Yoo
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: ICML 2025
点击查看摘要
Abstract:We introduce ConfPO, a method for preference learning in Large Language Models (LLMs) that identifies and optimizes preference-critical tokens based solely on the training policy’s confidence, without requiring any auxiliary models or compute. Unlike prior Direct Alignment Algorithms (DAAs) such as Direct Preference Optimization (DPO), which uniformly adjust all token probabilities regardless of their relevance to preference, ConfPO focuses optimization on the most impactful tokens. This targeted approach improves alignment quality while mitigating overoptimization (i.e., reward hacking) by using the KL divergence budget more efficiently. In contrast to recent token-level methods that rely on credit-assignment models or AI annotators, raising concerns about scalability and reliability, ConfPO is simple, lightweight, and model-free. Experimental results on challenging alignment benchmarks, including AlpacaEval 2 and Arena-Hard, demonstrate that ConfPO consistently outperforms uniform DAAs across various LLMs, delivering better alignment with zero additional computational overhead.
zh
[NLP-40] ClimateViz: A Benchmark for Statistical Reasoning and Fact Verification on Scientific Charts
【速读】: 该论文旨在解决科学事实核查中忽视科学图表的问题,这些图表在呈现定量证据和统计推理方面起着关键作用。其解决方案的关键在于引入ClimateViz,这是首个基于专家精选科学图表的大型基准数据集,包含49,862个与2,896个可视化结果相关联的陈述,并附有支持、反驳或信息不足的标签。此外,每个示例还包含结构化的知识图谱解释,涵盖趋势、比较和因果关系,以提高可解释性。
链接: https://arxiv.org/abs/2506.08700
作者: Ruiran Su,Jiasheng Si,Zhijiang Guo,Janet B. Pierrehumbert
机构: University of Oxford (牛津大学); Qilu University of Technology(Shandong Academy of Sciences) (齐鲁工业大学(山东省科学院)); Hong Kong University of Science and Technology (香港科技大学); Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Scientific fact-checking has mostly focused on text and tables, overlooking scientific charts, which are key for presenting quantitative evidence and statistical reasoning. We introduce ClimateViz, the first large-scale benchmark for scientific fact-checking using expert-curated scientific charts. ClimateViz contains 49,862 claims linked to 2,896 visualizations, each labeled as support, refute, or not enough information. To improve interpretability, each example includes structured knowledge graph explanations covering trends, comparisons, and causal relations. We evaluate state-of-the-art multimodal language models, including both proprietary and open-source systems, in zero-shot and few-shot settings. Results show that current models struggle with chart-based reasoning: even the best systems, such as Gemini 2.5 and InternVL 2.5, reach only 76.2 to 77.8 percent accuracy in label-only settings, far below human performance (89.3 and 92.7 percent). Explanation-augmented outputs improve performance in some models. We released our dataset and code alongside the paper.
zh
[NLP-41] Brevity is the soul of sustainability: Characterizing LLM response lengths ACL2025
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在推理过程中能耗过高的问题,其核心在于通过减少模型输出的冗余信息来实现能效优化。解决方案的关键在于采用简单且直观的提示工程(prompt-engineering)策略,通过设计特定的提示来控制输出长度和信息内容,在保持模型响应质量的前提下,显著降低推理过程中的能耗,实验结果表明该方法可实现25-60%的能效提升。
链接: https://arxiv.org/abs/2506.08686
作者: Soham Poddar,Paramita Koley,Janardan Misra,Sanjay Podder,Navveen Balani,Niloy Ganguly,Saptarshi Ghosh
机构: Indian Institute of Technology, Kharagpur, India (印度理工学院,卡哈格普尔); Indian Statistical Institute, Kolkata, India (印度统计研究所,加尔各答); Accenture Labs, Bangalore, India (埃森哲实验室,班加罗尔)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: Accepted to appear at the ACL 2025 findings
点击查看摘要
Abstract:A significant portion of the energy consumed by Large Language Models (LLMs) arises from their inference processes; hence developing energy-efficient methods for inference is crucial. While several techniques exist for inference optimization, output compression remains relatively unexplored, with only a few preliminary efforts addressing this aspect. In this work, we first benchmark 12 decoder-only LLMs across 5 datasets, revealing that these models often produce responses that are substantially longer than necessary. We then conduct a comprehensive quality assessment of LLM responses, formally defining six information categories present in LLM responses. We show that LLMs often tend to include redundant or additional information besides the minimal answer. To address this issue of long responses by LLMs, we explore several simple and intuitive prompt-engineering strategies. Empirical evaluation shows that appropriate prompts targeting length reduction and controlling information content can achieve significant energy optimization between 25-60% by reducing the response length while preserving the quality of LLM responses.
zh
[NLP-42] RuleReason er: Reinforced Rule-based Reasoning via Domain-aware Dynamic Sampling
【速读】: 该论文试图解决小规模推理模型(SRMs)在复杂、多样化的任务和领域中能否有效学习基于规则的推理问题,以及如何实现鲁棒泛化的问题。其解决方案的关键在于提出一种名为RuleReasoner的方法,该方法通过广泛收集的精心设计任务和一种新型的领域感知动态采样策略,利用强化学习进行规则推理。具体而言,RuleReasoner通过根据历史奖励更新不同领域的采样权重来重新采样每个训练批次,从而实现领域增强和灵活的在线学习调度,无需依赖传统方法中预设的人工混合训练方案。
链接: https://arxiv.org/abs/2506.08672
作者: Yang Liu,Jiaqi Li,Zilong Zheng
机构: 未知
类目: Computation and Language (cs.CL)
备注: 22 pages, 10 figures, 8 tables
点击查看摘要
Abstract:Rule-based reasoning has been acknowledged as one of the fundamental problems in reasoning, while deviations in rule formats, types, and complexity in real-world applications pose severe challenges. Recent studies have shown that large reasoning models (LRMs) have remarkable reasoning capabilities, and their performance is substantially enhanced by reinforcement learning (RL). However, it remains an open question whether small reasoning models (SRMs) can learn rule-based reasoning effectively with robust generalization across diverse tasks and domains. To address this, we introduce Reinforced Rule-based Reasoning, a.k.a. RuleReasoner, a simple yet effective method to conduct rule-based reasoning via a wide collection of curated tasks and a novel domain-aware dynamic sampling approach. Specifically, RuleReasoner resamples each training batch by updating the sampling weights of different domains based on historical rewards. This facilitates domain augmentation and flexible online learning schedules for RL, obviating the need for pre-hoc human-engineered mix-training recipes used in existing methods. Empirical evaluations on in-distribution (ID) and out-of-distribution (OOD) benchmarks reveal that RuleReasoner outperforms frontier LRMs by a significant margin ( \Delta 4.1% average points on eight ID tasks and \Delta 10.4% average points on three OOD tasks over OpenAI-o1). Notably, our approach also exhibits higher computational efficiency compared to prior dynamic sampling methods for RL.
zh
[NLP-43] Summarization for Generative Relation Extraction in the Microbiome Domain
【速读】: 该论文试图解决在肠道微生物组这一复杂且资源有限的生物医学领域中,如何有效提取微生物间相互作用关系的问题。其解决方案的关键在于采用生成式关系抽取(generative RE)流程,通过大型语言模型(LLMs)进行文本摘要以优化上下文信息,随后利用指令调优的生成模型进行关系抽取,从而减少噪声并提升模型性能。尽管如此,基于BERT的RE方法目前仍优于生成式模型。
链接: https://arxiv.org/abs/2506.08647
作者: Oumaima El Khettari,Solen Quiniou,Samuel Chaffron
机构: Nantes Université, École Centrale Nantes, CNRS, LS2N, UMR 6004 (南特大学, 南特中央理工学院, 法国国家科学研究中心, LS2N, UMR 6004)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:We explore a generative relation extraction (RE) pipeline tailored to the study of interactions in the intestinal microbiome, a complex and low-resource biomedical domain. Our method leverages summarization with large language models (LLMs) to refine context before extracting relations via instruction-tuned generation. Preliminary results on a dedicated corpus show that summarization improves generative RE performance by reducing noise and guiding the model. However, BERT-based RE approaches still outperform generative models. This ongoing work demonstrates the potential of generative methods to support the study of specialized domains in low-resources setting.
zh
[NLP-44] ableDreamer: Progressive and Weakness-guided Data Synthesis from Scratch for Table Instruction Tuning ACL2025
【速读】: 该论文旨在解决基于大语言模型(Large Language Model, LLM)的数据合成方法在生成表格指令微调数据时存在的两个问题:一是无法充分探索表格理解任务的输入空间,导致数据多样性不足;二是忽视目标LLM在表格理解能力上的弱点,盲目追求数据量的增加,从而影响数据效率。其解决方案的关键在于提出一种渐进式且以弱点为导向的数据合成框架——TableDreamer,该框架首先生成多样化的表格及其相关指令作为种子数据,随后在新识别的弱点数据引导下进行输入空间的迭代探索,最终形成用于微调目标LLM的训练数据。
链接: https://arxiv.org/abs/2506.08646
作者: Mingyu Zheng,Zhifan Feng,Jia Wang,Lanrui Wang,Zheng Lin,Yang Hao,Weiping Wang
机构: Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China (中国科学院信息工程研究所); School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China (中国科学院大学网络空间安全学院); Baidu Inc, Beijing, China (百度公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 27 pages, 19 figures, Findings of ACL 2025
点击查看摘要
Abstract:Despite the commendable progress of recent LLM-based data synthesis methods, they face two limitations in generating table instruction tuning data. First, they can not thoroughly explore the vast input space of table understanding tasks, leading to limited data diversity. Second, they ignore the weaknesses in table understanding ability of the target LLM and blindly pursue the increase of data quantity, resulting in suboptimal data efficiency. In this paper, we introduce a progressive and weakness-guided data synthesis framework tailored for table instruction tuning, named TableDreamer, to mitigate the above issues. Specifically, we first synthesize diverse tables and related instructions as seed data, and then perform an iterative exploration of the input space under the guidance of the newly identified weakness data, which eventually serve as the final training data for fine-tuning the target LLM. Extensive experiments on 10 tabular benchmarks demonstrate the effectiveness of the proposed framework, which boosts the average accuracy of Llama3.1-8B-instruct by 11.62% (49.07% to 60.69%) with 27K GPT-4o synthetic data and outperforms state-of-the-art data synthesis baselines which use more training data. The code and data is available at this https URL
zh
[NLP-45] MEMETRON: Metaheuristic Mechanisms for Test-time Response Optimization of Large Language Models
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在推理阶段行为受限的问题,现有解码策略如贪心搜索、采样或重排序等缺乏对任务特定目标的显式优化。解决方案的关键在于引入MEMETRON框架,该框架将LLM解码过程建模为离散的黑盒优化问题,并利用混合元启发式算法GENETRON和ANNETRON,在奖励模型和LLM自身上下文操作的引导下搜索响应空间,从而高效发现高奖励响应,而无需模型微调或梯度访问。
链接: https://arxiv.org/abs/2506.08643
作者: Son The Nguyen,Theja Tulabandhula
机构: University of Illinois Chicago (伊利诺伊大学芝加哥分校)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Large language models (LLMs) are increasingly used for both open-ended and structured tasks, yet their inference-time behavior is still largely dictated by heuristic decoding strategies such as greedy search, sampling, or reranking. These methods provide limited control and do not explicitly optimize for task-specific objectives. We introduce MEMETRON, a task-agnostic framework that formulates LLM decoding as a discrete black-box optimization problem. MEMETRON leverages hybrid metaheuristic algorithms, GENETRON and ANNETRON, to search the response space, guided by reward models and contextual operations performed by the LLM itself. This approach enables efficient discovery of high-reward responses without requiring model retraining or gradient access. The framework is modular and generalizes across diverse tasks, requiring only a reward function and lightweight prompt templates. We evaluate our framework on the critical human preference alignment task and demonstrate that it significantly outperforms standard decoding and reranking methods, highlighting its potential to improve alignment without model retraining.
zh
[NLP-46] RAISE: Enhancing Scientific Reasoning in LLM s via Step-by-Step Retrieval
【速读】: 该论文试图解决科学推理中面临的挑战,包括长链推理过程、领域特定术语的知识以及对最新研究成果的适应能力。其解决方案的关键在于提出RAISE框架,该框架通过分步骤的检索增强方法,从真实场景的语料库中检索出逻辑相关文档,具体包括问题分解、逻辑查询生成和逻辑检索三个步骤,从而有效提升科学推理性能。
链接: https://arxiv.org/abs/2506.08625
作者: Minhae Oh,Jeonghye Kim,Nakyung Lee,Donggeon Seo,Taeuk Kim,Jungwoo Lee
机构: Seoul National University (首尔国立大学); KAIST (韩国科学技术院); Hanyang University (汉阳大学)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Scientific reasoning requires not only long-chain reasoning processes, but also knowledge of domain-specific terminologies and adaptation to updated findings. To deal with these challenges for scientific reasoning, we introduce RAISE, a step-by-step retrieval-augmented framework which retrieves logically relevant documents from in-the-wild corpus. RAISE is divided into three steps: problem decomposition, logical query generation, and logical retrieval. We observe that RAISE consistently outperforms other baselines on scientific reasoning benchmarks. We analyze that unlike other baselines, RAISE retrieves documents that are not only similar in terms of the domain knowledge, but also documents logically more relevant.
zh
[NLP-47] Hateful Person or Hateful Model? Investigating the Role of Personas in Hate Speech Detection by Large Language Models
【速读】: 该论文试图解决人格特质(Personality Traits)对大型语言模型(Large Language Models, LLMs)在仇恨言论检测任务中表现的影响问题,特别是MBTI类型对模型输出的潜在偏见和不一致性。解决方案的关键在于通过引入基于MBTI的人格提示(Persona Prompts)来评估不同人格特征对模型分类结果的影响,并揭示了人格驱动的模型行为差异,包括与真实标签的不一致、人设间的分歧以及逻辑层面上的偏差。这一研究强调了在基于LLM的标注流程中,需谨慎定义人格提示以确保公平性和与人类价值观的一致性。
链接: https://arxiv.org/abs/2506.08593
作者: Shuzhou Yuan,Ercong Nie,Mario Tawfelis,Helmut Schmid,Hinrich Schütze,Michael Färber
机构: ScaDS.AI and TU Dresden (ScaDS.AI 和德累斯顿工业大学); LMU Munich (慕尼黑路德维希-马克西米利安大学); Munich Center for Machine Learning (MCML) (慕尼黑机器学习中心)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Hate speech detection is a socially sensitive and inherently subjective task, with judgments often varying based on personal traits. While prior work has examined how socio-demographic factors influence annotation, the impact of personality traits on Large Language Models (LLMs) remains largely unexplored. In this paper, we present the first comprehensive study on the role of persona prompts in hate speech classification, focusing on MBTI-based traits. A human annotation survey confirms that MBTI dimensions significantly affect labeling behavior. Extending this to LLMs, we prompt four open-source models with MBTI personas and evaluate their outputs across three hate speech datasets. Our analysis uncovers substantial persona-driven variation, including inconsistencies with ground truth, inter-persona disagreement, and logit-level biases. These findings highlight the need to carefully define persona prompts in LLM-based annotation workflows, with implications for fairness and alignment with human values.
zh
[NLP-48] Dense Retrievers Can Fail on Simple Queries: Revealing The Granularity Dilemma of Embeddings
【速读】: 该论文试图解决文本编码器在细粒度实体或事件识别上的局限性,这种局限性导致在简单案例中密集检索任务失败。解决方案的关键在于通过提出的数据生成策略对编码器进行微调,以提升其在细粒度匹配任务上的性能,同时识别并应对了粒度困境(granularity dilemma),即嵌入表示在表达细粒度显著性与保持整体语义对齐之间的挑战。
链接: https://arxiv.org/abs/2506.08592
作者: Liyan Xu,Zhenlin Su,Mo Yu,Jiangnan Li,Fandong Meng,Jie Zhou
机构: WeChat AI(微信人工智能); South China University of Technology(华南理工大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:This work focuses on an observed limitation of text encoders: embeddings may not be able to recognize fine-grained entities or events within the semantics, resulting in failed dense retrieval on even simple cases. To examine such behaviors, we first introduce a new evaluation dataset in Chinese, named CapRetrieval, whose passages are image captions, and queries are phrases inquiring entities or events in various forms. Zero-shot evaluation suggests that encoders may fail on these fine-grained matching, regardless of training sources or model sizes. Aiming for enhancement, we proceed to finetune encoders with our proposed data generation strategies, which obtains the best performance on CapRetrieval. Within this process, we further identify an issue of granularity dilemma, a challenge for embeddings to express fine-grained salience while aligning with overall semantics. Our dataset, code and models in this work are publicly released at this https URL.
zh
[NLP-49] CounselBench: A Large-Scale Expert Evaluation and Adversarial Benchmark of Large Language Models in Mental Health Counseling
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在真实心理辅导场景中的行为评估与安全性问题,尤其是在心理健康支持领域的应用潜力尚未得到充分验证。解决方案的关键在于构建CounselBench,这是一个由100名心理健康专业人员开发的大规模基准测试框架,包含两个核心组件:CounselBench-EVAL用于评估LLMs在单轮咨询中的表现,以及CounselBench-Adv作为对抗性数据集,用于探测模型的特定失败模式,从而为LLMs在高风险心理健康场景中的行为优化提供临床基础的评估标准。
链接: https://arxiv.org/abs/2506.08584
作者: Yahan Li,Jifan Yao,John Bosco S. Bunyi,Adam C. Frank,Angel Hwang,Ruishan Liu
机构: 未知
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Large language models (LLMs) are increasingly proposed for use in mental health support, yet their behavior in realistic counseling scenarios remains largely untested. We introduce CounselBench, a large-scale benchmark developed with 100 mental health professionals to evaluate and stress-test LLMs in single-turn counseling. The first component, CounselBench-EVAL, contains 2,000 expert evaluations of responses from GPT-4, LLaMA 3, Gemini, and online human therapists to real patient questions. Each response is rated along six clinically grounded dimensions, with written rationales and span-level annotations. We find that LLMs often outperform online human therapists in perceived quality, but experts frequently flag their outputs for safety concerns such as unauthorized medical advice. Follow-up experiments show that LLM judges consistently overrate model responses and overlook safety issues identified by human experts. To probe failure modes more directly, we construct CounselBench-Adv, an adversarial dataset of 120 expert-authored counseling questions designed to trigger specific model issues. Evaluation across 2,880 responses from eight LLMs reveals consistent, model-specific failure patterns. Together, CounselBench establishes a clinically grounded framework for benchmarking and improving LLM behavior in high-stakes mental health settings.
zh
[NLP-50] he Geometries of Truth Are Orthogonal Across Tasks
【速读】: 该论文试图解决生成式 AI (Generative AI) 在实际应用中可靠性不足的问题,具体表现为通过检查大型语言模型(Large Language Models, LLMs)在推理时的激活状态来评估其回答的正确性。论文指出,现有方法假设可以通过学习“真理的几何结构”来区分正确与错误的激活模式,但本文揭示了这些方法的关键局限性:即所谓的“真理几何结构”本质上是任务相关的,无法跨任务迁移。解决方案的关键在于认识到线性分类器在不同任务间共享的特征表示非常有限,且在引入稀疏性正则化时,其支持集几乎完全不重叠,这表明简单依赖激活向量进行答案分类的方法存在固有缺陷。
链接: https://arxiv.org/abs/2506.08572
作者: Waiss Azizian,Michael Kirchhof,Eugene Ndiaye,Louis Bethune,Michal Klein,Pierre Ablin,Marco Cuturi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) have demonstrated impressive generalization capabilities across various tasks, but their claim to practical relevance is still mired by concerns on their reliability. Recent works have proposed examining the activations produced by an LLM at inference time to assess whether its answer to a question is correct. Some works claim that a “geometry of truth” can be learned from examples, in the sense that the activations that generate correct answers can be distinguished from those leading to mistakes with a linear classifier. In this work, we underline a limitation of these approaches: we observe that these “geometries of truth” are intrinsically task-dependent and fail to transfer across tasks. More precisely, we show that linear classifiers trained across distinct tasks share little similarity and, when trained with sparsity-enforcing regularizers, have almost disjoint supports. We show that more sophisticated approaches (e.g., using mixtures of probes and tasks) fail to overcome this limitation, likely because activation vectors commonly used to classify answers form clearly separated clusters when examined across tasks.
zh
[NLP-51] Neighbors and relatives: How do speech embeddings reflect linguistic connections across the world?
【速读】: 该论文试图解决在全球尺度上分析语言关系的问题,特别是通过机器学习方法捕捉语法、音系和语调等多样语言特征的演化规律。其解决方案的关键在于利用微调后的XLS-R自监督语言识别模型voxlingua107-xls-r-300m-wav2vec生成的语言嵌入(language embeddings),通过线性判别分析(LDA)对106种世界语言进行聚类与比较,从而有效捕捉语言间的谱系、词汇和地理距离关系。该方法无需依赖专家对特定语言特征的分析,实现了基于语音数据的大规模、数据驱动的语言变异分析。
链接: https://arxiv.org/abs/2506.08564
作者: Tuukka Törö,Antti Suni,Juraj Šimko
机构: 未知
类目: Computation and Language (cs.CL); Sound (cs.SD)
备注: 27 pages, 11 figures (+5 supplementary), submitted to PLOS One
点击查看摘要
Abstract:Investigating linguistic relationships on a global scale requires analyzing diverse features such as syntax, phonology and prosody, which evolve at varying rates influenced by internal diversification, language contact, and sociolinguistic factors. Recent advances in machine learning (ML) offer complementary alternatives to traditional historical and typological approaches. Instead of relying on expert labor in analyzing specific linguistic features, these new methods enable the exploration of linguistic variation through embeddings derived directly from speech, opening new avenues for large-scale, data-driven analyses. This study employs embeddings from the fine-tuned XLS-R self-supervised language identification model voxlingua107-xls-r-300m-wav2vec, to analyze relationships between 106 world languages based on speech recordings. Using linear discriminant analysis (LDA), language embeddings are clustered and compared with genealogical, lexical, and geographical distances. The results demonstrate that embedding-based distances align closely with traditional measures, effectively capturing both global and local typological patterns. Challenges in visualizing relationships, particularly with hierarchical clustering and network-based methods, highlight the dynamic nature of language change. The findings show potential for scalable analyses of language variation based on speech embeddings, providing new perspectives on relationships among languages. By addressing methodological considerations such as corpus size and latent space dimensionality, this approach opens avenues for studying low-resource languages and bridging macro- and micro-level linguistic variation. Future work aims to extend these methods to underrepresented languages and integrate sociolinguistic variation for a more comprehensive understanding of linguistic diversity. Comments: 27 pages, 11 figures (+5 supplementary), submitted to PLOS One Subjects: Computation and Language (cs.CL); Sound (cs.SD) Cite as: arXiv:2506.08564 [cs.CL] (or arXiv:2506.08564v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2506.08564 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-52] Efficient Post-Training Refinement of Latent Reasoning in Large Language Models
【速读】: 该论文试图解决大型语言模型在推理过程中因Chain-of-Thought提示方法导致的token开销大和固定推理轨迹的问题,以及如何在后训练阶段有效更新推理嵌入以引导模型获得更准确解法的关键挑战。解决方案的关键在于提出一种轻量级的后训练框架,通过两种新颖策略:1) 对比推理反馈,通过将推理嵌入与强弱基线进行比较来推断有效的更新方向;2) 残差嵌入精炼,通过逐步整合当前和历史梯度来稳定更新,实现快速且受控的收敛。
链接: https://arxiv.org/abs/2506.08552
作者: Xinyuan Wang,Dongjie Wang,Wangyang Ying,Haoyue Bai,Nanxu Gong,Sixun Dong,Kunpeng Liu,Yanjie Fu
机构: Arizona State University (亚利桑那州立大学); University of Kansas (堪萨斯大学); Portland State University (波特兰州立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Reasoning is a key component of language understanding in Large Language Models. While Chain-of-Thought prompting enhances performance via explicit intermediate steps, it suffers from sufficient token overhead and a fixed reasoning trajectory, preventing step-wise refinement. Recent advances in latent reasoning address these limitations by refining internal reasoning processes directly in the model’s latent space, without producing explicit outputs. However, a key challenge remains: how to effectively update reasoning embeddings during post-training to guide the model toward more accurate solutions. To overcome this challenge, we propose a lightweight post-training framework that refines latent reasoning trajectories using two novel strategies: 1) Contrastive reasoning feedback, which compares reasoning embeddings against strong and weak baselines to infer effective update directions via embedding enhancement; 2) Residual embedding refinement, which stabilizes updates by progressively integrating current and historical gradients, enabling fast yet controlled convergence. Extensive experiments and case studies are conducted on five reasoning benchmarks to demonstrate the effectiveness of the proposed framework. Notably, a 5% accuracy gain on MathQA without additional training.
zh
[NLP-53] CoMuMDR: Code-mixed Multi-modal Multi-domain corpus for Discourse paRsing in conversations ACL
【速读】: 该论文旨在解决对话中话语解析(Discourse Parsing)在多领域、多语言混合语料下的挑战。现有话语解析数据集主要基于书面英语对话,且局限于单一领域,无法反映真实场景中的复杂性。为应对这一问题,作者提出了CoMuMDR:一个包含印地语和英语混合语料的多模态、多领域话语解析语料库,该语料库包含音频和转录文本,并标注了九种话语关系。实验表明,当前最先进的模型在该语料库上表现不佳,突显了构建适用于此类现实场景的更优模型的必要性。解决方案的关键在于构建高质量、多样化的多模态多领域语料库,以推动话语解析技术的发展。
链接: https://arxiv.org/abs/2506.08504
作者: Divyaksh Shukla,Ritesh Baviskar,Dwijesh Gohil,Aniket Tiwari,Atul Shree,Ashutosh Modi
机构: Indian Institute of Technology Kanpur (印度理工学院坎普尔分校); Convin-AI (Convin-AI)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at ACL Findings 2025 (16 pages: 5 pages main content + 3 pages references + 8 pages appendix)
点击查看摘要
Abstract:Discourse parsing is an important task useful for NLU applications such as summarization, machine comprehension, and emotion recognition. The current discourse parsing datasets based on conversations consists of written English dialogues restricted to a single domain. In this resource paper, we introduce CoMuMDR: Code-mixed Multi-modal Multi-domain corpus for Discourse paRsing in conversations. The corpus (code-mixed in Hindi and English) has both audio and transcribed text and is annotated with nine discourse relations. We experiment with various SoTA baseline models; the poor performance of SoTA models highlights the challenges of multi-domain code-mixed corpus, pointing towards the need for developing better models for such realistic settings.
zh
[NLP-54] DRAG ged into Conflicts: Detecting and Addressing Conflicting Sources in Search-Augmented LLM s
【速读】: 该论文旨在解决检索增强生成(Retrieval Augmented Generation, RAG)中因检索到的来源存在冲突信息而导致模型难以准确处理的问题。其解决方案的关键在于提出了一种新的知识冲突类型分类体系,并引入了CONFLICTS,一个高质量的基准数据集,包含在真实RAG场景下专家标注的冲突类型。该基准首次实现了对模型处理多种知识冲突能力的系统性评估,实验表明,尽管通过提示模型显式推理检索文档中的潜在冲突可显著提升响应质量和适当性,但仍有较大的改进空间。
链接: https://arxiv.org/abs/2506.08500
作者: Arie Cattan,Alon Jacovi,Ori Ram,Jonathan Herzig,Roee Aharoni,Sasha Goldshtein,Eran Ofek,Idan Szpektor,Avi Caciularu
机构: Google Research(谷歌研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Retrieval Augmented Generation (RAG) is a commonly used approach for enhancing large language models (LLMs) with relevant and up-to-date information. However, the retrieved sources can often contain conflicting information and it remains unclear how models should address such discrepancies. In this work, we first propose a novel taxonomy of knowledge conflict types in RAG, along with the desired model behavior for each type. We then introduce CONFLICTS, a high-quality benchmark with expert annotations of conflict types in a realistic RAG setting. CONFLICTS is the first benchmark that enables tracking progress on how models address a wide range of knowledge conflicts. We conduct extensive experiments on this benchmark, showing that LLMs often struggle to appropriately resolve conflicts between sources. While prompting LLMs to explicitly reason about the potential conflict in the retrieved documents significantly improves the quality and appropriateness of their responses, substantial room for improvement in future research remains.
zh
[NLP-55] Integration of Old and New Knowledge for Generalized Intent Discovery: A Consistency-driven Prototype-Prompting Framework IJCAI2025
【速读】: 该论文旨在解决意图检测(Intent Detection)中监督方法对领域内(in-domain, IND)标注数据依赖性强且难以处理领域外(out-of-domain, OOD)意图的问题,从而限制了其实际应用。为了解决这一问题,作者提出了一种基于一致性驱动的原型提示框架(consistency-driven prototype-prompting framework),其关键在于从旧知识与新知识融合的角度出发,通过原型提示机制迁移外部来源的旧知识,并利用层次化一致性约束学习目标领域的新知识,从而提升模型在未标注OOD数据上的泛化能力。
链接: https://arxiv.org/abs/2506.08490
作者: Xiao Wei,Xiaobao Wang,Ning Zhuang,Chenyang Wang,Longbiao Wang,Jianwu dang
机构: Tianjin Key Laboratory of Cognitive Computing and Application, College of Intelligence and Computing, Tianjin University, Tianjin, China; Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ), Shenzhen, China; Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China; College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, China; Huiyan Technology (Tianjin) Co., Ltd, Tianjin, China
类目: Computation and Language (cs.CL)
备注: 9 pages, 2 figures, 7 tables, IJCAI 2025
点击查看摘要
Abstract:Intent detection aims to identify user intents from natural language inputs, where supervised methods rely heavily on labeled in-domain (IND) data and struggle with out-of-domain (OOD) intents, limiting their practical applicability. Generalized Intent Discovery (GID) addresses this by leveraging unlabeled OOD data to discover new intents without additional annotation. However, existing methods focus solely on clustering unsupervised data while neglecting domain adaptation. Therefore, we propose a consistency-driven prototype-prompting framework for GID from the perspective of integrating old and new knowledge, which includes a prototype-prompting framework for transferring old knowledge from external sources, and a hierarchical consistency constraint for learning new knowledge from target domains. We conducted extensive experiments and the results show that our method significantly outperforms all baseline methods, achieving state-of-the-art results, which strongly demonstrates the effectiveness and generalization of our methods. Our source code is publicly available at this https URL.
zh
[NLP-56] EtiCor: Towards Understanding Etiquettical Bias in LLM s ACL
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在文化敏感性方面的不足,特别是对地区礼仪(etiquettes)的理解与偏见问题。解决方案的关键在于引入EtiCor++,这是一个涵盖全球礼仪的语料库,并设计了多种任务和度量标准来评估LLMs在不同地区的礼仪知识以及其潜在的偏见。实验结果表明,LLMs在某些地区存在固有偏见,而该研究通过构建系统化的评估框架为改进这一问题提供了基础。
链接: https://arxiv.org/abs/2506.08488
作者: Ashutosh Dwivedi,Siddhant Shivdutt Singh,Ashutosh Modi
机构: Indian Institute of Technology Kanpur (印度理工学院坎普尔分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: Accepted at ACL Findings 2025, 22 pages (9 pages main content + 4 pages references + 9 pages appendix)
点击查看摘要
Abstract:In recent years, researchers have started analyzing the cultural sensitivity of LLMs. In this respect, Etiquettes have been an active area of research. Etiquettes are region-specific and are an essential part of the culture of a region; hence, it is imperative to make LLMs sensitive to etiquettes. However, there needs to be more resources in evaluating LLMs for their understanding and bias with regard to etiquettes. In this resource paper, we introduce EtiCor++, a corpus of etiquettes worldwide. We introduce different tasks for evaluating LLMs for knowledge about etiquettes across various regions. Further, we introduce various metrics for measuring bias in LLMs. Extensive experimentation with LLMs shows inherent bias towards certain regions.
zh
[NLP-57] Fairness is Not Silence: Unmasking Vacuous Neutrality in Small Language Models
【速读】: 该论文试图解决小型语言模型(Small Language Models, SLMs)在设备端和资源受限场景中快速部署所带来的伦理风险问题,特别是针对其在公平性和实用性方面的表现缺乏系统性评估。解决方案的关键在于进行大规模的审计,涵盖从0.5到50亿参数的多种开源模型,并利用BBQ基准测试在零样本提示下分析模型在模糊和明确语境中的性能与公平性,从而揭示模型能力与偏见之间的复杂关系及压缩技术带来的权衡。
链接: https://arxiv.org/abs/2506.08487
作者: Sumanth Manduru,Carlotta Domeniconi
机构: George Mason University (乔治梅森大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:The rapid adoption of Small Language Models (SLMs) for on-device and resource-constrained deployments has outpaced our understanding of their ethical risks. To the best of our knowledge, we present the first large-scale audit of instruction-tuned SLMs spanning 0.5 to 5 billion parameters-an overlooked “middle tier” between BERT-class encoders and flagship LLMs. Our evaluation includes nine open-source models from the Qwen 2.5, LLaMA 3.2, Gemma 3, and Phi families. Using the BBQ benchmark under zero-shot prompting, we analyze both utility and fairness across ambiguous and disambiguated contexts. This evaluation reveals three key insights. First, competence and fairness need not be antagonistic: Phi models achieve F1 scores exceeding 90 percent while exhibiting minimal bias, showing that efficient and ethical NLP is attainable. Second, social bias varies significantly by architecture: Qwen 2.5 models may appear fair, but this often reflects vacuous neutrality, random guessing, or evasive behavior rather than genuine ethical alignment. In contrast, LLaMA 3.2 models exhibit stronger stereotypical bias, suggesting overconfidence rather than neutrality. Third, compression introduces nuanced trade-offs: 4-bit AWQ quantization improves F1 scores in ambiguous settings for LLaMA 3.2-3B but increases disability-related bias in Phi-4-Mini by over 7 percentage points. These insights provide practical guidance for the responsible deployment of SLMs in applications demanding fairness and efficiency, particularly benefiting small enterprises and resource-constrained environments.
zh
[NLP-58] Re-Thinking the Automatic Evaluation of Image-Text Alignment in Text-to-Image Models
【速读】: 该论文试图解决文本到图像生成模型在生成图像时与文本提示匹配度不足的问题,以及现有评估框架在可靠性方面的缺陷。其解决方案的关键在于识别出可靠评估应关注的两个核心方面,并通过实证研究指出当前主流评估框架在多种指标和模型上未能充分满足这些特性,从而提出改进图像-文本对齐评估的建议。
链接: https://arxiv.org/abs/2506.08480
作者: Huixuan Zhang,Xiaojun Wan
机构: Wangxuan Institute of Computer Technology, Peking University (北京大学王选计算机技术研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Text-to-image models often struggle to generate images that precisely match textual prompts. Prior research has extensively studied the evaluation of image-text alignment in text-to-image generation. However, existing evaluations primarily focus on agreement with human assessments, neglecting other critical properties of a trustworthy evaluation framework. In this work, we first identify two key aspects that a reliable evaluation should address. We then empirically demonstrate that current mainstream evaluation frameworks fail to fully satisfy these properties across a diverse range of metrics and models. Finally, we propose recommendations for improving image-text alignment evaluation.
zh
[NLP-59] Efficient Context Selection for Long-Context QA: No Tuning No Iteration Just Adaptive-k
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在开放域问答(Open-domain Question Answering, QA)中因上下文限制而导致的外部上下文最优检索问题。现有方法如Self-RAG和Self-Route依赖于迭代式LLM提示,在事实性QA中表现良好,但在聚合型QA中效果不佳,因为其最优上下文大小既未知又可变。论文提出的解决方案是Adaptive-k检索,其关键在于根据查询与候选段落之间的相似度分数分布自适应地选择段落数量,无需模型微调、额外的LLM推理或对现有检索-阅读流水线进行修改。该方法在事实性和聚合型QA基准测试中均表现出色,且相比全上下文输入节省了高达10倍的token,并仍能检索到70%的相关段落。
链接: https://arxiv.org/abs/2506.08479
作者: Chihiro Taguchi,Seiji Maekawa,Nikita Bhutani
机构: University of Notre Dame (圣母大学); Megagon Labs (Megagon实验室); Megagon labs (Megagon实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: 26 pages, 16 tables, 5 figures
点击查看摘要
Abstract:Retrieval-augmented generation (RAG) and long-context language models (LCLMs) both address context limitations of LLMs in open-domain question answering (QA). However, optimal external context to retrieve remains an open problem: fixing the retrieval size risks either wasting tokens or omitting key evidence. Existing adaptive methods like Self-RAG and Self-Route rely on iterative LLM prompting and perform well on factoid QA, but struggle with aggregation QA, where the optimal context size is both unknown and variable. We present Adaptive- k retrieval, a simple and effective single-pass method that adaptively selects the number of passages based on the distribution of the similarity scores between the query and the candidate passages. It does not require model fine-tuning, extra LLM inferences or changes to existing retriever-reader pipelines. On both factoid and aggregation QA benchmarks, Adaptive- k matches or outperforms fixed- k baselines while using up to 10x fewer tokens than full-context input, yet still retrieves 70% of relevant passages. It improves accuracy across five LCLMs and two embedding models, highlighting that dynamically adjusting context size leads to more efficient and accurate QA.
zh
[NLP-60] Detecting Harmful Memes with Decoupled Understanding and Guided CoT Reasoning
【速读】: 该论文试图解决有害模因(harmful meme)检测在资源效率、灵活性和可解释性方面的不足,这些问题限制了现有方法在内容审核系统中的实际部署。解决方案的关键在于提出一种名为U-CoT+的新型框架,其核心是首先构建一个高保真度的模因到文本转换管道,将视觉模因转化为保留细节的文本描述,从而将模因解释与分类解耦,避免直接对复杂的原始视觉内容进行推理,使基于通用大语言模型(LLMs)的有害模因检测更加资源高效。在此基础上,引入针对特定任务的人工制定的可解释性指导原则,以在零样本思维链(zero-shot CoT)提示下引导模型推理,从而实现对不同平台、地区和时间的有害性检测标准的灵活适应。
链接: https://arxiv.org/abs/2506.08477
作者: Fengjun Pan,Anh Tuan Luu,Xiaobao Wu
机构: Nanyang Technological University (南洋理工大学)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Detecting harmful memes is essential for maintaining the integrity of online environments. However, current approaches often struggle with resource efficiency, flexibility, or explainability, limiting their practical deployment in content moderation systems. To address these challenges, we introduce U-CoT+, a novel framework for harmful meme detection. Instead of relying solely on prompting or fine-tuning multimodal models, we first develop a high-fidelity meme-to-text pipeline that converts visual memes into detail-preserving textual descriptions. This design decouples meme interpretation from meme classification, thus avoiding immediate reasoning over complex raw visual content and enabling resource-efficient harmful meme detection with general large language models (LLMs). Building on these textual descriptions, we further incorporate targeted, interpretable human-crafted guidelines to guide models’ reasoning under zero-shot CoT prompting. As such, this framework allows for easy adaptation to different harmfulness detection criteria across platforms, regions, and over time, offering high flexibility and explainability. Extensive experiments on seven benchmark datasets validate the effectiveness of our framework, highlighting its potential for explainable and low-resource harmful meme detection using small-scale LLMs. Codes and data are available at: this https URL.
zh
[NLP-61] A Survey on Large Language Models for Mathematical Reasoning
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在数学推理能力上的局限性,旨在提升其理解和生成数学答案的能力。其解决方案的关键在于通过两个高阶认知阶段——理解阶段和答案生成阶段——来增强模型的数学推理能力,其中理解阶段涉及多样化的预训练策略以获得数学理解,而答案生成阶段则从直接预测发展为基于分步链式思维(Chain-of-Thought, CoT)的推理方法。此外,论文还探讨了多种增强数学推理的方法,包括无需训练的提示技术、微调方法以及扩展CoT和“测试时缩放”等最新研究方向。
链接: https://arxiv.org/abs/2506.08446
作者: Peng-Yuan Wang,Tian-Shuo Liu,Chenyang Wang,Yi-Di Wang,Shu Yan,Cheng-Xing Jia,Xu-Hui Liu,Xin-Wei Chen,Jia-Cheng Xu,Ziniu Li,Yang Yu
机构: Nanjing University (南京大学); Polixir.ai; Nanyang Technological University (南洋理工大学); Skywork AI (天工AI); The Chinese University of Hong Kong (香港中文大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Mathematical reasoning has long represented one of the most fundamental and challenging frontiers in artificial intelligence research. In recent years, large language models (LLMs) have achieved significant advances in this area. This survey examines the development of mathematical reasoning abilities in LLMs through two high-level cognitive phases: comprehension, where models gain mathematical understanding via diverse pretraining strategies, and answer generation, which has progressed from direct prediction to step-by-step Chain-of-Thought (CoT) reasoning. We review methods for enhancing mathematical reasoning, ranging from training-free prompting to fine-tuning approaches such as supervised fine-tuning and reinforcement learning, and discuss recent work on extended CoT and “test-time scaling”. Despite notable progress, fundamental challenges remain in terms of capacity, efficiency, and generalization. To address these issues, we highlight promising research directions, including advanced pretraining and knowledge augmentation techniques, formal reasoning frameworks, and meta-generalization through principled learning paradigms. This survey tries to provide some insights for researchers interested in enhancing reasoning capabilities of LLMs and for those seeking to apply these techniques to other domains.
zh
[NLP-62] Olica: Efficient Structured Pruning of Large Language Models without Retraining ICML2025
【速读】: 该论文旨在解决现有结构化剪枝方法在对大型语言模型(Large Language Models, LLMs)进行剪枝时需要大量计算和数据资源进行微调以恢复被破坏的相关性,从而导致成本过高的问题。其解决方案的关键在于提出一种名为正交分解与线性校准(Orthogonal decomposition and Linear Calibration, Olica)的剪枝框架,该框架通过将多头注意力(Multi-Head Attention, MHA)层中的矩阵乘积视为统一实体并应用主成分分析(PCA),提取关键信息实现模型压缩,同时保持模型精度和原始结构,从而无需重新训练。此外,针对前馈网络(Feed-Forward Network, FFN)层剪枝引起的误差累积问题,引入线性校准方法,利用最小二乘问题的奇异值分解(SVD)重构残差误差,进一步提升了剪枝效果。
链接: https://arxiv.org/abs/2506.08436
作者: Jiujun He,Huazhen Lin
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted to ICML 2025
点击查看摘要
Abstract:Most existing structured pruning methods for Large Language Models (LLMs) require substantial computational and data resources for retraining to reestablish the corrupted correlations, making them prohibitively expensive. To address this, we propose a pruning framework for LLMs called Orthogonal decomposition and Linear Calibration (Olica), which eliminates the need for retraining. A key observation is that the multi-head attention (MHA) layer depends on two types of matrix products. By treating these matrix products as unified entities and applying principal component analysis (PCA), we extract the most important information to compress LLMs without sacrificing accuracy or disrupting their original structure. Consequently, retraining becomes unnecessary. A fast decomposition method is devised, reducing the complexity of PCA by a factor of the square of the number of attention heads. Additionally, to mitigate error accumulation problem caused by pruning the feed-forward network (FFN) layer, we introduce a linear calibration method to reconstruct the residual errors of pruned layers using low-rank matrices. By leveraging singular value decomposition (SVD) on the solution of the least-squares problem, these matrices are obtained without requiring retraining. Extensive experiments show that the proposed Olica is efficient in terms of data usage, GPU memory, and running time, while delivering superior performance across multiple benchmarks.
zh
[NLP-63] Low-resource domain adaptation while minimizing energy and hardware resource consumption
【速读】: 该论文试图解决在资源受限环境中进行领域适应(domain adaptation)时,由于训练大型语言模型(Large Language Models, LLMs)在能耗、硬件和标注数据上的高成本而导致的可访问性问题。其解决方案的关键在于评估不同数值精度和数据并行化策略对训练速度(作为能耗和硬件消耗的代理指标)及模型准确性的影响力,从而在保证模型性能的同时降低计算资源的需求。
链接: https://arxiv.org/abs/2506.08433
作者: Hernán Maina,Nicolás Wolovick,Luciana Benotti
机构: FAMAF, Universidad Nacional de Córdoba; CONICET, Argentina
类目: Computation and Language (cs.CL); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
备注: A shorter version of this work was accepted as a two-page abstract for presentation at the Widening Natural Language Processing (WiNLP) 2023 Workshop. That version was not publicly released, and this is the first public version of the work
点击查看摘要
Abstract:Training Large Language Models (LLMs) is costly in terms of energy, hardware, and annotated data, often resulting in a positionality rooted in predominant cultures and values (Santy et al., 2023). Domain adaptation has emerged as a promising strategy to better align models with diverse cultural and value contexts (Hershcovich et al., 2022), but its computational cost remains a significant barrier, particularly for research groups lacking access to large-scale infrastructure. In this paper, we evaluate how the use of different numerical precisions and data parallelization strategies impacts both training speed (as a proxy to energy and hardware consumption) and model accuracy, with the goal of facilitating domain adaptation in low-resource environments. Our findings are relevant to any setting where energy efficiency, accessibility, or limited hardware availability are key concerns.
zh
[NLP-64] CAF-I: A Collaborative Multi-Agent Framework for Enhanced Irony Detection with Large Language Models ICML2025
【速读】: 该论文旨在解决现有大型语言模型(Large Language Model, LLM)在讽刺检测中面临的挑战,包括单一视角限制、综合理解不足以及可解释性缺失。其解决方案的关键在于提出一种基于LLM的多智能体系统——讽刺协作代理框架(Collaborative Agent Framework for Irony, CAF-I),该框架通过Context、Semantics和Rhetoric三个专业代理进行多维分析,并通过交互式协同优化实现多角度信息融合,最终由Decision Agent整合结果,Refinement Evaluator Agent提供条件反馈以优化决策,从而提升检测准确性和可解释性。
链接: https://arxiv.org/abs/2506.08430
作者: Ziqi.Liu,Ziyang.Zhou,Mingxuan.Hu
机构: 未知
类目: Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注: ICML 2025 Workshop on Collaborative and Federated Agentic Workflows
点击查看摘要
Abstract:Large language model (LLM) have become mainstream methods in the field of sarcasm detection. However, existing LLM methods face challenges in irony detection, including: 1. single-perspective limitations, 2. insufficient comprehensive understanding, and 3. lack of interpretability. This paper introduces the Collaborative Agent Framework for Irony (CAF-I), an LLM-driven multi-agent system designed to overcome these issues. CAF-I employs specialized agents for Context, Semantics, and Rhetoric, which perform multidimensional analysis and engage in interactive collaborative optimization. A Decision Agent then consolidates these perspectives, with a Refinement Evaluator Agent providing conditional feedback for optimization. Experiments on benchmark datasets establish CAF-I’s state-of-the-art zero-shot performance. Achieving SOTA on the vast majority of metrics, CAF-I reaches an average Macro-F1 of 76.31, a 4.98 absolute improvement over the strongest prior baseline. This success is attained by its effective simulation of human-like multi-perspective analysis, enhancing detection accuracy and interpretability.
zh
[NLP-65] Know-MRI: A Knowledge Mechanisms RevealerInterpreter for Large Language Models
【速读】: 该论文试图解决当前解释方法在输入数据格式和解释输出上存在差异,导致集成这些方法的工具仅能支持特定输入任务,从而显著限制了实际应用的问题。解决方案的关键在于提出一个开源的Knowledge Mechanisms Revealer Interpreter (Know-MRI),其核心模块能够自动匹配不同输入数据与解释方法,并整合解释输出,使用户可根据输入自由选择合适的解释方法,从而从多角度全面诊断模型的内部知识机制。
链接: https://arxiv.org/abs/2506.08427
作者: Jiaxiang Liu,Boxuan Xing,Chenhao Yuan,Chenxiang Zhang,Di Wu,Xiusheng Huang,Haida Yu,Chuhan Lang,Pengfei Cao,Jun Zhao,Kang Liu
机构: Chinese Academy of Sciences (中国科学院); University of Chinese Academy of Sciences (中国科学院大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:As large language models (LLMs) continue to advance, there is a growing urgency to enhance the interpretability of their internal knowledge mechanisms. Consequently, many interpretation methods have emerged, aiming to unravel the knowledge mechanisms of LLMs from various perspectives. However, current interpretation methods differ in input data formats and interpreting outputs. The tools integrating these methods are only capable of supporting tasks with specific inputs, significantly constraining their practical applications. To address these challenges, we present an open-source Knowledge Mechanisms RevealerInterpreter (Know-MRI) designed to analyze the knowledge mechanisms within LLMs systematically. Specifically, we have developed an extensible core module that can automatically match different input data with interpretation methods and consolidate the interpreting outputs. It enables users to freely choose appropriate interpretation methods based on the inputs, making it easier to comprehensively diagnose the model’s internal knowledge mechanisms from multiple perspectives. Our code is available at this https URL. We also provide a demonstration video on this https URL.
zh
[NLP-66] Large Language Models Have Intrinsic Meta-Cognition but Need a Good Lens
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在元认知能力(meta-cognition)评估方面的不足,特别是其对推理过程中步骤错误的自我意识和评估能力。现有研究多关注模型的认知错误检测能力,但缺乏对元认知能力的深入分析与改进。论文提出了一种自动化元认知评估框架AutoMeco,用于基准测试现有的元认知评估方法,并引入了一种无需训练的马尔可夫内在奖励调整策略MIRA,以提升元认知评估的效果。解决方案的关键在于通过AutoMeco框架系统性地评估元认知能力,并利用MIRA策略增强模型对自身推理过程的自我评估能力。
链接: https://arxiv.org/abs/2506.08410
作者: Ziyang Ma,Qingyue Yuan,Zhenglin Wang,Deyu Zhou
机构: Southeast University (东南大学); Nanjing Medical University (南京医科大学)
类目: Computation and Language (cs.CL)
备注: Preprint
点击查看摘要
Abstract:Previous research has primarily focused on the cognitive error detection capabilities of Large Language Models (LLMs), often prompting them to analyze mistakes in reasoning chains. However, few studies have examined the meta-cognitive abilities of LLMs (e.g., their self-awareness of step errors), which are crucial for their reliability. While studies on LLM self-evaluation present some measures, such as perplexity, which can reflect the answer correctness and be viewed as the lens of meta-cognition, they lack step-level analysis and adaptation. This paper studies the evaluation of LLM meta-cognition using the current lenses and how to improve these lenses. Specifically, we propose AutoMeco, an Automated Meta-cognition Evaluation framework for benchmarking the existing lenses. Furthermore, a training-free Markovian Intrinsic Reward Adjustment strategy, MIRA, is proposed to boost current meta-cognition lenses. Experimental results on three mathematical reasoning datasets and three LLMs show the reasonableness of AutoMeco by comparing it with Best-of-N verification. Moreover, the meta-cognition ability of LLMs can be better evaluated using MIRA.
zh
[NLP-67] ACTIC: Translation Agents with Cognitive-Theoretic Interactive Collaboration
【速读】: 该论文旨在解决如何充分挖掘大规模语言模型(LLMs)在机器翻译中的潜在能力,特别是在现有多智能体翻译框架中忽视认知翻译研究基础洞察的问题。其解决方案的关键在于提出一种基于认知理论的多智能体框架TACTIC,该框架包含六个功能各异的智能体,分别对应人类翻译行为中的关键认知过程,如初稿生成、内容优化、评估、评分、上下文推理和外部知识获取,通过模拟交互式且理论支撑的翻译流程,有效提升翻译质量。
链接: https://arxiv.org/abs/2506.08403
作者: Weiya Li,Junjie Chen,Bei Li,Boyang Liu,Zichen Wen,Nuanqiao Shan,Xiaoqian Liu,Anping Liu,Huajie Liu,Youyan Wang,Wujiuge Yin,Hu Song,Bing Huang,Zhiyuan Xia,Jialiang Chen,Linfeng Zhang
机构: Big Data&AI Lab, ICBC (大数据与人工智能实验室,工商银行); Shanghai Jiao Tong University (上海交通大学); Meituan Inc. (美团公司); Tongji University (同济大学); Fudan University (复旦大学); Northeastern University (东北大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 20 pages, 4 figures, Under review. Code: this https URL
点击查看摘要
Abstract:Machine translation has long been a central task in natural language processing. With the rapid advancement of large language models (LLMs), there has been remarkable progress in translation quality. However, fully realizing the translation potential of LLMs remains an open challenge. Recent studies have explored multi-agent systems to decompose complex translation tasks into collaborative subtasks, showing initial promise in enhancing translation quality through agent cooperation and specialization. Nevertheless, existing multi-agent translation frameworks largely neglect foundational insights from cognitive translation studies. These insights emphasize how human translators employ different cognitive strategies, such as balancing literal and free translation, refining expressions based on context, and iteratively evaluating outputs. To address this limitation, we propose a cognitively informed multi-agent framework called TACTIC, which stands for T ranslation A gents with Cognitive- T heoretic Interactive Collaboration. The framework comprises six functionally distinct agents that mirror key cognitive processes observed in human translation behavior. These include agents for drafting, refinement, evaluation, scoring, context reasoning, and external knowledge gathering. By simulating an interactive and theory-grounded translation workflow, TACTIC effectively leverages the full capacity of LLMs for high-quality translation. Experimental results on diverse language pairs from the FLORES-200 and WMT24 benchmarks show that our method consistently achieves state-of-the-art performance. Using DeepSeek-V3 as the base model, TACTIC surpasses GPT-4.1 by an average of +0.6 XCOMET and +1.18 COMETKIWI-23. Compared to DeepSeek-R1, it further improves by +0.84 XCOMET and +2.99 COMETKIWI-23. Code is available at this https URL.
zh
[NLP-68] mSTEB: Massively Multilingual Evaluation of LLM s on Speech and Text Tasks
【速读】: 该论文试图解决低资源语言在大型语言模型(Large Language Models, LLMs)评估中的不足问题,特别是针对非洲及美洲/大洋洲等地区的语言缺乏标准化的评估基准。其解决方案的关键是引入mSTEB,一个涵盖语言识别、文本分类、问答和翻译任务的多模态评估基准,旨在全面评估LLMs在不同语言上的性能。
链接: https://arxiv.org/abs/2506.08400
作者: Luel Hagos Beyene,Vivek Verma,Min Ma,Jesujoba O. Alabi,Fabian David Schmidt,Joyce Nakatumba-Nabende,David Ifeoluwa Adelani
机构: AIMS RIC Rwanda; Université de Montréal; Mila - Quebec AI Institute; Google DeepMind; Saarland University; University of Würzburg; Makerere University; McGill University; Canada CIFAR AI Chair
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: working paper
点击查看摘要
Abstract:Large Language models (LLMs) have demonstrated impressive performance on a wide range of tasks, including in multimodal settings such as speech. However, their evaluation is often limited to English and a few high-resource languages. For low-resource languages, there is no standardized evaluation benchmark. In this paper, we address this gap by introducing mSTEB, a new benchmark to evaluate the performance of LLMs on a wide range of tasks covering language identification, text classification, question answering, and translation tasks on both speech and text modalities. We evaluated the performance of leading LLMs such as Gemini 2.0 Flash and GPT-4o (Audio) and state-of-the-art open models such as Qwen 2 Audio and Gemma 3 27B. Our evaluation shows a wide gap in performance between high-resource and low-resource languages, especially for languages spoken in Africa and Americas/Oceania. Our findings show that more investment is needed to address their under-representation in LLMs coverage.
zh
[NLP-69] Reinforcement Learning Teachers of Test Time Scaling
【速读】: 该论文试图解决在训练推理语言模型(Reasoning Language Models, LMs)时,依赖强化学习(Reinforcement Learning, RL)进行一对一正确性训练所面临的探索挑战问题。其解决方案的关键在于引入一种新的框架,即强化学习教师(Reinforcement-Learned Teachers, RLTs),这些教师通过接收问题和对应的解答,并以详尽的解释“连接线索”来指导学生,从而实现高效的下游知识蒸馏。RLTs通过将解释输入学生并测试其对问题解决方案的理解来获得密集奖励,从而避免了传统RL方法中的探索难题。
链接: https://arxiv.org/abs/2506.08388
作者: Edoardo Cetin,Tianyu Zhao,Yujin Tang
机构: Sakana AI(サカナAI)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Preprint
点击查看摘要
Abstract:Training reasoning language models (LMs) with reinforcement learning (RL) for one-hot correctness inherently relies on the LM being able to explore and solve its task with some chance at initialization. Furthermore, a key use case of reasoning LMs is to act as teachers for distilling new students and cold-starting future RL iterations rather than being deployed themselves. From these considerations, we introduce a new framework that avoids RL’s exploration challenge by training a new class of Reinforcement-Learned Teachers (RLTs) focused on yielding the most effective downstream distillation. RLTs are prompted with both the question and solution to each problem, and tasked to simply “connect-the-dots” with detailed explanations tailored for their students. We train RLTs with dense rewards obtained by feeding each explanation to the student and testing its understanding of the problem’s solution. In practice, the raw outputs of a 7B RLT provide higher final performance on competition and graduate-level tasks than existing distillation and cold-starting pipelines that collect and postprocess the reasoning traces of orders of magnitude larger LMs. Furthermore, RLTs maintain their effectiveness when training larger students and when applied zero-shot to out-of-distribution tasks, unlocking new levels of efficiency and re-usability for the RL reasoning framework.
zh
[NLP-70] Reinforce LLM Reasoning through Multi-Agent Reflection ICML
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在推理能力提升过程中,由于反馈空间受限和多方协同训练不足导致的性能欠优问题。其解决方案的关键在于将多轮优化过程建模为马尔可夫决策过程,并引入DPSDP(Direct Policy Search by Dynamic Programming)算法,通过强化学习训练一个策略-评论家框架的LLM系统,利用自生成数据进行直接偏好学习,从而迭代优化答案。
链接: https://arxiv.org/abs/2506.08379
作者: Yurun Yuan,Tengyang Xie
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: International Conference on Machine Learning (ICML), 2025
点击查看摘要
Abstract:Leveraging more test-time computation has proven to be an effective way to boost the reasoning capabilities of large language models (LLMs). Among various methods, the verify-and-improve paradigm stands out for enabling dynamic solution exploration and feedback incorporation. However, existing approaches often suffer from restricted feedback spaces and lack of coordinated training of different parties, leading to suboptimal performance. To address this, we model this multi-turn refinement process as a Markov Decision Process and introduce DPSDP (Direct Policy Search by Dynamic Programming), a reinforcement learning algorithm that trains an actor-critic LLM system to iteratively refine answers via direct preference learning on self-generated data. Theoretically, DPSDP can match the performance of any policy within the training distribution. Empirically, we instantiate DPSDP with various base models and show improvements on both in- and out-of-distribution benchmarks. For example, on benchmark MATH 500, majority voting over five refinement steps increases first-turn accuracy from 58.2% to 63.2% with Ministral-based models. An ablation study further confirms the benefits of multi-agent collaboration and out-of-distribution generalization.
zh
[NLP-71] EIFBENCH: Extremely Complex Instruction Following Benchmark for Large Language Models
【速读】: 该论文试图解决现有基准测试在单任务环境中缺乏复杂性,无法充分反映现实世界场景的问题,从而难以评估大型语言模型(Large Language Models, LLMs)处理复杂用户需求的能力。解决方案的关键在于提出一个名为 Extremely Complex Instruction Following Benchmark (EIFBENCH) 的基准测试,该基准不仅包含多任务场景以实现对多种任务类型的综合评估,还整合了多种约束条件,以模拟复杂的操作环境。此外,论文还提出了 Segment Policy Optimization (SegPO) 算法,以提升模型准确执行多任务工作流的能力。
链接: https://arxiv.org/abs/2506.08375
作者: Tao Zou,Xinghua Zhang,Haiyang Yu,Minzheng Wang,Fei Huang,Yongbin Li
机构: Tongyi Lab (通义实验室); Alibaba Group (阿里巴巴集团)
类目: Computation and Language (cs.CL)
备注: 24 pages
点击查看摘要
Abstract:With the development and widespread application of large language models (LLMs), the new paradigm of “Model as Product” is rapidly evolving, and demands higher capabilities to address complex user needs, often requiring precise workflow execution which involves the accurate understanding of multiple tasks. However, existing benchmarks focusing on single-task environments with limited constraints lack the complexity required to fully reflect real-world scenarios. To bridge this gap, we present the Extremely Complex Instruction Following Benchmark (EIFBENCH), meticulously crafted to facilitate a more realistic and robust evaluation of LLMs. EIFBENCH not only includes multi-task scenarios that enable comprehensive assessment across diverse task types concurrently, but also integrates a variety of constraints, replicating complex operational environments. Furthermore, we propose the Segment Policy Optimization (SegPO) algorithm to enhance the LLM’s ability to accurately fulfill multi-task workflow. Evaluations on EIFBENCH have unveiled considerable performance discrepancies in existing LLMs when challenged with these extremely complex instructions. This finding underscores the necessity for ongoing optimization to navigate the intricate challenges posed by LLM applications.
zh
[NLP-72] Draft-based Approximate Inference for LLM s
【速读】: 该论文旨在解决长上下文大语言模型(Large Language Models, LLMs)推理优化问题,其核心挑战在于Transformer架构的二次计算复杂度和线性内存复杂度。现有近似方法(如键值缓存丢弃、稀疏注意力和提示压缩)通常依赖于对token或键值对重要性的粗略预测,难以实现高效且准确的近似推理。该论文提出的解决方案的关键在于引入小型草稿模型(draft model),以更精确地预测token和键值对的重要性,从而提升近似推理的准确性。具体而言,论文提出了两种实例化方法:SpecKV通过草稿模型输出评估每个键值对的重要性以实现更有效的键值缓存丢弃,而SpecPC则利用草稿模型的注意力激活来识别并丢弃不重要的提示token。这是首次将草稿模型应用于近似LLM推理加速,扩展了其在传统无损推测解码之外的应用场景。
链接: https://arxiv.org/abs/2506.08373
作者: Kevin Galim,Ethan Ewer,Wonjun Kang,Minjae Lee,Hyung Il Koo,Kangwook Lee
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Optimizing inference for long-context Large Language Models (LLMs) is increasingly important due to the quadratic compute and linear memory complexity of Transformers. Existing approximation methods, such as key-value (KV) cache dropping, sparse attention, and prompt compression, typically rely on rough predictions of token or KV pair importance. We propose a novel framework for approximate LLM inference that leverages small draft models to more accurately predict the importance of tokens and KV pairs. Specifically, we introduce two instantiations of our proposed framework: (i) SpecKV, which leverages a draft output to accurately assess the importance of each KV pair for more effective KV cache dropping, and (ii) SpecPC, which uses the draft model’s attention activations to identify and discard unimportant prompt tokens. To the best of our knowledge, this is the first work to use draft models for approximate LLM inference acceleration, extending their utility beyond traditional lossless speculative decoding. We motivate our methods with theoretical and empirical analyses, and show a strong correlation between the attention patterns of draft and target models. Extensive experiments on long-context benchmarks show that our methods consistently achieve higher accuracy than existing baselines, while preserving the same improvements in memory usage, latency, and throughput. Our code is available at this https URL.
zh
[NLP-73] Mitigating Posterior Salience Attenuation in Long-Context LLM s with Positional Contrastive Decoding
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在长上下文处理中性能退化的问题,当前解决方案存在训练成本高昂、统计行为和低成本方法研究不足的局限。其关键解决方案是提出一种无需训练的定位对比解码(Positional Contrastive Decoding, PCD),通过对比长程感知注意力与设计的局部感知注意力生成的logits,使模型聚焦于大规模短到长训练带来的增益,从而有效缓解注意力分数退化问题。
链接: https://arxiv.org/abs/2506.08371
作者: Zikai Xiao,Ziyang Wang,Wen Ma,Yan Zhang,Wei Shen,Yan Wang,Luqi Gong,Zuozhu Liu
机构: Zhejiang University (浙江大学); University of Science and Technology of China (中国科学技术大学); Bytedance (字节跳动); Zhejiang Lab (浙江省实验室); Zhejiang Key Laboratory of Medical Imaging Artificial Intelligence (浙江省医学影像人工智能重点实验室)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:While Large Language Models (LLMs) support long contexts, they struggle with performance degradation within the context window. Current solutions incur prohibitive training costs, leaving statistical behaviors and cost-effective approaches underexplored. From the decoding perspective, we identify the Posterior Salience Attenuation (PSA) phenomenon, where the salience ratio correlates with long-text performance degradation. Notably, despite the attenuation, gold tokens still occupy high-ranking positions in the decoding space. Motivated by it, we propose the training-free Positional Contrastive Decoding (PCD) that contrasts the logits derived from long-aware attention with those from designed local-aware attention, enabling the model to focus on the gains introduced by large-scale short-to-long training. Through the analysis of long-term decay simulation, we demonstrate that PCD effectively alleviates attention score degradation. Experimental results show that PCD achieves state-of-the-art performance on long-context benchmarks.
zh
[NLP-74] CC-RAG : Structured Multi-Hop Reasoning via Theme-Based Causal Graphs
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在特定领域中难以理解因果关系的问题,尤其是在需要深层次推理而非表面相关性的场景下。标准的检索增强生成(Retrieval-Augmented Generation, RAG)方法将证据视为扁平上下文,缺乏建模真实因果依赖关系所需的结构。论文提出的解决方案是Causal-Chain RAG(CC-RAG),其关键在于将零样本三元组提取与主题感知图链结合到RAG流程中,从而实现结构化的多跳推理。通过构建因果、关系、结果三元组的有向无环图(Directed Acyclic Graph, DAG),并利用前向/后向链式推理引导结构化答案生成,CC-RAG在两个实际领域(比特币价格波动和高歇氏病)的实验中表现出更高的链相似性、信息密度和词汇多样性。
链接: https://arxiv.org/abs/2506.08364
作者: Jash Rajesh Parekh,Pengcheng Jiang,Jiawei Han
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Understanding cause and effect relationships remains a formidable challenge for Large Language Models (LLMs), particularly in specialized domains where reasoning requires more than surface-level correlations. Retrieval-Augmented Generation (RAG) improves factual accuracy, but standard RAG pipelines treat evidence as flat context, lacking the structure required to model true causal dependencies. We introduce Causal-Chain RAG (CC-RAG), a novel approach that integrates zero-shot triple extraction and theme-aware graph chaining into the RAG pipeline, enabling structured multi-hop inference. Given a domain specific corpus, CC-RAG constructs a Directed Acyclic Graph (DAG) of cause, relation, effect triples and uses forward/backward chaining to guide structured answer generation. Experiments on two real-world domains: Bitcoin price fluctuations and Gaucher disease, show that CC-RAG outperforms standard RAG and zero-shot LLMs in chain similarity, information density, and lexical diversity. Both LLM-as-a-Judge and human evaluations consistently favor CC-RAG. Our results demonstrate that explicitly modeling causal structure enables LLMs to generate more accurate and interpretable responses, especially in specialized domains where flat retrieval fails.
zh
[NLP-75] DEAL: Disentangling Transformer Head Activations for LLM Steering
【速读】: 该论文试图解决在不修改大型语言模型(Large Language Models, LLMs)底层参数的情况下,通过推理阶段的引导(inference-time steering)改变模型响应特性的问题。当前模块选择方法通常依赖于表面线索或经验性启发式方法,可能导致次优或意外的结果。解决方案的关键在于提出一种基于因果归因(causal-attribution)的框架,用于识别变压器模型中与目标行为相关的注意力头(attention heads)。该框架通过对每个注意力头的注意力激活训练向量量化自编码器(VQ-AE),将潜在空间划分为与行为相关和无关的子空间,并通过二分类指标评估行为对齐与行为违背响应的可分性,从而获得行为相关性得分,指导注意力头的选择与重要性加权。
链接: https://arxiv.org/abs/2506.08359
作者: Li-Ming Zhan,Bo Liu,Zexin Lu,Chengqiang Xie,Jiannong Cao,Xiao-Ming Wu
机构: The Hong Kong Polytechnic University (香港理工大学)
类目: Computation and Language (cs.CL)
备注: Preprint
点击查看摘要
Abstract:Inference-time steering aims to alter the response characteristics of large language models (LLMs) without modifying their underlying parameters. A critical step in this process is the identification of internal modules within LLMs that are associated with the target behavior. However, current approaches to module selection often depend on superficial cues or ad-hoc heuristics, which can result in suboptimal or unintended outcomes. In this work, we propose a principled causal-attribution framework for identifying behavior-relevant attention heads in transformers. For each head, we train a vector-quantized autoencoder (VQ-AE) on its attention activations, partitioning the latent space into behavior-relevant and behavior-irrelevant subspaces, each quantized with a shared learnable codebook. We assess the behavioral relevance of each head by quantifying the separability of VQ-AE encodings for behavior-aligned versus behavior-violating responses using a binary classification metric. This yields a behavioral relevance score that reflects each head discriminative capacity with respect to the target behavior, guiding both selection and importance weighting. Experiments on seven LLMs from two model families and five behavioral steering datasets demonstrate that our method enables more accurate inference-time interventions, achieving superior performance on the truthfulness-steering task. Furthermore, the heads selected by our approach exhibit strong zero-shot generalization in cross-domain truthfulness-steering scenarios.
zh
[NLP-76] xt Embeddings Should Capture Implicit Semantics Not Just Surface Meaning
【速读】: 该论文试图解决当前文本嵌入(text embedding)研究中过于关注表层语义(surface meaning)而忽视隐含语义(implicit semantics)的问题。现有嵌入模型在训练数据和评估基准上缺乏对语用、说话者意图及社会文化背景的深度考量,导致其在需要解释性推理、说话者立场或社会意义的任务中表现不佳。解决方案的关键在于推动范式转变:优先采用更多样化且语言学基础坚实的训练数据,设计评估深层语义理解的基准,并将隐含意义明确作为核心建模目标,从而更好地匹配现实语言的复杂性。
链接: https://arxiv.org/abs/2506.08354
作者: Yiqun Sun,Qiang Huang,Anthony K. H. Tung,Jun Yu
机构: National University of Singapore (新加坡国立大学); Harbin Institute of Technology (Shenzhen) (哈尔滨工业大学(深圳))
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:
点击查看摘要
Abstract:This position paper argues that the text embedding research community should move beyond surface meaning and embrace implicit semantics as a central modeling goal. Text embedding models have become foundational in modern NLP, powering a wide range of applications and drawing increasing research attention. Yet, much of this progress remains narrowly focused on surface-level semantics. In contrast, linguistic theory emphasizes that meaning is often implicit, shaped by pragmatics, speaker intent, and sociocultural context. Current embedding models are typically trained on data that lacks such depth and evaluated on benchmarks that reward the capture of surface meaning. As a result, they struggle with tasks requiring interpretive reasoning, speaker stance, or social meaning. Our pilot study highlights this gap, showing that even state-of-the-art models perform only marginally better than simplistic baselines on implicit semantics tasks. To address this, we call for a paradigm shift: embedding research should prioritize more diverse and linguistically grounded training data, design benchmarks that evaluate deeper semantic understanding, and explicitly frame implicit meaning as a core modeling objective, better aligning embeddings with real-world language complexity.
zh
[NLP-77] How Much To Guide: Revisiting Adaptive Guidance in Classifier-Free Guidance Text-to-Vision Diffusion Models
【速读】: 该论文试图解决生成式 AI (Generative AI) 中基于分类器的无条件引导(classifier-free guidance)在文本到视觉生成扩散模型中计算成本过高的问题。其关键解决方案是提出一种名为 Step AG 的简单且通用的自适应引导策略,通过将无条件引导限制在去噪过程的前几步,能够在保持图像质量和图像-文本对齐度的前提下,实现平均 20% 至 30% 的加速效果。
链接: https://arxiv.org/abs/2506.08351
作者: Huixuan Zhang,Junzhe Zhang,Xiaojun Wan
机构: Wangxuan Institute of Computer Technology, Peking University (王选计算机技术研究所,北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:With the rapid development of text-to-vision generation diffusion models, classifier-free guidance has emerged as the most prevalent method for conditioning. However, this approach inherently requires twice as many steps for model forwarding compared to unconditional generation, resulting in significantly higher costs. While previous study has introduced the concept of adaptive guidance, it lacks solid analysis and empirical results, making previous method unable to be applied to general diffusion models. In this work, we present another perspective of applying adaptive guidance and propose Step AG, which is a simple, universally applicable adaptive guidance strategy. Our evaluations focus on both image quality and image-text alignment. whose results indicate that restricting classifier-free guidance to the first several denoising steps is sufficient for generating high-quality, well-conditioned images, achieving an average speedup of 20% to 30%. Such improvement is consistent across different settings such as inference steps, and various models including video generation models, highlighting the superiority of our method.
zh
[NLP-78] Evaluating LLM s Across Multi-Cognitive Levels: From Medical Knowledge Mastery to Scenario-Based Problem Solving ICML2025
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在医学领域中不同认知层次能力评估不足的问题。其解决方案的关键是提出一种基于布鲁姆分类法(Bloom’s Taxonomy)的多认知层次评估框架,该框架整合了现有的医学数据集,并设计了针对初步知识掌握、综合知识应用和基于情境的问题解决三个认知层次的任务,以系统评估LLMs在医学领域的表现。
链接: https://arxiv.org/abs/2506.08349
作者: Yuxuan Zhou,Xien Liu,Chenwei Yan,Chen Ning,Xiao Zhang,Boxun Li,Xiangling Fu,Shijin Wang,Guoping Hu,Yu Wang,Ji Wu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 20 pages, 11 figures. Accepted by ICML 2025
点击查看摘要
Abstract:Large language models (LLMs) have demonstrated remarkable performance on various medical benchmarks, but their capabilities across different cognitive levels remain underexplored. Inspired by Bloom’s Taxonomy, we propose a multi-cognitive-level evaluation framework for assessing LLMs in the medical domain in this study. The framework integrates existing medical datasets and introduces tasks targeting three cognitive levels: preliminary knowledge grasp, comprehensive knowledge application, and scenario-based problem solving. Using this framework, we systematically evaluate state-of-the-art general and medical LLMs from six prominent families: Llama, Qwen, Gemma, Phi, GPT, and DeepSeek. Our findings reveal a significant performance decline as cognitive complexity increases across evaluated models, with model size playing a more critical role in performance at higher cognitive levels. Our study highlights the need to enhance LLMs’ medical capabilities at higher cognitive levels and provides insights for developing LLMs suited to real-world medical applications.
zh
[NLP-79] SPBA: Utilizing Speech Large Language Model for Backdoor Attacks on Speech Classification Models IJCNN2025
【速读】: 该论文旨在解决语音识别系统在深度语音分类任务中面临的后门攻击问题,此类攻击通过污染训练数据使模型变得脆弱。论文提出的解决方案的关键在于利用语音大语言模型(Speech Large Language Model, SLLM)生成多样化的触发器,特别是针对语音元素如音色和情感进行策略性攻击,从而增加触发器的数量和多样性。这一方法能够显著提升攻击效果,同时引入多重梯度下降算法(Multiple Gradient Descent Algorithm, MGDA)作为缓解策略,以应对因触发器数量增加导致的中毒率上升和攻击成本增高的问题。
链接: https://arxiv.org/abs/2506.08346
作者: Wenhan Yao,Fen Xiao,Xiarun Chen,Jia Liu,YongQiang He,Weiping Wen
机构: Xiangtan University(湘潭大学); Peking University(北京大学)
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: Accepted by IJCNN 2025
点击查看摘要
Abstract:Deep speech classification tasks, including keyword spotting and speaker verification, are vital in speech-based human-computer interaction. Recently, the security of these technologies has been revealed to be susceptible to backdoor attacks. Specifically, attackers use noisy disruption triggers and speech element triggers to produce poisoned speech samples that train models to become vulnerable. However, these methods typically create only a limited number of backdoors due to the inherent constraints of the trigger function. In this paper, we propose that speech backdoor attacks can strategically focus on speech elements such as timbre and emotion, leveraging the Speech Large Language Model (SLLM) to generate diverse triggers. Increasing the number of triggers may disproportionately elevate the poisoning rate, resulting in higher attack costs and a lower success rate per trigger. We introduce the Multiple Gradient Descent Algorithm (MGDA) as a mitigation strategy to address this challenge. The proposed attack is called the Speech Prompt Backdoor Attack (SPBA). Building on this foundation, we conducted attack experiments on two speech classification tasks, demonstrating that SPBA shows significant trigger effectiveness and achieves exceptional performance in attack metrics.
zh
[NLP-80] Wait We Dont Need to “Wait”! Removing Thinking Tokens Improves Reasoning Efficiency
【速读】: 该论文试图解决大型推理模型在复杂推理过程中产生的过度思考问题,导致输出冗长且效率降低。其解决方案的关键在于提出NoWait方法,通过在推理过程中抑制如“Wait”和“Hmm”等显式自我反思的标记,从而消除显式自我反思,有效缩短思维链轨迹长度,同时保持模型的实用性。
链接: https://arxiv.org/abs/2506.08343
作者: Chenlong Wang,Yuanning Feng,Dongping Chen,Zhaoyang Chu,Ranjay Krishna,Tianyi Zhou
机构: University of Maryland(马里兰大学); University College London(伦敦大学学院); University of Washington(华盛顿大学)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Recent advances in large reasoning models have enabled complex, step-by-step reasoning but often introduce significant overthinking, resulting in verbose and redundant outputs that hinder efficiency. In this study, we examine whether explicit self-reflection, signaled by tokens such as “Wait” and “Hmm”, is necessary for advanced reasoning. We propose NoWait, a simple yet effective approach that disables explicit self-reflection by suppressing these tokens during inference. Extensive experiments on ten benchmarks across textual, visual, and video reasoning tasks show that NoWait reduces chain-of-thought trajectory length by up to 27%-51% in five R1-style model series, without compromising model utility. NoWait thus offers a plug-and-play solution for efficient and utility-preserving multimodal reasoning.
zh
[NLP-81] Institutional Books 1.0: A 242B token dataset from Harvard Librarys collections refined for accuracy and usability
【速读】: 该论文试图解决当前大型语言模型(Large Language Models, LLMs)在训练和推理过程中面临的数据集质量、多样性及可追溯性不足的问题。解决方案的关键在于构建一个具有明确来源链和广泛历史文本覆盖的高质量公共领域数据集,即Institutional Books 1.0,该数据集通过哈佛图书馆参与谷歌图书项目所数字化的书籍提取并处理而成,涵盖了约2500亿个token的多语言文本,旨在提升数据集的可持续性和可用性。
链接: https://arxiv.org/abs/2506.08300
作者: Matteo Cargnelutti,Catherine Brobston,John Hess,Jack Cushman,Kristi Mukk,Aristana Scourtas,Kyle Courtney,Greg Leppert,Amanda Watson,Martha Whitehead,Jonathan Zittrain
机构: 未知
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Large language models (LLMs) use data to learn about the world in order to produce meaningful correlations and predictions. As such, the nature, scale, quality, and diversity of the datasets used to train these models, or to support their work at inference time, have a direct impact on their quality. The rapid development and adoption of LLMs of varying quality has brought into focus the scarcity of publicly available, high-quality training data and revealed an urgent need to ground the stewardship of these datasets in sustainable practices with clear provenance chains. To that end, this technical report introduces Institutional Books 1.0, a large collection of public domain books originally digitized through Harvard Library’s participation in the Google Books project, beginning in 2006. Working with Harvard Library, we extracted, analyzed, and processed these volumes into an extensively-documented dataset of historic texts. This analysis covers the entirety of Harvard Library’s collection scanned as part of that project, originally spanning 1,075,899 volumes written in over 250 different languages for a total of approximately 250 billion tokens. As part of this initial release, the OCR-extracted text (original and post-processed) as well as the metadata (bibliographic, source, and generated) of the 983,004 volumes, or 242B tokens, identified as being in the public domain have been made available. This report describes this project’s goals and methods as well as the results of the analyses we performed, all in service of making this historical collection more accessible and easier for humans and machines alike to filter, read and use.
zh
[NLP-82] From Passive to Active Reasoning : Can Large Language Models Ask the Right Questions under Incomplete Information? ICML2025
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在主动推理(active reasoning)能力上的不足问题,即当前主流基准测试主要评估模型的被动推理(passive reasoning)能力,而忽视了模型在与外部系统交互以获取缺失信息或数据时的表现。解决方案的关键在于提出AR-Bench,这是一个专门设计用于评估LLM主动推理技能的新基准,包含侦探案件、情境谜题和猜数字三类任务,旨在模拟真实世界的代理场景,并衡量模型在常识性、逻辑性和符号推理方面的表现。
链接: https://arxiv.org/abs/2506.08295
作者: Zhanke Zhou,Xiao Feng,Zhaocheng Zhu,Jiangchao Yao,Sanmi Koyejo,Bo Han
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Accepted by ICML 2025
点击查看摘要
Abstract:While existing benchmarks probe the reasoning abilities of large language models (LLMs) across diverse domains, they predominantly assess passive reasoning, providing models with all the information needed to reach a solution. By contrast, active reasoning-where an LLM must interact with external systems to acquire missing evidence or data-has received little systematic attention. To address this shortfall, we present AR-Bench, a novel benchmark designed explicitly to evaluate an LLM’s active reasoning skills. AR-Bench comprises three task families-detective cases, situation puzzles, and guessing numbers-that together simulate real-world, agentic scenarios and measure performance across commonsense, logical, and symbolic reasoning challenges. Empirical evaluation on AR-Bench demonstrates that contemporary LLMs exhibit pronounced difficulties with active reasoning: they frequently fail to acquire or leverage the information needed to solve tasks. This gap highlights a stark divergence between their passive and active reasoning abilities. Moreover, ablation studies indicate that even advanced strategies, such as tree-based searching or post-training approaches, yield only modest gains and fall short of the levels required for real-world deployment. Collectively, these findings highlight the critical need to advance methodology for active reasoning, e.g., incorporating interactive learning, real-time feedback loops, and environment-aware objectives for training. The benchmark is publicly available at: this https URL.
zh
[NLP-83] From Debate to Equilibrium: Belief-Driven Multi-Agent LLM Reasoning via Bayesian Nash Equilibrium ICML2025
【速读】: 该论文旨在解决多大语言模型(Large Language Models, LLMs)协作框架中计算成本高且缺乏收敛性保证的问题。其解决方案的关键在于将多LLM协作建模为一个不完全信息博弈,并寻求贝叶斯纳什均衡(Bayesian Nash Equilibrium, BNE),通过引入一种名为ECON的分层强化学习范式,实现分布式推理与集中式最终输出的结合,从而在无需高昂的代理间通信成本的情况下,使每个LLM基于对其他代理策略的概率信念独立选择最大化预期奖励的响应。
链接: https://arxiv.org/abs/2506.08292
作者: Xie Yi,Zhanke Zhou,Chentao Cao,Qiyu Niu,Tongliang Liu,Bo Han
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Accepted by ICML 2025
点击查看摘要
Abstract:Multi-agent frameworks can substantially boost the reasoning power of large language models (LLMs), but they typically incur heavy computational costs and lack convergence guarantees. To overcome these challenges, we recast multi-LLM coordination as an incomplete-information game and seek a Bayesian Nash equilibrium (BNE), in which each agent optimally responds to its probabilistic beliefs about the strategies of others. We introduce Efficient Coordination via Nash Equilibrium (ECON), a hierarchical reinforcement-learning paradigm that marries distributed reasoning with centralized final output. Under ECON, each LLM independently selects responses that maximize its expected reward, conditioned on its beliefs about co-agents, without requiring costly inter-agent exchanges. We mathematically prove that ECON attains a markedly tighter regret bound than non-equilibrium multi-agent schemes. Empirically, ECON outperforms existing multi-LLM approaches by 11.2% on average across six benchmarks spanning complex reasoning and planning tasks. Further experiments demonstrate ECON’s ability to flexibly incorporate additional models, confirming its scalability and paving the way toward larger, more powerful multi-LLM ensembles. The code is publicly available at: this https URL.
zh
[NLP-84] Reinforcement Learning from Human Feedback with High-Confidence Safety Constraints
【速读】: 该论文旨在解决语言模型对齐过程中安全性和有用性之间的权衡问题,这一权衡可能导致在敏感领域产生不可接受的响应。其解决方案的关键在于提出一种高置信度安全强化学习从人类反馈(High-Confidence Safe Reinforcement Learning from Human Feedback, HC-RLHF)方法,该方法通过将人类偏好显式解耦为有用性和无害性(安全性),并分别训练奖励模型和成本模型,随后采用两步流程优化奖励函数并在安全测试中验证模型性能是否满足置信区间内的成本约束,从而在保证高置信度安全性的前提下最大化有用性。
链接: https://arxiv.org/abs/2506.08266
作者: Yaswanth Chittepu,Blossom Metevier,Will Schwarzer,Austin Hoag,Scott Niekum,Philip S. Thomas
机构: University of Massachusetts Amherst (马萨诸塞大学阿默斯特分校); Sony AI (索尼人工智能)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Applications (stat.AP)
备注: 20 pages, 6 figures, 4 tables, Second Reinforcement Learning Conference (RLC 2025)
点击查看摘要
Abstract:Existing approaches to language model alignment often treat safety as a tradeoff against helpfulness, which can lead to unacceptable responses in sensitive domains. To ensure reliable performance in such settings, we propose High-Confidence Safe Reinforcement Learning from Human Feedback (HC-RLHF), a method that provides high-confidence safety guarantees while maximizing helpfulness. Similar to previous methods, HC-RLHF explicitly decouples human preferences into helpfulness and harmlessness (safety), which are learned by training a reward model and a cost model, respectively. It then employs a two-step process to find safe solutions. In the first step, it optimizes the reward function under an intentionally pessimistic version of the cost constraint. In the second step, the trained model undergoes a safety test to verify whether its performance stays within an upper-confidence bound of the actual cost constraint. We provide a theoretical analysis of HC-RLHF, including proof that it will not return an unsafe solution with a probability greater than a user-specified threshold. For our empirical analysis, we apply HC-RLHF to align three different language models (Qwen2-1.5B, Qwen2.5-3B, and LLaMa3.2-3B) with human preferences. Our results demonstrate that HC-RLHF produces safe models with high probability and can improve harmlessness and helpfulness compared to previous methods.
zh
[NLP-85] Automatic Generation of Inference Making Questions for Reading Comprehension Assessments ACL2025
【速读】: 该论文试图解决阅读理解(Reading Comprehension, RC)中推理能力评估的复杂性问题,特别是如何有效生成符合诊断需求的RC题目以支持教育者提供针对性的教学干预。其解决方案的关键在于利用生成式AI(Generative AI)模型GPT-4o通过少量示例提示(few-shot prompting)生成桥梁推理(bridging-inference)类型的RC题目,并结合链式思维(chain-of-thought)提示以提升生成题目的质量与推理准确性。研究结果表明,尽管生成题目整体质量较高,但准确匹配目标推理类型的比例仍需通过人工判断进行优化。
链接: https://arxiv.org/abs/2506.08260
作者: Wanjing Anya Ma,Michael Flor,Zuowei Wang
机构: Stanford University (斯坦福大学); ETS Research Institute (ETS研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to the 20th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2025), co-located with the ACL 2025
点击查看摘要
Abstract:Inference making is an essential but complex skill in reading comprehension (RC). Some inferences require resolving references across sentences, and some rely on using prior knowledge to fill in the detail that is not explicitly written in the text. Diagnostic RC questions can help educators provide more effective and targeted reading instruction and interventions for school-age students. We introduce a taxonomy of inference types for RC and use it to analyze the distribution of items within a diagnostic RC item bank. Next, we present experiments using GPT-4o to generate bridging-inference RC items for given reading passages via few-shot prompting, comparing conditions with and without chain-of-thought prompts. Generated items were evaluated on three aspects: overall item quality, appropriate inference type, and LLM reasoning, achieving high inter-rater agreements above 0.90. Our results show that GPT-4o produced 93.8% good-quality questions suitable for operational use in grade 3-12 contexts; however, only 42.6% of the generated questions accurately matched the targeted inference type. We conclude that combining automatic item generation with human judgment offers a promising path toward scalable, high-quality diagnostic RC assessments.
zh
[NLP-86] RADAR: Benchmarking Language Models on Imperfect Tabular Data
【速读】: 该论文试图解决语言模型(Language Models, LMs)在处理真实世界表格数据时缺乏数据意识的问题,即模型无法有效识别、推理并适当处理数据中的缺失值、异常值和逻辑不一致等数据缺陷。解决方案的关键在于提出RADAR基准,通过程序化扰动生成数据缺陷,从而系统性地评估模型在表格数据上的数据感知推理能力。该基准包含2980个表格查询对,覆盖9个领域和5种数据缺陷类型,并支持不同规模表格的测试,以研究模型在面对更大数据量时的推理表现。
链接: https://arxiv.org/abs/2506.08249
作者: Ken Gu,Zhihan Zhang,Kate Lin,Yuwei Zhang,Akshay Paruchuri,Hong Yu,Mehran Kazemi,Kumar Ayush,A. Ali Heydari,Maxwell A. Xu,Girish Narayanswamy,Yun Liu,Ming-Zher Poh,Yuzhe Yang,Mark Malhotra,Shwetak Patel,Hamid Palangi,Xuhai Xu,Daniel McDuff,Tim Althoff,Xin Liu
机构: 未知
类目: Databases (cs.DB); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Language models (LMs) are increasingly being deployed to perform autonomous data analyses. However, their data awareness – the ability to recognize, reason over, and appropriately handle data artifacts such as missing values, outliers, and logical inconsistencies – remains underexplored. These artifacts are especially common in real-world tabular data and, if mishandled, can significantly compromise the validity of analytical conclusions. To address this gap, we present RADAR, a benchmark for systematically evaluating data-aware reasoning on tabular data. We develop a framework to simulate data artifacts via programmatic perturbations to enable targeted evaluation of model behavior. RADAR comprises 2980 table query pairs, grounded in real-world data spanning 9 domains and 5 data artifact types. In addition to evaluating artifact handling, RADAR systematically varies table size to study how reasoning performance holds when increasing table size. Our evaluation reveals that, despite decent performance on tables without data artifacts, frontier models degrade significantly when data artifacts are introduced, exposing critical gaps in their capacity for robust, data-aware analysis. Designed to be flexible and extensible, RADAR supports diverse perturbation types and controllable table sizes, offering a valuable resource for advancing tabular reasoning.
zh
[NLP-87] Can AI Validate Science? Benchmarking LLM s for Accurate Scientific Claim rightarrow Evidence Reasoning
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在处理复杂科研论文中科学主张与证据之间逻辑关系的能力不足的问题,特别是其在科学论点-证据抽取与验证任务中的表现。解决方案的关键在于提出CLAIM-BENCH基准,用于系统评估LLMs在科学主张-证据关联任务中的能力,并通过分而治之的策略设计三种不同的提示方法,以提升模型对分散证据与主张之间准确关联的能力。
链接: https://arxiv.org/abs/2506.08235
作者: Shashidhar Reddy Javaji,Yupeng Cao,Haohang Li,Yangyang Yu,Nikhil Muralidhar,Zining Zhu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 21 pages, 6 figures, Under review
点击查看摘要
Abstract:Large language models (LLMs) are increasingly being used for complex research tasks such as literature review, idea generation, and scientific paper analysis, yet their ability to truly understand and process the intricate relationships within complex research papers, such as the logical links between claims and supporting evidence remains largely unexplored. In this study, we present CLAIM-BENCH, a comprehensive benchmark for evaluating LLMs’ capabilities in scientific claim-evidence extraction and validation, a task that reflects deeper comprehension of scientific argumentation. We systematically compare three approaches which are inspired by divide and conquer approaches, across six diverse LLMs, highlighting model-specific strengths and weaknesses in scientific comprehension. Through evaluation involving over 300 claim-evidence pairs across multiple research domains, we reveal significant limitations in LLMs’ ability to process complex scientific content. Our results demonstrate that closed-source models like GPT-4 and Claude consistently outperform open-source counterparts in precision and recall across claim-evidence identification tasks. Furthermore, strategically designed three-pass and one-by-one prompting approaches significantly improve LLMs’ abilities to accurately link dispersed evidence with claims, although this comes at increased computational cost. CLAIM-BENCH sets a new standard for evaluating scientific comprehension in LLMs, offering both a diagnostic tool and a path forward for building systems capable of deeper, more reliable reasoning across full-length papers.
zh
[NLP-88] Compound AI Systems Optimization: A Survey of Methods Challenges and Future Directions
【速读】: 该论文试图解决复杂AI工作流中复合AI系统(compound AI system)的优化问题,特别是在系统组件及其交互日益复杂的情况下,如何有效提升系统性能。解决方案的关键在于整合数值优化方法与基于自然语言反馈的优化技术,通过系统性地分析和分类现有方法,探索非可微系统的优化路径,并提出未来研究方向。
链接: https://arxiv.org/abs/2506.08234
作者: Yu-Ang Lee,Guan-Ting Yi,Mei-Yi Liu,Jui-Chao Lu,Guan-Bo Yang,Yun-Nung Chen
机构: National Taiwan University (国立台湾大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 15 pages, 4 figures, 1 table
点击查看摘要
Abstract:Recent advancements in large language models (LLMs) and AI systems have led to a paradigm shift in the design and optimization of complex AI workflows. By integrating multiple components, compound AI systems have become increasingly adept at performing sophisticated tasks. However, as these systems grow in complexity, new challenges arise in optimizing not only individual components but also their interactions. While traditional optimization methods such as supervised fine-tuning (SFT) and reinforcement learning (RL) remain foundational, the rise of natural language feedback introduces promising new approaches, especially for optimizing non-differentiable systems. This paper provides a systematic review of recent progress in optimizing compound AI systems, encompassing both numerical and language-based techniques. We formalize the notion of compound AI system optimization, classify existing methods along several key dimensions, and highlight open research challenges and future directions in this rapidly evolving field. A list of surveyed papers is publicly available at this https URL.
zh
[NLP-89] “I Wrote I Paused I Rewrote” Teaching LLM s to Read Between the Lines of Student Writing
【速读】: 该论文试图解决当前大型语言模型(Large Language Models, LLMs)在支持学生写作时,其反馈仅基于最终作文内容而缺乏对写作过程上下文信息的了解的问题。解决方案的关键在于利用通过按键记录和周期性快照收集的写作过程数据,使LLMs能够更全面地理解学习者的思维和修改过程,从而生成更具针对性和个性化的反馈。
链接: https://arxiv.org/abs/2506.08221
作者: Samra Zafar,Shaheer Minhas,Syed Ali Hassan Zaidi,Arfa Naeem,Zahra Ali
机构: 未知
类目: Computation and Language (cs.CL)
备注: 7 pages, 6 figures, 2 tables
点击查看摘要
Abstract:Large language models(LLMs) like Gemini are becoming common tools for supporting student writing. But most of their feedback is based only on the final essay missing important context about how that text was written. In this paper, we explore whether using writing process data, collected through keystroke logging and periodic snapshots, can help LLMs give feedback that better reflects how learners think and revise while writing. We built a digital writing tool that captures both what students type and how their essays evolve over time. Twenty students used this tool to write timed essays, which were then evaluated in two ways: (i) LLM generated feedback using both the final essay and the full writing trace, and (ii) After the task, students completed surveys about how useful and relatable they found the feedback. Early results show that learners preferred the process-aware LLM feedback, finding it more in tune with their own thinking. We also found that certain types of edits, like adding new content or reorganizing paragraphs, aligned closely with higher scores in areas like coherence and elaboration. Our findings suggest that making LLMs more aware of the writing process can lead to feedback that feels more meaningful, personal, and supportive.
zh
[NLP-90] A Comprehensive Study of Decoder-Only LLM s for Text-to-Image Generation CVPR2025
【速读】: 该论文试图解决当前文本到图像生成模型中仍使用较为过时的T5和CLIP作为文本编码器的问题,旨在探索现代解码器-only语言模型(LLMs)作为文本编码器在文本到图像扩散模型中的有效性。解决方案的关键在于构建一个标准化的训练与评估流程,以隔离并评估不同文本嵌入的效果,并通过分析不同层的嵌入提取方式、LLMs变体及模型规模,发现采用全层归一化平均嵌入能够显著提升与复杂提示的对齐效果,从而优于传统最后一层嵌入的条件设置。
链接: https://arxiv.org/abs/2506.08210
作者: Andrew Z. Wang,Songwei Ge,Tero Karras,Ming-Yu Liu,Yogesh Balaji
机构: University of Maryland(马里兰大学); NVIDIA(英伟达)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: CVPR 2025
点击查看摘要
Abstract:Both text-to-image generation and large language models (LLMs) have made significant advancements. However, many text-to-image models still employ the somewhat outdated T5 and CLIP as their text encoders. In this work, we investigate the effectiveness of using modern decoder-only LLMs as text encoders for text-to-image diffusion models. We build a standardized training and evaluation pipeline that allows us to isolate and evaluate the effect of different text embeddings. We train a total of 27 text-to-image models with 12 different text encoders to analyze the critical aspects of LLMs that could impact text-to-image generation, including the approaches to extract embeddings, different LLMs variants, and model sizes. Our experiments reveal that the de facto way of using last-layer embeddings as conditioning leads to inferior performance. Instead, we explore embeddings from various layers and find that using layer-normalized averaging across all layers significantly improves alignment with complex prompts. Most LLMs with this conditioning outperform the baseline T5 model, showing enhanced performance in advanced visio-linguistic reasoning skills.
zh
[NLP-91] GradEscape: A Gradient-Based Evader Against AI-Generated Text Detectors USENIX-SECURITY’25
【速读】: 该论文试图解决AI-generated text (AIGT)检测器面临的对抗性攻击问题,具体是通过生成能够绕过检测器的文本。解决方案的关键在于提出GradEscape,这是一种基于梯度的逃逸方法,通过引入加权嵌入构造技术克服了文本离散性导致的不可微计算问题,并利用受害者检测器的反馈更新逃逸模型参数,从而以最小的文本修改实现高攻击成功率。此外,为了解决逃逸者与检测器之间的分词器不匹配问题,论文还引入了热启动逃逸方法,使GradEscape能够适应不同语言模型架构的检测器。
链接: https://arxiv.org/abs/2506.08188
作者: Wenlong Meng,Shuguo Fan,Chengkun Wei,Min Chen,Yuwei Li,Yuanchao Zhang,Zhikun Zhang,Wenzhi Chen
机构: Zhejiang University (浙江大学); Vrije Universiteit Amsterdam (阿姆斯特丹自由大学); National University of Defense Technology (国防科技大学); Anhui Province Key Laboratory of Cyberspace Security Situation Awareness and Evaluation (安徽省网络空间安全态势感知与评估重点实验室); Mybank, Ant Group (蚂蚁集团我的银行)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注: Accepted to USENIX Security’25
点击查看摘要
Abstract:In this paper, we introduce GradEscape, the first gradient-based evader designed to attack AI-generated text (AIGT) detectors. GradEscape overcomes the undifferentiable computation problem, caused by the discrete nature of text, by introducing a novel approach to construct weighted embeddings for the detector input. It then updates the evader model parameters using feedback from victim detectors, achieving high attack success with minimal text modification. To address the issue of tokenizer mismatch between the evader and the detector, we introduce a warm-started evader method, enabling GradEscape to adapt to detectors across any language model architecture. Moreover, we employ novel tokenizer inference and model extraction techniques, facilitating effective evasion even in query-only access. We evaluate GradEscape on four datasets and three widely-used language models, benchmarking it against four state-of-the-art AIGT evaders. Experimental results demonstrate that GradEscape outperforms existing evaders in various scenarios, including with an 11B paraphrase model, while utilizing only 139M parameters. We have successfully applied GradEscape to two real-world commercial AIGT detectors. Our analysis reveals that the primary vulnerability stems from disparity in text expression styles within the training data. We also propose a potential defense strategy to mitigate the threat of AIGT evaders. We open-source our GradEscape for developing more robust AIGT detectors. Comments: Accepted to USENIX Security’25 Subjects: Cryptography and Security (cs.CR); Computation and Language (cs.CL) Cite as: arXiv:2506.08188 [cs.CR] (or arXiv:2506.08188v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2506.08188 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-92] Unable to forget: Proactive lnterference Reveals Working Memory Limits in LLM s Beyond Context Length
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在信息检索过程中受到上下文内干扰(intra-context interference)影响的问题,即模型在处理长上下文时难以准确检索最新信息。解决方案的关键在于引入了源自认知科学的主动干扰(proactive interference, PI)范式,通过顺序流式传输语义相关的键值更新,并仅查询最终值来评估模型的检索能力。实验结果表明,随着干扰累积,LLM的检索准确率呈对数线性下降,错误源于检索到先前覆盖的值,这揭示了LLM在区分干扰和灵活操作信息方面存在根本性限制。
链接: https://arxiv.org/abs/2506.08184
作者: Chupei Wang(University of Virginia),Jiaqiu Vince Sun(New York University)
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC)
备注:
点击查看摘要
Abstract:Information retrieval in Large Language Models (LLMs) is increasingly recognized as intertwined with generation capabilities rather than mere lookup. While longer contexts are often assumed to improve retrieval, the effects of intra-context interference remain understudied. To address this, we adapt the proactive interference (PI) paradigm from cognitive science, where earlier information disrupts recall of newer updates. In humans, susceptibility to such interference is inversely linked to working memory capacity. We introduce PI-LLM, an evaluation that sequentially streams semantically related key-value updates and queries only the final values. Although these final values are clearly positioned just before the query, LLM retrieval accuracy declines log-linearly toward zero as interference accumulates; errors arise from retrieving previously overwritten values. Attempts to mitigate interference via prompt engineering (e.g., instructing models to ignore earlier input) yield limited success. These findings reveal a fundamental constraint on LLMs’ ability to disentangle interference and flexibly manipulate information, suggesting a working memory bottleneck beyond mere context access. This calls for approaches that strengthen models’ ability to suppress irrelevant content during retrieval.
zh
[NLP-93] LLM -BT: Back-Translation as a Framework for Terminology Standardization and Dynamic Semantic Embedding
【速读】: 该论文试图解决快速发展的技术领域中,传统专家驱动的术语标准化方法难以应对英文技术术语快速增长的问题,特别是在人工智能(AI)和量子计算等动态变化的领域中,手动方法难以保证多语言一致性。其解决方案的关键在于提出一种基于大语言模型(LLM)的回译框架LLM-BT,通过跨语言语义对齐实现术语验证与标准化,核心创新包括:在术语层面实现高一致性验证、构建多路径验证工作流以增强跨语言鲁棒性,以及将回译概念化为动态语义嵌入,从而支持多语言术语治理和人机协作。
链接: https://arxiv.org/abs/2506.08174
作者: Li Weigang,Pedro Carvalho Brom
机构: University of Brasilia (巴西利亚大学); Federal Institute of Brasilia (巴西利亚联邦理工学院)
类目: Computation and Language (cs.CL)
备注: 23 pages
点击查看摘要
Abstract:The rapid growth of English technical terms challenges traditional expert-driven standardization, especially in fast-evolving fields like AI and quantum computing. Manual methods struggle to ensure multilingual consistency. We propose \textbfLLM-BT, a back-translation framework powered by large language models (LLMs) to automate terminology verification and standardization via cross-lingual semantic alignment. Our contributions are: \textbf(1) Term-Level Consistency Validation: Using English \rightarrow intermediate language \rightarrow English back-translation, LLM-BT achieves high term consistency across models (e.g., GPT-4, DeepSeek, Grok), with case studies showing over 90% exact or semantic matches. \textbf(2) Multi-Path Verification Workflow: A novel ``Retrieve–Generate–Verify–Optimize’’ pipeline integrates serial (e.g., EN \rightarrow ZHcn \rightarrow ZHtw \rightarrow EN) and parallel (e.g., EN \rightarrow Chinese/Portuguese \rightarrow EN) BT routes. BLEU and term accuracy indicate strong cross-lingual robustness (BLEU 0.45; Portuguese accuracy 100%). \textbf(3) Back-Translation as Semantic Embedding: BT is conceptualized as dynamic semantic embedding, revealing latent meaning trajectories. Unlike static embeddings, LLM-BT provides transparent path-based embeddings shaped by model evolution. LLM-BT transforms back-translation into an active engine for multilingual terminology standardization, enabling human–AI collaboration: machines ensure semantic fidelity, humans guide cultural interpretation. This infrastructure supports terminology governance across scientific and technological fields worldwide.
zh
[NLP-94] Can Artificial Intelligence Write Like Borges? An Evaluation Protocol for Spanish Microfiction
【速读】: 该论文试图解决AI生成微型小说在文学价值评估方面的不足,特别是对美学质量等文学标准的系统性评估缺乏关注。其解决方案的关键在于提出GrAImes评估协议,该协议基于文学理论,从主题一致性、文本清晰度、阐释深度和审美质量等多个方面构建了一个客观的评估框架,以实现对AI生成微型小说的文学价值进行有效评价。
链接: https://arxiv.org/abs/2506.08172
作者: Gerardo Aleman Manzanarez,Nora de la Cruz Arana,Jorge Garcia Flores,Yobany Garcia Medina,Raul Monroy,Nathalie Pernelle
机构: 未知
类目: Computation and Language (cs.CL)
备注: 28 pages, 16 figures. Submitted to Applied Sciences
点击查看摘要
Abstract:Automated story writing has been a subject of study for over 60 years. Large language models can generate narratively consistent and linguistically coherent short fiction texts. Despite these advancements, rigorous assessment of such outputs for literary merit - especially concerning aesthetic qualities - has received scant attention. In this paper, we address the challenge of evaluating AI-generated microfictions and argue that this task requires consideration of literary criteria across various aspects of the text, such as thematic coherence, textual clarity, interpretive depth, and aesthetic quality. To facilitate this, we present GrAImes: an evaluation protocol grounded in literary theory, specifically drawing from a literary perspective, to offer an objective framework for assessing AI-generated microfiction. Furthermore, we report the results of our validation of the evaluation protocol, as answered by both literature experts and literary enthusiasts. This protocol will serve as a foundation for evaluating automatically generated microfictions and assessing their literary value.
zh
[NLP-95] ETT-CKGE: Efficient Task-driven Tokens for Continual Knowledge Graph Embedding
【速读】: 该论文旨在解决持续知识图谱嵌入(Continual Knowledge Graph Embedding, CKGE)中知识保存效率和可扩展性不足的问题,具体表现为:(1)由于手动设计的节点/关系重要性评分忽略了与下游任务相关的图依赖关系,导致快照间知识保存不理想;(2)节点/关系重要性计算需要复杂的图遍历,导致训练速度慢且内存开销大。解决方案的关键在于提出ETT-CKGE(Efficient, Task-driven, Tokens for Continual Knowledge Graph Embedding),该方法通过引入可学习的任务驱动标记(task-driven tokens),直接捕捉任务相关信号,从而避免了显式的节点评分或图遍历,利用简单的矩阵运算实现快照间的高效嵌入对齐与知识迁移,显著提升了训练效率和可扩展性。
链接: https://arxiv.org/abs/2506.08158
作者: Lijing Zhu,Qizhen Lan,Qing Tian,Wenbo Sun,Li Yang,Lu Xia,Yixin Xie,Xi Xiao,Tiehang Duan,Cui Tao,Shuteng Niu
机构: 未知
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Continual Knowledge Graph Embedding (CKGE) seeks to integrate new knowledge while preserving past information. However, existing methods struggle with efficiency and scalability due to two key limitations: (1) suboptimal knowledge preservation between snapshots caused by manually designed node/relation importance scores that ignore graph dependencies relevant to the downstream task, and (2) computationally expensive graph traversal for node/relation importance calculation, leading to slow training and high memory overhead. To address these limitations, we introduce ETT-CKGE (Efficient, Task-driven, Tokens for Continual Knowledge Graph Embedding), a novel task-guided CKGE method that leverages efficient task-driven tokens for efficient and effective knowledge transfer between snapshots. Our method introduces a set of learnable tokens that directly capture task-relevant signals, eliminating the need for explicit node scoring or traversal. These tokens serve as consistent and reusable guidance across snapshots, enabling efficient token-masked embedding alignment between snapshots. Importantly, knowledge transfer is achieved through simple matrix operations, significantly reducing training time and memory usage. Extensive experiments across six benchmark datasets demonstrate that ETT-CKGE consistently achieves superior or competitive predictive performance, while substantially improving training efficiency and scalability compared to state-of-the-art CKGE methods. The code is available at: this https URL
zh
[NLP-96] Multilingual Hate Speech Detection in Social Media Using Translation-Based Approaches with Large Language Models
【速读】: 该论文试图解决多语言环境下仇恨言论检测的不足,特别是在乌尔都语等语言中相关研究较少的问题。其解决方案的关键在于构建一个包含英语、乌尔都语和西班牙语的三语种数据集,并采用注意力机制结合生成式 AI (Generative AI) 模型(如 GPT-3.5 Turbo 和 Qwen 2.5 72B)进行特征提取与分类,从而提升多语言仇恨言论检测的性能。
链接: https://arxiv.org/abs/2506.08147
作者: Muhammad Usman,Muhammad Ahmad,M. Shahiki Tash,Irina Gelbukh,Rolando Quintero Tellez,Grigori Sidorov
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Social media platforms are critical spaces for public discourse, shaping opinions and community dynamics, yet their widespread use has amplified harmful content, particularly hate speech, threatening online safety and inclusivity. While hate speech detection has been extensively studied in languages like English and Spanish, Urdu remains underexplored, especially using translation-based approaches. To address this gap, we introduce a trilingual dataset of 10,193 tweets in English (3,834 samples), Urdu (3,197 samples), and Spanish (3,162 samples), collected via keyword filtering, with a balanced distribution of 4,849 Hateful and 5,344 Not-Hateful labels. Our methodology leverages attention layers as a precursor to transformer-based models and large language models (LLMs), enhancing feature extraction for multilingual hate speech detection. For non-transformer models, we use TF-IDF for feature extraction. The dataset is benchmarked using state-of-the-art models, including GPT-3.5 Turbo and Qwen 2.5 72B, alongside traditional machine learning models like SVM and other transformers (e.g., BERT, RoBERTa). Three annotators, following rigorous guidelines, ensured high dataset quality, achieving a Fleiss’ Kappa of 0.821. Our approach, integrating attention layers with GPT-3.5 Turbo and Qwen 2.5 72B, achieves strong performance, with macro F1 scores of 0.87 for English (GPT-3.5 Turbo), 0.85 for Spanish (GPT-3.5 Turbo), 0.81 for Urdu (Qwen 2.5 72B), and 0.88 for the joint multilingual model (Qwen 2.5 72B). These results reflect improvements of 8.75% in English (over SVM baseline 0.80), 8.97% in Spanish (over SVM baseline 0.78), 5.19% in Urdu (over SVM baseline 0.77), and 7.32% in the joint multilingual model (over SVM baseline 0.82). Our framework offers a robust solution for multilingual hate speech detection, fostering safer digital communities worldwide.
zh
[NLP-97] AutoSDT: Scaling Data-Driven Discovery Tasks Toward Open Co-Scientists
【速读】: 该论文旨在解决在数据驱动的科学发现中,由于高质量训练和评估数据有限而导致的构建AI协作者的挑战。其解决方案的关键在于提出AutoSDT自动流水线,该流水线利用大语言模型(Large Language Models, LLMs)的编码能力和参数化知识,自动收集高质量的编码任务,并生成准确的任务指令和代码解决方案,从而构建了AutoSDT-5K数据集,这是目前唯一自动收集且规模最大的数据驱动科学发现公开数据集。
链接: https://arxiv.org/abs/2506.08140
作者: Yifei Li,Hanane Nour Moussa,Ziru Chen,Shijie Chen,Botao Yu,Mingyi Xue,Benjamin Burns,Tzu-Yao Chiu,Vishal Dey,Zitong Lu,Chen Wei,Qianheng Zhang,Tianyu Zhang,Song Gao,Xuhui Huang,Xia Ning,Nesreen K. Ahmed,Ali Payani,Huan Sun
机构: 1Department of Computer Science and Engineering 2College of Pharmacy 3Department of Psychology 4Department of Biomedical Informatics 5Department of Geography 6Department of Chemistry ∘\circ∘Cisco Research ⋄⋄\diamond⋄University of Wisconsin–Madison ††\dagger†The Ohio State University
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Despite long-standing efforts in accelerating scientific discovery with AI, building AI co-scientists remains challenging due to limited high-quality data for training and evaluation. To tackle this data scarcity issue, we present AutoSDT, an automatic pipeline that collects high-quality coding tasks in real-world data-driven discovery workflows. AutoSDT leverages the coding capabilities and parametric knowledge of LLMs to search for diverse sources, select ecologically valid tasks, and synthesize accurate task instructions and code solutions. Using our pipeline, we construct AutoSDT-5K, a dataset of 5,404 coding tasks for data-driven discovery that covers four scientific disciplines and 756 unique Python packages. To the best of our knowledge, AutoSDT-5K is the only automatically collected and the largest open dataset for data-driven scientific discovery. Expert feedback on a subset of 256 tasks shows the effectiveness of AutoSDT: 93% of the collected tasks are ecologically valid, and 92.2% of the synthesized programs are functionally correct. Trained on AutoSDT-5K, the Qwen2.5-Coder-Instruct LLM series, dubbed AutoSDT-Coder, show substantial improvement on two challenging data-driven discovery benchmarks, ScienceAgentBench and DiscoveryBench. Most notably, AutoSDT-Coder-32B reaches the same level of performance as GPT-4o on ScienceAgentBench with a success rate of 7.8%, doubling the performance of its base model. On DiscoveryBench, it lifts the hypothesis matching score to 8.1, bringing a 17.4% relative improvement and closing the gap between open-weight models and GPT-4o.
zh
[NLP-98] EconWebArena: Benchmarking Autonomous Agents on Economic Tasks in Realistic Web Environments
【速读】: 该论文试图解决在真实网络环境中对自主代理进行复杂、多模态经济任务评估的问题,其核心挑战在于代理需要具备导航实时网站、解析结构化与视觉内容、与真实界面交互以及通过多步骤流程提取精确且时间敏感数据的能力。解决方案的关键在于构建一个包含360个精心筛选任务的基准测试集(EconWebArena),这些任务来源于82个权威网站,覆盖宏观经济、劳动力、金融、贸易和公共政策等领域,并通过大规模语言模型生成候选任务后进行严格的人工校准,以确保任务的清晰性、可行性和来源可靠性。
链接: https://arxiv.org/abs/2506.08136
作者: Zefang Liu,Yinzhu Quan
机构: Georgia Institute of Technology (佐治亚理工学院)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:We introduce EconWebArena, a benchmark for evaluating autonomous agents on complex, multimodal economic tasks in realistic web environments. The benchmark comprises 360 curated tasks from 82 authoritative websites spanning domains such as macroeconomics, labor, finance, trade, and public policy. Each task challenges agents to navigate live websites, interpret structured and visual content, interact with real interfaces, and extract precise, time-sensitive data through multi-step workflows. We construct the benchmark by prompting multiple large language models (LLMs) to generate candidate tasks, followed by rigorous human curation to ensure clarity, feasibility, and source reliability. Unlike prior work, EconWebArena emphasizes fidelity to authoritative data sources and the need for grounded web-based economic reasoning. We evaluate a diverse set of state-of-the-art multimodal LLMs as web agents, analyze failure cases, and conduct ablation studies to assess the impact of visual grounding, plan-based reasoning, and interaction design. Our results reveal substantial performance gaps and highlight persistent challenges in grounding, navigation, and multimodal understanding, positioning EconWebArena as a rigorous testbed for economic web intelligence.
zh
[NLP-99] Bingo: Boosting Efficient Reasoning of LLM s via Dynamic and Significance-based Reinforcement Learning
【速读】: 该论文试图解决大型语言模型在推理过程中存在的效率问题,即模型生成的输出往往冗长或重复,导致计算资源浪费。解决方案的关键在于提出一种基于强化学习(Reinforcement Learning, RL)的框架Bingo,其核心是改进长度奖励机制,通过两个关键机制提升推理效率:一是显著性感知的长度奖励,引导模型逐步减少不重要的标记;二是动态长度奖励,初期鼓励对复杂问题进行详细推理,随后逐渐衰减以提高整体效率。
链接: https://arxiv.org/abs/2506.08125
作者: Hanbing Liu,Lang Cao,Yuanyi Ren,Mengyu Zhou,Haoyu Dong,Xiaojun Ma,Shi Han,Dongmei Zhang
机构: Tsinghua University (清华大学); University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); Peking University (北京大学); Microsoft (微软)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Large language models have demonstrated impressive reasoning capabilities, yet they often suffer from inefficiencies due to unnecessarily verbose or redundant outputs. While many works have explored reinforcement learning (RL) to enhance reasoning abilities, most primarily focus on improving accuracy, with limited attention to reasoning efficiency. Some existing approaches introduce direct length-based rewards to encourage brevity, but this often leads to noticeable drops in accuracy. In this paper, we propose Bingo, an RL framework that advances length-based reward design to boost efficient reasoning. Bingo incorporates two key mechanisms: a significance-aware length reward, which gradually guides the model to reduce only insignificant tokens, and a dynamic length reward, which initially encourages elaborate reasoning for hard questions but decays over time to improve overall efficiency. Experiments across multiple reasoning benchmarks show that Bingo improves both accuracy and efficiency. It outperforms the vanilla reward and several other length-based reward baselines in RL, achieving a favorable trade-off between accuracy and efficiency. These results underscore the potential of training LLMs explicitly for efficient reasoning.
zh
[NLP-100] QA-LIGN: Aligning LLM s through Constitutionally Decomposed QA
【速读】: 该论文试图解决大型语言模型与明确原则(如帮助性、诚实性和无害性)对齐过程中因标准基于奖励的方法将多样化反馈压缩为单一标量奖励而导致的可解释性不足问题。解决方案的关键在于提出QA-LIGN,这是一种自动符号奖励分解方法,能够在奖励机制中保留每个宪法原则的结构,通过构建特定于原则的评估问题并为每个原则生成独立的奖励组件,从而实现更透明和可调整的对齐过程。
链接: https://arxiv.org/abs/2506.08123
作者: Jacob Dineen(1),Aswin RRV(1),Qin Liu(2),Zhikun Xu(1),Xiao Ye(1),Ming Shen(1),Zhaonan Li(1),Shijie Lu(1),Chitta Baral(1),Muhao Chen(2),Ben Zhou(1) ((1) Arizona State University, (2) University of California Davis)
机构: 未知
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Alignment of large language models with explicit principles (such as helpfulness, honesty, and harmlessness) is crucial for ensuring safe and reliable AI systems. However, standard reward-based alignment methods typically collapse diverse feedback into a single scalar reward, entangling multiple objectives into one opaque training signal, which hinders interpretability. In this work, we introduce QA-LIGN, an automatic symbolic reward decomposition approach that preserves the structure of each constitutional principle within the reward mechanism. Instead of training a black-box reward model that outputs a monolithic score, QA-LIGN formulates principle-specific evaluation questions and derives separate reward components for each principle, making it a drop-in reward model replacement. Experiments aligning an uncensored large language model with a set of constitutional principles demonstrate that QA-LIGN offers greater transparency and adaptability in the alignment process. At the same time, our approach achieves performance on par with or better than a DPO baseline. Overall, these results represent a step toward more interpretable and controllable alignment of language models, achieved without sacrificing end-task performance.
zh
[NLP-101] Conservative Bias in Large Language Models : Measuring Relation Predictions
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在关系抽取任务中表现出的保守偏差(conservative bias)问题,即当没有合适的关系选项时,模型倾向于默认使用“无关系”(No_Relation)标签,从而导致信息丢失。解决方案的关键在于通过系统评估不同提示、数据集和关系类型下的这一权衡,并引入“霍布森选择”(Hobson’s choice)概念来描述模型在安全但缺乏信息的标签与幻觉标签之间的选择行为。研究还通过SBERT和LLM提示量化了保守偏差行为在受限提示与半受限及开放式提示生成标签之间的语义相似性。
链接: https://arxiv.org/abs/2506.08120
作者: Toyin Aguda,Erik Wilson,Allan Anzagira,Simerjot Kaur,Charese Smiley
机构: 未知
类目: Computation and Language (cs.CL)
备注: 10 pages
点击查看摘要
Abstract:Large language models (LLMs) exhibit pronounced conservative bias in relation extraction tasks, frequently defaulting to No_Relation label when an appropriate option is unavailable. While this behavior helps prevent incorrect relation assignments, our analysis reveals that it also leads to significant information loss when reasoning is not explicitly included in the output. We systematically evaluate this trade-off across multiple prompts, datasets, and relation types, introducing the concept of Hobson’s choice to capture scenarios where models opt for safe but uninformative labels over hallucinated ones. Our findings suggest that conservative bias occurs twice as often as hallucination. To quantify this effect, we use SBERT and LLM prompts to capture the semantic similarity between conservative bias behaviors in constrained prompts and labels generated from semi-constrained and open-ended prompts.
zh
[NLP-102] Hierarchical Lexical Graph for Enhanced Multi-Hop Retrieval KDD’25
【速读】: 该论文试图解决在跨语义距离文档中整合信息以生成准确答案的挑战,尤其是在处理需要多跳推理的复杂查询时,传统检索增强生成(Retrieval-Augmented Generation, RAG)方法表现不佳。其解决方案的关键在于构建一个三层索引结构——层次化词法图(Hierarchical Lexical Graph, HLG),该结构通过追踪每个原子命题的来源、聚类命题为潜在主题以及连接实体与关系来揭示跨文档路径,从而提升多文档信息整合的能力。在此基础上,论文进一步提出了两种互补的可插拔检索器:StatementGraphRAG 和 TopicGraphRAG,分别用于高精度事实性问题和探索性查询的上下文获取。
链接: https://arxiv.org/abs/2506.08074
作者: Abdellah Ghassel,Ian Robinson,Gabriel Tanase,Hal Cooper,Bryan Thompson,Zhen Han,Vassilis N. Ioannidis,Soji Adeshina,Huzefa Rangwala
机构: Queen’s University (皇后大学); Amazon (亚马逊)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: KDD '25
点击查看摘要
Abstract:Retrieval-Augmented Generation (RAG) grounds large language models in external evidence, yet it still falters when answers must be pieced together across semantically distant documents. We close this gap with the Hierarchical Lexical Graph (HLG), a three-tier index that (i) traces every atomic proposition to its source, (ii) clusters propositions into latent topics, and (iii) links entities and relations to expose cross-document paths. On top of HLG we build two complementary, plug-and-play retrievers: StatementGraphRAG, which performs fine-grained entity-aware beam search over propositions for high-precision factoid questions, and TopicGraphRAG, which selects coarse topics before expanding along entity links to supply broad yet relevant context for exploratory queries. Additionally, existing benchmarks lack the complexity required to rigorously evaluate multi-hop summarization systems, often focusing on single-document queries or limited datasets. To address this, we introduce a synthetic dataset generation pipeline that curates realistic, multi-document question-answer pairs, enabling robust evaluation of multi-hop retrieval systems. Extensive experiments across five datasets demonstrate that our methods outperform naive chunk-based RAG achieving an average relative improvement of 23.1% in retrieval recall and correctness. Open-source Python library is available at this https URL.
zh
[NLP-103] Modality-Balancing Preference Optimization of Large Multimodal Models by Adversarial Negative Mining
【速读】: 该论文试图解决大型多模态模型(Large Multimodal Models, LMMs)在推理过程中存在的模态不平衡问题,即语言先验偏差过度影响视觉输入,从而限制了模型在下游任务中的泛化能力并导致幻觉现象。解决方案的关键在于提出一种新的偏好学习框架——模态平衡偏好优化(Modality-Balancing Preference Optimization, MBPO),通过生成硬负样本以构建更有效的离线偏好数据集,并结合易于验证的封闭式任务生成在线响应,最终利用混合离线-在线数据进行模型训练,以实现更平衡的多模态推理能力。
链接: https://arxiv.org/abs/2506.08022
作者: Chenxi Liu,Tianyi Xiong,Ruibo Chen,Yihan Wu,Junfeng Guo,Tianyi Zhou,Heng Huang
机构: University of Maryland, College Park (马里兰大学学院市分校)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:The task adaptation and alignment of Large Multimodal Models (LMMs) have been significantly advanced by instruction tuning and further strengthened by recent preference optimization. Yet, most LMMs still suffer from severe modality imbalance during reasoning, i.e., outweighing language prior biases over visual inputs, which bottlenecks their generalization to downstream tasks and causes hallucinations. However, existing preference optimization approaches for LMMs do not focus on restraining the internal biases of their Large Language Model (LLM) backbones when curating the training data. Moreover, they heavily rely on offline data and lack the capacity to explore diverse responses adaptive to dynamic distributional shifts during training. Meanwhile, Group Relative Policy Optimization (GRPO), a recent method using online-generated data and verified rewards to improve reasoning capabilities, remains largely underexplored in LMM alignment. In this paper, we propose a novel preference learning framework, Modality-Balancing Preference Optimization (MBPO), to address the modality imbalance in LMMs. MBPO constructs a more effective offline preference dataset by generating hard negatives, i.e., rejected responses misled by LLM biases due to limited usage of visual information, through adversarial perturbation of input images. Moreover, MBPO leverages the easy-to-verify nature of close-ended tasks to generate online responses with verified rewards. GRPO is then employed to train the model with offline-online hybrid data. Extensive experiments demonstrate that MBPO can enhance LMM performance on challenging vision-language tasks and effectively reduce hallucinations.
zh
[NLP-104] Being Strong Progressively! Enhancing Knowledge Distillation of Large Language Models through a Curriculum Learning Framework
【速读】: 该论文试图解决知识蒸馏(Knowledge Distillation, KD)在压缩大语言模型(Large Language Models, LLMs)过程中存在的学生模型分布显著偏移问题,这些问题包括灾难性遗忘、模式崩溃和训练与推理不匹配。解决方案的关键在于提出一种基于“渐进超载”(Progressive Overload, POCL)原理的可插拔课程学习框架,该框架通过难度度量器对训练样本进行从易到难的排序与划分,并由训练调度器在固定时间间隔内逐步引入这些子集,同时应用温度逐渐升高的损失函数,从而提升学习的稳定性和效率。
链接: https://arxiv.org/abs/2506.05695
作者: Lingyuan Liu,Mengxiang Zhang
机构: City University of Hong Kong (香港城市大学); The University of Hong Kong (香港大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Knowledge Distillation (KD) compresses large language models (LLMs) by transferring the teacher model’s capabilities to a smaller student model, reducing inference cost and memory usage while maintaining performance. However, existing KD methods for LLMs often fail to prevent significant shifts in the student model’s distribution during training, leading to issues such as catastrophic forgetting, mode collapse, and training-inference mismatch. To address these challenges, we propose a novel, plug-in curriculum learning framework inspired by the strength training principle of “progressive overload” (POCL), which can be seamlessly integrated into existing white-box KD approaches with minimal computational overhead. The framework comprises two core components: (1) a difficulty measurer that ranks and partitions training samples from easy to hard, and (2) a training scheduler that incrementally introduces these subsets into the distillation process at fixed intervals while applying loss functions with progressively rising temperatures. By starting with the easiest samples and progressively increasing the difficulty, the approach enhances both the stability and efficiency of learning. Extensive experiments in instruction-following settings demonstrate that POCL consistently improves the performance of distilled student models across various white-box KD methods and model families. Our findings highlight the effectiveness of sorted training samples in KD for LLMs. More generally, our work demonstrates how to structure training data within the KD process to enhance the stability and performance of distilled LLMs.
zh
[NLP-105] Exp4Fuse: A Rank Fusion Framework for Enhanced Sparse Retrieval using Large Language Model-based Query Expansion
【速读】: 该论文试图解决传统稀疏检索(sparse retrieval)在查询扩展(query expansion)中效果受限的问题,尤其是在缺乏高质量生成文档的情况下,现有方法依赖复杂的提示策略和先进密集检索技术,导致成本高且计算复杂。解决方案的关键在于引入一种基于零样本(zero-shot)大型语言模型(LLM)的查询扩展方法,并结合一种新颖的融合排序框架Exp4Fuse,通过同时考虑原始查询和LLM增强查询的两条检索路径,利用稀疏检索器生成两个排序列表并采用改进的倒数排名融合(modified reciprocal rank fusion)方法进行融合,从而有效提升稀疏检索性能。
链接: https://arxiv.org/abs/2506.04760
作者: Lingyuan Liu,Mengxiang Zhang
机构: City University of Hong Kong (香港城市大学); The University of Hong Kong (香港大学)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) have shown potential in generating hypothetical documents for query expansion, thereby enhancing information retrieval performance. However, the efficacy of this method is highly dependent on the quality of the generated documents, which often requires complex prompt strategies and the integration of advanced dense retrieval techniques. This can be both costly and computationally intensive. To mitigate these limitations, we explore the use of zero-shot LLM-based query expansion to improve sparse retrieval, particularly for learned sparse retrievers. We introduce a novel fusion ranking framework, Exp4Fuse, which enhances the performance of sparse retrievers through an indirect application of zero-shot LLM-based query expansion. Exp4Fuse operates by simultaneously considering two retrieval routes-one based on the original query and the other on the LLM-augmented query. It then generates two ranked lists using a sparse retriever and fuses them using a modified reciprocal rank fusion method. We conduct extensive evaluations of Exp4Fuse against leading LLM-based query expansion methods and advanced retrieval techniques on three MS MARCO-related datasets and seven low-resource datasets. Experimental results reveal that Exp4Fuse not only surpasses existing LLM-based query expansion methods in enhancing sparse retrievers but also, when combined with advanced sparse retrievers, achieves SOTA results on several benchmarks. This highlights the superior performance and effectiveness of Exp4Fuse in improving query expansion for sparse retrieval.
zh
[NLP-106] Werewolf: A Straightforward Game Framework with TTS for Improved User Engagement
【速读】: 该论文试图解决在基于生成式 AI (Generative AI) 的社交推理游戏(如狼人杀)中,如何提升人类玩家的参与度与交互体验的问题。现有研究主要依赖于微调、高级提示工程或额外的经验池来实现吸引人的文本格式游戏体验,而本文提出了一种新颖且简洁的基于大型语言模型(Large Language Models, LLMs)的狼人杀游戏系统,其关键在于优化文本转语音(Text-to-Speech, TTS)模型以增强与不同LLM模型的兼容性,并通过提升用户互动性来改善整体体验。论文认为,随着LLM推理能力的不断提升,在狼人杀这类游戏中,额外的组件将变得不再必要。
链接: https://arxiv.org/abs/2506.00160
作者: Qihui Fan,Enfu Nan,Wenbo Li,Lei Lu,Pu Zhao,Yanzhi Wang
机构: Northeastern University (东北大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:The growing popularity of social deduction game systems for both business applications and AI research has greatly benefited from the rapid advancements in Large Language Models (LLMs), which now demonstrate stronger reasoning and persuasion capabilities. Especially with the raise of DeepSeek R1 and V3 models, LLMs should enable a more engaging experience for human players in LLM-agent-based social deduction games like Werewolf. Previous works either fine-tuning, advanced prompting engineering, or additional experience pool to achieve engaging text-format Werewolf game experience. We propose a novel yet straightforward LLM-based Werewolf game system with tuned Text-to-Speech(TTS) models designed for enhanced compatibility with various LLM models, and improved user engagement. We argue with ever enhancing LLM reasoning, extra components will be unnecessary in the case of Werewolf.
zh
[NLP-107] EDINET-Bench: Evaluating LLM s on Complex Financial Tasks using Japanese Financial Statements
【速读】: 该论文试图解决金融分析领域中由于缺乏高质量、具有挑战性的财务数据集,尤其是针对日本金融数据的稀缺性,导致大型语言模型(Large Language Model, LLM)在该领域的研究和评估受限的问题。解决方案的关键在于构建EDINET-Bench,这是一个开源的日本金融基准数据集,通过从日本电子披露投资者网络(EDINET)下载过去10年的年度报告并自动标注与会计欺诈检测、盈利预测和行业预测等任务相关的标签,以支持LLMs在复杂金融任务上的性能评估。
链接: https://arxiv.org/abs/2506.08762
作者: Issa Sugiura,Takashi Ishida,Taro Makino,Chieko Tazuke,Takanori Nakagawa,Kosuke Nakago,David Ha
机构: Sakana AI(サカナAI); Kyoto University(京都大学)
类目: atistical Finance (q-fin.ST); Computational Engineering, Finance, and Science (cs.CE); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Financial analysis presents complex challenges that could leverage large language model (LLM) capabilities. However, the scarcity of challenging financial datasets, particularly for Japanese financial data, impedes academic innovation in financial analytics. As LLMs advance, this lack of accessible research resources increasingly hinders their development and evaluation in this specialized domain. To address this gap, we introduce EDINET-Bench, an open-source Japanese financial benchmark designed to evaluate the performance of LLMs on challenging financial tasks including accounting fraud detection, earnings forecasting, and industry prediction. EDINET-Bench is constructed by downloading annual reports from the past 10 years from Japan’s Electronic Disclosure for Investors’ NETwork (EDINET) and automatically assigning labels corresponding to each evaluation task. Our experiments reveal that even state-of-the-art LLMs struggle, performing only slightly better than logistic regression in binary classification for fraud detection and earnings forecasting. These results highlight significant challenges in applying LLMs to real-world financial applications and underscore the need for domain-specific adaptation. Our dataset, benchmark construction code, and evaluation code is publicly available to facilitate future research in finance with LLMs.
zh
[NLP-108] Approaching Dialogue State Tracking via Aligning Speech Encoders and LLM s INTERSPEECH2025
【速读】: 该论文旨在解决语音对话状态追踪(Spoken Dialogue State Tracking, DST)中的挑战,特别是在开放源代码和开放数据组件下提升模型在对话槽位值中命名实体的性能。其解决方案的关键在于通过一个小型连接模块将语音编码器(如WavLM-large)与大型语言模型(LLM,如OLMo)的表示空间进行对齐,同时结合全量微调与LoRA适配器微调、对话历史中的代理回合影响分析以及基于模糊匹配的输出后处理方法,从而显著提升系统性能。
链接: https://arxiv.org/abs/2506.08633
作者: Šimon Sedláček,Bolaji Yusuf,Ján Švec,Pradyoth Hegde,Santosh Kesiraju,Oldřich Plchot,Jan Černocký
机构: 未知
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL)
备注: Accepted to Interspeech 2025
点击查看摘要
Abstract:In this work, we approach spoken Dialogue State Tracking (DST) by bridging the representation spaces of speech encoders and LLMs via a small connector module, with a focus on fully open-sourced and open-data components (WavLM-large, OLMo). We focus on ablating different aspects of such systems including full/LoRA adapter fine-tuning, the effect of agent turns in the dialogue history, as well as fuzzy matching-based output post-processing, which greatly improves performance of our systems on named entities in the dialogue slot values. We conduct our experiments on the SpokenWOZ dataset, and additionally utilize the Speech-Aware MultiWOZ dataset to augment our training data. Ultimately, our best-performing WavLM + connector + OLMo-1B aligned models achieve state of the art on the SpokenWOZ test set (34.66% JGA), and our system with Gemma-2-9B-instruct further surpasses this result, reaching 42.17% JGA on SpokenWOZ test.
zh
[NLP-109] Instruction-Tuned Video-Audio Models Elucidate Functional Specialization in the Brain
【速读】: 该论文试图解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在多模态刺激设置下与大脑活动对齐程度不足的问题,尤其是针对非指令微调的多模态模型和单模态模型的局限性。其解决方案的关键在于利用指令微调的多模态模型生成任务特定的表示,并通过自然电影(包含视频和音频)诱发的神经活动进行预测性评估,从而验证指令微调对于提升模型与大脑功能加工一致性的重要作用。
链接: https://arxiv.org/abs/2506.08277
作者: Subba Reddy Oota,Khushbu Pahwa,Prachi Jindal,Satya Sai Srinath Namburi,Maneesh Singh,Tanmoy Chakraborty,Bapi S. Raju,Manish Gupta
机构: Technische Universität Berlin, Germany; Rice University, USA; AWS AI Labs, Amazon; IIT Delhi, India; University of Wisconsin - Madison, USA; Spector Inc, USA; IIIT-Hyderabad, India; Microsoft, India
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 39 pages, 22 figures
点击查看摘要
Abstract:Recent voxel-wise multimodal brain encoding studies have shown that multimodal large language models (MLLMs) exhibit a higher degree of brain alignment compared to unimodal models in both unimodal and multimodal stimulus settings. More recently, instruction-tuned multimodal models have shown to generate task-specific representations that align strongly with brain activity. However, prior work evaluating the brain alignment of MLLMs has primarily focused on unimodal settings or relied on non-instruction-tuned multimodal models for multimodal stimuli. To address this gap, we investigated brain alignment, that is, measuring the degree of predictivity of neural activity recorded while participants were watching naturalistic movies (video along with audio) with representations derived from MLLMs. We utilized instruction-specific embeddings from six video and two audio instruction-tuned MLLMs. Experiments with 13 video task-specific instructions show that instruction-tuned video MLLMs significantly outperform non-instruction-tuned multimodal (by 15%) and unimodal models (by 20%). Our evaluation of MLLMs for both video and audio tasks using language-guided instructions shows clear disentanglement in task-specific representations from MLLMs, leading to precise differentiation of multimodal functional processing in the brain. We also find that MLLM layers align hierarchically with the brain, with early sensory areas showing strong alignment with early layers, while higher-level visual and language regions align more with middle to late layers. These findings provide clear evidence for the role of task-specific instructions in improving the alignment between brain activity and MLLMs, and open new avenues for mapping joint information processing in both the systems. We make the code publicly available [this https URL].
zh
计算机视觉
[CV-0] VIKI-R: Coordinating Embodied Multi-Agent Cooperation via Reinforcement Learning
【速读】:该论文旨在解决在动态环境中协调多个具身智能体的核心挑战,该问题需要感知驱动的推理和可扩展的合作策略。现有方法主要依赖大型语言模型(Large Language Models, LLMs)进行多智能体规划,而对视觉-语言模型(Vision-Language Models, VLMs)在视觉推理中的应用探索较少,且其对多样化具身类型的支持有限。论文提出的解决方案关键在于引入VIKI-Bench,这是一个针对具身多智能体协作的分层基准,包含三个结构化层级:智能体激活、任务规划和轨迹感知,并结合多种机器人具身形式、多视角视觉观测和结构化监督信号。此外,论文还提出VIKI-R框架,通过基于思维链(Chain-of-Thought)标注演示的微调和多级奖励信号下的强化学习,提升视觉驱动的协作性能。
链接: https://arxiv.org/abs/2506.09049
作者: Li Kang,Xiufeng Song,Heng Zhou,Yiran Qin,Jie Yang,Xiaohong Liu,Philip Torr,Lei Bai,Zhenfei Yin
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Project page: this https URL
点击查看摘要
Abstract:Coordinating multiple embodied agents in dynamic environments remains a core challenge in artificial intelligence, requiring both perception-driven reasoning and scalable cooperation strategies. While recent works have leveraged large language models (LLMs) for multi-agent planning, a few have begun to explore vision-language models (VLMs) for visual reasoning. However, these VLM-based approaches remain limited in their support for diverse embodiment types. In this work, we introduce VIKI-Bench, the first hierarchical benchmark tailored for embodied multi-agent cooperation, featuring three structured levels: agent activation, task planning, and trajectory perception. VIKI-Bench includes diverse robot embodiments, multi-view visual observations, and structured supervision signals to evaluate reasoning grounded in visual inputs. To demonstrate the utility of VIKI-Bench, we propose VIKI-R, a two-stage framework that fine-tunes a pretrained vision-language model (VLM) using Chain-of-Thought annotated demonstrations, followed by reinforcement learning under multi-level reward signals. Our extensive experiments show that VIKI-R significantly outperforms baselines method across all task levels. Furthermore, we show that reinforcement learning enables the emergence of compositional cooperation patterns among heterogeneous agents. Together, VIKI-Bench and VIKI-R offer a unified testbed and method for advancing multi-agent, visual-driven cooperation in embodied AI systems.
zh
[CV-1] MagCache: Fast Video Generation with Magnitude-Aware Cache
【速读】:该论文试图解决视频扩散模型加速过程中因依赖均匀启发式或时间嵌入变体而需要大量校准样本以及可能因提示特定过拟合导致输出不一致的问题。解决方案的关键在于发现了一种统一的幅度定律,即连续残差输出的幅度比在大多数时间步中单调稳定下降,而在最后几步迅速下降,并基于此提出了Magnitude-aware Cache (MagCache),通过误差建模机制和自适应缓存策略自适应地跳过不重要的时间步,仅需单个样本即可完成校准,从而显著提升了加速效果并保持了高质量的视觉保真度。
链接: https://arxiv.org/abs/2506.09045
作者: Zehong Ma,Longhui Wei,Feng Wang,Shiliang Zhang,Qi Tian
机构: Huawei Inc. (华为公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL
点击查看摘要
Abstract:Existing acceleration techniques for video diffusion models often rely on uniform heuristics or time-embedding variants to skip timesteps and reuse cached features. These approaches typically require extensive calibration with curated prompts and risk inconsistent outputs due to prompt-specific overfitting. In this paper, we introduce a novel and robust discovery: a unified magnitude law observed across different models and prompts. Specifically, the magnitude ratio of successive residual outputs decreases monotonically and steadily in most timesteps while rapidly in the last several steps. Leveraging this insight, we introduce a Magnitude-aware Cache (MagCache) that adaptively skips unimportant timesteps using an error modeling mechanism and adaptive caching strategy. Unlike existing methods requiring dozens of curated samples for calibration, MagCache only requires a single sample for calibration. Experimental results show that MagCache achieves 2.1x and 2.68x speedups on Open-Sora and Wan 2.1, respectively, while preserving superior visual fidelity. It significantly outperforms existing methods in LPIPS, SSIM, and PSNR, under comparable computational budgets.
zh
[CV-2] Cosmos-Drive-Dreams: Scalable Synthetic Driving Data Generation with World Foundation Models
【速读】:该论文旨在解决安全关键型物理AI系统(如自动驾驶汽车)在收集和标注真实世界数据时所面临的耗时、高成本问题,尤其是难以捕捉罕见边缘案例的问题。其解决方案的关键在于提出了一种合成数据生成(SDG)流水线——Cosmos-Drive-Dreams,该流水线基于专门针对驾驶领域优化的Cosmos-Drive模型套件,能够生成可控、高保真、多视角且时空一致的驾驶视频,从而有效提升驾驶数据集的数量与多样性,并在下游任务中改善长尾分布问题和泛化能力。
链接: https://arxiv.org/abs/2506.09042
作者: Xuanchi Ren,Yifan Lu,Tianshi Cao,Ruiyuan Gao,Shengyu Huang,Amirmojtaba Sabour,Tianchang Shen,Tobias Pfaff,Jay Zhangjie Wu,Runjian Chen,Seung Wook Kim,Jun Gao,Laura Leal-Taixe,Mike Chen,Sanja Fidler,Huan Ling
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Xuanchi Ren, Yifan Lu, Tianshi Cao, Ruiyuan Gao: Equal contribution. Only the core contributors are listed. The full list of contributors can be found in Appendix A of this paper
点击查看摘要
Abstract:Collecting and annotating real-world data for safety-critical physical AI systems, such as Autonomous Vehicle (AV), is time-consuming and costly. It is especially challenging to capture rare edge cases, which play a critical role in training and testing of an AV system. To address this challenge, we introduce the Cosmos-Drive-Dreams - a synthetic data generation (SDG) pipeline that aims to generate challenging scenarios to facilitate downstream tasks such as perception and driving policy training. Powering this pipeline is Cosmos-Drive, a suite of models specialized from NVIDIA Cosmos world foundation model for the driving domain and are capable of controllable, high-fidelity, multi-view, and spatiotemporally consistent driving video generation. We showcase the utility of these models by applying Cosmos-Drive-Dreams to scale the quantity and diversity of driving datasets with high-fidelity and challenging scenarios. Experimentally, we demonstrate that our generated data helps in mitigating long-tail distribution problems and enhances generalization in downstream tasks such as 3D lane detection, 3D object detection and driving policy learning. We open source our pipeline toolkit, dataset and model weights through the NVIDIA’s Cosmos platform. Project page: this https URL Comments: Xuanchi Ren, Yifan Lu, Tianshi Cao, Ruiyuan Gao: Equal contribution. Only the core contributors are listed. The full list of contributors can be found in Appendix A of this paper Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2506.09042 [cs.CV] (or arXiv:2506.09042v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2506.09042 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-3] Princeton365: A Diverse Dataset with Accurate Camera Pose
【速读】:该论文旨在解决当前SLAM(Simultaneous Localization and Mapping,同步定位与建图)基准测试中精度与数据多样性之间的差距。其关键解决方案是引入一种基于标定板和360度相机的新型真实值采集框架,从而构建了一个包含室内、室外及物体扫描视频的大规模多样化数据集Princeton365,并提供了同步的单目和双目RGB视频输出以及IMU数据。此外,论文还提出了一种新的场景尺度感知的SLAM评估指标,该指标基于由相机姿态估计误差引起的光流,能够跨场景比较SLAM方法的性能,有助于分析方法的失败模式。
链接: https://arxiv.org/abs/2506.09035
作者: Karhan Kayan,Stamatis Alexandropoulos,Rishabh Jain,Yiming Zuo,Erich Liang,Jia Deng
机构: Princeton University (普林斯顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:We introduce Princeton365, a large-scale diverse dataset of 365 videos with accurate camera pose. Our dataset bridges the gap between accuracy and data diversity in current SLAM benchmarks by introducing a novel ground truth collection framework that leverages calibration boards and a 360-camera. We collect indoor, outdoor, and object scanning videos with synchronized monocular and stereo RGB video outputs as well as IMU. We further propose a new scene scale-aware evaluation metric for SLAM based on the the optical flow induced by the camera pose estimation error. In contrast to the current metrics, our new metric allows for comparison between the performance of SLAM methods across scenes as opposed to existing metrics such as Average Trajectory Error (ATE), allowing researchers to analyze the failure modes of their methods. We also propose a challenging Novel View Synthesis benchmark that covers cases not covered by current NVS benchmarks, such as fully non-Lambertian scenes with 360-degree camera trajectories. Please visit this https URL for the dataset, code, videos, and submission.
zh
[CV-4] Diffuse and Disperse: Image Generation with Representation Regularization
【速读】:该论文试图解决扩散模型(diffusion models)在表示学习(representation learning)方面进展不足的问题,即传统扩散模型通常依赖回归目标且缺乏显式正则化。其解决方案的关键在于提出一种名为“Dispersive Loss”的简单插件式正则化方法,该方法通过鼓励隐藏空间中的内部表示分散,类似于对比自监督学习,但无需正样本对,从而不影响回归采样过程。与现有方法相比,该方法具有自包含和极简的特点,无需预训练、额外参数或外部数据。
链接: https://arxiv.org/abs/2506.09027
作者: Runqian Wang,Kaiming He
机构: MIT(麻省理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:The development of diffusion-based generative models over the past decade has largely proceeded independently of progress in representation learning. These diffusion models typically rely on regression-based objectives and generally lack explicit regularization. In this work, we propose \textitDispersive Loss, a simple plug-and-play regularizer that effectively improves diffusion-based generative models. Our loss function encourages internal representations to disperse in the hidden space, analogous to contrastive self-supervised learning, with the key distinction that it requires no positive sample pairs and therefore does not interfere with the sampling process used for regression. Compared to the recent method of representation alignment (REPA), our approach is self-contained and minimalist, requiring no pre-training, no additional parameters, and no external data. We evaluate Dispersive Loss on the ImageNet dataset across a range of models and report consistent improvements over widely used and strong baselines. We hope our work will help bridge the gap between generative modeling and representation learning.
zh
[CV-5] DIsoN: Decentralized Isolation Networks for Out-of-Distribution Detection in Medical Imaging
【速读】:该论文旨在解决在安全关键领域(如医学影像)中部署机器学习模型时,如何有效检测分布外(out-of-distribution, OOD)输入的问题,以避免不可靠的预测。现有OOD检测方法通常在部署后丢弃训练数据或假设测试样本与训练数据集中存储,这在实际场景中难以满足。该论文提出的解决方案的关键是引入隔离网络(Isolation Network),通过解决二分类任务来量化将目标测试样本与训练数据分离的难度,并进一步提出去中心化隔离网络(Decentralized Isolation Networks, DIsoN),在无法共享数据的情况下,仅通过交换模型参数实现训练数据与测试数据的比较,同时通过类别条件扩展提升检测精度。
链接: https://arxiv.org/abs/2506.09024
作者: Felix Wagner,Pramit Saha,Harry Anthony,J. Alison Noble,Konstantinos Kamnitsas
机构: University of Oxford (牛津大学); Imperial College London (帝国理工学院); University of Birmingham (伯明翰大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Safe deployment of machine learning (ML) models in safety-critical domains such as medical imaging requires detecting inputs with characteristics not seen during training, known as out-of-distribution (OOD) detection, to prevent unreliable predictions. Effective OOD detection after deployment could benefit from access to the training data, enabling direct comparison between test samples and the training data distribution to identify differences. State-of-the-art OOD detection methods, however, either discard training data after deployment or assume that test samples and training data are centrally stored together, an assumption that rarely holds in real-world settings. This is because shipping training data with the deployed model is usually impossible due to the size of training databases, as well as proprietary or privacy constraints. We introduce the Isolation Network, an OOD detection framework that quantifies the difficulty of separating a target test sample from the training data by solving a binary classification task. We then propose Decentralized Isolation Networks (DIsoN), which enables the comparison of training and test data when data-sharing is impossible, by exchanging only model parameters between the remote computational nodes of training and deployment. We further extend DIsoN with class-conditioning, comparing a target sample solely with training data of its predicted class. We evaluate DIsoN on four medical imaging datasets (dermatology, chest X-ray, breast ultrasound, histopathology) across 12 OOD detection tasks. DIsoN performs favorably against existing methods while respecting data-privacy. This decentralized OOD detection framework opens the way for a new type of service that ML developers could provide along with their models: providing remote, secure utilization of their training data for OOD detection services. Code will be available upon acceptance at: *****
zh
[CV-6] Fine-Grained Spatially Varying Material Selection in Images
【速读】:该论文旨在解决图像中材料选择(material selection)的问题,特别是在光照和反射变化下仍能实现鲁棒的像素级选择,以支持后续的图像编辑任务。其解决方案的关键在于利用视觉Transformer(Vision Transformer, ViT)模型的特征,并提出一种多分辨率处理策略,从而获得比以往方法更精细且稳定的选区结果。此外,研究还通过引入一个包含超过80万张合成图像的密集标注数据集——双层次材料选择(DuMaS)数据集,实现了在纹理和亚纹理两个层级上的材料选择。
链接: https://arxiv.org/abs/2506.09023
作者: Julia Guerrero-Viu,Michael Fischer,Iliyan Georgiev,Elena Garces,Diego Gutierrez,Belen Masia,Valentin Deschaintre
机构: Universidad de Zaragoza - I3A(Spain); Adobe ResearchUnited Kingdom; Adobe ResearchFrance
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Selection is the first step in many image editing processes, enabling faster and simpler modifications of all pixels sharing a common modality. In this work, we present a method for material selection in images, robust to lighting and reflectance variations, which can be used for downstream editing tasks. We rely on vision transformer (ViT) models and leverage their features for selection, proposing a multi-resolution processing strategy that yields finer and more stable selection results than prior methods. Furthermore, we enable selection at two levels: texture and subtexture, leveraging a new two-level material selection (DuMaS) dataset which includes dense annotations for over 800,000 synthetic images, both on the texture and subtexture levels.
zh
[CV-7] Do MIL Models Transfer? ICML2025
【速读】:该论文试图解决在计算病理学(Computational Pathology, CPath)中,多实例学习(Multiple Instance Learning, MIL)模型在小规模、弱监督临床数据集上表现不佳的问题。传统上,MIL模型在数据稀缺的情况下难以有效训练,而其他领域如自然语言处理和传统计算机视觉则广泛采用迁移学习来缓解这一问题,但MIL模型的迁移能力尚未被充分理解。论文提出的解决方案的关键在于系统评估预训练MIL模型的迁移学习能力,结果显示,即使在不同器官上预训练的模型,也能在目标任务上优于从头开始训练的模型,且在泛化能力上优于滑动基础模型,同时使用更少的预训练数据。这表明MIL模型具有强大的适应性,并且迁移学习可以显著提升CPath中的性能。
链接: https://arxiv.org/abs/2506.09022
作者: Daniel Shao,Richard J. Chen,Andrew H. Song,Joel Runevic,Ming Y. Lu,Tong Ding,Faisal Mahmood
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICML 2025 (Spotlight). 20 pages, 8 figures
点击查看摘要
Abstract:Multiple Instance Learning (MIL) is a cornerstone approach in computational pathology (CPath) for generating clinically meaningful slide-level embeddings from gigapixel tissue images. However, MIL often struggles with small, weakly supervised clinical datasets. In contrast to fields such as NLP and conventional computer vision, where transfer learning is widely used to address data scarcity, the transferability of MIL models remains poorly understood. In this study, we systematically evaluate the transfer learning capabilities of pretrained MIL models by assessing 11 models across 21 pretraining tasks for morphological and molecular subtype prediction. Our results show that pretrained MIL models, even when trained on different organs than the target task, consistently outperform models trained from scratch. Moreover, pretraining on pancancer datasets enables strong generalization across organs and tasks, outperforming slide foundation models while using substantially less pretraining data. These findings highlight the robust adaptability of MIL models and demonstrate the benefits of leveraging transfer learning to boost performance in CPath. Lastly, we provide a resource which standardizes the implementation of MIL models and collection of pretrained model weights on popular CPath tasks, available at this https URL
zh
[CV-8] SDTagNet: Leverag ing Text-Annotated Navigation Maps for Online HD Map Construction
【速读】:该论文旨在解决高精度(HD)地图维护成本高导致的自动驾驶车辆规模化部署难题,通过在线构建HD地图的方法来降低对昂贵离线地图的依赖。其解决方案的关键在于首次充分利用广泛可用的标准精度(SD)地图信息,如OpenStreetMap,以提升远距离检测的准确性。具体而言,该方法引入了两项关键创新:一是将传统多边形SD地图数据与文本注释的语义信息结合,利用自然语言处理(NLP)提取特征,从而摆脱对预定义规范或详尽类别体系的依赖;二是设计点级SD地图编码器与正交元素标识符,实现所有地图要素的统一融合。
链接: https://arxiv.org/abs/2506.08997
作者: Fabian Immel,Jan-Hendrik Pauls,Richard Fehler,Frank Bieder,Jonas Merkert,Christoph Stiller
机构: FZI Research Center for Information Technology (FZI信息技术研究中心); Karlsruhe Institute of Technology (卡尔斯鲁厄理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
点击查看摘要
Abstract:Autonomous vehicles rely on detailed and accurate environmental information to operate safely. High definition (HD) maps offer a promising solution, but their high maintenance cost poses a significant barrier to scalable deployment. This challenge is addressed by online HD map construction methods, which generate local HD maps from live sensor data. However, these methods are inherently limited by the short perception range of onboard sensors. To overcome this limitation and improve general performance, recent approaches have explored the use of standard definition (SD) maps as prior, which are significantly easier to maintain. We propose SDTagNet, the first online HD map construction method that fully utilizes the information of widely available SD maps, like OpenStreetMap, to enhance far range detection accuracy. Our approach introduces two key innovations. First, in contrast to previous work, we incorporate not only polyline SD map data with manually selected classes, but additional semantic information in the form of textual annotations. In this way, we enrich SD vector map tokens with NLP-derived features, eliminating the dependency on predefined specifications or exhaustive class taxonomies. Second, we introduce a point-level SD map encoder together with orthogonal element identifiers to uniformly integrate all types of map elements. Experiments on Argoverse 2 and nuScenes show that this boosts map perception performance by up to +5.9 mAP (+45%) w.r.t. map construction without priors and up to +3.2 mAP (+20%) w.r.t. previous approaches that already use SD map priors. Code is available at this https URL
zh
[CV-9] Do Concept Replacement Techniques Really Erase Unacceptable Concepts?
【速读】:该论文试图解决生成式 AI (Generative AI) 在内容对齐方面的问题,即如何有效避免生成不可接受的概念(如冒犯性内容、受版权保护的内容或名人肖像)。现有概念替换技术(Concept Replacement Techniques, CRTs)在文本到图像(T2I)模型中表现出一定的有效性,但在图像到图像(I2I)模型中却未能实现预期的“擦除”效果,这表明当前CRTs在新兴的I2I场景中可能无效。论文指出,一个优秀的CRT应在替换不可接受概念的同时保持输入中其他概念的保真度(fidelity),而现有研究忽视了这一关键因素。解决方案的关键在于利用定向图像编辑技术,以同时实现效果与保真度,文中提出的AntiMirror方法验证了这一思路的可行性。
链接: https://arxiv.org/abs/2506.08991
作者: Anudeep Das,Gurjot Singh,Prach Chantasantitam,N. Asokan
机构: University of Waterloo (滑铁卢大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
备注:
点击查看摘要
Abstract:Generative models, particularly diffusion-based text-to-image (T2I) models, have demonstrated astounding success. However, aligning them to avoid generating content with unacceptable concepts (e.g., offensive or copyrighted content, or celebrity likenesses) remains a significant challenge. Concept replacement techniques (CRTs) aim to address this challenge, often by trying to “erase” unacceptable concepts from models. Recently, model providers have started offering image editing services which accept an image and a text prompt as input, to produce an image altered as specified by the prompt. These are known as image-to-image (I2I) models. In this paper, we first use an I2I model to empirically demonstrate that today’s state-of-the-art CRTs do not in fact erase unacceptable concepts. Existing CRTs are thus likely to be ineffective in emerging I2I scenarios, despite their proven ability to remove unwanted concepts in T2I pipelines, highlighting the need to understand this discrepancy between T2I and I2I settings. Next, we argue that a good CRT, while replacing unacceptable concepts, should preserve other concepts specified in the inputs to generative models. We call this fidelity. Prior work on CRTs have neglected fidelity in the case of unacceptable concepts. Finally, we propose the use of targeted image-editing techniques to achieve both effectiveness and fidelity. We present such a technique, AntiMirror, and demonstrate its viability.
zh
[CV-10] Efficient Medical Vision-Language Alignment Through Adapting Masked Vision Models
【速读】:该论文试图解决医学视觉-语言对齐中传统跨模态对比学习方法(如CLIP-based方法)在视觉表征能力上的不足,以及多模态掩码建模预训练模型在直接跨模态匹配上的困难之间的矛盾。解决方案的关键在于提出ALTA(ALign Through Adapting),该方法通过仅使用约8%的可训练参数和不到1/5的计算消耗,对基于掩码记录建模的预训练视觉模型进行适应性调整,从而在视觉-语言匹配任务中实现更优性能。此外,ALTA还引入了时间多视角X光图像输入,以增强影像与报告描述之间的一致性,进一步提升对齐效果。
链接: https://arxiv.org/abs/2506.08990
作者: Chenyu Lian,Hong-Yu Zhou,Dongyun Liang,Jing Qin,Liansheng Wang
机构: Xiamen University (厦门大学); Hong Kong Polytechnic University (香港理工大学); Harvard Medical School (哈佛医学院); Fudan University (复旦大学); Xiamen Municipal Clinical Research Center for Medical Imaging (厦门市医学影像临床研究中心); Fujian Province Key Clinical Specialty for Medical Imaging (福建省医学影像重点临床专科); Xiamen Key Laboratory of Clinical Transformation of Imaging Big Data and Artificial Intelligence (厦门市医学影像大数据与人工智能临床转化重点实验室); National Institute for Data Science in Health and Medicine (健康与医学数据科学国家研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: TMI 2025
点击查看摘要
Abstract:Medical vision-language alignment through cross-modal contrastive learning shows promising performance in image-text matching tasks, such as retrieval and zero-shot classification. However, conventional cross-modal contrastive learning (CLIP-based) methods suffer from suboptimal visual representation capabilities, which also limits their effectiveness in vision-language alignment. In contrast, although the models pretrained via multimodal masked modeling struggle with direct cross-modal matching, they excel in visual representation. To address this contradiction, we propose ALTA (ALign Through Adapting), an efficient medical vision-language alignment method that utilizes only about 8% of the trainable parameters and less than 1/5 of the computational consumption required for masked record modeling. ALTA achieves superior performance in vision-language matching tasks like retrieval and zero-shot classification by adapting the pretrained vision model from masked record modeling. Additionally, we integrate temporal-multiview radiograph inputs to enhance the information consistency between radiographs and their corresponding descriptions in reports, further improving the vision-language alignment. Experimental evaluations show that ALTA outperforms the best-performing counterpart by over 4% absolute points in text-to-image accuracy and approximately 6% absolute points in image-to-text retrieval accuracy. The adaptation of vision-language models during efficient alignment also promotes better vision and language understanding. Code is publicly available at this https URL.
zh
[CV-11] Rethinking Range-View LiDAR Segmentation in Adverse Weather
【速读】:该论文旨在解决在恶劣天气条件下,基于范围视图(range-view)的LiDAR分割方法泛化性能不足的问题,从而提升其在真实环境中的可靠性。解决方案的关键在于提出一种模块化且轻量级的框架,通过重构标准范围视图网络的初始茎块为两个分支,分别处理几何属性和反射强度,其中几何异常抑制(GAS)模块用于减少天气引起的空间噪声影响,反射畸变校准(RDC)模块则通过记忆引导的自适应实例归一化校正反射畸变,最终将处理后的特征融合并传递至原始分割流程,从而在保持模型核心架构不变的前提下显著提升模型在恶劣天气条件下的泛化能力。
链接: https://arxiv.org/abs/2506.08979
作者: Longyu Yang,Ping Hu,Lu Zhang,Jun Liu,Yap-Peng Tan,Heng Tao Shen,Xiaofeng Zhu
机构: University of Electronic Science and Technology of China (中国电子科技大学); Dalian University of Technology (大连理工大学); Lancaster University (兰卡斯特大学); Nanyang Technological University (南洋理工大学); Tongji University (同济大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
点击查看摘要
Abstract:LiDAR segmentation has emerged as an important task to enrich multimedia experiences and analysis. Range-view-based methods have gained popularity due to their high computational efficiency and compatibility with real-time deployment. However, their generalized performance under adverse weather conditions remains underexplored, limiting their reliability in real-world environments. In this work, we identify and analyze the unique challenges that affect the generalization of range-view LiDAR segmentation in severe weather. To address these challenges, we propose a modular and lightweight framework that enhances robustness without altering the core architecture of existing models. Our method reformulates the initial stem block of standard range-view networks into two branches to process geometric attributes and reflectance intensity separately. Specifically, a Geometric Abnormality Suppression (GAS) module reduces the influence of weather-induced spatial noise, and a Reflectance Distortion Calibration (RDC) module corrects reflectance distortions through memory-guided adaptive instance normalization. The processed features are then fused and passed to the original segmentation pipeline. Extensive experiments on different benchmarks and baseline models demonstrate that our approach significantly improves generalization to adverse weather with minimal inference overhead, offering a practical and effective solution for real-world LiDAR segmentation.
zh
[CV-12] ADAM: Autonomous Discovery and Annotation Model using LLM s for Context-Aware Annotations
【速读】:该论文试图解决传统目标检测模型在开放世界场景中无法识别新物体的问题,因为这些模型依赖于预定义类别。解决方案的关键在于提出ADAM(Autonomous Discovery and Annotation Model),这是一个无需训练的自优化框架,通过利用大语言模型(LLM)生成基于场景中已知实体上下文信息的候选标签,并结合CLIP的视觉嵌入构建嵌入-标签库(ELR),从而实现无需类别监督的推理。
链接: https://arxiv.org/abs/2506.08968
作者: Amirreza Rouhi,Solmaz Arezoomandan,Knut Peterson,Joseph T. Woods,David K. Han
机构: Cranberry-Lemon University (Cranberry-Lemon 大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Object detection models typically rely on predefined categories, limiting their ability to identify novel objects in open-world scenarios. To overcome this constraint, we introduce ADAM: Autonomous Discovery and Annotation Model, a training-free, self-refining framework for open-world object labeling. ADAM leverages large language models (LLMs) to generate candidate labels for unknown objects based on contextual information from known entities within a scene. These labels are paired with visual embeddings from CLIP to construct an Embedding-Label Repository (ELR) that enables inference without category supervision. For a newly encountered unknown object, ADAM retrieves visually similar instances from the ELR and applies frequency-based voting and cross-modal re-ranking to assign a robust label. To further enhance consistency, we introduce a self-refinement loop that re-evaluates repository labels using visual cohesion analysis and k-nearest-neighbor-based majority re-labeling. Experimental results on the COCO and PASCAL datasets demonstrate that ADAM effectively annotates novel categories using only visual and contextual signals, without requiring any fine-tuning or retraining.
zh
[CV-13] ORIDa: Object-centric Real-world Image Composition Dataset CVPR2025
【速读】:该论文旨在解决现实世界中图像物体合成(object compositing)任务的数据集不足问题,即现有数据集缺乏足够的多样性和规模以全面探索真实场景。其解决方案的关键在于引入ORIDa(Object-centric Real-world Image Composition Dataset),这是一个大规模、真实采集的图像数据集,包含超过30,000张图像,涵盖200个独特物体,并在不同位置和场景中呈现。ORIDa包含两种类型的数据:事实-反事实集和仅事实场景,其中事实-反事实集通过对比带有物体和不带物体的图像来增强对物体合成的理解,而仅事实场景则扩展了环境多样性。ORIDa是首个具有如此规模和复杂度的公开可用的真实图像合成数据集。
链接: https://arxiv.org/abs/2506.08964
作者: Jinwoo Kim,Sangmin Han,Jinho Jeong,Jiwoo Choi,Dongyoung Kim,Seon Joo Kim
机构: Yonsei University (延世大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at CVPR 2025
点击查看摘要
Abstract:Object compositing, the task of placing and harmonizing objects in images of diverse visual scenes, has become an important task in computer vision with the rise of generative models. However, existing datasets lack the diversity and scale required to comprehensively explore real-world scenarios. We introduce ORIDa (Object-centric Real-world Image Composition Dataset), a large-scale, real-captured dataset containing over 30,000 images featuring 200 unique objects, each of which is presented across varied positions and scenes. ORIDa has two types of data: factual-counterfactual sets and factual-only scenes. The factual-counterfactual sets consist of four factual images showing an object in different positions within a scene and a single counterfactual (or background) image of the scene without the object, resulting in five images per scene. The factual-only scenes include a single image containing an object in a specific context, expanding the variety of environments. To our knowledge, ORIDa is the first publicly available dataset with its scale and complexity for real-world image composition. Extensive analysis and experiments highlight the value of ORIDa as a resource for advancing further research in object compositing.
zh
[CV-14] Data Augmentation For Small Object using Fast AutoAugment
【速读】:该论文旨在解决小目标检测性能显著低于大目标的问题(small object detection performance),这是计算机视觉中的一个关键且具有挑战性的问题。论文提出的解决方案的关键在于采用一种基于Fast AutoAugment的最优数据增强方法,通过该方法能够快速找到克服小目标检测退化问题的最优增强策略,并在DOTA数据集上实现了20%的性能提升。
链接: https://arxiv.org/abs/2506.08956
作者: DaeEun Yoon,Semin Kim,SangWook Yoo,Jongha Lee
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted and published in the USB Proceedings of the 20th International Conference on Modeling Decisions for Artificial Intelligence (MDAI 2023), Umeå, Sweden, June 19–22, 2023, ISBN 978-91-527-7293-5, pp.\ 12–21
点击查看摘要
Abstract:In recent years, there has been tremendous progress in object detection performance. However, despite these advances, the detection performance for small objects is significantly inferior to that of large objects. Detecting small objects is one of the most challenging and important problems in computer vision. To improve the detection performance for small objects, we propose an optimal data augmentation method using Fast AutoAugment. Through our proposed method, we can quickly find optimal augmentation policies that can overcome degradation when detecting small objects, and we achieve a 20% performance improvement on the DOTA dataset.
zh
[CV-15] Segment Concealed Objects with Incomplete Supervision
【速读】:该论文旨在解决不完全监督下的隐蔽目标分割(Incompletely-Supervised Concealed Object Segmentation, ISCOS)问题,该任务面临两大挑战:一是不完全标注数据提供的监督信息有限,二是隐蔽目标与背景之间的内在相似性导致区分困难。解决方案的关键在于提出一种统一的均值教师框架SEE,该框架利用视觉基础模型“Segment Anything Model (SAM)”生成伪标签,并通过一系列策略优化伪标签的生成、存储与监督,以提升模型训练的鲁棒性;同时设计了一种混合粒度特征分组模块,通过聚类相似特征促进分割一致性,从而实现更完整的单目标和多目标分割效果。
链接: https://arxiv.org/abs/2506.08955
作者: Chunming He,Kai Li,Yachao Zhang,Ziyun Yang,Youwei Pang,Longxiang Tang,Chengyu Fang,Yulun Zhang,Linghe Kong,Xiu Li,Sina Farsiu
机构: Duke University (杜克大学); Meta (元); Tsinghua Shenzhen International Graduate School (清华大学深圳国际研究生院); Dalian University of Technology (大连理工大学); Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: IEEE TPAMI
点击查看摘要
Abstract:Incompletely-Supervised Concealed Object Segmentation (ISCOS) involves segmenting objects that seamlessly blend into their surrounding environments, utilizing incompletely annotated data, such as weak and semi-annotations, for model training. This task remains highly challenging due to (1) the limited supervision provided by the incompletely annotated training data, and (2) the difficulty of distinguishing concealed objects from the background, which arises from the intrinsic similarities in concealed scenarios. In this paper, we introduce the first unified method for ISCOS to address these challenges. To tackle the issue of incomplete supervision, we propose a unified mean-teacher framework, SEE, that leverages the vision foundation model, ``\emphSegment Anything Model (SAM)‘’, to generate pseudo-labels using coarse masks produced by the teacher model as prompts. To mitigate the effect of low-quality segmentation masks, we introduce a series of strategies for pseudo-label generation, storage, and supervision. These strategies aim to produce informative pseudo-labels, store the best pseudo-labels generated, and select the most reliable components to guide the student model, thereby ensuring robust network training. Additionally, to tackle the issue of intrinsic similarity, we design a hybrid-granularity feature grouping module that groups features at different granularities and aggregates these results. By clustering similar features, this module promotes segmentation coherence, facilitating more complete segmentation for both single-object and multiple-object images. We validate the effectiveness of our approach across multiple ISCOS tasks, and experimental results demonstrate that our method achieves state-of-the-art performance. Furthermore, SEE can serve as a plug-and-play solution, enhancing the performance of existing models.
zh
[CV-16] Cross-Spectral Body Recognition with Side Information Embedding: Benchmarks on LLCM and Analyzing Range-Induced Occlusions on IJB-MDF
【速读】:该论文旨在解决跨光谱(可见光与红外)人体识别中的匹配问题,即在可见光(VIS)和红外(IR)图像之间进行有效的人体特征对齐与识别。其解决方案的关键在于引入了侧信息嵌入(Side Information Embedding, SIE),并通过编码相机信息而非显式地包含光谱域信息,实现了在LLCM数据集上的最先进性能。此外,研究还针对可见光-红外重识别中被忽视的遮挡问题,利用IARPA Janus Benchmark Multi-Domain Face(IJB-MDF)数据集分析了距离引起的遮挡影响,以推动该领域的进一步研究。
链接: https://arxiv.org/abs/2506.08953
作者: Anirudh Nanduri,Siyuan Huang,Rama Chellappa
机构: University of Maryland (马里兰大学); Johns Hopkins University (约翰霍普金斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Vision Transformers (ViTs) have demonstrated impressive performance across a wide range of biometric tasks, including face and body recognition. In this work, we adapt a ViT model pretrained on visible (VIS) imagery to the challenging problem of cross-spectral body recognition, which involves matching images captured in the visible and infrared (IR) domains. Recent ViT architectures have explored incorporating additional embeddings beyond traditional positional embeddings. Building on this idea, we integrate Side Information Embedding (SIE) and examine the impact of encoding domain and camera information to enhance cross-spectral matching. Surprisingly, our results show that encoding only camera information - without explicitly incorporating domain information - achieves state-of-the-art performance on the LLCM dataset. While occlusion handling has been extensively studied in visible-spectrum person re-identification (Re-ID), occlusions in visible-infrared (VI) Re-ID remain largely underexplored - primarily because existing VI-ReID datasets, such as LLCM, SYSU-MM01, and RegDB, predominantly feature full-body, unoccluded images. To address this gap, we analyze the impact of range-induced occlusions using the IARPA Janus Benchmark Multi-Domain Face (IJB-MDF) dataset, which provides a diverse set of visible and infrared images captured at various distances, enabling cross-range, cross-spectral evaluations.
zh
[CV-17] SSS: Semi-Supervised SAM-2 with Efficient Prompting for Medical Imaging Segmentation
【速读】:该论文旨在解决在医疗影像领域中,如何高效利用大规模未标记数据并减少对高质量像素级标注的依赖这一关键问题。其解决方案的关键在于提出一种名为SSS(Semi-Supervised SAM-2)的新方法,该方法借助SAM-2的强大学习能力,从未标记医学图像中挖掘潜在知识,从而增强全监督模型的特征支持。核心创新包括引入判别特征增强(DFE)机制以探索多视图数据增强策略带来的特征差异,并结合物理约束与滑动窗口(PCSW)机制生成输入提示,以满足SAM-2对额外提示的需求。
链接: https://arxiv.org/abs/2506.08949
作者: Hongjie Zhu,Xiwei Liu,Rundong Xue,Zeyu Zhang,Yong Xu,Daji Ergu,Ying Cai,Yang Zhao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:In the era of information explosion, efficiently leveraging large-scale unlabeled data while minimizing the reliance on high-quality pixel-level annotations remains a critical challenge in the field of medical imaging. Semi-supervised learning (SSL) enhances the utilization of unlabeled data by facilitating knowledge transfer, significantly improving the performance of fully supervised models and emerging as a highly promising research direction in medical image analysis. Inspired by the ability of Vision Foundation Models (e.g., SAM-2) to provide rich prior knowledge, we propose SSS (Semi-Supervised SAM-2), a novel approach that leverages SAM-2’s robust feature extraction capabilities to uncover latent knowledge in unlabeled medical images, thus effectively enhancing feature support for fully supervised medical image segmentation. Specifically, building upon the single-stream “weak-to-strong” consistency regularization framework, this paper introduces a Discriminative Feature Enhancement (DFE) mechanism to further explore the feature discrepancies introduced by various data augmentation strategies across multiple views. By leveraging feature similarity and dissimilarity across multi-scale augmentation techniques, the method reconstructs and models the features, thereby effectively optimizing the salient regions. Furthermore, a prompt generator is developed that integrates Physical Constraints with a Sliding Window (PCSW) mechanism to generate input prompts for unlabeled data, fulfilling SAM-2’s requirement for additional prompts. Extensive experiments demonstrate the superiority of the proposed method for semi-supervised medical image segmentation on two multi-label datasets, i.e., ACDC and BHSD. Notably, SSS achieves an average Dice score of 53.15 on BHSD, surpassing the previous state-of-the-art method by +3.65 Dice. Code will be available at this https URL.
zh
[CV-18] What Limits Virtual Agent Application? OmniBench: A Scalable Multi-Dimensional Benchmark for Essential Virtual Agent Capabilities ICML2025
【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)虚拟代理评估中存在的问题,包括任务复杂度不可控、人工标注耗时且场景有限以及缺乏多维评估体系。其解决方案的关键在于提出OmniBench,一个自生成、跨平台、基于图结构的基准测试框架,并通过子任务组合实现可控复杂度的任务合成;同时引入OmniEval,一个多维评估框架,涵盖子任务级评估、基于图的指标及跨10种能力的全面测试,从而更全面、高效地评估虚拟代理的性能。
链接: https://arxiv.org/abs/2506.08933
作者: Wendong Bu,Yang Wu,Qifan Yu,Minghe Gao,Bingchen Miao,Zhenkui Zhang,Kaihang Pan,Yunfei Li,Mengze Li,Wei Ji,Juncheng Li,Siliang Tang,Yueting Zhuang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICML 2025 (Oral)
点击查看摘要
Abstract:As multimodal large language models (MLLMs) advance, MLLM-based virtual agents have demonstrated remarkable performance. However, existing benchmarks face significant limitations, including uncontrollable task complexity, extensive manual annotation with limited scenarios, and a lack of multidimensional evaluation. In response to these challenges, we introduce OmniBench, a self-generating, cross-platform, graph-based benchmark with an automated pipeline for synthesizing tasks of controllable complexity through subtask composition. To evaluate the diverse capabilities of virtual agents on the graph, we further present OmniEval, a multidimensional evaluation framework that includes subtask-level evaluation, graph-based metrics, and comprehensive tests across 10 capabilities. Our synthesized dataset contains 36k graph-structured tasks across 20 scenarios, achieving a 91% human acceptance rate. Training on our graph-structured data shows that it can more efficiently guide agents compared to manually annotated data. We conduct multidimensional evaluations for various open-source and closed-source models, revealing their performance across various capabilities and paving the way for future advancements. Our project is available at this https URL.
zh
[CV-19] Inherently Faithful Attention Maps for Vision Transformers
【速读】:该论文旨在解决在存在干扰性上下文(spurious correlations)和分布外背景(out-of-distribution backgrounds)的情况下,对象感知易受偏差影响的问题。其解决方案的关键在于提出一种基于注意力机制的两阶段框架:第一阶段处理完整图像以发现对象部分并识别任务相关区域,第二阶段通过输入注意力掩码限制感受野至这些区域,从而实现聚焦分析并过滤潜在噪声信息。两个阶段联合训练,使第二阶段能够优化第一阶段的结果。
链接: https://arxiv.org/abs/2506.08915
作者: Ananthu Aniraj,Cassio F. Dantas,Dino Ienco,Diego Marcos
机构: Inria(法国国家信息与自动化研究所); Inrae(法国农业科学研究院); University of Montpellier(蒙彼利埃大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:We introduce an attention-based method that uses learned binary attention masks to ensure that only attended image regions influence the prediction. Context can strongly affect object perception, sometimes leading to biased representations, particularly when objects appear in out-of-distribution backgrounds. At the same time, many image-level object-centric tasks require identifying relevant regions, often requiring context. To address this conundrum, we propose a two-stage framework: stage 1 processes the full image to discover object parts and identify task-relevant regions, while stage 2 leverages input attention masking to restrict its receptive field to these regions, enabling a focused analysis while filtering out potentially spurious information. Both stages are trained jointly, allowing stage 2 to refine stage 1. Extensive experiments across diverse benchmarks demonstrate that our approach significantly improves robustness against spurious correlations and out-of-distribution backgrounds.
zh
[CV-20] SkipVAR: Accelerating Visual Autoregressive Modeling via Adaptive Frequency-Aware Skipping
【速读】:该论文旨在解决视觉自回归(Visual Autoregressive, VAR)模型在生成过程中因高频成分或后期步骤导致的推理延迟问题,以及这些步骤中潜在的计算冗余问题。其解决方案的关键在于识别并缓解两种主要的效率瓶颈:步骤冗余和无条件分支冗余。针对步骤冗余,提出了一种自动跳过策略以选择性地省略不必要的生成步骤;针对无条件分支冗余,引入了无条件分支替换技术以减少计算成本。此外,为适应不同样本的特性,提出了SkipVAR框架,利用频率信息动态选择最优加速策略,从而实现高效的自回归图像生成。
链接: https://arxiv.org/abs/2506.08908
作者: Jiajun Li(1 and 5),Yue Ma(2),Xinyu Zhang(1),Qingyan Wei(3),Songhua Liu(4 and 5),Linfeng Zhang(5) ((1) University of Electronic Science and Technology of China, (2) The Hong Kong University of Science and Technology, (3) Central South University, (4) National University of Singapore, (5) Shanghai Jiaotong University)
机构: University of Electronic Science and Technology of China (中国电子科技大学); The Hong Kong University of Science and Technology (香港科技大学); Central South University (中南大学); National University of Singapore (新加坡国立大学); Shanghai Jiaotong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Recent studies on Visual Autoregressive (VAR) models have highlighted that high-frequency components, or later steps, in the generation process contribute disproportionately to inference latency. However, the underlying computational redundancy involved in these steps has yet to be thoroughly investigated. In this paper, we conduct an in-depth analysis of the VAR inference process and identify two primary sources of inefficiency: step redundancy and unconditional branch redundancy. To address step redundancy, we propose an automatic step-skipping strategy that selectively omits unnecessary generation steps to improve efficiency. For unconditional branch redundancy, we observe that the information gap between the conditional and unconditional branches is minimal. Leveraging this insight, we introduce unconditional branch replacement, a technique that bypasses the unconditional branch to reduce computational cost. Notably, we observe that the effectiveness of acceleration strategies varies significantly across different samples. Motivated by this, we propose SkipVAR, a sample-adaptive framework that leverages frequency information to dynamically select the most suitable acceleration strategy for each instance. To evaluate the role of high-frequency information, we introduce high-variation benchmark datasets that test model sensitivity to fine details. Extensive experiments show SkipVAR achieves over 0.88 average SSIM with up to 1.81x overall acceleration and 2.62x speedup on the GenEval benchmark, maintaining model quality. These results confirm the effectiveness of frequency-aware, training-free adaptive acceleration for scalable autoregressive image generation. Our code is available at this https URL and has been publicly released.
zh
[CV-21] Hyperbolic Dual Feature Augmentation for Open-Environment
【速读】:该论文旨在解决开放环境中特征增强方法的局限性,传统方法通常仅针对已知类别(seen classes)进行特征生成,而无法处理未知类别(unseen classes)。其解决方案的关键在于提出一种双特征增强方法,在双曲空间中同时对已知和未知类别进行特征增强。该方法通过引入神经微分方程模块结合元学习来估计特征分布,引入正则化项以保持数据的潜在层次结构,并推导出双曲增强损失的上界,从而实现对已知和未知类别的无限增强训练。
链接: https://arxiv.org/abs/2506.08906
作者: Peilin Yu,Yuwei Wu,Zhi Gao,Xiaomeng Fan,Shuo Yang,Yunde Jia
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: arXiv admin note: text overlap with arXiv:2207.03824 , arXiv:2304.11855 by other authors
点击查看摘要
Abstract:Feature augmentation generates novel samples in the feature space, providing an effective way to enhance the generalization ability of learning algorithms with hyperbolic geometry. Most hyperbolic feature augmentation is confined to closed-environment, assuming the number of classes is fixed (\emphi.e., seen classes) and generating features only for these classes. In this paper, we propose a hyperbolic dual feature augmentation method for open-environment, which augments features for both seen and unseen classes in the hyperbolic space. To obtain a more precise approximation of the real data distribution for efficient training, (1) we adopt a neural ordinary differential equation module, enhanced by meta-learning, estimating the feature distributions of both seen and unseen classes; (2) we then introduce a regularizer to preserve the latent hierarchical structures of data in the hyperbolic space; (3) we also derive an upper bound for the hyperbolic dual augmentation loss, allowing us to train a hyperbolic model using infinite augmentations for seen and unseen classes. Extensive experiments on five open-environment tasks: class-incremental learning, few-shot open-set recognition, few-shot learning, zero-shot learning, and general image classification, demonstrate that our method effectively enhances the performance of hyperbolic algorithms in open-environment.
zh
[CV-22] MIRAG E: Multimodal foundation model and benchmark for comprehensive retinal OCT image analysis
【速读】:该论文旨在解决当前人工智能(Artificial Intelligence, AI)模型在眼科图像分析中依赖大量标注数据且在独立未见数据上表现不佳的问题。其关键解决方案是提出MIRAGE,一种新型的多模态基础模型(Foundation Model, FM),用于光学相干断层扫描(OCT)和扫描激光眼底镜(SLO)图像的分析,并构建了一个包含OCT/SLO分类与分割任务的新评估基准。实验结果表明,MIRAGE在两类任务中均表现出优越性,证明其作为视网膜OCT图像分析稳健AI系统基础的潜力。
链接: https://arxiv.org/abs/2506.08900
作者: José Morano,Botond Fazekas,Emese Sükei,Ronald Fecso,Taha Emre,Markus Gumpinger,Georg Faustmann,Marzieh Oghbaie,Ursula Schmidt-Erfurth,Hrvoje Bogunović
机构: Medical University of Vienna (维也纳医科大学); Institute of Artificial Intelligence (人工智能研究所); Center for Medical Data Science (医学数据科学中心); Christian Doppler Laboratory for Artificial Intelligence in Retina (人工神经视网膜克里斯蒂安·多普勒实验室); Comprehensive Center for AI in Medicine (人工智能医学综合中心); OPTIMA Lab (OPTIMA实验室); Department of Ophthalmology (眼科部门)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Artificial intelligence (AI) has become a fundamental tool for assisting clinicians in analyzing ophthalmic images, such as optical coherence tomography (OCT). However, developing AI models often requires extensive annotation, and existing models tend to underperform on independent, unseen data. Foundation models (FMs), large AI models trained on vast unlabeled datasets, have shown promise in overcoming these challenges. Nonetheless, available FMs for ophthalmology lack extensive validation, especially for segmentation tasks, and focus on a single imaging modality. In this context, we propose MIRAGE, a novel multimodal FM for the analysis of OCT and scanning laser ophthalmoscopy (SLO) images. Additionally, we propose a new evaluation benchmark with OCT/SLO classification and segmentation tasks. The comparison with general and specialized FMs and segmentation methods shows the superiority of MIRAGE in both types of tasks, highlighting its suitability as a basis for the development of robust AI systems for retinal OCT image analysis. Both MIRAGE and the evaluation benchmark are publicly available: this https URL.
zh
[CV-23] WetCat: Automating Skill Assessment in Wetlab Cataract Surgery Videos
【速读】:该论文旨在解决传统湿实验(wetlab)手术训练中依赖人工评估导致的效率低、耗时且主观性较强的问题。其解决方案的关键在于引入WetCat数据集,这是首个专为自动化技能评估而精心构建的湿实验白内障手术视频数据集,包含高分辨率手术录像、全面的阶段标注及关键解剖结构的语义分割,旨在支持标准化手术技能评估框架下的可解释性AI评估工具开发。
链接: https://arxiv.org/abs/2506.08896
作者: Negin Ghamsarian,Raphael Sznitman,Klaus Schoeffmann,Jens Kowal
机构: ARTORG Center, University of Bern (ARTORG中心,伯尔尼大学); University of Klagenfurt (克恩滕州立大学); University of Bern (伯尔尼大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 6 figures
点击查看摘要
Abstract:To meet the growing demand for systematic surgical training, wetlab environments have become indispensable platforms for hands-on practice in ophthalmology. Yet, traditional wetlab training depends heavily on manual performance evaluations, which are labor-intensive, time-consuming, and often subject to variability. Recent advances in computer vision offer promising avenues for automated skill assessment, enhancing both the efficiency and objectivity of surgical education. Despite notable progress in ophthalmic surgical datasets, existing resources predominantly focus on real surgeries or isolated tasks, falling short of supporting comprehensive skill evaluation in controlled wetlab settings. To address these limitations, we introduce WetCat, the first dataset of wetlab cataract surgery videos specifically curated for automated skill assessment. WetCat comprises high-resolution recordings of surgeries performed by trainees on artificial eyes, featuring comprehensive phase annotations and semantic segmentations of key anatomical structures. These annotations are meticulously designed to facilitate skill assessment during the critical capsulorhexis and phacoemulsification phases, adhering to standardized surgical skill assessment frameworks. By focusing on these essential phases, WetCat enables the development of interpretable, AI-driven evaluation tools aligned with established clinical metrics. This dataset lays a strong foundation for advancing objective, scalable surgical education and sets a new benchmark for automated workflow analysis and skill assessment in ophthalmology training. The dataset and annotations are publicly available in Synapse this https URL.
zh
[CV-24] Product of Experts for Visual Generation
【速读】:该论文试图解决如何在图像和视频生成任务中有效整合来自不同来源的多样化知识的问题,这些来源包括视觉生成模型、视觉语言模型以及包含人工设计知识的图形引擎和物理模拟器。解决方案的关键在于提出了一种无需训练的Product of Experts (PoE)框架,该框架通过Annealed Importance Sampling (AIS)从异构模型的联合分布中进行推理时的知识组合,从而实现更优的可控性和灵活的用户交互界面。
链接: https://arxiv.org/abs/2506.08894
作者: Yunzhi Zhang,Carson Murtuza-Lanier,Zizhang Li,Yilun Du,Jiajun Wu
机构: Stanford University (斯坦福大学); Harvard University (哈佛大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Project page: this https URL
点击查看摘要
Abstract:Modern neural models capture rich priors and have complementary knowledge over shared data domains, e.g., images and videos. Integrating diverse knowledge from multiple sources – including visual generative models, visual language models, and sources with human-crafted knowledge such as graphics engines and physics simulators – remains under-explored. We propose a Product of Experts (PoE) framework that performs inference-time knowledge composition from heterogeneous models. This training-free approach samples from the product distribution across experts via Annealed Importance Sampling (AIS). Our framework shows practical benefits in image and video synthesis tasks, yielding better controllability than monolithic methods and additionally providing flexible user interfaces for specifying visual generation goals.
zh
[CV-25] DiscoVLA: Discrepancy Reduction in Vision Language and Alignment for Parameter-Efficient Video-Text Retrieval CVPR2025
【速读】:该论文旨在解决将图像-文本预训练模型CLIP(Contrastive Language-Image Pretraining)高效适配到视频-文本检索任务中的问题,其核心挑战在于从图像级到视频级的三个关键差异:视觉、语言和对齐。现有方法主要关注视觉层面的差异,而忽视了语言和对齐方面的不足。论文提出的解决方案是DiscoVLA,其关键在于同时缓解这三个差异:通过图像-视频特征融合整合图像级与视频级特征以解决视觉和语言差异,生成伪图像描述以学习细粒度的图像级对齐,以及通过图像到视频对齐蒸馏利用图像级对齐知识增强视频级对齐。
链接: https://arxiv.org/abs/2506.08887
作者: Leqi Shen,Guoqiang Gong,Tianxiang Hao,Tao He,Yifeng Zhang,Pengzhang Liu,Sicheng Zhao,Jungong Han,Guiguang Ding
机构: JD.com(京东); Tsinghua University (清华大学); GRG Banking Equipment Co., Ltd.(广发银行设备有限公司); South China University of Technology (华南理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025
点击查看摘要
Abstract:The parameter-efficient adaptation of the image-text pretraining model CLIP for video-text retrieval is a prominent area of research. While CLIP is focused on image-level vision-language matching, video-text retrieval demands comprehensive understanding at the video level. Three key discrepancies emerge in the transfer from image-level to video-level: vision, language, and alignment. However, existing methods mainly focus on vision while neglecting language and alignment. In this paper, we propose Discrepancy Reduction in Vision, Language, and Alignment (DiscoVLA), which simultaneously mitigates all three discrepancies. Specifically, we introduce Image-Video Features Fusion to integrate image-level and video-level features, effectively tackling both vision and language discrepancies. Additionally, we generate pseudo image captions to learn fine-grained image-level alignment. To mitigate alignment discrepancies, we propose Image-to-Video Alignment Distillation, which leverages image-level alignment knowledge to enhance video-level alignment. Extensive experiments demonstrate the superiority of our DiscoVLA. In particular, on MSRVTT with CLIP (ViT-B/16), DiscoVLA outperforms previous methods by 1.5% in R@1, reaching a final score of 50.5% R@1. The code is available at this https URL.
zh
[CV-26] StreamSplat: Towards Online Dynamic 3D Reconstruction from Uncalibrated Video Streams
【速读】:该论文旨在解决从非校准视频流中实时重建动态三维场景的问题,这一问题在多个实际应用中具有重要意义。现有方法难以同时应对三个关键挑战:1)实时处理非校准输入,2)准确建模动态场景演化,3)保持长期稳定性和计算效率。其解决方案的关键在于提出StreamSplat,这是一个完全前馈的框架,能够以在线方式将任意长度的非校准视频流转换为动态三维高斯点云(3DGS)表示,并能够从时间局部观测中恢复场景动态。该方法的核心创新包括静态编码器中的概率采样机制用于3DGS位置预测,以及动态解码器中的双向形变场,从而实现鲁棒且高效的动态建模。
链接: https://arxiv.org/abs/2506.08862
作者: Zike Wu,Qi Yan,Xuanyu Yi,Lele Wang,Renjie Liao
机构: University of British Columbia (不列颠哥伦比亚大学); Vector Institute for AI (向量人工智能研究所); Canada CIFAR AI Chair (加拿大CIFAR人工智能主席); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Real-time reconstruction of dynamic 3D scenes from uncalibrated video streams is crucial for numerous real-world applications. However, existing methods struggle to jointly address three key challenges: 1) processing uncalibrated inputs in real time, 2) accurately modeling dynamic scene evolution, and 3) maintaining long-term stability and computational efficiency. To this end, we introduce StreamSplat, the first fully feed-forward framework that transforms uncalibrated video streams of arbitrary length into dynamic 3D Gaussian Splatting (3DGS) representations in an online manner, capable of recovering scene dynamics from temporally local observations. We propose two key technical innovations: a probabilistic sampling mechanism in the static encoder for 3DGS position prediction, and a bidirectional deformation field in the dynamic decoder that enables robust and efficient dynamic modeling. Extensive experiments on static and dynamic benchmarks demonstrate that StreamSplat consistently outperforms prior works in both reconstruction quality and dynamic scene modeling, while uniquely supporting online reconstruction of arbitrarily long video streams. Code and models are available at this https URL.
zh
[CV-27] Spatial Transcriptomics Expression Prediction from Histopathology Based on Cross-Modal Mask Reconstruction and Contrastive Learning
【速读】:该论文旨在解决大规模空间转录组数据获取成本高、难以获得的问题,通过开发一种基于对比学习的深度学习方法,从全切片图像中预测空间分辨的基因表达。该解决方案的关键在于利用对比学习框架,有效提升对高表达基因、高变基因和标志物基因预测的Pearson相关系数,并保持基因间相关性,从而在样本量有限的数据集中仍具有良好的适用性。
链接: https://arxiv.org/abs/2506.08854
作者: Junzhuo Liu,Markus Eckstein,Zhixiang Wang,Friedrich Feuerhake,Dorit Merhof
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 20 pages, 7 figures
点击查看摘要
Abstract:Spatial transcriptomics is a technology that captures gene expression levels at different spatial locations, widely used in tumor microenvironment analysis and molecular profiling of histopathology, providing valuable insights into resolving gene expression and clinical diagnosis of cancer. Due to the high cost of data acquisition, large-scale spatial transcriptomics data remain challenging to obtain. In this study, we develop a contrastive learning-based deep learning method to predict spatially resolved gene expression from whole-slide images. Evaluation across six different disease datasets demonstrates that, compared to existing studies, our method improves Pearson Correlation Coefficient (PCC) in the prediction of highly expressed genes, highly variable genes, and marker genes by 6.27%, 6.11%, and 11.26% respectively. Further analysis indicates that our method preserves gene-gene correlations and applies to datasets with limited samples. Additionally, our method exhibits potential in cancer tissue localization based on biomarker expression.
zh
[CV-28] Adapting Vision-Language Foundation Model for Next Generation Medical Ultrasound Image Analysis
【速读】:该论文旨在解决医学超声图像分析中由于手动勾画感兴趣区域耗时且易受个体差异影响而带来的挑战,同时提升视觉-语言基础模型在医学影像领域的性能。其解决方案的关键在于通过引入大语言模型作为文本优化器,并结合专门设计的领域适应策略和任务驱动的头部结构,对视觉-语言基础模型进行微调,从而实现跨领域适应性增强。
链接: https://arxiv.org/abs/2506.08849
作者: Jingguo Qu,Xinyang Han,Tonghuan Xiao,Jia Ai,Juan Wu,Tong Zhao,Jing Qin,Ann Dorothy King,Winnie Chiu-Wing Chu,Jing Cai,Michael Tin-Cheung Yingınst
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Medical ultrasonography is an essential imaging technique for examining superficial organs and tissues, including lymph nodes, breast, and thyroid. It employs high-frequency ultrasound waves to generate detailed images of the internal structures of the human body. However, manually contouring regions of interest in these images is a labor-intensive task that demands expertise and often results in inconsistent interpretations among individuals. Vision-language foundation models, which have excelled in various computer vision applications, present new opportunities for enhancing ultrasound image analysis. Yet, their performance is hindered by the significant differences between natural and medical imaging domains. This research seeks to overcome these challenges by developing domain adaptation methods for vision-language foundation models. In this study, we explore the fine-tuning pipeline for vision-language foundation models by utilizing large language model as text refiner with special-designed adaptation strategies and task-driven heads. Our approach has been extensively evaluated on six ultrasound datasets and two tasks: segmentation and classification. The experimental results show that our method can effectively improve the performance of vision-language foundation models for ultrasound image analysis, and outperform the existing state-of-the-art vision-language and pure foundation models. The source code of this study is available at \hrefthis https URLGitHub.
zh
[CV-29] Video-CoT: A Comprehensive Dataset for Spatiotemporal Understanding of Videos Based on Chain-of-Thought
【速读】:该论文旨在解决视频内容理解中对细粒度时空细节捕捉不足的问题,尤其是在大规模视觉-语言模型(VLMs)中表现出来的局限性。其解决方案的关键在于引入Video-CoT数据集,该数据集采用思维链(Chain-of-Thought, CoT)方法,包含192,000个精细的时空问答对和23,000个高质量的CoT标注样本,从而为评估视频理解中的时空推理能力提供了坚实的基础。
链接: https://arxiv.org/abs/2506.08817
作者: Shuyi Zhang,Xiaoshuai Hao,Yingbo Tang,Lingfeng Zhang,Pengwei Wang,Zhongyuan Wang,Hongxuan Ma,Shanghang Zhang
机构: Institute of Automation, CAS(中国科学院自动化研究所); School of Artificial Intelligence, UCAS(中国科学院大学人工智能学院); Beijing Academy of Artificial Intelligence (BAAI)(北京人工智能研究院); Shenzhen International Graduate School, Tsinghua University(清华大学深圳国际研究生院); State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University(北京大学计算机学院多媒体信息处理国家重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Video content comprehension is essential for various applications, ranging from video analysis to interactive systems. Despite advancements in large-scale vision-language models (VLMs), these models often struggle to capture the nuanced, spatiotemporal details essential for thorough video analysis. To address this gap, we introduce Video-CoT, a groundbreaking dataset designed to enhance spatiotemporal understanding using Chain-of-Thought (CoT) methodologies. Video-CoT contains 192,000 fine-grained spa-tiotemporal question-answer pairs and 23,000 high-quality CoT-annotated samples, providing a solid foundation for evaluating spatiotemporal understanding in video comprehension. Additionally, we provide a comprehensive benchmark for assessing these tasks, with each task featuring 750 images and tailored evaluation metrics. Our extensive experiments reveal that current VLMs face significant challenges in achieving satisfactory performance, high-lighting the difficulties of effective spatiotemporal understanding. Overall, the Video-CoT dataset and benchmark open new avenues for research in multimedia understanding and support future innovations in intelligent systems requiring advanced video analysis capabilities. By making these resources publicly available, we aim to encourage further exploration in this critical area. Project website:this https URL .
zh
[CV-30] HiSin: Efficient High-Resolution Sinogram Inpainting via Resolution-Guided Progressive Inference
【速读】:该论文旨在解决高分辨率sinogram补全在计算机断层扫描(computed tomography, CT)重建中的问题,因为缺失的高频投影会导致可见伪影和诊断错误。其解决方案的关键在于提出HiSin框架,该框架基于扩散模型(diffusion models),通过分辨率引导的渐进推理实现高效的sinogram补全。该方法在低分辨率下逐步提取全局结构,并将高分辨率推理限制在小块区域,从而实现内存高效的补全,同时结合频域感知的块跳过和结构自适应步长分配以减少冗余计算。
链接: https://arxiv.org/abs/2506.08809
作者: Jiaze E,Srutarshi Banerjee,Tekin Bicer,Guannan Wang,Yanfu Zhang,Bin Ren
机构: William & Mary(威廉与玛丽学院); Argonne National Laboratory(阿贡国家实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:
点击查看摘要
Abstract:High-resolution sinogram inpainting is essential for computed tomography reconstruction, as missing high-frequency projections can lead to visible artifacts and diagnostic errors. Diffusion models are well-suited for this task due to their robustness and detail-preserving capabilities, but their application to high-resolution inputs is limited by excessive memory and computational demands. To address this limitation, we propose HiSin, a novel diffusion based framework for efficient sinogram inpainting via resolution-guided progressive inference. It progressively extracts global structure at low resolution and defers high-resolution inference to small patches, enabling memory-efficient inpainting. It further incorporates frequency-aware patch skipping and structure-adaptive step allocation to reduce redundant computation. Experimental results show that HiSin reduces peak memory usage by up to 31.25% and inference time by up to 18.15%, and maintains inpainting accuracy across datasets, resolutions, and mask conditions.
zh
[CV-31] HunyuanVideo-HOMA: Generic Human-Object Interaction in Multimodal Driven Human Animation
【速读】:该论文旨在解决人类-物体交互(human-object interaction, HOI)视频生成中的关键问题,包括对精心筛选运动数据的依赖、对新物体/场景泛化能力有限以及可访问性受限。其解决方案的关键在于提出一种弱条件多模态驱动框架——HunyuanVideo-HOMA,通过稀疏解耦的运动引导增强可控性并降低对精确输入的依赖,同时将外观和运动信号编码到多模态扩散变换器(MMDiT)的双输入空间中,在共享上下文空间内融合以生成时间一致且物理合理的交互。
链接: https://arxiv.org/abs/2506.08797
作者: Ziyao Huang,Zixiang Zhou,Juan Cao,Yifeng Ma,Yi Chen,Zejing Rao,Zhiyong Xu,Hongmei Wang,Qin Lin,Yuan Zhou,Qinglin Lu,Fan Tang
机构: Tencent Hunyuan(腾讯混元)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:To address key limitations in human-object interaction (HOI) video generation – specifically the reliance on curated motion data, limited generalization to novel objects/scenarios, and restricted accessibility – we introduce HunyuanVideo-HOMA, a weakly conditioned multimodal-driven framework. HunyuanVideo-HOMA enhances controllability and reduces dependency on precise inputs through sparse, decoupled motion guidance. It encodes appearance and motion signals into the dual input space of a multimodal diffusion transformer (MMDiT), fusing them within a shared context space to synthesize temporally consistent and physically plausible interactions. To optimize training, we integrate a parameter-space HOI adapter initialized from pretrained MMDiT weights, preserving prior knowledge while enabling efficient adaptation, and a facial cross-attention adapter for anatomically accurate audio-driven lip synchronization. Extensive experiments confirm state-of-the-art performance in interaction naturalness and generalization under weak supervision. Finally, HunyuanVideo-HOMA demonstrates versatility in text-conditioned generation and interactive object manipulation, supported by a user-friendly demo interface. The project page is at this https URL.
zh
[CV-32] Flow Diverse and Efficient: Learning Momentum Flow Matching via Stochastic Velocity Field Sampling
【速读】:该论文试图解决生成式 AI (Generative AI) 中基于流的扩散模型在直路径采样过程中存在的多样性不足和多尺度噪声建模能力有限的问题。其解决方案的关键在于提出一种名为 Discretized-RF 的新家族的修正流(也称为动量流模型),该方法将直线路径离散化为一系列可变速度场子路径(即“动量场”),并通过在子路径的速度上引入噪声来改变其方向,从而扩展搜索空间,提升多样性和多尺度噪声建模能力。
链接: https://arxiv.org/abs/2506.08796
作者: Zhiyuan Ma,Ruixun Liu,Sixian Liu,Jianjun Li,Bowen Zhou
机构: Tsinghua University (清华大学); Xi’an Jiaotong University (西安交通大学); University of California, Berkeley (加州大学伯克利分校); Huazhong University of Science and Technology (华中科技大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Recently, the rectified flow (RF) has emerged as the new state-of-the-art among flow-based diffusion models due to its high efficiency advantage in straight path sampling, especially with the amazing images generated by a series of RF models such as Flux 1.0 and SD 3.0. Although a straight-line connection between the noisy and natural data distributions is intuitive, fast, and easy to optimize, it still inevitably leads to: 1) Diversity concerns, which arise since straight-line paths only cover a fairly restricted sampling space. 2) Multi-scale noise modeling concerns, since the straight line flow only needs to optimize the constant velocity field \bm v between the two distributions \bm\pi_0 and \bm\pi_1 . In this work, we present Discretized-RF, a new family of rectified flow (also called momentum flow models since they refer to the previous velocity component and the random velocity component in each diffusion step), which discretizes the straight path into a series of variable velocity field sub-paths (namely ``momentum fields’') to expand the search space, especially when close to the distribution p_\textnoise . Different from the previous case where noise is directly superimposed on \bm x , we introduce noise on the velocity \bm v of the sub-path to change its direction in order to improve the diversity and multi-scale noise modeling abilities. Experimental results on several representative datasets demonstrate that learning momentum flow matching by sampling random velocity fields will produce trajectories that are both diverse and efficient, and can consistently generate high-quality and diverse results. Code is available at this https URL.
zh
[CV-33] A PDE-Based Image Dehazing Method via Atmospheric Scattering Theory
【速读】:该论文旨在解决单图像去雾问题,通过提出一种新的偏微分方程(PDE)框架来实现。其解决方案的关键在于将大气散射模型与非局部正则化和暗通道先验相结合,构建改进的PDE模型,并引入基于透射图的自适应正则化参数,以增强去雾效果和算法稳定性。
链接: https://arxiv.org/abs/2506.08793
作者: Zhuoran Zheng
机构: Shenzhen Campus, Sun Yat-sen University (深圳校区,中山大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: report
点击查看摘要
Abstract:This paper presents a novel partial differential equation (PDE) framework for single-image dehazing. By integrating the atmospheric scattering model with nonlocal regularization and dark channel prior, we propose the improved PDE: [ -\textdiv\left(D(\nabla u)\nabla u\right) + \lambda(t) G(u) = \Phi(I,t,A) ] where D(\nabla u) = (|\nabla u| + \epsilon)^-1 is the edge-preserving diffusion coefficient, G(u) is the Gaussian convolution operator, and \lambda(t) is the adaptive regularization parameter based on transmission map t . We prove the existence and uniqueness of weak solutions in H_0^1(\Omega) using Lax-Milgram theorem, and implement an efficient fixed-point iteration scheme accelerated by PyTorch GPU computation. The experimental results demonstrate that this method is a promising deghazing solution that can be generalized to the deep model paradigm.
zh
[CV-34] HomographyAD: Deep Anomaly Detection Using Self Homography Learning
【速读】:该论文旨在解决现有异常检测(Anomaly Detection, AD)方法在完全对齐数据集上表现良好,但在实际工业环境中由于数据不对齐而性能下降的问题。其解决方案的关键在于提出HomographyAD,一种基于ImageNet预训练网络的深度异常检测方法,通过使用深度单应性估计进行输入前景对齐,并利用自单应性学习微调模型以从正常样本中学习额外的形状信息,最终通过测试样本特征与提取的正常特征分布的距离进行异常检测。
链接: https://arxiv.org/abs/2506.08784
作者: Jongyub Seok,Chanjin Kang
机构: AIRS Company, Hyundai Motor Group, Seoul, Korea; LG Energy Solution; Dfinite Inc.
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Anomaly detection (AD) is a task that distinguishes normal and abnormal data, which is important for applying automation technologies of the manufacturing facilities. For MVTec dataset that is a representative AD dataset for industrial environment, many recent works have shown remarkable performances. However, the existing anomaly detection works have a limitation of showing good performance for fully-aligned datasets only, unlike real-world industrial environments. To solve this limitation, we propose HomographyAD, a novel deep anomaly detection methodology based on the ImageNet-pretrained network, which is specially designed for actual industrial dataset. Specifically, we first suggest input foreground alignment using the deep homography estimation method. In addition, we fine-tune the model by self homography learning to learn additional shape information from normal samples. Finally, we conduct anomaly detection based on the measure of how far the feature of test sample is from the distribution of the extracted normal features. By applying our proposed method to various existing AD approaches, we show performance enhancement through extensive experiments.
zh
[CV-35] Landsat-Bench: Datasets and Benchmarks for Landsat Foundation Models
【速读】:该论文旨在解决Landsat数据缺乏基准测试的问题,从而限制了基于Landsat的地理空间基础模型(Geospatial Foundation Models, GFM)的发展。其解决方案的关键在于引入Landsat-Bench,这是一个由三个基准组成的套件,基于现有的遥感数据集(EuroSAT-L、BigEarthNet-L和LC100-L)进行适应性调整,并建立了跨常见架构和Landsat基础模型的基线与标准化评估方法。此外,研究还证明了在SSL4EO-L数据集上预训练的GFM在下游任务中能够提取比ImageNet更好的表示,表现出更高的分类性能。
链接: https://arxiv.org/abs/2506.08780
作者: Isaac Corley,Lakshay Sharma,Ruth Crasto
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:The Landsat program offers over 50 years of globally consistent Earth imagery. However, the lack of benchmarks for this data constrains progress towards Landsat-based Geospatial Foundation Models (GFM). In this paper, we introduce Landsat-Bench, a suite of three benchmarks with Landsat imagery that adapt from existing remote sensing datasets – EuroSAT-L, BigEarthNet-L, and LC100-L. We establish baseline and standardized evaluation methods across both common architectures and Landsat foundation models pretrained on the SSL4EO-L dataset. Notably, we provide evidence that SSL4EO-L pretrained GFMs extract better representations for downstream tasks in comparison to ImageNet, including performance gains of +4% OA and +5.1% mAP on EuroSAT-L and BigEarthNet-L.
zh
[CV-36] Gaussian2Scene: 3D Scene Representation Learning via Self-supervised Learning with 3D Gaussian Splatting
【速读】:该论文旨在解决现有自监督学习(SSL)方法在点云预训练中依赖隐式场景表示、内存需求高以及无法有效捕捉三维几何结构的问题。其解决方案的关键在于提出Gaussian2Scene框架,该框架利用三维高斯泼溅(3DGS)的高效性和显式性进行预训练,从而减轻体积渲染的计算负担,并支持直接的三维场景重建,提升主干网络的几何理解能力。
链接: https://arxiv.org/abs/2506.08777
作者: Keyi Liu,Weidong Yang,Ben Fei,Ying He
机构: Fudan University (复旦大学); The Chinese University of Hong Kong (香港中文大学); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Self-supervised learning (SSL) for point cloud pre-training has become a cornerstone for many 3D vision tasks, enabling effective learning from large-scale unannotated data. At the scene level, existing SSL methods often incorporate volume rendering into the pre-training framework, using RGB-D images as reconstruction signals to facilitate cross-modal learning. This strategy promotes alignment between 2D and 3D modalities and enables the model to benefit from rich visual cues in the RGB-D inputs. However, these approaches are limited by their reliance on implicit scene representations and high memory demands. Furthermore, since their reconstruction objectives are applied only in 2D space, they often fail to capture underlying 3D geometric structures. To address these challenges, we propose Gaussian2Scene, a novel scene-level SSL framework that leverages the efficiency and explicit nature of 3D Gaussian Splatting (3DGS) for pre-training. The use of 3DGS not only alleviates the computational burden associated with volume rendering but also supports direct 3D scene reconstruction, thereby enhancing the geometric understanding of the backbone network. Our approach follows a progressive two-stage training strategy. In the first stage, a dual-branch masked autoencoder learns both 2D and 3D scene representations. In the second stage, we initialize training with reconstructed point clouds and further supervise learning using the geometric locations of Gaussian primitives and rendered RGB images. This process reinforces both geometric and cross-modal learning. We demonstrate the effectiveness of Gaussian2Scene across several downstream 3D object detection tasks, showing consistent improvements over existing pre-training methods.
zh
[CV-37] RS-MTDF: Multi-Teacher Distillation and Fusion for Remote Sensing Semi-Supervised Semantic Segmentation
【速读】:该论文旨在解决遥感图像语义分割中对大规模高质量像素级标注数据的高度依赖问题,这一问题导致数据获取成本高且耗时。其解决方案的关键在于利用预训练于大规模多样化数据集的视觉基础模型(Vision Foundation Models, VFMs)所具备的强泛化能力,通过多教师蒸馏与融合(RS-MTDF)框架,将VFMs中的语义知识引导至半监督学习过程中,从而有效缓解有标签数据与无标签数据之间的分布不匹配问题。
链接: https://arxiv.org/abs/2506.08772
作者: Jiayi Song,Kaiyu Li,Xiangyong Cao,Deyu Meng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Semantic segmentation in remote sensing images is crucial for various applications, yet its performance is heavily reliant on large-scale, high-quality pixel-wise annotations, which are notoriously expensive and time-consuming to acquire. Semi-supervised semantic segmentation (SSS) offers a promising alternative to mitigate this data dependency. However, existing SSS methods often struggle with the inherent distribution mismatch between limited labeled data and abundant unlabeled data, leading to suboptimal generalization. We propose that Vision Foundation Models (VFMs), pre-trained on vast and diverse datasets, possess robust generalization capabilities that can effectively bridge this distribution gap and provide strong semantic priors for SSS. Inspired by this, we introduce RS-MTDF (Multi-Teacher Distillation and Fusion), a novel framework that leverages the powerful semantic knowledge embedded in VFMs to guide semi-supervised learning in remote sensing. Specifically, RS-MTDF employs multiple frozen VFMs (\textite.g., DINOv2 and CLIP) as expert teachers, utilizing feature-level distillation to align student features with their robust representations. To further enhance discriminative power, the distilled knowledge is seamlessly fused into the student decoder. Extensive experiments on three challenging remote sensing datasets (ISPRS Potsdam, LoveDA, and DeepGlobe) demonstrate that RS-MTDF consistently achieves state-of-the-art performance. Notably, our method outperforms existing approaches across various label ratios on LoveDA and secures the highest IoU in the majority of semantic categories. These results underscore the efficacy of multi-teacher VFM guidance in significantly enhancing both generalization and semantic understanding for remote sensing segmentation. Ablation studies further validate the contribution of each proposed module.
zh
[CV-38] Normalized Radon Cumulative Distribution Transforms for Invariance and Robustness in Optimal Transport Based Image Classification
【速读】:该论文旨在解决图像分类任务中在面对非仿射图像变形时特征提取器的鲁棒性问题,特别是在数据受到一般仿射变换影响的情况下,传统方法可能无法保证类别间的线性可分性。解决方案的关键在于引入了最大归一化Radon累积分布变换(max-normalized R-CDT),其通过仅使用基本运算即可保证在任意仿射变换下的可分性,并进一步提出均值归一化版本以增强对非仿射变形和脉冲噪声的鲁棒性,从而在理论上和实验上验证了其有效性。
链接: https://arxiv.org/abs/2506.08761
作者: Matthias Beckmann,Robert Beinert,Jonas Bresch
机构: Center for Industrial Mathematics, University of Bremen, Germany; Department of Electrical and Electronic Engineering, Imperial College London, UK; Institut für Mathematik, Technische Universität Berlin, Germany
类目: Numerical Analysis (math.NA); Computer Vision and Pattern Recognition (cs.CV); Information Theory (cs.IT)
备注:
点击查看摘要
Abstract:The Radon cumulative distribution transform (R-CDT), is an easy-to-compute feature extractor that facilitates image classification tasks especially in the small data regime. It is closely related to the sliced Wasserstein distance and provably guaranties the linear separability of image classes that emerge from translations or scalings. In many real-world applications, like the recognition of watermarks in filigranology, however, the data is subject to general affine transformations originating from the measurement process. To overcome this issue, we recently introduced the so-called max-normalized R-CDT that only requires elementary operations and guaranties the separability under arbitrary affine transformations. The aim of this paper is to continue our study of the max-normalized R-CDT especially with respect to its robustness against non-affine image deformations. Our sensitivity analysis shows that its separability properties are stable provided the Wasserstein-infinity distance between the samples can be controlled. Since the Wasserstein-infinity distance only allows small local image deformations, we moreover introduce a mean-normalized version of the R-CDT. In this case, robustness relates to the Wasserstein-2 distance and also covers image deformations caused by impulsive noise for instance. Our theoretical results are supported by numerical experiments showing the effectiveness of our novel feature extractors as well as their robustness against local non-affine deformations and impulsive noise.
zh
[CV-39] InceptionMamba: An Efficient Hybrid Network with Large Band Convolution and Bottleneck Mamba
【速读】:该论文旨在解决InceptionNeXt在图像分类及下游任务中由于依赖并行一维条带卷积而导致的空间依赖性捕捉能力有限以及局部邻域空间建模不充分的问题,同时克服卷积操作固有的局部性约束对全局上下文建模的不利影响。其解决方案的关键在于用正交带卷积(orthogonal band convolutions)替代传统的单维条带卷积,以实现更连贯的空间建模,并引入瓶颈Mamba模块以实现全局上下文建模,从而增强跨通道信息融合和扩大感受野。
链接: https://arxiv.org/abs/2506.08735
作者: Yuhang Wang,Jun Li,Zhijian Wu,Jianhua Xu
机构: Nanjing Normal University (南京师范大学); East China Normal University (华东师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Within the family of convolutional neural networks, InceptionNeXt has shown excellent competitiveness in image classification and a number of downstream tasks. Built on parallel one-dimensional strip convolutions, however, it suffers from limited ability of capturing spatial dependencies along different dimensions and fails to fully explore spatial modeling in local neighborhood. Besides, inherent locality constraints of convolution operations are detrimental to effective global context modeling. To overcome these limitations, we propose a novel backbone architecture termed InceptionMamba in this study. More specifically, the traditional one-dimensional strip convolutions are replaced by orthogonal band convolutions in our InceptionMamba to achieve cohesive spatial modeling. Furthermore, global contextual modeling can be achieved via a bottleneck Mamba module, facilitating enhanced cross-channel information fusion and enlarged receptive field. Extensive evaluations on classification and various downstream tasks demonstrate that the proposed InceptionMamba achieves state-of-the-art performance with superior parameter and computational efficiency. The source code will be available at this https URL.
zh
[CV-40] Geometric deep learning for local growth prediction on abdominal aortic aneurysm surfaces
【速读】:该论文旨在解决传统腹部主动脉瘤(AAA)监测策略中因仅依赖最大直径而无法准确反映AAA三维形态与生长关系的问题,从而可能导致标准化随访间隔不适用。其解决方案的关键在于提出一种基于SE(3)-对称变换器模型的方法,该模型直接在富含局部多物理特征的血管表面进行AAA生长预测,保留了血管表面的解剖结构和几何保真度,而非通过参数化AAA形态进行建模,从而实现更精准的个性化生长预测和患者是否将在两年内符合择期修复标准的判断。
链接: https://arxiv.org/abs/2506.08729
作者: Dieuwertje Alblas,Patryk Rygiel,Julian Suk,Kaj O. Kappe,Marieke Hofman,Christoph Brune,Kak Khee Yeung,Jelmer M. Wolterink
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Abdominal aortic aneurysms (AAAs) are progressive focal dilatations of the abdominal aorta. AAAs may rupture, with a survival rate of only 20%. Current clinical guidelines recommend elective surgical repair when the maximum AAA diameter exceeds 55 mm in men or 50 mm in women. Patients that do not meet these criteria are periodically monitored, with surveillance intervals based on the maximum AAA diameter. However, this diameter does not take into account the complex relation between the 3D AAA shape and its growth, making standardized intervals potentially unfit. Personalized AAA growth predictions could improve monitoring strategies. We propose to use an SE(3)-symmetric transformer model to predict AAA growth directly on the vascular model surface enriched with local, multi-physical features. In contrast to other works which have parameterized the AAA shape, this representation preserves the vascular surface’s anatomical structure and geometric fidelity. We train our model using a longitudinal dataset of 113 computed tomography angiography (CTA) scans of 24 AAA patients at irregularly sampled intervals. After training, our model predicts AAA growth to the next scan moment with a median diameter error of 1.18 mm. We further demonstrate our model’s utility to identify whether a patient will become eligible for elective repair within two years (acc = 0.93). Finally, we evaluate our model’s generalization on an external validation set consisting of 25 CTAs from 7 AAA patients from a different hospital. Our results show that local directional AAA growth prediction from the vascular surface is feasible and may contribute to personalized surveillance strategies.
zh
[CV-41] SceneSplat: A Large Dataset and Comprehensive Benchmark for Language Gaussian Splatting
【速读】:该论文旨在解决当前3D Gaussian Splatting (3DGS)方法在3D场景理解中的泛化能力不足问题,尤其是现有方法大多仅在有限的场景和视角下进行评估,难以全面反映其在复杂3D空间中的性能。解决方案的关键在于提出首个大规模基准测试,系统评估三种主流方法(基于场景优化、无场景优化和可泛化方法)在1060个场景上的表现,并引入GaussianWorld-49K数据集,验证了可泛化范式在放松场景依赖性、实现新场景快速前向推理以及提升分割性能方面的优势。
链接: https://arxiv.org/abs/2506.08710
作者: Mengjiao Ma,Qi Ma,Yue Li,Jiahuan Cheng,Runyi Yang,Bin Ren,Nikola Popovic,Mingqiang Wei,Nicu Sebe,Luc Van Gool,Theo Gevers,Martin R. Oswald,Danda Pani Paudel
机构: INSAIT, Sofia University “St. Kliment Ohridski” ; Nanjing University of Aeronautics and Astronautics ; ETH Zürich ; University of Amsterdam ; Johns Hopkins University ; University of Pisa ; University of Trento
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, codes, data and benchmark will be released
点击查看摘要
Abstract:3D Gaussian Splatting (3DGS) serves as a highly performant and efficient encoding of scene geometry, appearance, and semantics. Moreover, grounding language in 3D scenes has proven to be an effective strategy for 3D scene understanding. Current Language Gaussian Splatting line of work fall into three main groups: (i) per-scene optimization-based, (ii) per-scene optimization-free, and (iii) generalizable approach. However, most of them are evaluated only on rendered 2D views of a handful of scenes and viewpoints close to the training views, limiting ability and insight into holistic 3D understanding. To address this gap, we propose the first large-scale benchmark that systematically assesses these three groups of methods directly in 3D space, evaluating on 1060 scenes across three indoor datasets and one outdoor dataset. Benchmark results demonstrate a clear advantage of the generalizable paradigm, particularly in relaxing the scene-specific limitation, enabling fast feed-forward inference on novel scenes, and achieving superior segmentation performance. We further introduce GaussianWorld-49K a carefully curated 3DGS dataset comprising around 49K diverse indoor and outdoor scenes obtained from multiple sources, with which we demonstrate the generalizable approach could harness strong data priors. Our codes, benchmark, and datasets will be made public to accelerate research in generalizable 3DGS scene understanding.
zh
[CV-42] PhyBlock: A Progressive Benchmark for Physical Understanding and Planning via 3D Block Assembly
【速读】:该论文试图解决视觉-语言模型(Vision-Language Models, VLMs)在理解物理现象,尤其是在结构化3D环境中的表现受限的问题。其解决方案的关键在于提出PhyBlock,一个渐进式基准测试,通过机器人3D积木组装任务评估VLMs在物理理解和规划方面的能力。PhyBlock结合了新颖的四级认知层次组装任务和针对性的视觉问答(VQA)样本,旨在全面评估空间推理和基本物理理解能力。
链接: https://arxiv.org/abs/2506.08708
作者: Liang Ma,Jiajun Wen,Min Lin,Rongtao Xu,Xiwen Liang,Bingqian Lin,Jun Ma,Yongxin Wang,Ziming Wei,Haokun Lin,Mingfei Han,Meng Cao,Bokui Chen,Ivan Laptev,Xiaodan Liang
机构: Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学); Sun Yat-Sen University (中山大学); Shanghai Jiao Tong University (上海交通大学); Tsinghua University (清华大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:While vision-language models (VLMs) have demonstrated promising capabilities in reasoning and planning for embodied agents, their ability to comprehend physical phenomena, particularly within structured 3D environments, remains severely limited. To close this gap, we introduce PhyBlock, a progressive benchmark designed to assess VLMs on physical understanding and planning through robotic 3D block assembly tasks. PhyBlock integrates a novel four-level cognitive hierarchy assembly task alongside targeted Visual Question Answering (VQA) samples, collectively aimed at evaluating progressive spatial reasoning and fundamental physical comprehension, including object properties, spatial relationships, and holistic scene understanding. PhyBlock includes 2600 block tasks (400 assembly tasks, 2200 VQA tasks) and evaluates models across three key dimensions: partial completion, failure diagnosis, and planning robustness. We benchmark 21 state-of-the-art VLMs, highlighting their strengths and limitations in physically grounded, multi-step planning. Our empirical findings indicate that the performance of VLMs exhibits pronounced limitations in high-level planning and reasoning capabilities, leading to a notable decline in performance for the growing complexity of the tasks. Error analysis reveals persistent difficulties in spatial orientation and dependency reasoning. Surprisingly, chain-of-thought prompting offers minimal improvements, suggesting spatial tasks heavily rely on intuitive model comprehension. We position PhyBlock as a unified testbed to advance embodied reasoning, bridging vision-language understanding and real-world physical problem-solving.
zh
[CV-43] raG raph-GS: Trajectory Graph-based Gaussian Splatting for Arbitrary Large-Scale Scene Rendering
【速读】:该论文旨在解决大规模场景下高质量新视角合成的难题,现有方法通过将大场景划分为多个区域并分别进行3D重建后合并,虽能准确渲染特定场景,但因刚性空间划分难以适应任意相机轨迹以及区域合并导致的高斯分布重叠影响纹理细节,从而缺乏泛化能力。其解决方案的关键在于提出TraGraph-GS,利用轨迹图实现对任意大规模场景的高精度渲染,包括基于图的空间划分方法、用于提升纹理和远距离物体渲染的正则化约束,以及减轻高斯重叠引起的伪影的渐进式渲染策略。
链接: https://arxiv.org/abs/2506.08704
作者: Xiaohan Zhang,Sitong Wang,Yushen Yan,Yi Yang,Mingda Xu,Qi Liu
机构: South China University of Technology (华南理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:High-quality novel view synthesis for large-scale scenes presents a challenging dilemma in 3D computer vision. Existing methods typically partition large scenes into multiple regions, reconstruct a 3D representation using Gaussian splatting for each region, and eventually merge them for novel view rendering. They can accurately render specific scenes, yet they do not generalize effectively for two reasons: (1) rigid spatial partition techniques struggle with arbitrary camera trajectories, and (2) the merging of regions results in Gaussian overlap to distort texture details. To address these challenges, we propose TraGraph-GS, leveraging a trajectory graph to enable high-precision rendering for arbitrarily large-scale scenes. We present a spatial partitioning method for large-scale scenes based on graphs, which incorporates a regularization constraint to enhance the rendering of textures and distant objects, as well as a progressive rendering strategy to mitigate artifacts caused by Gaussian overlap. Experimental results demonstrate its superior performance both on four aerial and four ground datasets and highlight its remarkable efficiency: our method achieves an average improvement of 1.86 dB in PSNR on aerial datasets and 1.62 dB on ground datasets compared to state-of-the-art approaches.
zh
[CV-44] ArrowPose: Segmentation Detection and 5 DoF Pose Estimation Network for Colorless Point Clouds
【速读】:该论文旨在解决无色点云中的快速检测与五自由度(5 DoF)位姿估计问题。其解决方案的关键在于通过神经网络预测物体的中心点和顶点,从而计算出物体的位姿,该网络在合成数据上进行训练,并在基准数据集上表现出色,优于所有无色方法,且推理速度仅为250毫秒,适用于多种应用场景。
链接: https://arxiv.org/abs/2506.08699
作者: Frederik Hagelskjaer
机构: SDU Robotics, Mærsk Mc-Kinney Møller Institute, University of Southern Denmark (南丹麦大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 5 figures, 4 tables
点击查看摘要
Abstract:This paper presents a fast detection and 5 DoF (Degrees of Freedom) pose estimation network for colorless point clouds. The pose estimation is calculated from center and top points of the object, predicted by the neural network. The network is trained on synthetic data, and tested on a benchmark dataset, where it demonstrates state-of-the-art performance and outperforms all colorless methods. The network is able to run inference in only 250 milliseconds making it usable in many scenarios. Project page with code at this http URL
zh
[CV-45] MoSiC: Optimal-Transport Motion Trajectory for Dense Self-Supervised Learning
【速读】:该论文旨在解决视频中密集自监督学习的挑战,特别是由于运动动态的复杂性导致的像素级和块级表示学习不一致问题。现有方法因依赖静态增强策略,在物体形变、遮挡和相机移动情况下表现不佳。其解决方案的关键在于提出一种基于运动引导的自监督学习框架,通过聚类密集点轨迹来学习时空一致的表示,利用预训练点追踪器提取长距离运动轨迹,并通过基于动量编码器的最优传输机制优化特征聚类,从而实现跨视角的特征一致性。
链接: https://arxiv.org/abs/2506.08694
作者: Mohammadreza Salehi,Shashanka Venkataramanan,Ioana Simion,Efstratios Gavves,Cees G. M. Snoek,Yuki M Asano
机构: VIS Lab, UvA (视觉实验室,阿姆斯特丹大学); Valeo.ai (法雷奥人工智能); Fundamental AI Lab, UTN (基础人工智能实验室,图卢兹大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: preprint
点击查看摘要
Abstract:Dense self-supervised learning has shown great promise for learning pixel- and patch-level representations, but extending it to videos remains challenging due to the complexity of motion dynamics. Existing approaches struggle as they rely on static augmentations that fail under object deformations, occlusions, and camera movement, leading to inconsistent feature learning over time. We propose a motion-guided self-supervised learning framework that clusters dense point tracks to learn spatiotemporally consistent representations. By leveraging an off-the-shelf point tracker, we extract long-range motion trajectories and optimize feature clustering through a momentum-encoder-based optimal transport mechanism. To ensure temporal coherence, we propagate cluster assignments along tracked points, enforcing feature consistency across views despite viewpoint changes. Integrating motion as an implicit supervisory signal, our method learns representations that generalize across frames, improving robustness in dynamic scenes and challenging occlusion scenarios. By initializing from strong image-pretrained models and leveraging video data for training, we improve state-of-the-art by 1% to 6% on six image and video datasets and four evaluation benchmarks. The implementation is publicly available at our GitHub repository: this https URL
zh
[CV-46] VReST: Enhancing Reasoning in Large Vision-Language Models through Tree Search and Self-Reward Mechanism ACL2025
【速读】:该论文试图解决大型视觉语言模型(Large Vision-Language Models, LVLMs)在复杂视觉推理任务中表现受限的问题,尤其是在使用链式思维提示(Chain-of-Thought prompting)技术时效果不足。解决方案的关键在于提出一种无需训练的方法VReST,其核心是通过蒙特卡洛树搜索(Monte Carlo Tree Search)和自奖励(Self-Reward)机制增强推理能力。VReST通过构建搜索树来遍历推理空间,每个节点代表一个推理步骤,每条路径表示完整的推理序列,并利用多模态自奖励机制评估推理步骤的质量,从而提升模型在多模态数学推理基准上的性能。
链接: https://arxiv.org/abs/2506.08691
作者: Congzhi Zhang,Jiawei Peng,Zhenglin Wang,Yilong Lai,Haowen Sun,Heng Chang,Fei Ma,Weijiang Yu
机构: Sun Yat-Sen University (中山大学); Guangdong Laboratory of Artificial Intelligence and Digital Economy (广东省人工智能与数字经济发展实验室); Huawei Technologies Co., Ltd (华为技术有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ACL 2025 main
点击查看摘要
Abstract:Large Vision-Language Models (LVLMs) have shown exceptional performance in multimodal tasks, but their effectiveness in complex visual reasoning is still constrained, especially when employing Chain-of-Thought prompting techniques. In this paper, we propose VReST, a novel training-free approach that enhances Reasoning in LVLMs through Monte Carlo Tree Search and Self-Reward mechanisms. VReST meticulously traverses the reasoning landscape by establishing a search tree, where each node encapsulates a reasoning step, and each path delineates a comprehensive reasoning sequence. Our innovative multimodal Self-Reward mechanism assesses the quality of reasoning steps by integrating the utility of sub-questions, answer correctness, and the relevance of vision-language clues, all without the need for additional models. VReST surpasses current prompting methods and secures state-of-the-art performance across three multimodal mathematical reasoning benchmarks. Furthermore, it substantiates the efficacy of test-time scaling laws in multimodal tasks, offering a promising direction for future research.
zh
[CV-47] CanadaFireSat: Toward high-resolution wildfire forecasting with multiple modalities
【速读】:该论文旨在解决高分辨率野火预测的难题,特别是在加拿大北方生态系统中,传统方法因依赖低分辨率环境驱动因素和卫星产品,导致预测精度受限,通常仅为约0.1°。其解决方案的关键在于构建了一个基准数据集CanadaFireSat,并利用多模态数据(包括高分辨率多光谱卫星影像Sentinel-2 L1C、中分辨率卫星产品MODIS以及环境因素ERA5再分析数据)进行深度学习建模。通过引入多模态时间输入,模型在2023年极端野火季节的F1分数达到60.3%,展现出高分辨率和大陆尺度野火预测的潜力。
链接: https://arxiv.org/abs/2506.08690
作者: Hugo Porta,Emanuele Dalsasso,Jessica L. McCarty,Devis Tuia
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 34 pages, 11 figures
点击查看摘要
Abstract:Canada experienced in 2023 one of the most severe wildfire seasons in recent history, causing damage across ecosystems, destroying communities, and emitting large quantities of CO2. This extreme wildfire season is symptomatic of a climate-change-induced increase in the length and severity of the fire season that affects the boreal ecosystem. Therefore, it is critical to empower wildfire management in boreal communities with better mitigation solutions. Wildfire probability maps represent an important tool for understanding the likelihood of wildfire occurrence and the potential severity of future wildfires. The massive increase in the availability of Earth observation data has enabled the development of deep learning-based wildfire forecasting models, aiming at providing precise wildfire probability maps at different spatial and temporal scales. A main limitation of such methods is their reliance on coarse-resolution environmental drivers and satellite products, leading to wildfire occurrence prediction of reduced resolution, typically around \sim 0.1 °. This paper presents a benchmark dataset: CanadaFireSat, and baseline methods for high-resolution: 100 m wildfire forecasting across Canada, leveraging multi-modal data from high-resolution multi-spectral satellite images (Sentinel-2 L1C), mid-resolution satellite products (MODIS), and environmental factors (ERA5 reanalysis data). Our experiments consider two major deep learning architectures. We observe that using multi-modal temporal inputs outperforms single-modal temporal inputs across all metrics, achieving a peak performance of 60.3% in F1 score for the 2023 wildfire season, a season never seen during model training. This demonstrates the potential of multi-modal deep learning models for wildfire forecasting at high-resolution and continental scale.
zh
[CV-48] ATAS: Any-to-Any Self-Distillation for Enhanced Open-Vocabulary Dense Prediction
【速读】:该论文旨在解决生成式视觉-语言模型(如CLIP)在细粒度、区域级别的理解能力不足的问题,这限制了其在密集预测任务中的效果。论文提出的关键解决方案是Any-to-Any Self-Distillation (ATAS),其核心在于通过模型自身的知识在所有表示层次上同时增强语义连贯性和细粒度视觉-语言对齐,而不依赖额外模块或监督微调,仅利用未标记图像和内部自蒸馏过程来优化CLIP视觉编码器的表示。
链接: https://arxiv.org/abs/2506.08678
作者: Juan Yeo,Soonwoo Cha,Jiwoo Song,Hyunbin Jin,Taesup Kim
机构: Seoul National University (首尔国立大学); RIST (RIST); Samsung Research (三星研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Vision-language models such as CLIP have recently propelled open-vocabulary dense prediction tasks by enabling recognition of a broad range of visual concepts. However, CLIP still struggles with fine-grained, region-level understanding, hindering its effectiveness on these dense prediction tasks. We identify two pivotal factors required to address this limitation: semantic coherence and fine-grained vision-language alignment. Current adaptation methods often improve fine-grained alignment at the expense of semantic coherence, and often rely on extra modules or supervised fine-tuning. To overcome these issues, we propose Any-to-Any Self-Distillation (ATAS), a novel approach that simultaneously enhances semantic coherence and fine-grained alignment by leveraging own knowledge of a model across all representation levels. Unlike prior methods, ATAS uses only unlabeled images and an internal self-distillation process to refine representations of CLIP vision encoders, preserving local semantic consistency while sharpening local detail recognition. On open-vocabulary object detection and semantic segmentation benchmarks, ATAS achieves substantial performance gains, outperforming baseline CLIP models. These results validate the effectiveness of our approach and underscore the importance of jointly maintaining semantic coherence and fine-grained alignment for advanced open-vocabulary dense prediction.
zh
[CV-49] LLaVA-c: Continual Improved Visual Instruction Tuning
【速读】:该论文试图解决多任务学习在持续学习场景下的挑战,包括任务平衡问题和新任务引入导致的基模型退化问题。其解决方案的关键在于对LLaVA-1.5进行两项改进:一是基于谱感知的整合方法以提升任务平衡,二是无监督探究正则化以防止基模型性能下降。这些改进使得模型在持续预训练和微调过程中既能保持通用能力,又能提升任务特定性能。
链接: https://arxiv.org/abs/2506.08666
作者: Wenzhuo Liu,Fei Zhu,Haiyang Guo,Longhui Wei,Cheng-Lin Liu
机构: 1School of Artificial Intelligence, UCAS(中国科学院大学人工智能学院); 2State Key Laboratory of Multimodal Artificial Intelligence Systems, CASIA(中国科学院自动化研究所多模态人工智能系统国家重点实验室); 3Centre for Artificial Intelligence and Robotics, HKISI-CAS(香港智能系统与机器人中心,中科院); 4Huawei Inc(华为公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Multimodal models like LLaVA-1.5 achieve state-of-the-art visual understanding through visual instruction tuning on multitask datasets, enabling strong instruction-following and multimodal performance. However, multitask learning faces challenges such as task balancing, requiring careful adjustment of data proportions, and expansion costs, where new tasks risk catastrophic forgetting and need costly retraining. Continual learning provides a promising alternative to acquiring new knowledge incrementally while preserving existing capabilities. However, current methods prioritize task-specific performance, neglecting base model degradation from overfitting to specific instructions, which undermines general capabilities. In this work, we propose a simple but effective method with two modifications on LLaVA-1.5: spectral-aware consolidation for improved task balance and unsupervised inquiry regularization to prevent base model degradation. We evaluate both general and task-specific performance across continual pretraining and fine-tuning. Experiments demonstrate that LLaVA-c consistently enhances standard benchmark performance and preserves general capabilities. For the first time, we show that task-by-task continual learning can achieve results that match or surpass multitask joint learning. The code will be publicly released.
zh
[CV-50] Beyond Calibration: Physically Informed Learning for Raw-to-Raw Mapping
【速读】:该论文旨在解决多相机之间一致色彩再现的问题,这是实现图像融合和图像处理流水线(ISP)兼容性的关键挑战,但受限于传感器和光学系统的差异而难以实现。其解决方案的关键在于提出一种轻量级、物理启发的神经物理模型(Neural Physical Model, NPM),该模型通过模拟特定光照条件下的原始图像来估计设备间的转换关系,具备适应不同光照条件、可基于物理测量初始化以及支持有监督或无监督训练的优势。实验结果表明,NPM在公开数据集上优于现有最先进的方法,能够提供跨不同传感器和光学系统的鲁棒色度一致性。
链接: https://arxiv.org/abs/2506.08650
作者: Peter Grönquist,Stepan Tulyakov,Dengxin Dai
机构: Huawei Research Zürich(华为研究苏黎世); Huawei Research Zürich(华为研究苏黎世); Huawei Research Zürich(华为研究苏黎世)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Achieving consistent color reproduction across multiple cameras is essential for seamless image fusion and Image Processing Pipeline (ISP) compatibility in modern devices, but it is a challenging task due to variations in sensors and optics. Existing raw-to-raw conversion methods face limitations such as poor adaptability to changing illumination, high computational costs, or impractical requirements such as simultaneous camera operation and overlapping fields-of-view. We introduce the Neural Physical Model (NPM), a lightweight, physically-informed approach that simulates raw images under specified illumination to estimate transformations between devices. The NPM effectively adapts to varying illumination conditions, can be initialized with physical measurements, and supports training with or without paired data. Experiments on public datasets like NUS and BeyondRGB demonstrate that NPM outperforms recent state-of-the-art methods, providing robust chromatic consistency across different sensors and optical systems.
zh
[CV-51] Enhancing Video Memorability Prediction with Text-Motion Cross-modal Contrastive Loss and Its Application in Video Summarization
【速读】:该论文旨在解决视频可记忆性预测中运动特征表示不足的问题,以及视频摘要标签主观性过强的挑战。其解决方案的关键在于引入文本-运动跨模态对比损失(TMCCL),通过利用视频间的文本描述相似性构建正负运动样本集,从而增强运动特征的表示能力,提升视频可记忆性预测的准确性。此外,为探索视频可记忆性预测的应用潜力,论文提出了基于可记忆性加权修正的视频摘要方法(MWCVS),以减少视频摘要标签的主观性。
链接: https://arxiv.org/abs/2506.08649
作者: Zhiyi Zhu,Xiaoyu Wu,Youwei Lu
机构: Communication University of China(中国传媒大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Video memorability refers to the ability of videos to be recalled after viewing, playing a crucial role in creating content that remains memorable. Existing models typically focus on extracting multimodal features to predict video memorability scores but often fail to fully utilize motion cues. The representation of motion features is compromised during the fine-tuning phase of the motion feature extractor due to a lack of labeled data. In this paper, we introduce the Text-Motion Cross-modal Contrastive Loss (TMCCL), a multimodal video memorability prediction model designed to enhance the representation of motion features. We tackle the challenge of improving motion feature representation by leveraging text description similarities across videos to establish positive and negative motion sample sets for a given target. This enhancement allows the model to learn similar feature representations for semantically related motion content, resulting in more accurate memorability predictions. Our model achieves state-of-the-art performance on two video memorability prediction datasets. Moreover, the potential applications of video memorability prediction have been underexplored. To address this gap, we present Memorability Weighted Correction for Video Summarization (MWCVS), using video memorability prediction to reduce subjectivity in video summarization labels. Experimental results on two video summarization datasets demonstrate the effectiveness of MWCVS, showcasing the promising applications of video memorability prediction.
zh
[CV-52] me Series Representations for Classification Lie Hidden in Pretrained Vision Transformers
【速读】:该论文试图解决时间序列分类任务中由于公开可用时间序列数据集稀缺而导致的时间序列基础模型(TSFM)发展受限的问题。其解决方案的关键在于提出一种名为Time Vision Transformer (TiViT)的框架,该框架通过将时间序列转换为图像,从而利用预训练在大规模图像数据集上的冻结视觉变压器(ViT)的表征能力。
链接: https://arxiv.org/abs/2506.08641
作者: Simon Roschmann,Quentin Bouniot,Vasilii Feofanov,Ievgen Redko,Zeynep Akata
机构: Helmholtz Munich (赫尔姆霍兹慕尼黑); Technical University of Munich (慕尼黑技术大学); Munich Center for Machine Learning (慕尼黑机器学习中心); MDSI (MDSI); Paris Noah’s Ark Lab (巴黎诺亚方舟实验室)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Time series classification is a fundamental task in healthcare and industry, yet the development of time series foundation models (TSFMs) remains limited by the scarcity of publicly available time series datasets. In this work, we propose Time Vision Transformer (TiViT), a framework that converts time series into images to leverage the representational power of frozen Vision Transformers (ViTs) pretrained on large-scale image datasets. First, we theoretically motivate our approach by analyzing the 2D patching of ViTs for time series, showing that it can increase the number of label-relevant tokens and reduce the sample complexity. Second, we empirically demonstrate that TiViT achieves state-of-the-art performance on standard time series classification benchmarks by utilizing the hidden representations of large OpenCLIP models. We explore the structure of TiViT representations and find that intermediate layers with high intrinsic dimension are the most effective for time series classification. Finally, we assess the alignment between TiViT and TSFM representation spaces and identify a strong complementarity, with further performance gains achieved by combining their features. Our findings reveal yet another direction for reusing vision representations in a non-visual domain.
zh
[CV-53] Orientation Matters: Making 3D Generative Models Orientation-Aligned
【速读】:该论文试图解决现有3D生成模型在单图像生成时由于训练数据不一致导致的物体方向不对齐问题,从而限制了其在下游任务中的应用。解决方案的关键在于引入方向对齐的3D物体生成任务,并构建了一个包含14,832个方向对齐3D模型的大型数据集Objaverse-OA,涵盖1,008个类别。基于该数据集,研究者对两种代表性的3D生成模型进行了微调,使其能够在不同类别中生成具有统一方向的3D物体,从而提升模型的泛化能力。
链接: https://arxiv.org/abs/2506.08640
作者: Yichong Lu,Yuzhuo Tian,Zijin Jiang,Yikun Zhao,Yuanbo Yang,Hao Ouyang,Haoji Hu,Huimin Yu,Yujun Shen,Yiyi Liao
机构: Zhejiang University (浙江大学); Ant Group (蚂蚁集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL
点击查看摘要
Abstract:Humans intuitively perceive object shape and orientation from a single image, guided by strong priors about canonical poses. However, existing 3D generative models often produce misaligned results due to inconsistent training data, limiting their usability in downstream tasks. To address this gap, we introduce the task of orientation-aligned 3D object generation: producing 3D objects from single images with consistent orientations across categories. To facilitate this, we construct Objaverse-OA, a dataset of 14,832 orientation-aligned 3D models spanning 1,008 categories. Leveraging Objaverse-OA, we fine-tune two representative 3D generative models based on multi-view diffusion and 3D variational autoencoder frameworks to produce aligned objects that generalize well to unseen objects across various categories. Experimental results demonstrate the superiority of our method over post-hoc alignment approaches. Furthermore, we showcase downstream applications enabled by our aligned object generation, including zero-shot object orientation estimation via analysis-by-synthesis and efficient arrow-based object rotation manipulation.
zh
[CV-54] SurfR: Surface Reconstruction with Multi-scale Attention
【速读】:该论文旨在解决无序点云的快速且精确的表面重建问题,现有学习方法在单物体表示与泛化表示之间存在性能与速度的权衡,前者需要逐物体训练以获得高表面细节但效率低,后者虽能泛化到新形状但缺乏细节且推理速度慢。论文提出的解决方案关键在于三个贡献:首先,通过“延迟查询”策略,在早期阶段无需使用查询点进行特征提取以加速重建;其次,采用并行多尺度网格表示来生成适应不同噪声水平和输入分辨率的鲁棒特征;最后,利用跨尺度注意力机制提升重建效果,从而实现了最佳的精度与速度平衡。
链接: https://arxiv.org/abs/2506.08635
作者: Siddhant Ranade,Gonçalo Dias Pais,Ross Tyler Whitaker,Jacinto C. Nascimento,Pedro Miraldo,Srikumar Ramalingam
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in 3DV 2025
点击查看摘要
Abstract:We propose a fast and accurate surface reconstruction algorithm for unorganized point clouds using an implicit representation. Recent learning methods are either single-object representations with small neural models that allow for high surface details but require per-object training or generalized representations that require larger models and generalize to newer shapes but lack details, and inference is slow. We propose a new implicit representation for general 3D shapes that is faster than all the baselines at their optimum resolution, with only a marginal loss in performance compared to the state-of-the-art. We achieve the best accuracy-speed trade-off using three key contributions. Many implicit methods extract features from the point cloud to classify whether a query point is inside or outside the object. First, to speed up the reconstruction, we show that this feature extraction does not need to use the query point at an early stage (lazy query). Second, we use a parallel multi-scale grid representation to develop robust features for different noise levels and input resolutions. Finally, we show that attention across scales can provide improved reconstruction results.
zh
[CV-55] MOSAIC-F: A Framework for Enhancing Students Oral Presentation Skills through Personalized Feedback
【速读】:该论文试图解决传统学生学习活动反馈中缺乏个性化与多维度分析的问题,尤其是在提升学生口头表达能力方面。解决方案的关键在于提出一种名为MOSAIC-F的多模态反馈框架,该框架通过整合多模态学习分析(Multimodal Learning Analytics, MMLA)、观察、传感器、人工智能(AI)和协作评估,实现对学生学习活动的个性化反馈。其核心步骤包括基于标准化量规的师生评估、多模态数据采集、AI驱动的个性化反馈生成以及学生自我评估与反馈可视化,从而结合人工与数据驱动的评价方法,提供更精准、个性化且可操作的反馈。
链接: https://arxiv.org/abs/2506.08634
作者: Alvaro Becerra,Daniel Andres,Pablo Villegas,Roberto Daza,Ruth Cobos
机构: GHIA Group, School of Engineering, Universidad Autónoma de Madrid, Spain; BiDA-Lab Group, School of Engineering, Universidad Autónoma de Madrid, Spain
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in LASI Spain 25: Learning Analytics Summer Institute Spain 2025
点击查看摘要
Abstract:In this article, we present a novel multimodal feedback framework called MOSAIC-F, an acronym for a data-driven Framework that integrates Multimodal Learning Analytics (MMLA), Observations, Sensors, Artificial Intelligence (AI), and Collaborative assessments for generating personalized feedback on student learning activities. This framework consists of four key steps. First, peers and professors’ assessments are conducted through standardized rubrics (that include both quantitative and qualitative evaluations). Second, multimodal data are collected during learning activities, including video recordings, audio capture, gaze tracking, physiological signals (heart rate, motion data), and behavioral interactions. Third, personalized feedback is generated using AI, synthesizing human-based evaluations and data-based multimodal insights such as posture, speech patterns, stress levels, and cognitive load, among others. Finally, students review their own performance through video recordings and engage in self-assessment and feedback visualization, comparing their own evaluations with peers and professors’ assessments, class averages, and AI-generated recommendations. By combining human-based and data-based evaluation techniques, this framework enables more accurate, personalized and actionable feedback. We tested MOSAIC-F in the context of improving oral presentation skills.
zh
[CV-56] RoboSwap: A GAN-driven Video Diffusion Framework For Unsupervised Robot Arm Swapping
【速读】:该论文旨在解决视频条件下的机器人学习中因缺乏多样且高质量数据集而导致的跨平台泛化能力受限问题,特别是针对在不同视频中交换机械臂这一关键步骤。其解决方案的关键在于提出一种名为RoboSwap的框架,该框架能够在无配对数据的多样化环境中操作,通过整合生成对抗网络(GAN)和扩散模型的各自优势,实现机械臂的迁移与视频背景的融合与优化,从而提升结构一致性和运动真实性。
链接: https://arxiv.org/abs/2506.08632
作者: Yang Bai,Liudi Yang,George Eskandar,Fengyi Shen,Dong Chen,Mohammad Altillawi,Ziyuan Liu,Gitta Kutyniok
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Recent advancements in generative models have revolutionized video synthesis and editing. However, the scarcity of diverse, high-quality datasets continues to hinder video-conditioned robotic learning, limiting cross-platform generalization. In this work, we address the challenge of swapping a robotic arm in one video with another: a key step for crossembodiment learning. Unlike previous methods that depend on paired video demonstrations in the same environmental settings, our proposed framework, RoboSwap, operates on unpaired data from diverse environments, alleviating the data collection needs. RoboSwap introduces a novel video editing pipeline integrating both GANs and diffusion models, combining their isolated advantages. Specifically, we segment robotic arms from their backgrounds and train an unpaired GAN model to translate one robotic arm to another. The translated arm is blended with the original video background and refined with a diffusion model to enhance coherence, motion realism and object interaction. The GAN and diffusion stages are trained independently. Our experiments demonstrate that RoboSwap outperforms state-of-the-art video and image editing models on three benchmarks in terms of both structural coherence and motion consistency, thereby offering a robust solution for generating reliable, cross-embodiment data in robotic learning.
zh
[CV-57] ECMNet:Lightweight Semantic Segmentation with Efficient CNN-Mamba Network
【速读】:该论文旨在解决语义分割任务中卷积神经网络(CNN)与Transformer模型在全局上下文建模方面的不足,以及如何有效融合不同层次特征以提升分割精度的问题。其解决方案的关键在于提出一种轻量级的高效CNN-Mamba网络(ECMNet),通过胶囊框架巧妙结合CNN与Mamba结构,弥补各自缺陷;同时设计了增强型双注意力块(EDAB)、多尺度注意力单元(MSAU)和Mamba增强的特征融合模块(FFM),以提升特征表示能力和多尺度特征融合效果,从而在保持模型效率的同时显著提高分割性能。
链接: https://arxiv.org/abs/2506.08629
作者: Feixiang Du,Shengkun Wu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 16 pages, 2 figures, 4 tables
点击查看摘要
Abstract:In the past decade, Convolutional Neural Networks (CNNs) and Transformers have achieved wide applicaiton in semantic segmentation tasks. Although CNNs with Transformer models greatly improve performance, the global context modeling remains inadequate. Recently, Mamba achieved great potential in vision tasks, showing its advantages in modeling long-range dependency. In this paper, we propose a lightweight Efficient CNN-Mamba Network for semantic segmentation, dubbed as ECMNet. ECMNet combines CNN with Mamba skillfully in a capsule-based framework to address their complementary weaknesses. Specifically, We design a Enhanced Dual-Attention Block (EDAB) for lightweight bottleneck. In order to improve the representations ability of feature, We devise a Multi-Scale Attention Unit (MSAU) to integrate multi-scale feature aggregation, spatial aggregation and channel aggregation. Moreover, a Mamba enhanced Feature Fusion Module (FFM) merges diverse level feature, significantly enhancing segmented accuracy. Extensive experiments on two representative datasets demonstrate that the proposed model excels in accuracy and efficiency balance, achieving 70.6% mIoU on Cityscapes and 73.6% mIoU on CamVid test datasets, with 0.87M parameters and 8.27G FLOPs on a single RTX 3090 GPU platform.
zh
[CV-58] A Probability-guided Sampler for Neural Implicit Surface Rendering ECCV2024
【速读】:该论文试图解决神经辐射场(Neural Radiance Fields, NeRFs)在训练过程中无法使用所有可能输入数据(如每个像素和投影射线上的潜在3D点)所带来的可扩展性问题,从而影响合成图像质量和表面重建精度。其解决方案的关键在于利用前景场景的隐式表面表示,并在3D图像投影空间中建模概率密度函数,以实现对感兴趣区域的更精准采样,进而提升渲染效果。此外,还提出了一种新的表面重建损失函数,充分利用了所提出的3D图像投影空间模型,并融合了近表面和空洞空间成分,从而提升了整体性能。
链接: https://arxiv.org/abs/2506.08619
作者: Gonçalo Dias Pais,Valter Piedade,Moitreya Chatterjee,Marcus Greiff,Pedro Miraldo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in ECCV 2024
点击查看摘要
Abstract:Several variants of Neural Radiance Fields (NeRFs) have significantly improved the accuracy of synthesized images and surface reconstruction of 3D scenes/objects. In all of these methods, a key characteristic is that none can train the neural network with every possible input data, specifically, every pixel and potential 3D point along the projection rays due to scalability issues. While vanilla NeRFs uniformly sample both the image pixels and 3D points along the projection rays, some variants focus only on guiding the sampling of the 3D points along the projection rays. In this paper, we leverage the implicit surface representation of the foreground scene and model a probability density function in a 3D image projection space to achieve a more targeted sampling of the rays toward regions of interest, resulting in improved rendering. Additionally, a new surface reconstruction loss is proposed for improved performance. This new loss fully explores the proposed 3D image projection space model and incorporates near-to-surface and empty space components. By integrating our novel sampling strategy and novel loss into current state-of-the-art neural implicit surface renderers, we achieve more accurate and detailed 3D reconstructions and improved image rendering, especially for the regions of interest in any given scene.
zh
[CV-59] HSG-12M: A Large-Scale Spatial Multigraph Dataset
【速读】:该论文旨在解决传统图基准数据集在建模物理系统时存在的局限性,即其假设非空间的简单边,将物理上不同的路径合并为单一连接。解决方案的关键在于引入HSG-12M,这是首个大规模的空间多图(spatial multigraph)数据集,其中多个几何上不同的轨迹被保留为独立边,嵌入在度量空间中。该数据集包含11.6百万个静态和5.1百万个动态Hamiltonian谱图,来源于177TB的谱势数据,能够编码一维晶体能量谱的完整几何结构,从而生成具有物理基础的多样化拓扑结构。此外,论文还提出了Poly2Graph工具链,用于将任意一维晶体哈密顿量映射到谱图,推动了几何感知图学习的发展。
链接: https://arxiv.org/abs/2506.08618
作者: Xianquan Yan,Hakan Akgün,Kenji Kawaguchi,N. Duane Loh,Ching Hua Lee
机构: 未知
类目: Machine Learning (cs.LG); Mesoscale and Nanoscale Physics (cond-mat.mes-hall); Other Condensed Matter (cond-mat.other); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 39 pages, 13 figures, 3 tables. Code pipeline: [ this https URL ] Dataset: [ this https URL ] Dataset released under CC BY 4.0
点击查看摘要
Abstract:Existing graph benchmarks assume non-spatial, simple edges, collapsing physically distinct paths into a single link. We introduce HSG-12M, the first large-scale dataset of \textbfspatial multigraphs- graphs embedded in a metric space where multiple geometrically distinct trajectories between two nodes are retained as separate edges. HSG-12M contains 11.6 million static and 5.1 million dynamic \textitHamiltonian spectral graphs across 1401 characteristic-polynomial classes, derived from 177 TB of spectral potential data. Each graph encodes the full geometry of a 1-D crystal’s energy spectrum on the complex plane, producing diverse, physics-grounded topologies that transcend conventional node-coordinate datasets. To enable future extensions, we release \textttPoly2Graph : a high-performance, open-source pipeline that maps arbitrary 1-D crystal Hamiltonians to spectral graphs. Benchmarks with popular GNNs expose new challenges in learning from multi-edge geometry at scale. Beyond its practical utility, we show that spectral graphs serve as universal topological fingerprints of polynomials, vectors, and matrices, forging a new algebra-to-graph link. HSG-12M lays the groundwork for geometry-aware graph learning and new opportunities of data-driven scientific discovery in condensed matter physics and beyond.
zh
[CV-60] SAMSelect: A Spectral Index Search for Marine Debris Visualization using Segment Anything
【速读】:该论文旨在解决多光谱图像中难以有效可视化海洋漂浮垃圾的问题,特别是在中分辨率影像中由于垃圾成分的异质性导致的视觉识别困难。传统方法依赖领域专家根据经验选择波段和光谱指数进行图像解释,缺乏系统性和自动化。解决方案的关键在于提出SAMSelect算法,该算法通过Segment Anything Model在小规模标注数据集上选择能够实现最佳分类精度的波段或指数组合,从而生成具有最佳分割效果的三通道可视化图像,进而为遥感图像的视觉解释提供更准确的视觉信息。
链接: https://arxiv.org/abs/2506.08613
作者: Joost van Dalen,Yuki M. Asano,Marc Russwurm
机构: Wageningen University (瓦赫宁根大学); University of Technology Nuremberg (纽伦堡应用技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:This work proposes SAMSelect, an algorithm to obtain a salient three-channel visualization for multispectral images. We develop SAMSelect and show its use for marine scientists visually interpreting floating marine debris in Sentinel-2 imagery. These debris are notoriously difficult to visualize due to their compositional heterogeneity in medium-resolution imagery. Out of these difficulties, a visual interpretation of imagery showing marine debris remains a common practice by domain experts, who select bands and spectral indices on a case-by-case basis informed by common practices and heuristics. SAMSelect selects the band or index combination that achieves the best classification accuracy on a small annotated dataset through the Segment Anything Model. Its central assumption is that the three-channel visualization achieves the most accurate segmentation results also provide good visual information for photo-interpretation. We evaluate SAMSelect in three Sentinel-2 scenes containing generic marine debris in Accra, Ghana, and Durban, South Africa, and deployed plastic targets from the Plastic Litter Project. This reveals the potential of new previously unused band combinations (e.g., a normalized difference index of B8, B2), which demonstrate improved performance compared to literature-based indices. We describe the algorithm in this paper and provide an open-source code repository that will be helpful for domain scientists doing visual photo interpretation, especially in the marine field. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2506.08613 [cs.CV] (or arXiv:2506.08613v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2506.08613 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1109/LGRS.2025.3572407 Focus to learn more DOI(s) linking to related resources
zh
[CV-61] Data-Efficient Challenges in Visual Inductive Priors: A Retrospective
【速读】:该论文试图解决在数据不足的场景下深度学习模型性能下降的问题(data-deficient settings),旨在提升深度学习模型的数据效率。解决方案的关键在于引入先验知识(prior knowledge),通过结合大型模型集成(large model ensembles)以及混合Transformer和卷积神经网络(CNNs)的方法,同时采用大量的数据增强(data augmentation)技术,以提高在有限数据下的模型训练效果。
链接: https://arxiv.org/abs/2506.08612
作者: Robert-Jan Bruintjes,Attila Lengyel,Osman Semih Kayhan,Davide Zambrano,Nergis Tömen,Hadi Jamali-Rad,Jan van Gemert
机构: Delft University of Technology (代尔夫特理工大学); Haga Teaching Hospital, The Hague (海牙哈加教学医院); Sportradar Group AG (Sportradar集团); Shell (壳牌)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Deep Learning requires large amounts of data to train models that work well. In data-deficient settings, performance can be degraded. We investigate which Deep Learning methods benefit training models in a data-deficient setting, by organizing the “VIPriors: Visual Inductive Priors for Data-Efficient Deep Learning” workshop series, featuring four editions of data-impaired challenges. These challenges address the problem of training deep learning models for computer vision tasks with limited data. Participants are limited to training models from scratch using a low number of training samples and are not allowed to use any form of transfer learning. We aim to stimulate the development of novel approaches that incorporate prior knowledge to improve the data efficiency of deep learning models. Successful challenge entries make use of large model ensembles that mix Transformers and CNNs, as well as heavy data augmentation. Novel prior knowledge-based methods contribute to success in some entries.
zh
[CV-62] owards Class-wise Fair Adversarial Training via Anti-Bias Soft Label Distillation
【速读】:该论文试图解决对抗训练(Adversarial Training, AT)和对抗鲁棒性蒸馏(Adversarial Robustness Distillation, ARD)中存在的鲁棒公平性问题,即模型在不同类别上表现出差异化的对抗鲁棒性,对某些类别(易类)具有较强的鲁棒性,而对其他类别(难类)则较弱。解决方案的关键在于提出一种基于知识蒸馏框架的抗偏倚软标签蒸馏(Anti-Bias Soft Label Distillation, ABSLD),通过自适应地减少不同类别间的学生模型误差风险差距,调整教师模型软标签的类别相关平滑度,该调整通过为不同类别分配不同的温度参数实现,从而提升对抗鲁棒性公平性。
链接: https://arxiv.org/abs/2506.08611
作者: Shiji Zhao,Chi Chen,Ranjie Duan,Xizhe Wang,Xingxing Wei
机构: Beihang University (北京航空航天大学); Alibaba Group (阿里巴巴集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: arXiv admin note: text overlap with arXiv:2312.05508
点击查看摘要
Abstract:Adversarial Training (AT) is widely recognized as an effective approach to enhance the adversarial robustness of Deep Neural Networks. As a variant of AT, Adversarial Robustness Distillation (ARD) has shown outstanding performance in enhancing the robustness of small models. However, both AT and ARD face robust fairness issue: these models tend to display strong adversarial robustness against some classes (easy classes) while demonstrating weak adversarial robustness against others (hard classes). This paper explores the underlying factors of this problem and points out the smoothness degree of soft labels for different classes significantly impacts the robust fairness from both empirical observation and theoretical analysis. Based on the above exploration, we propose Anti-Bias Soft Label Distillation (ABSLD) within the Knowledge Distillation framework to enhance the adversarial robust fairness. Specifically, ABSLD adaptively reduces the student’s error risk gap between different classes, which is accomplished by adjusting the class-wise smoothness degree of teacher’s soft labels during the training process, and the adjustment is managed by assigning varying temperatures to different classes. Additionally, as a label-based approach, ABSLD is highly adaptable and can be integrated with the sample-based methods. Extensive experiments demonstrate ABSLD outperforms state-of-the-art methods on the comprehensive performance of robustness and fairness.
zh
[CV-63] ransformers Meet Hyperspectral Imaging: A Comprehensive Study of Models Challenges and Open Problems
【速读】:该论文旨在解决将Transformer架构应用于高光谱成像(HSI)分类中的挑战,特别是针对HSI数据的长程依赖建模问题。其关键解决方案在于系统性地分析并优化典型处理流程中的各个阶段,包括预处理、空间-光谱特征提取、多头自注意力变体设计以及损失函数构建,并结合HSI特有的空间-光谱特性进行对比分析,以克服标签数据稀缺、极端光谱维度、计算开销大和模型可解释性不足等障碍。
链接: https://arxiv.org/abs/2506.08596
作者: Guyang Zhang,Waleed Abdulla
机构: University of Auckland (奥克兰大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Transformers have become the architecture of choice for learning long-range dependencies, yet their adoption in hyperspectral imaging (HSI) is still emerging. We reviewed more than 300 papers published up to 2025 and present the first end-to-end survey dedicated to Transformer-based HSI classification. The study categorizes every stage of a typical pipeline-pre-processing, patch or pixel tokenization, positional encoding, spatial-spectral feature extraction, multi-head self-attention variants, skip connections, and loss design-and contrasts alternative design choices with the unique spatial-spectral properties of HSI. We map the field’s progress against persistent obstacles: scarce labeled data, extreme spectral dimensionality, computational overhead, and limited model explainability. Finally, we outline a research agenda prioritizing valuable public data sets, lightweight on-edge models, illumination and sensor shifts robustness, and intrinsically interpretable attention mechanisms. Our goal is to guide researchers in selecting, combining, or extending Transformer components that are truly fit for purpose for next-generation HSI applications.
zh
[CV-64] Diversity-Guided MLP Reduction for Efficient Large Vision Transformers
【速读】:该论文旨在解决大规模视觉Transformer模型中参数量过大导致的计算和内存成本过高的问题。其核心解决方案是提出一种基于多样性引导的多层感知机(MLP)压缩方法(Diversity-Guided MLP Reduction, DGMR),通过Gram-Schmidt权重剪枝策略去除MLP隐藏层中的冗余神经元,同时保留权重多样性以在知识蒸馏过程中实现更好的性能恢复。该方法能够在仅使用极少量无标签数据的情况下,显著减少模型参数和计算量,而几乎不损失性能。
链接: https://arxiv.org/abs/2506.08591
作者: Chengchao Shen,Hourun Zhu,Gongfan Fang,Jianxin Wang,Xinchao Wang
机构: Central South University (中南大学); National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM)
备注:
点击查看摘要
Abstract:Transformer models achieve excellent scaling property, where the performance is improved with the increment of model capacity. However, large-scale model parameters lead to an unaffordable cost of computing and memory. We analyze popular transformer architectures and find that multilayer perceptron (MLP) modules take up the majority of model parameters. To this end, we focus on the recoverability of the compressed models and propose a Diversity-Guided MLP Reduction (DGMR) method to significantly reduce the parameters of large vision transformers with only negligible performance degradation. Specifically, we conduct a Gram-Schmidt weight pruning strategy to eliminate redundant neurons of MLP hidden layer, while preserving weight diversity for better performance recover during distillation. Compared to the model trained from scratch, our pruned model only requires 0.06% data of LAION-2B (for the training of large vision transformers) without labels (ImageNet-1K) to recover the original performance. Experimental results on several state-of-the-art large vision transformers demonstrate that our method achieves a more than 57.0% parameter and FLOPs reduction in a near lossless manner. Notably, for EVA-CLIP-E (4.4B), our method accomplishes a 71.5% parameter and FLOPs reduction without performance degradation. The source code and trained weights are available at this https URL.
zh
[CV-65] Generating Vision-Language Navigation Instructions Incorporated Fine-Grained Alignment Annotations
【速读】:该论文旨在解决视觉-语言导航(Vision-Language Navigation, VLN)中由于缺乏细粒度跨模态对齐标注而导致的挑战,现有数据集主要关注全局指令-轨迹匹配,而忽略了子指令级和实体级对齐,这对准确的导航动作决策至关重要。其解决方案的关键在于提出FCA-NIG框架,该框架通过自动构建带有双层级细粒度跨模态标注的导航指令,包括子轨迹分割、基于GLIP的地标检测、指令构造、基于OFA-Speaker的R2R类似指令生成以及基于CLIP的实体选择,从而生成带有实体-地标标注的子指令-轨迹对,并最终聚合为完整的指令-轨迹对。
链接: https://arxiv.org/abs/2506.08566
作者: Yibo Cui,Liang Xie,Yu Zhao,Jiawei Sun,Erwei Yin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Vision-Language Navigation (VLN) enables intelligent agents to navigate environments by integrating visual perception and natural language instructions, yet faces significant challenges due to the scarcity of fine-grained cross-modal alignment annotations. Existing datasets primarily focus on global instruction-trajectory matching, neglecting sub-instruction-level and entity-level alignments critical for accurate navigation action decision-making. To address this limitation, we propose FCA-NIG, a generative framework that automatically constructs navigation instructions with dual-level fine-grained cross-modal annotations. In this framework, an augmented trajectory is first divided into sub-trajectories, which are then processed through GLIP-based landmark detection, crafted instruction construction, OFA-Speaker based R2R-like instruction generation, and CLIP-powered entity selection, generating sub-instruction-trajectory pairs with entity-landmark annotations. Finally, these sub-pairs are aggregated to form a complete instruction-trajectory pair. The framework generates the FCA-R2R dataset, the first large-scale augmentation dataset featuring precise sub-instruction-sub-trajectory and entity-landmark alignments. Extensive experiments demonstrate that training with FCA-R2R significantly improves the performance of multiple state-of-the-art VLN agents, including SF, EnvDrop, RecBERT, and HAMT. Incorporating sub-instruction-trajectory alignment enhances agents’ state awareness and decision accuracy, while entity-landmark alignment further boosts navigation performance and generalization. These results highlight the effectiveness of FCA-NIG in generating high-quality, scalable training data without manual annotation, advancing fine-grained cross-modal learning in complex navigation tasks.
zh
[CV-66] Hierarchical Neural Collapse Detection Transformer for Class Incremental Object Detection
【速读】:该论文试图解决增量目标检测(Incremental Object Detection, IOD)中模型在持续学习新物体时面临的灾难性遗忘问题,以及现有方法在性能和推理效率上的不足。其解决方案的关键在于提出一种新的框架——Hier-DETR:分层神经崩溃检测变压器,通过利用神经崩溃(Neural Collapse)机制处理类别不平衡数据,并结合类别标签的层次关系,从而在保证高效性的同时实现具有竞争力的检测性能。
链接: https://arxiv.org/abs/2506.08562
作者: Duc Thanh Pham,Hong Dang Nguyen,Nhat Minh Nguyen Quoc,Linh Ngo Van,Sang Dinh Viet,Duc Anh Nguyen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Recently, object detection models have witnessed notable performance improvements, particularly with transformer-based models. However, new objects frequently appear in the real world, requiring detection models to continually learn without suffering from catastrophic forgetting. Although Incremental Object Detection (IOD) has emerged to address this challenge, these existing models are still not practical due to their limited performance and prolonged inference time. In this paper, we introduce a novel framework for IOD, called Hier-DETR: Hierarchical Neural Collapse Detection Transformer, ensuring both efficiency and competitive performance by leveraging Neural Collapse for imbalance dataset and Hierarchical relation of classes’ labels.
zh
[CV-67] owards Cross-Subject EMG Pattern Recognition via Dual-Branch Adversarial Feature Disentanglement
【速读】:该论文旨在解决跨被试肌电图(EMG)模式识别中因个体间肌肉解剖结构、电极位置和信号特征差异带来的挑战。传统方法依赖于被试特定的校准数据来适应新用户,这种方法在大规模实际应用中存在耗时且不切实际的问题。论文提出的解决方案关键在于通过特征解耦实现无需校准的跨被试泛化,其核心是设计了一个端到端的双分支对抗神经网络,该网络通过将EMG特征解耦为模式特异性与被试特异性成分,同时完成模式识别与个体识别,从而在无需模型校准的情况下实现对新用户的鲁棒模式识别及下游应用如任务不变生物特征识别。
链接: https://arxiv.org/abs/2506.08555
作者: Xinyue Niu,Akira Furui
机构: Hiroshima University (广岛大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注: 6 pages, 3 figures. This work has been accepted for presentation at the IEEE Engineering in Medicine and Biology Conference (EMBC) 2025
点击查看摘要
Abstract:Cross-subject electromyography (EMG) pattern recognition faces significant challenges due to inter-subject variability in muscle anatomy, electrode placement, and signal characteristics. Traditional methods rely on subject-specific calibration data to adapt models to new users, an approach that is both time-consuming and impractical for large-scale, real-world deployment. This paper presents an approach to eliminate calibration requirements through feature disentanglement, enabling effective cross-subject generalization. We propose an end-to-end dual-branch adversarial neural network that simultaneously performs pattern recognition and individual identification by disentangling EMG features into pattern-specific and subject-specific components. The pattern-specific components facilitate robust pattern recognition for new users without model calibration, while the subject-specific components enable downstream applications such as task-invariant biometric identification. Experimental results demonstrate that the proposed model achieves robust performance on data from unseen users, outperforming various baseline methods in cross-subject scenarios. Overall, this study offers a new perspective for cross-subject EMG pattern recognition without model calibration and highlights the proposed model’s potential for broader applications, such as task-independent biometric systems.
zh
[CV-68] From Pixels to Graphs: using Scene and Knowledge Graphs for HD-EPIC VQA Challenge
【速读】:该论文旨在解决复杂第一视角视频问答(egocentric VQA)任务中的多模态理解与推理问题,特别是在高分辨率动态日常行为数据集HD-EPIC上的挑战。其解决方案的关键在于结合两种方法:SceneNet利用多模态大语言模型(MLLM)生成的场景图来捕捉细粒度的对象交互、空间关系和时间定位事件;KnowledgeNet则通过引入ConceptNet的外部常识知识,建立实体之间的高层语义联系,从而实现对视觉证据之外的推理。这两种方法在HD-EPIC基准的七个类别中展现出各自的优势,并在整体框架中融合,最终达到44.21%的准确率,验证了其在复杂第一视角VQA任务中的有效性。
链接: https://arxiv.org/abs/2506.08553
作者: Agnese Taluzzi,Davide Gesualdi,Riccardo Santambrogio,Chiara Plizzari,Francesca Palermo,Simone Mentasti,Matteo Matteucci
机构: Politecnico di Milano (米兰理工大学); EssilorLuxottica (依视路陆逊梯卡)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Technical report for the HD-EPIC VQA Challenge 2025 (1st place)
点击查看摘要
Abstract:This report presents SceneNet and KnowledgeNet, our approaches developed for the HD-EPIC VQA Challenge 2025. SceneNet leverages scene graphs generated with a multi-modal large language model (MLLM) to capture fine-grained object interactions, spatial relationships, and temporally grounded events. In parallel, KnowledgeNet incorporates ConceptNet’s external commonsense knowledge to introduce high-level semantic connections between entities, enabling reasoning beyond directly observable visual evidence. Each method demonstrates distinct strengths across the seven categories of the HD-EPIC benchmark, and their combination within our framework results in an overall accuracy of 44.21% on the challenge, highlighting its effectiveness for complex egocentric VQA tasks.
zh
[CV-69] Convergence of Spectral Principal Paths: How Deep Networks Distill Linear Representations from Noisy Inputs
【速读】:该论文试图解决深度神经网络中高阶表示的形成机制问题,旨在提升人工智能的可解释性与可控性。其解决方案的关键在于提出输入空间线性假设(Input-Space Linearity Hypothesis, ISLH),认为与概念对齐的方向起源于输入空间,并随着网络深度增加被选择性增强,同时引入谱主路径(Spectral Principal Path, SPP)框架,以形式化描述深度网络如何沿主要谱方向逐步提炼线性表示。
链接: https://arxiv.org/abs/2506.08543
作者: Bowei Tian,Xuntao Lyu,Meng Liu,Hongyi Wang,Ang Li
机构: University of Maryland (马里兰大学); North Carolina State University (北卡罗来纳州立大学); Rutgers University (罗格斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: arXiv admin note: text overlap with arXiv:2503.22720
点击查看摘要
Abstract:High-level representations have become a central focus in enhancing AI transparency and control, shifting attention from individual neurons or circuits to structured semantic directions that align with human-interpretable concepts. Motivated by the Linear Representation Hypothesis (LRH), we propose the Input-Space Linearity Hypothesis (ISLH), which posits that concept-aligned directions originate in the input space and are selectively amplified with increasing depth. We then introduce the Spectral Principal Path (SPP) framework, which formalizes how deep networks progressively distill linear representations along a small set of dominant spectral directions. Building on this framework, we further demonstrate the multimodal robustness of these representations in Vision-Language Models (VLMs). By bridging theoretical insights with empirical validation, this work advances a structured theory of representation formation in deep networks, paving the way for improving AI robustness, fairness, and transparency.
zh
[CV-70] rajFlow: Multi-modal Motion Prediction via Flow Matching
【速读】:该论文旨在解决自动驾驶中高效且准确的运动预测问题,特别是在动态现实条件下需要多模态预测的情况下。其解决方案的关键在于提出了一种基于流匹配(flow matching)的运动预测框架TrajFlow,该框架能够在单次推理过程中预测多个合理的未来轨迹,从而显著降低计算开销并保持预测的一致性。此外,通过引入基于Plackett-Luce分布的排序损失和自条件训练技术,进一步提升了预测的不确定性和泛化能力。
链接: https://arxiv.org/abs/2506.08541
作者: Qi Yan,Brian Zhang,Yutong Zhang,Daniel Yang,Joshua White,Di Chen,Jiachao Liu,Langechuan Liu,Binnan Zhuang,Shaoshuai Shi,Renjie Liao
机构: University of British Columbia; Vector Institute for AI; University of Waterloo; Georgia Institute of Technology; Carnegie Mellon University; Tesla; XPeng Motors; Leapmotor; Nvidia; DiDi; Canada CIFAR AI Chair
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Efficient and accurate motion prediction is crucial for ensuring safety and informed decision-making in autonomous driving, particularly under dynamic real-world conditions that necessitate multi-modal forecasts. We introduce TrajFlow, a novel flow matching-based motion prediction framework that addresses the scalability and efficiency challenges of existing generative trajectory prediction methods. Unlike conventional generative approaches that employ i.i.d. sampling and require multiple inference passes to capture diverse outcomes, TrajFlow predicts multiple plausible future trajectories in a single pass, significantly reducing computational overhead while maintaining coherence across predictions. Moreover, we propose a ranking loss based on the Plackett-Luce distribution to improve uncertainty estimation of predicted trajectories. Additionally, we design a self-conditioning training technique that reuses the model’s own predictions to construct noisy inputs during a second forward pass, thereby improving generalization and accelerating inference. Extensive experiments on the large-scale Waymo Open Motion Dataset (WOMD) demonstrate that TrajFlow achieves state-of-the-art performance across various key metrics, underscoring its effectiveness for safety-critical autonomous driving applications. The code and other details are available on the project website this https URL.
zh
[CV-71] LiftVSR: Lifting Image Diffusion to Video Super-Resolution via Hybrid Temporal Modeling with Only 4timesRTX 4090s
【速读】:该论文旨在解决视频超分辨率(Video Super-Resolution, VSR)中长期时间一致性不足与计算成本过高的问题。现有方法在处理长视频时往往面临时间连贯性受限和需要大量GPU资源(如超过8块NVIDIA A100-80G GPU)的挑战。其解决方案的关键在于提出LiftVSR框架,该框架通过引入一种混合时间建模机制,将时间学习分解为两个互补组件:动态时间注意力(Dynamic Temporal Attention, DTA)用于短帧段内的细粒度时间建模,而注意力记忆缓存(Attention Memory Cache, AMC)则用于跨段的长期时间建模,从而在保证时间一致性的同时提升效率。此外,通过引入非对称采样策略进一步稳定推理过程中的缓存交互,最终实现了在较低计算成本下取得最先进的性能。
链接: https://arxiv.org/abs/2506.08529
作者: Xijun Wang,Xin Li,Bingchen Li,Zhibo Chen
机构: University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
点击查看摘要
Abstract:Diffusion models have significantly advanced video super-resolution (VSR) by enhancing perceptual quality, largely through elaborately designed temporal modeling to ensure inter-frame consistency. However, existing methods usually suffer from limited temporal coherence and prohibitively high computational costs (e.g., typically requiring over 8 NVIDIA A100-80G GPUs), especially for long videos. In this work, we propose LiftVSR, an efficient VSR framework that leverages and elevates the image-wise diffusion prior from PixArt- \alpha , achieving state-of-the-art results using only 4 \times RTX 4090 GPUs. To balance long-term consistency and efficiency, we introduce a hybrid temporal modeling mechanism that decomposes temporal learning into two complementary components: (i) Dynamic Temporal Attention (DTA) for fine-grained temporal modeling within short frame segment ( \textiti.e. , low complexity), and (ii) Attention Memory Cache (AMC) for long-term temporal modeling across segments ( \textiti.e. , consistency). Specifically, DTA identifies multiple token flows across frames within multi-head query and key tokens to warp inter-frame contexts in the value tokens. AMC adaptively aggregates historical segment information via a cache unit, ensuring long-term coherence with minimal overhead. To further stabilize the cache interaction during inference, we introduce an asymmetric sampling strategy that mitigates feature mismatches arising from different diffusion sampling steps. Extensive experiments on several typical VSR benchmarks have demonstrated that LiftVSR achieves impressive performance with significantly lower computational costs.
zh
[CV-72] Robust Visual Localization via Semantic-Guided Multi-Scale Transformer
【速读】:该论文试图解决在动态环境中视觉定位的挑战,特别是在光照变化、恶劣天气和移动物体干扰下,现有绝对位姿回归方法难以保持一致性的问题。其解决方案的关键在于提出一种结合多尺度特征学习与语义场景理解的框架,通过分层Transformer与跨尺度注意力机制融合几何细节和上下文线索,同时在训练过程中利用神经场景表示进行语义监督,以学习视图不变特征并抑制环境干扰,从而提升定位的鲁棒性。
链接: https://arxiv.org/abs/2506.08526
作者: Zhongtao Tian,Wenhao Huang,Zhidong Chen,Xiao Wei Sun
机构: Southern University of Science and Technology (南方科技大学); Peng Cheng Laboratory (鹏城实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Visual localization remains challenging in dynamic environments where fluctuating lighting, adverse weather, and moving objects disrupt appearance cues. Despite advances in feature representation, current absolute pose regression methods struggle to maintain consistency under varying conditions. To address this challenge, we propose a framework that synergistically combines multi-scale feature learning with semantic scene understanding. Our approach employs a hierarchical Transformer with cross-scale attention to fuse geometric details and contextual cues, preserving spatial precision while adapting to environmental changes. We improve the performance of this architecture with semantic supervision via neural scene representation during training, guiding the network to learn view-invariant features that encode persistent structural information while suppressing complex environmental interference. Experiments on TartanAir demonstrate that our approach outperforms existing pose regression methods in challenging scenarios with dynamic objects, illumination changes, and occlusions. Our findings show that integrating multi-scale processing with semantic guidance offers a promising strategy for robust visual localization in real-world dynamic environments.
zh
[CV-73] FEDTAIL: Federated Long-Tailed Domain Generalization with Sharpness-Guided Gradient Matching ICML’25 ICML2025
【速读】:该论文旨在解决领域泛化(Domain Generalization, DG)中面临的挑战,特别是在长尾类别分布和冲突优化目标下的模型性能下降问题。其解决方案的关键在于提出FedTAIL框架,通过尖锐度引导的梯度对齐优化来增强模型的稳定性与泛化能力。该方法引入梯度一致性正则化以缓解分类与对抗目标之间的冲突,并采用曲率感知的动态加权策略来应对类别不平衡问题。此外,通过将尖锐度感知扰动整合到熵正则化中,进一步提升了模型在领域偏移下的鲁棒性。
链接: https://arxiv.org/abs/2506.08518
作者: Sunny Gupta,Nikita Jangid,Shounak Das,Amit Sethi
机构: IIT Bombay (印度理工学院孟买分校)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted at ICML 2025 Workshop on Collaborative and Federated Agentic Workflows CFAgentic @ ICML’25
点击查看摘要
Abstract:Domain Generalization (DG) seeks to train models that perform reliably on unseen target domains without access to target data during training. While recent progress in smoothing the loss landscape has improved generalization, existing methods often falter under long-tailed class distributions and conflicting optimization objectives. We introduce FedTAIL, a federated domain generalization framework that explicitly addresses these challenges through sharpness-guided, gradient-aligned optimization. Our method incorporates a gradient coherence regularizer to mitigate conflicts between classification and adversarial objectives, leading to more stable convergence. To combat class imbalance, we perform class-wise sharpness minimization and propose a curvature-aware dynamic weighting scheme that adaptively emphasizes underrepresented tail classes. Furthermore, we enhance conditional distribution alignment by integrating sharpness-aware perturbations into entropy regularization, improving robustness under domain shift. FedTAIL unifies optimization harmonization, class-aware regularization, and conditional alignment into a scalable, federated-compatible framework. Extensive evaluations across standard domain generalization benchmarks demonstrate that FedTAIL achieves state-of-the-art performance, particularly in the presence of domain shifts and label imbalance, validating its effectiveness in both centralized and federated settings. Code: this https URL
zh
[CV-74] MLVTG: Mamba-Based Feature Alignment and LLM -Driven Purification for Multi-Modal Video Temporal Grounding
【速读】:该论文旨在解决视频时序定位(Video Temporal Grounding, VTG)任务中存在冗余注意力机制和多模态对齐效果不佳的问题。其解决方案的关键在于提出MLVTG框架,该框架集成两个核心模块:MambaAligner和LLMRefiner。MambaAligner采用堆叠的Vision Mamba块替代Transformer,以建模时间依赖性并提取稳健的视频表征用于多模态对齐;LLMRefiner则利用预训练大型语言模型(Large Language Model, LLM)的冻结层隐式传递语义先验,从而在不进行微调的情况下增强多模态对齐效果。
链接: https://arxiv.org/abs/2506.08512
作者: Zhiyi Zhu,Xiaoyu Wu,Zihao Liu,Linlin Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Video Temporal Grounding (VTG), which aims to localize video clips corresponding to natural language queries, is a fundamental yet challenging task in video understanding. Existing Transformer-based methods often suffer from redundant attention and suboptimal multi-modal alignment. To address these limitations, we propose MLVTG, a novel framework that integrates two key modules: MambaAligner and LLMRefiner. MambaAligner uses stacked Vision Mamba blocks as a backbone instead of Transformers to model temporal dependencies and extract robust video representations for multi-modal alignment. LLMRefiner leverages the specific frozen layer of a pre-trained Large Language Model (LLM) to implicitly transfer semantic priors, enhancing multi-modal alignment without fine-tuning. This dual alignment strategy, temporal modeling via structured state-space dynamics and semantic purification via textual priors, enables more precise localization. Extensive experiments on QVHighlights, Charades-STA, and TVSum demonstrate that MLVTG achieves state-of-the-art performance and significantly outperforms existing baselines.
zh
[CV-75] Context-aware TFL: A Universal Context-aware Contrastive Learning Framework for Temporal Forgery Localization
【速读】:该论文试图解决在真实视频中嵌入的小型伪造音视频片段的时序伪造定位(Temporal Forgery Localization, TFL)问题,传统方法仅将深度伪造检测视为分类任务,忽略了视频部分片段被篡改的情况,而TFL更符合实际应用场景。解决方案的关键在于提出一种通用的上下文感知对比学习框架(UniCaCLF),其核心是通过监督对比学习发现并识别伪造时刻,并利用一种新颖的上下文感知感知层,结合异构激活操作和自适应上下文更新器构建上下文感知对比目标,从而增强伪造时刻特征与真实时刻特征之间的可区分性。此外,引入高效的上下文感知对比编码,在监督样本逐个处理的方式下进一步提升真实与伪造时刻特征的可区分性,抑制跨样本干扰,提高时序伪造定位性能。
链接: https://arxiv.org/abs/2506.08493
作者: Qilin Yin,Wei Lu,Xiangyang Luo,Xiaochun Cao
机构: Sun Yat-sen University (中山大学); School of Computer Science and Engineering (计算机科学与工程学院); MoE Key Laboratory of Information Technology (教育部信息技术重点实验室); Guangdong Province Key Laboratory of Information Security Technology (广东省信息安全技术重点实验室); State Key Laboratory of Mathematical Engineering and Advanced Computing (数学工程与先进计算国家重点实验室); School of Cyber Science and Technology (网络空间安全学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:
点击查看摘要
Abstract:Most research efforts in the multimedia forensics domain have focused on detecting forgery audio-visual content and reached sound achievements. However, these works only consider deepfake detection as a classification task and ignore the case where partial segments of the video are tampered with. Temporal forgery localization (TFL) of small fake audio-visual clips embedded in real videos is still challenging and more in line with realistic application scenarios. To resolve this issue, we propose a universal context-aware contrastive learning framework (UniCaCLF) for TFL. Our approach leverages supervised contrastive learning to discover and identify forged instants by means of anomaly detection, allowing for the precise localization of temporal forged segments. To this end, we propose a novel context-aware perception layer that utilizes a heterogeneous activation operation and an adaptive context updater to construct a context-aware contrastive objective, which enhances the discriminability of forged instant features by contrasting them with genuine instant features in terms of their distances to the global context. An efficient context-aware contrastive coding is introduced to further push the limit of instant feature distinguishability between genuine and forged instants in a supervised sample-by-sample manner, suppressing the cross-sample influence to improve temporal forgery localization performance. Extensive experimental results over five public datasets demonstrate that our proposed UniCaCLF significantly outperforms the state-of-the-art competing algorithms.
zh
[CV-76] MARMOT: Masked Autoencoder for Modeling Transient Imaging
【速读】:该论文旨在解决非视距(NLOS)成像中隐含物体重建的问题,传统方法通常通过优化体积密度或表面来实现重建,但未能有效利用从数据集中学习到的先验知识。其解决方案的关键在于提出一种用于建模瞬态成像的掩码自编码器(MARMOT),该模型通过在大规模且多样的NLOS瞬态数据集上进行自监督预训练,利用基于Transformer的编码器-解码器结构,从部分遮蔽的瞬态数据中学习特征,并通过扫描模式掩码(SPM)实现对完整测量值的预测,从而提升下游成像任务的性能。
链接: https://arxiv.org/abs/2506.08470
作者: Siyuan Shen,Ziheng Wang,Xingyue Peng,Suan Xia,Ruiqian Li,Shiying Li,Jingyi Yu
机构: ShanghaiTech University (上海科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Pretrained models have demonstrated impressive success in many modalities such as language and vision. Recent works facilitate the pretraining paradigm in imaging research. Transients are a novel modality, which are captured for an object as photon counts versus arrival times using a precisely time-resolved sensor. In particular for non-line-of-sight (NLOS) scenarios, transients of hidden objects are measured beyond the sensor’s direct line of sight. Using NLOS transients, the majority of previous works optimize volume density or surfaces to reconstruct the hidden objects and do not transfer priors learned from datasets. In this work, we present a masked autoencoder for modeling transient imaging, or MARMOT, to facilitate NLOS applications. Our MARMOT is a self-supervised model pretrianed on massive and diverse NLOS transient datasets. Using a Transformer-based encoder-decoder, MARMOT learns features from partially masked transients via a scanning pattern mask (SPM), where the unmasked subset is functionally equivalent to arbitrary sampling, and predicts full measurements. Pretrained on TransVerse-a synthesized transient dataset of 500K 3D models-MARMOT adapts to downstream imaging tasks using direct feature transfer or decoder finetuning. Comprehensive experiments are carried out in comparisons with state-of-the-art methods. Quantitative and qualitative results demonstrate the efficiency of our MARMOT.
zh
[CV-77] Enhancing Motion Dynamics of Image-to-Video Models via Adaptive Low-Pass Guidance
【速读】:该论文旨在解决图像到视频(I2V)生成过程中运动动态性不足的问题,即在对预训练文本到视频(T2V)模型进行微调以支持I2V生成时,生成视频往往比T2V模型生成的视频更加静态。其关键在于识别出问题根源为输入图像中高频细节的过早暴露,导致采样过程偏向于一个过度拟合参考图像静态外观的捷径轨迹。为解决该问题,作者提出自适应低通引导(ALG),通过在去噪早期阶段对条件图像应用低通滤波,自适应地调节频率内容,从而提升视频的时间动态性,同时保持帧级图像质量和文本对齐度。
链接: https://arxiv.org/abs/2506.08456
作者: June Suk Choi,Kyungmin Lee,Sihyun Yu,Yisol Choi,Jinwoo Shin,Kimin Lee
机构: KAIST(韩国科学技术院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint. Under review. Project page available at this http URL
点击查看摘要
Abstract:Recent text-to-video (T2V) models have demonstrated strong capabilities in producing high-quality, dynamic videos. To improve the visual controllability, recent works have considered fine-tuning pre-trained T2V models to support image-to-video (I2V) generation. However, such adaptation frequently suppresses motion dynamics of generated outputs, resulting in more static videos compared to their T2V counterparts. In this work, we analyze this phenomenon and identify that it stems from the premature exposure to high-frequency details in the input image, which biases the sampling process toward a shortcut trajectory that overfits to the static appearance of the reference image. To address this, we propose adaptive low-pass guidance (ALG), a simple fix to the I2V model sampling procedure to generate more dynamic videos without compromising per-frame image quality. Specifically, ALG adaptively modulates the frequency content of the conditioning image by applying low-pass filtering at the early stage of denoising. Extensive experiments demonstrate that ALG significantly improves the temporal dynamics of generated videos, while preserving image fidelity and text alignment. Especially, under VBench-I2V test suite, ALG achieves an average improvement of 36% in dynamic degree without a significant drop in video quality or image fidelity.
zh
[CV-78] SakugaFlow: A Stagewise Illustration Framework Emulating the Human Drawing Process and Providing Interactive Tutoring for Novice Drawing Skills
【速读】:该论文试图解决当前生成式 AI (Generative AI) 插画工具在生成高质量图像时缺乏对人类艺术家创作步骤的透明性问题。解决方案的关键在于提出 SakugaFlow,这是一个四阶段流程,将基于扩散模型的图像生成与大型语言模型导师相结合,通过实时反馈、非线性修订和分支版本生成,实现对解剖学、透视和构图的指导,从而将黑箱生成器转化为结构化的学习环境。
链接: https://arxiv.org/abs/2506.08443
作者: Kazuki Kawamura,Jun Rekimoto
机构: Sony Computer Science Laboratories, Inc.(索尼计算机科学实验室,株式会社); The University of Tokyo(东京大学)
类目: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages, 1 figure; accepted as a paper to the Generative AI and HCI (GenAICHI) workshop at CHI 2025 (Yokohama, 27 Apr 2025)
点击查看摘要
Abstract:While current AI illustration tools can generate high-quality images from text prompts, they rarely reveal the step-by-step procedure that human artists follow. We present SakugaFlow, a four-stage pipeline that pairs diffusion-based image generation with a large-language-model tutor. At each stage, novices receive real-time feedback on anatomy, perspective, and composition, revise any step non-linearly, and branch alternative versions. By exposing intermediate outputs and embedding pedagogical dialogue, SakugaFlow turns a black-box generator into a scaffolded learning environment that supports both creative exploration and skills acquisition.
zh
[CV-79] Boosting Gradient Leakage Attacks: Data Reconstruction in Realistic FL Settings USENIX-SECURITY2025
【速读】:该论文试图解决联邦学习(Federated Learning, FL)中隐私泄露的问题,特别是针对梯度泄漏攻击(Gradient Leakage Attacks, GLAs)在现实FL环境中的有效性问题。现有研究对此存在争议,部分观点认为在实际FL场景下GLAs的效果受限,因而隐私风险不大。然而,本文通过实证研究表明,在现实FL环境中,客户端的数据仍可被有效重建,揭示了FL系统的重大安全隐患。解决方案的关键在于提出FedLeak,其引入了两种新颖技术:部分梯度匹配和梯度正则化,以克服传统GLAs在梯度匹配问题上的性能瓶颈。
链接: https://arxiv.org/abs/2506.08435
作者: Mingyuan Fan,Fuyi Wang,Cen Chen,Jianying Zhou
机构: East China Normal University (华东师范大学); RMIT University (皇家墨尔本理工大学); Singapore University of Technology and Design (新加坡科技设计大学)
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to Usenix Security 2025
点击查看摘要
Abstract:Federated learning (FL) enables collaborative model training among multiple clients without the need to expose raw data. Its ability to safeguard privacy, at the heart of FL, has recently been a hot-button debate topic. To elaborate, several studies have introduced a type of attacks known as gradient leakage attacks (GLAs), which exploit the gradients shared during training to reconstruct clients’ raw data. On the flip side, some literature, however, contends no substantial privacy risk in practical FL environments due to the effectiveness of such GLAs being limited to overly relaxed conditions, such as small batch sizes and knowledge of clients’ data distributions. This paper bridges this critical gap by empirically demonstrating that clients’ data can still be effectively reconstructed, even within realistic FL environments. Upon revisiting GLAs, we recognize that their performance failures stem from their inability to handle the gradient matching problem. To alleviate the performance bottlenecks identified above, we develop FedLeak, which introduces two novel techniques, partial gradient matching and gradient regularization. Moreover, to evaluate the performance of FedLeak in real-world FL environments, we formulate a practical evaluation protocol grounded in a thorough review of extensive FL literature and industry practices. Under this protocol, FedLeak can still achieve high-fidelity data reconstruction, thereby underscoring the significant vulnerability in FL systems and the urgent need for more effective defense methods. Comments: Accepted to Usenix Security 2025 Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2506.08435 [cs.LG] (or arXiv:2506.08435v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2506.08435 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-80] Better Reasoning with Less Data: Enhancing VLMs Through Unified Modality Scoring
【速读】:该论文试图解决视觉语言模型(Vision-Language Models, VLMs)在训练过程中面临的两个关键问题:图像与文本之间的噪声对齐导致的误解释,以及模糊或具有误导性的文本对视觉内容的遮蔽。其解决方案的关键在于提出SCALE(Single modality data quality and Cross modality Alignment Evaluation),一个以数据质量驱动的VLM指令微调数据集选择管道,该方法通过跨模态评估框架对每条数据进行任务分类、生成通用及任务特定的描述,并基于生成的描述评估数据条目的对齐性、清晰度、任务稀有性、文本连贯性和图像清晰度,从而提升数据质量并增强模型的视觉理解能力。
链接: https://arxiv.org/abs/2506.08429
作者: Mingjie Xu,Andrew Estornell,Hongzheng Yang,Yuzhi Zhao,Zhaowei Zhu,Qi Xuan,Jiaheng Wei
机构: The Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州)); ByteDance Seed (字节跳动种子); The Chinese University of Hong Kong (香港中文大学); City University of Hong Kong (香港城市大学); BIAI-ZJUT (BIAI-ZJUT); Zhejiang University of Technology (浙江工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:The application of visual instruction tuning and other post-training techniques has significantly enhanced the capabilities of Large Language Models (LLMs) in visual understanding, enriching Vision-Language Models (VLMs) with more comprehensive visual language datasets. However, the effectiveness of VLMs is highly dependent on large-scale, high-quality datasets that ensure precise recognition and accurate reasoning. Two key challenges hinder progress: (1) noisy alignments between images and the corresponding text, which leads to misinterpretation, and (2) ambiguous or misleading text, which obscures visual content. To address these challenges, we propose SCALE (Single modality data quality and Cross modality Alignment Evaluation), a novel quality-driven data selection pipeline for VLM instruction tuning datasets. Specifically, SCALE integrates a cross-modality assessment framework that first assigns each data entry to its appropriate vision-language task, generates general and task-specific captions (covering scenes, objects, style, etc.), and evaluates the alignment, clarity, task rarity, text coherence, and image clarity of each entry based on the generated captions. We reveal that: (1) current unimodal quality assessment methods evaluate one modality while overlooking the rest, which can underestimate samples essential for specific tasks and discard the lower-quality instances that help build model robustness; and (2) appropriately generated image captions provide an efficient way to transfer the image-text multimodal task into a unified text modality.
zh
[CV-81] RadioDUN: A Physics-Inspired Deep Unfolding Network for Radio Map Estimation
【速读】:该论文旨在解决在实际场景中由于采样数量有限而难以构建高密度无线电图(radio map)的问题。现有方法虽利用深度学习从稀疏样本中估计密集无线电图,但难以融合无线电图的物理特性。其解决方案的关键在于将无线电图估计建模为稀疏信号恢复问题,并引入物理传播模型以分解问题为多个因子优化子问题,从而降低恢复复杂度。进一步地,提出了Radio Deep Unfolding Network (RadioDUN),通过展开优化过程实现自适应参数调整和先验拟合,并结合动态重加权模块(DRM)以适应性地建模各因子的重要性,同时引入阴影损失作为补充监督目标,提升模型性能。
链接: https://arxiv.org/abs/2506.08418
作者: Taiqin Chen,Zikun Zhou,Zheng Fang,Wenzhen Zou,Kanjun Liu,Ke Chen,Yongbing Zhang,Yaowei Wang
机构: Harbin Institute of Technology (Shenzhen) (哈尔滨工业大学(深圳)); Pengcheng Laboratory (鹏城实验室); Southern University of Science and Technology (南方科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
备注:
点击查看摘要
Abstract:The radio map represents the spatial distribution of spectrum resources within a region, supporting efficient resource allocation and interference mitigation. However, it is difficult to construct a dense radio map as a limited number of samples can be measured in practical scenarios. While existing works have used deep learning to estimate dense radio maps from sparse samples, they are hard to integrate with the physical characteristics of the radio map. To address this challenge, we cast radio map estimation as the sparse signal recovery problem. A physical propagation model is further incorporated to decompose the problem into multiple factor optimization sub-problems, thereby reducing recovery complexity. Inspired by the existing compressive sensing methods, we propose the Radio Deep Unfolding Network (RadioDUN) to unfold the optimization process, achieving adaptive parameter adjusting and prior fitting in a learnable manner. To account for the radio propagation characteristics, we develop a dynamic reweighting module (DRM) to adaptively model the importance of each factor for the radio map. Inspired by the shadowing factor in the physical propagation model, we integrate obstacle-related factors to express the obstacle-induced signal stochastic decay. The shadowing loss is further designed to constrain the factor prediction and act as a supplementary supervised objective, which enhances the performance of RadioDUN. Extensive experiments have been conducted to demonstrate that the proposed method outperforms the state-of-the-art methods. Our code will be made publicly available upon publication.
zh
[CV-82] SECOND: Mitigating Perceptual Hallucination in Vision-Language Models via Selective and Contrastive Decoding
【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)中存在的对象幻觉(object hallucination)问题,这一问题严重影响了模型的视觉理解准确性。解决方案的关键在于提出SECOND:一种选择性和对比性解码方法,通过以对象为中心的方式有效利用多尺度视觉信息,从而更贴近人类的视觉感知。SECOND通过逐步选择和整合多尺度视觉信息,并通过迭代对比减少感知幻觉,显著提升了模型性能。
链接: https://arxiv.org/abs/2506.08391
作者: Woohyeon Park,Woojin Kim,Jaeik Kim,Jaeyoung Do
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Despite significant advancements in Vision-Language Models (VLMs), the performance of existing VLMs remains hindered by object hallucination, a critical challenge to achieving accurate visual understanding. To address this issue, we propose SECOND: Selective and Contrastive Decoding, a novel approach that enables VLMs to effectively leverage multi-scale visual information with an object-centric manner, closely aligning with human visual perception. SECOND progressively selects and integrates multi-scale visual information, facilitating a more precise interpretation of images. By contrasting these visual information iteratively, SECOND significantly reduces perceptual hallucinations and outperforms a wide range of benchmarks. Our theoretical analysis and experiments highlight the largely unexplored potential of multi-scale application in VLMs, showing that prioritizing and contrasting across scales outperforms existing methods.
zh
[CV-83] Image Demoiréing Using Dual Camera Fusion on Mobile Phones ICME2025
【速读】:该论文旨在解决电子屏幕拍摄时产生的摩尔纹(moiré patterns)对图像质量的严重影响问题,尤其是针对大而密集的摩尔纹去除难题。其解决方案的关键在于提出一种基于双摄像头融合的去摩尔纹方法(Dual Camera fusion for Image Demoiréing, DCID),通过超广角(UW)图像辅助广角(W)图像的摩尔纹去除,利用现代智能手机中普遍配备的双镜头结构,以及UW图像在W图像存在摩尔纹时通常能提供正常颜色和纹理的特点。
链接: https://arxiv.org/abs/2506.08361
作者: Yanting Mei,Zhilu Zhang,Xiaohe Wu,Wangmeng Zuo
机构: Harbin Institute of Technology (哈尔滨工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICME 2025
点击查看摘要
Abstract:When shooting electronic screens, moiré patterns usually appear in captured images, which seriously affects the image quality. Existing image demoiréing methods face great challenges in removing large and heavy moiré. To address the issue, we propose to utilize Dual Camera fusion for Image Demoiréing (DCID), \ie, using the ultra-wide-angle (UW) image to assist the moiré removal of wide-angle (W) image. This is inspired by two motivations: (1) the two lenses are commonly equipped with modern smartphones, (2) the UW image generally can provide normal colors and textures when moiré exists in the W image mainly due to their different focal lengths. In particular, we propose an efficient DCID method, where a lightweight UW image encoder is integrated into an existing demoiréing network and a fast two-stage image alignment manner is present. Moreover, we construct a large-scale real-world dataset with diverse mobile phones and monitors, containing about 9,000 samples. Experiments on the dataset show our method performs better than state-of-the-art methods. Code and dataset are available at this https URL.
zh
[CV-84] MedMoE: Modality-Specialized Mixture of Experts for Medical Vision-Language Understanding
【速读】:该论文试图解决医学影像领域中现有视觉-语言框架在局部特征提取时采用统一策略,忽视了不同影像模态(medical imaging modalities)的特定需求的问题。解决方案的关键在于提出MedMoE框架,该框架通过基于报告类型的Mixture-of-Experts (MoE)模块动态适配视觉表示,将多尺度图像特征路由至专门训练以捕捉模态特异性视觉语义的专家分支,从而实现与文本描述对齐的局部视觉表征。
链接: https://arxiv.org/abs/2506.08356
作者: Shivang Chopra,Lingchao Mao,Gabriela Sanchez-Rodriguez,Andrew J Feola,Jing Li,Zsolt Kira
机构: Georgia Institute of Technology (佐治亚理工学院); Emory University (埃默里大学); Joseph M Cleland Atlanta VAMC (亚特兰大杰弗里·M·克雷兰 VA医疗中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Different medical imaging modalities capture diagnostic information at varying spatial resolutions, from coarse global patterns to fine-grained localized structures. However, most existing vision-language frameworks in the medical domain apply a uniform strategy for local feature extraction, overlooking the modality-specific demands. In this work, we present MedMoE, a modular and extensible vision-language processing framework that dynamically adapts visual representation based on the diagnostic context. MedMoE incorporates a Mixture-of-Experts (MoE) module conditioned on the report type, which routes multi-scale image features through specialized expert branches trained to capture modality-specific visual semantics. These experts operate over feature pyramids derived from a Swin Transformer backbone, enabling spatially adaptive attention to clinically relevant regions. This framework produces localized visual representations aligned with textual descriptions, without requiring modality-specific supervision at inference. Empirical results on diverse medical benchmarks demonstrate that MedMoE improves alignment and retrieval performance across imaging modalities, underscoring the value of modality-specialized visual representations in clinical vision-language systems.
zh
[CV-85] An Adaptive Method Stabilizing Activations for Enhanced Generalization
【速读】:该论文试图解决深度学习模型训练过程中神经元输出稳定性不足以及传统激活正则化方法在收敛速度与泛化能力之间难以平衡的问题。解决方案的关键在于提出AdaAct算法,该算法通过根据激活方差动态调整学习率,引入神经元级别的自适应机制,从而提升神经元输出的稳定性,进而实现更好的泛化性能。
链接: https://arxiv.org/abs/2506.08353
作者: Hyunseok Seung,Jaewoo Lee,Hyunsuk Ko
机构: University of Georgia (佐治亚大学); Hanyang University (汉阳大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:We introduce AdaAct, a novel optimization algorithm that adjusts learning rates according to activation variance. Our method enhances the stability of neuron outputs by incorporating neuron-wise adaptivity during the training process, which subsequently leads to better generalization – a complementary approach to conventional activation regularization methods. Experimental results demonstrate AdaAct’s competitive performance across standard image classification benchmarks. We evaluate AdaAct on CIFAR and ImageNet, comparing it with other state-of-the-art methods. Importantly, AdaAct effectively bridges the gap between the convergence speed of Adam and the strong generalization capabilities of SGD, all while maintaining competitive execution times. Code is available at this https URL.
zh
[CV-86] Complex-Valued Holographic Radiance Fields
【速读】:该论文旨在解决在三维表示中建模光的完整特性(包括振幅和相位)的问题,以推动物理上逼真的渲染,特别是在全息显示领域。其解决方案的关键在于提出一种新的表示方法,通过使用复数域的高斯基元重新构建三维高斯点云投射(3D Gaussian splatting),从而直接优化复数高斯分布作为全息场景的表示,而无需依赖基于强度的中间步骤,进而显著提升了计算效率。
链接: https://arxiv.org/abs/2506.08350
作者: Yicheng Zhan,Dong-Ha Shin,Seung-Hwan Baek,Kaan Akşit
机构: University College London (伦敦大学学院); Pohang University of Science and Technology (POSTECH) (浦项科技大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET)
备注: 28 pages, 21 figures
点击查看摘要
Abstract:Modeling the full properties of light, including both amplitude and phase, in 3D representations is crucial for advancing physically plausible rendering, particularly in holographic displays. To support these features, we propose a novel representation that optimizes 3D scenes without relying on intensity-based intermediaries. We reformulate 3D Gaussian splatting with complex-valued Gaussian primitives, expanding support for rendering with light waves. By leveraging RGBD multi-view images, our method directly optimizes complex-valued Gaussians as a 3D holographic scene representation. This eliminates the need for computationally expensive hologram re-optimization. Compared with state-of-the-art methods, our method achieves 30x-10,000x speed improvements while maintaining on-par image quality, representing a first step towards geometrically aligned, physically plausible holographic scene representations.
zh
[CV-87] Generalizable Articulated Object Reconstruction from Casually Captured RGBD Videos
【速读】:该论文旨在解决从随意拍摄的RGBD视频中重建可动物体(articulated object)的问题,传统方法需要精心采集的数据进行训练或推理,限制了其在实际场景中的可扩展性和泛化能力。论文提出的解决方案的关键在于构建一个从粗到细的框架,该框架能够从动态RGBD视频中推断关节参数并分割物体的可动部分,从而实现对不同类别可动物体的有效重建。
链接: https://arxiv.org/abs/2506.08334
作者: Weikun Peng,Jun Lv,Cewu Lu,Manolis Savva
机构: Simon Fraser University (西蒙弗雷泽大学); Shanghai Jiao Tong University (上海交通大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: Project website can be found at this https URL
点击查看摘要
Abstract:Articulated objects are prevalent in daily life. Understanding their kinematic structure and reconstructing them have numerous applications in embodied AI and robotics. However, current methods require carefully captured data for training or inference, preventing practical, scalable, and generalizable reconstruction of articulated objects. We focus on reconstruction of an articulated object from a casually captured RGBD video shot with a hand-held camera. A casually captured video of an interaction with an articulated object is easy to acquire at scale using smartphones. However, this setting is quite challenging, as the object and camera move simultaneously and there are significant occlusions as the person interacts with the object. To tackle these challenges, we introduce a coarse-to-fine framework that infers joint parameters and segments movable parts of the object from a dynamic RGBD video. To evaluate our method under this new setting, we build a 20 \times larger synthetic dataset of 784 videos containing 284 objects across 11 categories. We compare our approach with existing methods that also take video as input. Experiments show that our method can reconstruct synthetic and real articulated objects across different categories from dynamic RGBD videos, outperforming existing methods significantly.
zh
[CV-88] Locating Tennis Ball Impact on the Racket in Real Time Using an Event Camera
【速读】:该论文旨在解决在网球等球拍类运动中,如何高效、准确地定位击球瞬间球与球拍的位置问题,以支持个性化装备设计和运动员表现分析。传统方法依赖高速摄像机和人工数字化,存在内存消耗大、耗时且易出错的缺点。论文提出的解决方案关键在于使用事件相机(event camera)实现实时击球位置检测,该相机通过微秒级精度捕捉亮度变化(events),在高动态场景下具有低内存占用和持续监测能力。其核心方法包括击球时间范围识别、击球时刻判定及球与球拍轮廓提取,结合传统计算机视觉技术和原创的基于事件的处理算法(PATS:时间对称性中的极性不对称量),实现了高效且精确的实时定位。
链接: https://arxiv.org/abs/2506.08327
作者: Yuto Kase,Kai Ishibe,Ryoma Yasuda,Yudai Washida,Sakiko Hashimoto
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 10 figures, 3 tables
点击查看摘要
Abstract:In racket sports, such as tennis, locating the ball’s position at impact is important in clarifying player and equipment characteristics, thereby aiding in personalized equipment design. High-speed cameras are used to measure the impact location; however, their excessive memory consumption limits prolonged scene capture, and manual digitization for position detection is time-consuming and prone to human error. These limitations make it difficult to effectively capture the entire playing scene, hindering the ability to analyze the player’s performance. We propose a method for locating the tennis ball impact on the racket in real time using an event camera. Event cameras efficiently measure brightness changes (called `events’) with microsecond accuracy under high-speed motion while using lower memory consumption. These cameras enable users to continuously monitor their performance over extended periods. Our method consists of three identification steps: time range of swing, timing at impact, and contours of ball and racket. Conventional computer vision techniques are utilized along with an original event-based processing to detect the timing at impact (PATS: the amount of polarity asymmetry in time symmetry). The results of the experiments were within the permissible range for measuring tennis players’ performance. Moreover, the computation time was sufficiently short for real-time applications.
zh
[CV-89] Hyperspectral Image Classification via Transformer-based Spectral-Spatial Attention Decoupling and Adaptive Gating
【速读】:该论文旨在解决高光谱图像(Hyperspectral Image, HSI)分类中面临的挑战,包括高维数据、地物对象的稀疏分布以及光谱冗余等问题,这些问题常导致分类过拟合和泛化能力受限。其解决方案的关键在于提出了一种名为STNet的新网络架构,该架构的核心优势源于空间-光谱Transformer模块的双重创新设计:首先,空间与光谱注意力的显式解耦确保了对HSI中关键信息的精准捕捉;其次,两种功能不同的门控机制分别在注意力流融合层面(自适应注意力融合门控)和特征变换内部层面(GFFN)实现智能调控,从而提升了特征提取与融合能力,并在小样本和高噪声场景下降低了过拟合风险。
链接: https://arxiv.org/abs/2506.08324
作者: Guandong Li,Mengxia Ye
机构: aiFLYTEK(科大讯飞); Aegon THTF(安联THTF)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: arXiv admin note: substantial text overlap with arXiv:2504.15155 , arXiv:2504.13045 , arXiv:2503.23472
点击查看摘要
Abstract:Deep neural networks face several challenges in hyperspectral image classification, including high-dimensional data, sparse distribution of ground objects, and spectral redundancy, which often lead to classification overfitting and limited generalization capability. To more effectively extract and fuse spatial context with fine spectral information in hyperspectral image (HSI) classification, this paper proposes a novel network architecture called STNet. The core advantage of STNet stems from the dual innovative design of its Spatial-Spectral Transformer module: first, the fundamental explicit decoupling of spatial and spectral attention ensures targeted capture of key information in HSI; second, two functionally distinct gating mechanisms perform intelligent regulation at both the fusion level of attention flows (adaptive attention fusion gating) and the internal level of feature transformation (GFFN). This characteristic demonstrates superior feature extraction and fusion capabilities compared to traditional convolutional neural networks, while reducing overfitting risks in small-sample and high-noise scenarios. STNet enhances model representation capability without increasing network depth or width. The proposed method demonstrates superior performance on IN, UP, and KSC datasets, outperforming mainstream hyperspectral image classification approaches.
zh
[CV-90] OpenRR-1k: A Scalable Dataset for Real-World Reflection Removal
【速读】:该论文旨在解决现有反射去除技术因缺乏高质量的真实场景数据集而受到限制的问题。其解决方案的关键在于提出了一种新的数据收集范式,该方法便捷、成本低且可扩展,能够确保所收集的数据对具有高质量、精确对齐,并反映自然多样的场景。基于此范式,研究者构建了一个名为OpenRR-1k的数据集,包含1,000对高质量的传输-反射图像,用于提升反射去除方法在复杂现实环境中的鲁棒性。
链接: https://arxiv.org/abs/2506.08299
作者: Kangning Yang,Ling Ouyang,Huiming Sun,Jie Cai,Lan Fu,Jiaming Ding,Chiu Man Ho,Zibo Meng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Reflection removal technology plays a crucial role in photography and computer vision applications. However, existing techniques are hindered by the lack of high-quality in-the-wild datasets. In this paper, we propose a novel paradigm for collecting reflection datasets from a fresh perspective. Our approach is convenient, cost-effective, and scalable, while ensuring that the collected data pairs are of high quality, perfectly aligned, and represent natural and diverse scenarios. Following this paradigm, we collect a Real-world, Diverse, and Pixel-aligned dataset (named OpenRR-1k dataset), which contains 1,000 high-quality transmission-reflection image pairs collected in the wild. Through the analysis of several reflection removal methods and benchmark evaluation experiments on our dataset, we demonstrate its effectiveness in improving robustness in challenging real-world environments. Our dataset is available at this https URL.
zh
[CV-91] SEMA: a Scalable and Efficient Mamba like Attention via Token Localization and Averag ing
【速读】:该论文试图解决传统全注意力机制(full attention)在输入规模上具有二次计算复杂度,以及线性注意力机制无法有效聚焦的问题。解决方案的关键在于提出一种广义注意力(generalized attention)的数学定义,并在此框架下统一了标准softmax注意力和线性注意力。通过证明广义注意力具有发散特性(dispersion),即当键的数量趋于无穷时,查询会对所有键分配相等权重,作者受到该特性和Mamba形式注意力的启发,设计了可扩展且高效的Mamba-like注意力(SEMA),利用标记定位避免发散并保持聚焦,同时结合理论上一致的算术平均来捕捉注意力的全局特性。
链接: https://arxiv.org/abs/2506.08297
作者: Nhat Thanh Tran,Fanghui Xue,Shuai Zhang,Jiancheng Lyu,Yunling Zheng,Yingyong Qi,Jack Xin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 15 pages, figures 3
点击查看摘要
Abstract:Attention is the critical component of a transformer. Yet the quadratic computational complexity of vanilla full attention in the input size and the inability of its linear attention variant to focus have been challenges for computer vision tasks. We provide a mathematical definition of generalized attention and formulate both vanilla softmax attention and linear attention within the general framework. We prove that generalized attention disperses, that is, as the number of keys tends to infinity, the query assigns equal weights to all keys. Motivated by the dispersion property and recent development of Mamba form of attention, we design Scalable and Efficient Mamba like Attention (SEMA) which utilizes token localization to avoid dispersion and maintain focusing, complemented by theoretically consistent arithmetic averaging to capture global aspect of attention. We support our approach on Imagenet-1k where classification results show that SEMA is a scalable and effective alternative beyond linear attention, outperforming recent vision Mamba models on increasingly larger scales of images at similar model parameter sizes.
zh
[CV-92] Seeing Voices: Generating A-Roll Video from Audio with Mirag e
【速读】:该论文旨在解决视频生成中音频与视觉内容协同生成的问题,即如何在视频创作中实现音频与图像的和谐统一。现有方法要么忽略声音仅关注静态图像生成,要么仅限于特定应用场景如配音。论文提出的解决方案是引入Mirage,一个基于自注意力机制的音频到视频基础模型,其关键在于提出了一种统一的训练方法,能够从零开始或基于已有权重训练音频到视频生成模型,从而在保持通用性的同时,生成具有更高主观质量的视频内容。
链接: https://arxiv.org/abs/2506.08279
作者: Aditi Sundararaman,Amogh Adishesha,Andrew Jaegle,Dan Bigioi,Hyoung-Kyu Song,Jon Kyl,Justin Mao,Kevin Lan,Mojtaba Komeili,ShahRukh Athar,Sheila Babayan,Stanislau Beliasau,William Buchwalter
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Technical report website: this http URL , product website: this http URL
点击查看摘要
Abstract:From professional filmmaking to user-generated content, creators and consumers have long recognized that the power of video depends on the harmonious integration of what we hear (the video’s audio track) with what we see (the video’s image sequence). Current approaches to video generation either ignore sound to focus on general-purpose but silent image sequence generation or address both visual and audio elements but focus on restricted application domains such as re-dubbing. We introduce Mirage, an audio-to-video foundation model that excels at generating realistic, expressive output imagery from scratch given an audio input. When integrated with existing methods for speech synthesis (text-to-speech, or TTS), Mirage results in compelling multimodal video. When trained on audio-video footage of people talking (A-roll) and conditioned on audio containing speech, Mirage generates video of people delivering a believable interpretation of the performance implicit in input audio. Our central technical contribution is a unified method for training self-attention-based audio-to-video generation models, either from scratch or given existing weights. This methodology allows Mirage to retain generality as an approach to audio-to-video generation while producing outputs of superior subjective quality to methods that incorporate audio-specific architectures or loss components specific to people, speech, or details of how images or audio are captured. We encourage readers to watch and listen to the results of Mirage for themselves (see paper and comments for links).
zh
[CV-93] Surgeons Awareness Expectations and Involvement with Artificial Intelligence: a Survey Pre and Post the GPT Era
【速读】:该论文旨在探讨外科医生对人工智能(Artificial Intelligence, AI)的认知、期望及参与度的变化,特别是在生成式AI(Generative AI)兴起背景下其在手术领域的应用潜力。研究通过对比2021年和2024年的全球横断面调查,分析了外科医生对AI课程的熟悉程度、对AI角色的预期以及伦理考量等关键因素。研究发现,尽管AI课程的知晓率和参与度有所提升,但基础AI概念的掌握仍有限,且基础设施限制仍是主要障碍。解决方案的关键在于加强跨学科协作与系统性培训,以推动AI在手术领域的有效整合,并通过教育、伦理框架和基础设施建设应对知识缺口与实施挑战。
链接: https://arxiv.org/abs/2506.08258
作者: Lorenzo Arboit,Dennis N. Schneider,Toby Collins,Daniel A. Hashimoto,Silvana Perretta,Bernard Dallemagne,Jacques Marescaux,EAES Working Group,Nicolas Padoy,Pietro Mascagni
机构: University of Strasbourg, CNRS, INSERM, ICube, UMR7357, Strasbourg, France; IRCAD, Strasbourg, France; University of Pennsylvania Perelman School of Medicine, Department of Surgery, Philadelphia, PA, USA; Institute of Image-Guided Surgery, Strasbourg, France; The University Hospitals of Strasbourg, Strasbourg, France; Fondazione Policlinico Universitario Agostino Gemelli IRCCS, Rome, Italy
类目: Computers and Society (cs.CY); Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 3 figures
点击查看摘要
Abstract:Artificial Intelligence (AI) is transforming medicine, with generative AI models like ChatGPT reshaping perceptions of its potential. This study examines surgeons’ awareness, expectations, and involvement with AI in surgery through comparative surveys conducted in 2021 and 2024. Two cross-sectional surveys were distributed globally in 2021 and 2024, the first before an IRCAD webinar and the second during the annual EAES meeting. The surveys assessed demographics, AI awareness, expectations, involvement, and ethics (2024 only). The surveys collected a total of 671 responses from 98 countries, 522 in 2021 and 149 in 2024. Awareness of AI courses rose from 14.5% in 2021 to 44.6% in 2024, while course attendance increased from 12.9% to 23%. Despite this, familiarity with foundational AI concepts remained limited. Expectations for AI’s role shifted in 2024, with hospital management gaining relevance. Ethical concerns gained prominence, with 87.2% of 2024 participants emphasizing accountability and transparency. Infrastructure limitations remained the primary obstacle to implementation. Interdisciplinary collaboration and structured training were identified as critical for successful AI adoption. Optimism about AI’s transformative potential remained high, with 79.9% of respondents believing AI would positively impact surgery and 96.6% willing to integrate AI into their clinical practice. Surgeons’ perceptions of AI are evolving, driven by the rise of generative AI and advancements in surgical data science. While enthusiasm for integration is strong, knowledge gaps and infrastructural challenges persist. Addressing these through education, ethical frameworks, and infrastructure development is essential.
zh
[CV-94] Highly Compressed Tokenizer Can Generate Without Training
【速读】:该论文试图解决传统图像分词器生成的二维网格令牌在图像编辑和生成任务中表达能力有限的问题,以及需要训练复杂生成模型才能实现高质量图像生成的局限性。其解决方案的关键在于利用具有向量量化(vector quantization)的1D图像分词器,将图像压缩为仅32个离散令牌的一维序列,从而实现通过启发式令牌操作进行图像编辑和生成,同时借助基于梯度的测试时优化(test-time optimization)与即插即用损失函数(如重建损失或CLIP相似性)进一步提升生成效果。
链接: https://arxiv.org/abs/2506.08257
作者: L. Lao Beyer,T. Li,X. Chen,S. Karaman,K. He
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Main manuscript: 9 pages, 7 figures. Appendix: 8 pages, 9 figures. To appear in the Proceedings of the 42nd International Conference on Machine Learning
点击查看摘要
Abstract:Commonly used image tokenizers produce a 2D grid of spatially arranged tokens. In contrast, so-called 1D image tokenizers represent images as highly compressed one-dimensional sequences of as few as 32 discrete tokens. We find that the high degree of compression achieved by a 1D tokenizer with vector quantization enables image editing and generative capabilities through heuristic manipulation of tokens, demonstrating that even very crude manipulations – such as copying and replacing tokens between latent representations of images – enable fine-grained image editing by transferring appearance and semantic attributes. Motivated by the expressivity of the 1D tokenizer’s latent space, we construct an image generation pipeline leveraging gradient-based test-time optimization of tokens with plug-and-play loss functions such as reconstruction or CLIP similarity. Our approach is demonstrated for inpainting and text-guided image editing use cases, and can generate diverse and realistic samples without requiring training of any generative model.
zh
[CV-95] A Good CREPE needs more than just Sugar: Investigating Biases in Compositional Vision-Language Benchmarks
【速读】:该论文试图解决当前用于评估视觉-语言模型(Vision-Language Models, VLMs)组合理解能力的基准测试中存在的固有偏差问题。研究发现,现有基准测试(如SugarCREPE、VALSE等)在数据来源和标注过程中存在设计缺陷,导致正样本与负样本之间的分布不对称,从而使得简单的启发式方法(如词元长度、语言模型下的对数似然)能够达到与CLIP模型相当的性能,表明这些基准未能有效衡量组合理解能力。解决方案的关键在于识别并缓解由基准构建过程引起的分布不对称性,提出了一系列构建更稳健的视觉-语言组合理解基准的建议。
链接: https://arxiv.org/abs/2506.08227
作者: Vishaal Udandarao,Mehdi Cherti,Shyamgopal Karthik,Jenia Jitsev,Samuel Albanie,Matthias Bethge
机构: University of Tübingen and Tübingen AI Center (图宾根大学和图宾根人工智能中心); Juelich Supercomputing Centre (JSC) (于利希超级计算中心); University of Cambridge (剑桥大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:We investigate 17 benchmarks (e.g. SugarCREPE, VALSE) commonly used for measuring compositional understanding capabilities of vision-language models (VLMs). We scrutinize design choices in their construction, including data source (e.g. MS-COCO) and curation procedures (e.g. constructing negative images/captions), uncovering several inherent biases across most benchmarks. We find that blind heuristics (e.g. token-length, log-likelihood under a language model) perform on par with CLIP models, indicating that these benchmarks do not effectively measure compositional understanding. We demonstrate that the underlying factor is a distribution asymmetry between positive and negative images/captions, induced by the benchmark construction procedures. To mitigate these issues, we provide a few key recommendations for constructing more robust vision-language compositional understanding benchmarks, that would be less prone to such simple attacks.
zh
[CV-96] Jamais Vu: Exposing the Generalization Gap in Supervised Semantic Correspondence
【速读】:该论文试图解决语义对应(Semantic Correspondence, SC)中监督方法在稀疏标注训练关键点之外泛化能力不足的问题,这些方法本质上更像是关键点检测器。解决方案的关键在于通过单目深度估计将2D关键点提升到规范的3D空间,从而构建一个无需显式3D监督或相机标注的连续规范流形,以捕捉物体几何结构。
链接: https://arxiv.org/abs/2506.08220
作者: Octave Mariotti,Zhipeng Du,Yash Bhalgat,Oisin Mac Aodha,Hakan Bilen
机构: University of Edinburgh (爱丁堡大学); University of Oxford (牛津大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Semantic correspondence (SC) aims to establish semantically meaningful matches across different instances of an object category. We illustrate how recent supervised SC methods remain limited in their ability to generalize beyond sparsely annotated training keypoints, effectively acting as keypoint detectors. To address this, we propose a novel approach for learning dense correspondences by lifting 2D keypoints into a canonical 3D space using monocular depth estimation. Our method constructs a continuous canonical manifold that captures object geometry without requiring explicit 3D supervision or camera annotations. Additionally, we introduce SPair-U, an extension of SPair-71k with novel keypoint annotations, to better assess generalization. Experiments not only demonstrate that our model significantly outperforms supervised baselines on unseen keypoints, highlighting its effectiveness in learning robust correspondences, but that unsupervised baselines outperform supervised counterparts when generalized across different datasets.
zh
[CV-97] Using Satellite Images And Self-supervised Machine Learning Networks To Detect Water Hidden Under Vegetation
【速读】:该论文试图解决在利用高分辨率雷达卫星图像进行湿地表面积远程监测时,传统模型依赖大量手动标注数据导致的耗时和成本高昂的问题。解决方案的关键在于采用深度聚类与负采样相结合的方法,在无需任何人工标注的情况下训练模型,以实现对雷达卫星图像中水体与陆地区域的分割。此外,通过实现模型的集成版本,进一步降低了方差并提升了性能。
链接: https://arxiv.org/abs/2506.08214
作者: Ioannis Iakovidis,Zahra Kalantari,Amir Hossein Payberah,Fernando Jaramillo,Francisco Pena Escobar
机构: KTH; Stockholm University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 9 figures
点击查看摘要
Abstract:In recent years the wide availability of high-resolution radar satellite images along with the advancement of computer vision models have enabled the remote monitoring of the surface area of wetlands. However, these models require large amounts of manually annotated satellite images, which are slow and expensive to produce. To overcome this problem, self-supervised training methods have been deployed to train models without using annotated data. In this paper we use a combination of deep clustering and negative sampling to train a model to segment radar satellite images into areas that separate water from land without the use of any manual annotations. Furthermore, we implement an ensemble version of the model to reduce variance and improve performance. Compared to a single fully-supervised model using the same architecture, our ensemble of self-supervised models achieves a 0.02 improvement in the Intersection Over Union metric over our test dataset.
zh
[CV-98] GIQ: Benchmarking 3D Geometric Reasoning of Vision Foundation Models with Simulated and Real Polyhedra
【速读】:该论文试图解决当前单目3D重建方法和视觉-语言模型(Vision-Language Models, VLMs)在几何推理能力方面的不足,尤其是在理解复杂几何属性和结构上的局限性。其解决方案的关键在于提出GIQ,一个专门用于评估视觉和视觉-语言基础模型几何推理能力的综合性基准,涵盖多种多样的多面体,包括柏拉图立体、阿基米德立体、约翰逊立体、卡塔兰立体以及星形和复合形状,以系统性地测试模型在单目3D重建、3D对称性检测、心理旋转测试和零样本形状分类任务中的表现,从而揭示现有模型在几何理解上的关键缺陷。
链接: https://arxiv.org/abs/2506.08194
作者: Mateusz Michalkiewicz,Anekha Sokhal,Tadeusz Michalkiewicz,Piotr Pawlikowski,Mahsa Baktashmotlagh,Varun Jampani,Guha Balakrishnan
机构: Rice University (莱斯大学); Independent Researcher (独立研究员); The University of Queensland (昆士兰大学); Stability AI (Stability AI)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 4 figures
点击查看摘要
Abstract:Monocular 3D reconstruction methods and vision-language models (VLMs) demonstrate impressive results on standard benchmarks, yet their true understanding of geometric properties remains unclear. We introduce GIQ , a comprehensive benchmark specifically designed to evaluate the geometric reasoning capabilities of vision and vision-language foundation models. GIQ comprises synthetic and real-world images of 224 diverse polyhedra - including Platonic, Archimedean, Johnson, and Catalan solids, as well as stellations and compound shapes - covering varying levels of complexity and symmetry. Through systematic experiments involving monocular 3D reconstruction, 3D symmetry detection, mental rotation tests, and zero-shot shape classification tasks, we reveal significant shortcomings in current models. State-of-the-art reconstruction algorithms trained on extensive 3D datasets struggle to reconstruct even basic geometric forms accurately. While foundation models effectively detect specific 3D symmetry elements via linear probing, they falter significantly in tasks requiring detailed geometric differentiation, such as mental rotation. Moreover, advanced vision-language assistants exhibit remarkably low accuracy on complex polyhedra, systematically misinterpreting basic properties like face geometry, convexity, and compound structures. GIQ is publicly available, providing a structured platform to highlight and address critical gaps in geometric intelligence, facilitating future progress in robust, geometry-aware representation learning.
zh
[CV-99] Generative Learning of Differentiable Object Models for Compositional Interpretation of Complex Scenes
【速读】:该论文试图解决在单个场景中处理多个物体的图像分解问题,以及由此带来的训练困难,特别是在图像空间重建损失中存在大量平缓区域时。其解决方案的关键在于扩展原始的DVP(Disentangler of Visual Priors)架构,使其能够处理多物体场景,并利用其潜在表示的可解释性,通过解码器生成额外的训练样本,同时引入基于图像空间和潜在空间的损失函数进行训练。这一方法显著提升了模型的重建质量和对重叠物体的分解能力。
链接: https://arxiv.org/abs/2506.08191
作者: Antoni Nowinowski,Krzysztof Krawiec
机构: Poznan University of Technology (波兹南科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:This study builds on the architecture of the Disentangler of Visual Priors (DVP), a type of autoencoder that learns to interpret scenes by decomposing the perceived objects into independent visual aspects of shape, size, orientation, and color appearance. These aspects are expressed as latent parameters which control a differentiable renderer that performs image reconstruction, so that the model can be trained end-to-end with gradient using reconstruction loss. In this study, we extend the original DVP so that it can handle multiple objects in a scene. We also exploit the interpretability of its latent by using the decoder to sample additional training examples and devising alternative training modes that rely on loss functions defined not only in the image space, but also in the latent space. This significantly facilitates training, which is otherwise challenging due to the presence of extensive plateaus in the image-space reconstruction loss. To examine the performance of this approach, we propose a new benchmark featuring multiple 2D objects, which subsumes the previously proposed Multi-dSprites dataset while being more parameterizable. We compare the DVP extended in these ways with two baselines (MONet and LIVE) and demonstrate its superiority in terms of reconstruction quality and capacity to decompose overlapping objects. We also analyze the gradients induced by the considered loss functions, explain how they impact the efficacy of training, and discuss the limitations of differentiable rendering in autoencoders and the ways in which they can be addressed.
zh
[CV-100] Open World Scene Graph Generation using Vision Language Models CVPR2025
【速读】:该论文试图解决传统场景图生成(Scene-Graph Generation, SGG)方法依赖于特定数据集监督的问题,这种依赖限制了其在开放世界场景中的适用性,尤其是在面对新对象和/或关系时。解决方案的关键在于提出一种无需训练、高效且与模型无关的框架——Open-World SGG,该框架直接利用预训练视觉语言模型(Vision Language Models, VLMs)的知识,通过零样本结构化推理问题的形式,结合多模态提示、嵌入对齐和轻量级对精炼策略,实现对未见过的对象词汇和关系集的推理能力。
链接: https://arxiv.org/abs/2506.08189
作者: Amartya Dutta,Kazi Sajeed Mehrab,Medha Sawhney,Abhilash Neog,Mridul Khurana,Sepideh Fatemi,Aanish Pradhan,M. Maruf,Ismini Lourentzou,Arka Daw,Anuj Karpatne
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in CVPR 2025 Workshop (CVinW)
点击查看摘要
Abstract:Scene-Graph Generation (SGG) seeks to recognize objects in an image and distill their salient pairwise relationships. Most methods depend on dataset-specific supervision to learn the variety of interactions, restricting their usefulness in open-world settings, involving novel objects and/or relations. Even methods that leverage large Vision Language Models (VLMs) typically require benchmark-specific fine-tuning. We introduce Open-World SGG, a training-free, efficient, model-agnostic framework that taps directly into the pretrained knowledge of VLMs to produce scene graphs with zero additional learning. Casting SGG as a zero-shot structured-reasoning problem, our method combines multimodal prompting, embedding alignment, and a lightweight pair-refinement strategy, enabling inference over unseen object vocabularies and relation sets. To assess this setting, we formalize an Open-World evaluation protocol that measures performance when no SGG-specific data have been observed either in terms of objects and relations. Experiments on Visual Genome, Open Images V6, and the Panoptic Scene Graph (PSG) dataset demonstrate the capacity of pretrained VLMs to perform relational understanding without task-level training.
zh
[CV-101] Surgeon Style Fingerprinting and Privacy Risk Quantification via Discrete Diffusion Models in a Vision-Language-Action Framework
【速读】:该论文旨在解决机器人手术中AI系统忽视外科医生个性化行为特征的问题,即当前AI系统未能有效建模外科医生独特的操作风格。解决方案的关键在于提出一种基于离散扩散框架与视觉-语言-动作(VLA)管道相结合的方法,通过多模态输入(包括内窥镜视频、手术意图语言以及隐私感知的外科医生身份与技能嵌入)来建模细粒度的外科医生特定指纹。该方法将手势预测建模为结构化序列去噪任务,并通过第三方语言模型生成自然语言提示来编码个性化外科医生指纹,从而在不暴露显式身份的情况下保留个体行为风格。
链接: https://arxiv.org/abs/2506.08185
作者: Huixin Zhan,Jason H. Moore
机构: Cedars-Sinai Medical Center (塞德斯-西奈医疗中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Surgeons exhibit distinct operating styles due to differences in training, experience, and motor behavior - yet current AI systems often ignore this personalization signal. We propose a novel approach to model fine-grained, surgeon-specific fingerprinting in robotic surgery using a discrete diffusion framework integrated with a vision-language-action (VLA) pipeline. Our method formulates gesture prediction as a structured sequence denoising task, conditioned on multimodal inputs including endoscopic video, surgical intent language, and a privacy-aware embedding of surgeon identity and skill. Personalized surgeon fingerprinting is encoded through natural language prompts using third-party language models, allowing the model to retain individual behavioral style without exposing explicit identity. We evaluate our method on the JIGSAWS dataset and demonstrate that it accurately reconstructs gesture sequences while learning meaningful motion fingerprints unique to each surgeon. To quantify the privacy implications of personalization, we perform membership inference attacks and find that more expressive embeddings improve task performance but simultaneously increase susceptibility to identity leakage. These findings demonstrate that while personalized embeddings improve performance, they also increase vulnerability to identity leakage, revealing the importance of balancing personalization with privacy risk in surgical modeling. Code is available at: this https URL.
zh
[CV-102] UniVarFL: Uniformity and Variance Regularized Federated Learning for Heterogeneous Data
【速读】:该论文旨在解决联邦学习(Federated Learning, FL)在面对非独立同分布(non-IID)数据时性能严重下降的问题,其主要原因是本地分类器偏差。论文提出的解决方案关键在于提出一种名为UniVarFL的新框架,该框架通过在客户端层面直接模拟独立同分布(IID)的训练动态,从而消除对全局模型的依赖。其核心创新在于结合了两种互补的正则化策略:分类器方差正则化(Classifier Variance Regularization)用于对齐类别概率分布以缓解本地分类器偏差;超球面均匀性正则化(Hyperspherical Uniformity Regularization)用于促进特征表示在超球面上的均匀分布,从而提升模型在多样化数据分布下的泛化能力。
链接: https://arxiv.org/abs/2506.08167
作者: Sunny Gupta,Nikita Jangid,Amit Sethi
机构: Indian Institute of Technology, Bombay (印度理工学院,孟买); Koita Centre for Digital Health (库塔数字健康中心)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:
点击查看摘要
Abstract:Federated Learning (FL) often suffers from severe performance degradation when faced with non-IID data, largely due to local classifier bias. Traditional remedies such as global model regularization or layer freezing either incur high computational costs or struggle to adapt to feature shifts. In this work, we propose UniVarFL, a novel FL framework that emulates IID-like training dynamics directly at the client level, eliminating the need for global model dependency. UniVarFL leverages two complementary regularization strategies during local training: Classifier Variance Regularization, which aligns class-wise probability distributions with those expected under IID conditions, effectively mitigating local classifier bias; and Hyperspherical Uniformity Regularization, which encourages a uniform distribution of feature representations across the hypersphere, thereby enhancing the model’s ability to generalize under diverse data distributions. Extensive experiments on multiple benchmark datasets demonstrate that UniVarFL outperforms existing methods in accuracy, highlighting its potential as a highly scalable and efficient solution for real-world FL deployments, especially in resource-constrained settings. Code: this https URL
zh
[CV-103] Spectral Domain Neural Reconstruction for Passband FMCW Radars
【速读】:该论文试图解决在高频段下基于FMCW雷达的高保真体积重建问题,特别是在高起始频率条件下由于相位混叠和子频段模糊导致的建模困难。解决方案的关键在于提出了一种完全可微的频域正向模型,该模型通过闭式合成捕捉复杂的雷达响应,并结合隐式神经表示(INR)实现连续的体积场景建模,同时直接对复数频谱进行监督,从而保持频谱保真度并大幅降低计算开销。此外,引入稀疏性和平滑性正则化以消除细粒度距离分辨率下的子频段模糊问题。
链接: https://arxiv.org/abs/2506.08163
作者: Harshvardhan Takawale,Nirupam Roy
机构: University of Maryland, College Park (马里兰大学学院市分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: arXiv admin note: substantial text overlap with arXiv:2503.23313
点击查看摘要
Abstract:We present SpINRv2, a neural framework for high-fidelity volumetric reconstruction using Frequency-Modulated Continuous-Wave (FMCW) radar. Extending our prior work (SpINR), this version introduces enhancements that allow accurate learning under high start frequencies-where phase aliasing and sub-bin ambiguity become prominent. Our core contribution is a fully differentiable frequency-domain forward model that captures the complex radar response using closed-form synthesis, paired with an implicit neural representation (INR) for continuous volumetric scene modeling. Unlike time-domain baselines, SpINRv2 directly supervises the complex frequency spectrum, preserving spectral fidelity while drastically reducing computational overhead. Additionally, we introduce sparsity and smoothness regularization to disambiguate sub-bin ambiguities that arise at fine range resolutions. Experimental results show that SpINRv2 significantly outperforms both classical and learning-based baselines, especially under high-frequency regimes, establishing a new benchmark for neural radar-based 3D imaging.
zh
[CV-104] IGraSS: Learning to Identify Infrastructure Networks from Satellite Imagery by Iterative Graph-constrained Semantic Segmentation
【速读】:该论文旨在解决基础设施网络(如渠道和道路)在遥感图像中准确映射的问题,特别是在缺乏完整或高质量真实标注数据的情况下。其关键解决方案是提出一种名为IGraSS的迭代框架,该框架结合了语义分割模块与基于图的真值优化模块,通过融合RGB影像及额外模态(如NDWI、DEM)进行分割,并将整个基础设施网络视为图结构进行优化,从而提升真实标注的质量和网络的完整性。
链接: https://arxiv.org/abs/2506.08137
作者: Oishee Bintey Hoque,Abhijin Adiga,Aniruddha Adiga,Siddharth Chaudhary,Madhav V. Marathe,S. S. Ravi,Kirti Rajagopalan,Amanda Wilson,Samarth Swarup
机构: Biocomplexity Institute, University of Virginia; Department of Computer Science, University of Virginia; Department Biomedical Systems Engineering, Washington State University; Earth System Science Center, University of Alabama in Huntsville
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Accurate canal network mapping is essential for water management, including irrigation planning and infrastructure maintenance. State-of-the-art semantic segmentation models for infrastructure mapping, such as roads, rely on large, well-annotated remote sensing datasets. However, incomplete or inadequate ground truth can hinder these learning approaches. Many infrastructure networks have graph-level properties such as reachability to a source (like canals) or connectivity (roads) that can be leveraged to improve these existing ground truth. This paper develops a novel iterative framework IGraSS, combining a semantic segmentation module-incorporating RGB and additional modalities (NDWI, DEM)-with a graph-based ground-truth refinement module. The segmentation module processes satellite imagery patches, while the refinement module operates on the entire data viewing the infrastructure network as a graph. Experiments show that IGraSS reduces unreachable canal segments from around 18% to 3%, and training with refined ground truth significantly improves canal identification. IGraSS serves as a robust framework for both refining noisy ground truth and mapping canal networks from remote sensing imagery. We also demonstrate the effectiveness and generalizability of IGraSS using road networks as an example, applying a different graph-theoretic constraint to complete road networks.
zh
[CV-105] CuRe: Cultural Gaps in the Long Tail of Text-to-Image Systems
【速读】:该论文试图解决当前主流文本到图像(text-to-image, T2I)系统在训练数据中存在文化代表性不足的问题,尤其是对全球南方文化缺乏充分覆盖。其解决方案的关键在于提出CuRe,一个新颖且可扩展的文化代表性基准测试与评分工具,该工具利用属性规范的边际效用作为人类判断的代理,通过构建基于众包维基百科知识图谱的新型分类层次结构,对300个跨32个文化子类别的文化文物进行评估,从而实现对T2I系统在文化细节上的细致比较与分析。
链接: https://arxiv.org/abs/2506.08071
作者: Aniket Rege,Zinnia Nie,Mahesh Ramesh,Unmesh Raskar,Zhuoran Yu,Aditya Kusupati,Yong Jae Lee,Ramya Korlakai Vinayak
机构: University of Wisconsin-Madison (威斯康星大学麦迪逊分校); University of Washington (华盛顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 41 pages, 22 figures, 17 tables
点击查看摘要
Abstract:Popular text-to-image (T2I) systems are trained on web-scraped data, which is heavily Amero and Euro-centric, underrepresenting the cultures of the Global South. To analyze these biases, we introduce CuRe, a novel and scalable benchmarking and scoring suite for cultural representativeness that leverages the marginal utility of attribute specification to T2I systems as a proxy for human judgments. Our CuRe benchmark dataset has a novel categorical hierarchy built from the crowdsourced Wikimedia knowledge graph, with 300 cultural artifacts across 32 cultural subcategories grouped into six broad cultural axes (food, art, fashion, architecture, celebrations, and people). Our dataset’s categorical hierarchy enables CuRe scorers to evaluate T2I systems by analyzing their response to increasing the informativeness of text conditioning, enabling fine-grained cultural comparisons. We empirically observe much stronger correlations of our class of scorers to human judgments of perceptual similarity, image-text alignment, and cultural diversity across image encoders (SigLIP 2, AIMV2 and DINOv2), vision-language models (OpenCLIP, SigLIP 2, Gemini 2.0 Flash) and state-of-the-art text-to-image systems, including three variants of Stable Diffusion (1.5, XL, 3.5 Large), FLUX.1 [dev], Ideogram 2.0, and DALL-E 3. The code and dataset is open-sourced and available at this https URL.
zh
[CV-106] A Real-time 3D Desktop Display
【速读】:该论文旨在解决从2D图像或视频流生成实时光场(light-field)以实现无眼镜全息显示的问题,特别是在处理来自2D网络摄像头图像或平面视频文件的3D视频流时。解决方案的关键在于使用MiDaS卷积神经网络(CNN)从单个2D图像中提取深度图,从而重建多视角内容,结合人工智能(AI)计算技术提升扩展后的altiro3D库的整体性能。
链接: https://arxiv.org/abs/2506.08064
作者: Livio Tenze,Enrique Canessa
机构: 未知
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 5 figures
点击查看摘要
Abstract:A new extended version of the altiro3D C++ Library – initially developed to get glass-free holographic displays starting from 2D images – is here introduced aiming to deal with 3D video streams from either 2D webcam images or flat video files. These streams are processed in real-time to synthesize light-fields (in Native format) and feed realistic 3D experiences. The core function needed to recreate multiviews consists on the use of MiDaS Convolutional Neural Network (CNN), which allows to extract a depth map from a single 2D image. Artificial Intelligence (AI) computing techniques are applied to improve the overall performance of the extended altiro3D Library. Thus, altiro3D can now treat standard images, video streams or screen portions of a Desktop where other apps may be also running (like web browsers, video chats, etc) and render them into 3D. To achieve the latter, a screen region need to be selected in order to feed the output directly into a light-field 3D device such as Looking Glass (LG) Portrait. In order to simplify the acquisition of a Desktop screen area by the user, a multi-platform Graphical User Interface has been also implemented. Sources available at: this https URL
zh
[CV-107] ReCogDrive: A Reinforced Cognitive Framework for End-to-End Autonomous Driving
【速读】:该论文旨在解决端到端自动驾驶在罕见和长尾场景下性能显著下降的问题。现有方法虽尝试利用视觉-语言模型(VLM)的丰富世界知识,但面临三个关键限制:预训练数据与真实驾驶数据之间的领域差异、离散语言空间与连续动作空间的维度不匹配,以及模仿学习可能捕捉到次优甚至危险的平均行为。论文提出的解决方案——ReCogDrive,其关键在于将VLM与扩散规划器(diffusion planner)相结合,并采用三阶段训练范式:首先利用大规模驾驶问答数据集训练VLM以缓解领域差异,其次通过基于扩散的规划器执行模仿学习,将语言空间表示映射为连续驾驶动作,最后使用NAVSIM非反应式模拟器进行强化学习微调,从而生成更安全、更接近人类的驾驶轨迹。
链接: https://arxiv.org/abs/2506.08052
作者: Yongkang Li,Kaixin Xiong,Xiangyu Guo,Fang Li,Sixu Yan,Gangwei Xu,Lijun Zhou,Long Chen,Haiyang Sun,Bing Wang,Guang Chen,Hangjun Ye,Wenyu Liu,Xinggang Wang
机构: Huazhong University of Science and Technology (华中科技大学); Xiaomi EV (小米汽车)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
点击查看摘要
Abstract:Although end-to-end autonomous driving has made remarkable progress, its performance degrades significantly in rare and long-tail scenarios. Recent approaches attempt to address this challenge by leveraging the rich world knowledge of Vision-Language Models (VLMs), but these methods suffer from several limitations: (1) a significant domain gap between the pre-training data of VLMs and real-world driving data, (2) a dimensionality mismatch between the discrete language space and the continuous action space, and (3) imitation learning tends to capture the average behavior present in the dataset, which may be suboptimal even dangerous. In this paper, we propose ReCogDrive, an autonomous driving system that integrates VLMs with diffusion planner, which adopts a three-stage paradigm for training. In the first stage, we use a large-scale driving question-answering datasets to train the VLMs, mitigating the domain discrepancy between generic content and real-world driving scenarios. In the second stage, we employ a diffusion-based planner to perform imitation learning, mapping representations from the latent language space to continuous driving actions. Finally, we fine-tune the diffusion planner using reinforcement learning with NAVSIM non-reactive simulator, enabling the model to generate safer, more human-like driving trajectories. We evaluate our approach on the planning-oriented NAVSIM benchmark, achieving a PDMS of 89.6 and setting a new state-of-the-art that surpasses the previous vision-only SOTA by 5.6 PDMS.
zh
[CV-108] owards Reliable AR-Guided Surgical Navigation: Interactive Deformation Modeling with Data-Driven Biomechanics and Prompts
【速读】:该论文旨在解决增强现实(AR)引导手术导航中预手术器官模型与术中动态解剖结构之间对齐不准确的问题,特别是在面对如气腹或韧带分离等大范围解剖变化时,传统算法难以保持精确的解剖对应关系,从而影响AR导航的可靠性。其解决方案的关键在于提出一种数据驱动的生物力学算法,该算法在保持有限元方法(FEM)级精度的同时提升了计算效率,并引入了一种人机协同机制,使外科医生能够通过交互式提示纠正解剖错位,从而结合临床经验,提升模型对复杂手术场景的适应能力。
链接: https://arxiv.org/abs/2506.08048
作者: Zheng Han,Jun Zhou,Jialun Pei,Jing Qin,Yingfang Fan,Qi Dou
机构: The Chinese University of Hong Kong, HKSAR, China; Hong Kong Polytechnic University, HKSAR, China; Southern Medical University, Guangzhou, China
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:
点击查看摘要
Abstract:In augmented reality (AR)-guided surgical navigation, preoperative organ models are superimposed onto the patient’s intraoperative anatomy to visualize critical structures such as vessels and tumors. Accurate deformation modeling is essential to maintain the reliability of AR overlays by ensuring alignment between preoperative models and the dynamically changing anatomy. Although the finite element method (FEM) offers physically plausible modeling, its high computational cost limits intraoperative applicability. Moreover, existing algorithms often fail to handle large anatomical changes, such as those induced by pneumoperitoneum or ligament dissection, leading to inaccurate anatomical correspondences and compromised AR guidance. To address these challenges, we propose a data-driven biomechanics algorithm that preserves FEM-level accuracy while improving computational efficiency. In addition, we introduce a novel human-in-the-loop mechanism into the deformation modeling process. This enables surgeons to interactively provide prompts to correct anatomical misalignments, thereby incorporating clinical expertise and allowing the model to adapt dynamically to complex surgical scenarios. Experiments on a publicly available dataset demonstrate that our algorithm achieves a mean target registration error of 3.42 mm. Incorporating surgeon prompts through the interactive framework further reduces the error to 2.78 mm, surpassing state-of-the-art methods in volumetric accuracy. These results highlight the ability of our framework to deliver efficient and accurate deformation modeling while enhancing surgeon-algorithm collaboration, paving the way for safer and more reliable computer-assisted surgeries.
zh
[CV-109] Neural-Augmented Kelvinlet: Real-Time Soft Tissue Deformation with Multiple Graspers
【速读】:该论文旨在解决手术机器人和医学训练中软组织形变快速且准确模拟的问题,这是实现高保真手术仿真和实时交互的关键挑战。其解决方案的关键在于提出了一种物理信息神经模拟器,该模拟器将基于Kelvinlet的先验知识整合到神经模拟器中,首次利用Kelvinlets进行残差学习和正则化,从而提升了数据驱动的软组织建模的准确性与物理一致性,同时保持了低延迟以满足实时性能需求。
链接: https://arxiv.org/abs/2506.08043
作者: Ashkan Shahbazi,Kyvia Pereira,Jon S. Heiselman,Elaheh Akbari,Annie C. Benson,Sepehr Seifi,Xinyuan Liu,Garrison L. Johnston,Erwin Terpstra,Anne Draaisma,Jan-Jaap Severes,Jie Ying Wu,Nabil Simaan,Michael L.Miga,Soheil Kolouri
机构: 未知
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
备注:
点击查看摘要
Abstract:Fast and accurate simulation of soft tissue deformation is a critical factor for surgical robotics and medical training. In this paper, we introduce a novel physics-informed neural simulator that approximates soft tissue deformations in a realistic and real-time manner. Our framework integrates Kelvinlet-based priors into neural simulators, making it the first approach to leverage Kelvinlets for residual learning and regularization in data-driven soft tissue modeling. By incorporating large-scale Finite Element Method (FEM) simulations of both linear and nonlinear soft tissue responses, our method improves neural network predictions across diverse architectures, enhancing accuracy and physical consistency while maintaining low latency for real-time performance. We demonstrate the effectiveness of our approach by performing accurate surgical maneuvers that simulate the use of standard laparoscopic tissue grasping tools with high fidelity. These results establish Kelvinlet-augmented learning as a powerful and efficient strategy for real-time, physics-aware soft tissue simulation in surgical applications.
zh
[CV-110] Bi-level Unbalanced Optimal Transport for Partial Domain Adaptation
【速读】:该论文试图解决部分领域自适应(Partial Domain Adaptation, PDA)问题,即在跨域样本对齐的同时区分异常类别以实现准确的知识迁移。现有加权框架通过引入与目标域标签分布相似的加权源域来处理异常类别,但其经验建模仅能表征样本级关系,导致对聚类结构探索不足,且权重对预测不准确敏感,易造成异常类别的混淆。该论文的关键解决方案是提出一种双层不平衡最优传输(Bi-level Unbalanced Optimal Transport, BUOT)模型,在统一的传输框架中同时表征样本级和类别级关系。其核心在于引入样本级与类别级传输之间的协作机制,其中样本级传输为类别级知识迁移提供结构信息,而类别级传输则为异常识别提供判别信息,从而提升对齐过程的准确性与效率。
链接: https://arxiv.org/abs/2506.08020
作者: Zi-Ying Chen,Chuan-Xian Ren,Hong Yan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Partial domain adaptation (PDA) problem requires aligning cross-domain samples while distinguishing the outlier classes for accurate knowledge transfer. The widely used weighting framework tries to address the outlier classes by introducing the reweighed source domain with a similar label distribution to the target domain. However, the empirical modeling of weights can only characterize the sample-wise relations, which leads to insufficient exploration of cluster structures, and the weights could be sensitive to the inaccurate prediction and cause confusion on the outlier classes. To tackle these issues, we propose a Bi-level Unbalanced Optimal Transport (BUOT) model to simultaneously characterize the sample-wise and class-wise relations in a unified transport framework. Specifically, a cooperation mechanism between sample-level and class-level transport is introduced, where the sample-level transport provides essential structure information for the class-level knowledge transfer, while the class-level transport supplies discriminative information for the outlier identification. The bi-level transport plan provides guidance for the alignment process. By incorporating the label-aware transport cost, the local transport structure is ensured and a fast computation formulation is derived to improve the efficiency. Extensive experiments on benchmark datasets validate the competitiveness of BUOT.
zh
[CV-111] Gridding Forced Displacement using Semi-Supervised Learning
【速读】:该论文旨在解决难民统计数据在行政边界层面缺乏空间细化的问题,通过将难民统计信息从宏观的行政区域 disaggregates(分解)到0.5度网格单元,以提高空间分辨率。其解决方案的关键在于采用一种半监督方法,结合联合国难民署(UNHCR)的ProGres注册数据、Google Open Buildings提供的卫星衍生建筑足迹数据以及OpenStreetMap Populated Places的位置坐标,利用标签传播算法实现对超过1000万条难民观测数据的高精度空间定位,平均准确率达到92.9%。
链接: https://arxiv.org/abs/2506.08019
作者: Andrew Wells,Geraldine Henningsen,Brice Bolane Tchinde Kengne
机构: UNHCR(联合国难民署); UNHCR(联合国难民署); UNHCR(联合国难民署)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)
备注:
点击查看摘要
Abstract:We present a semi-supervised approach that disaggregates refugee statistics from administrative boundaries to 0.5-degree grid cells across 25 Sub-Saharan African countries. By integrating UNHCR’s ProGres registration data with satellite-derived building footprints from Google Open Buildings and location coordinates from OpenStreetMap Populated Places, our label spreading algorithm creates spatially explicit refugee statistics at high this http URL methodology achieves 92.9% average accuracy in placing over 10 million refugee observations into appropriate grid cells, enabling the identification of localized displacement patterns previously obscured in broader regional and national statistics. The resulting high-resolution dataset provides a foundation for a deeper understanding of displacement drivers.
zh
[CV-112] Video-ColBERT: Contextualized Late Interaction for Text-to-Video Retrieval CVPR2025
【速读】:该论文试图解决文本到视频检索(text-to-video retrieval, T2VR)的问题。其解决方案的关键在于提出一种名为Video-ColBERT的方法,该方法通过引入细粒度的空间和时间逐标记交互、查询与视觉扩展以及训练过程中的双sigmoid损失,实现查询与视频之间的细粒度相似性评估。这种交互与训练范式能够生成强而兼容的视频内容表示,从而在常见的文本到视频检索基准测试中提升性能。
链接: https://arxiv.org/abs/2503.19009
作者: Arun Reddy,Alexander Martin,Eugene Yang,Andrew Yates,Kate Sanders,Kenton Murray,Reno Kriz,Celso M. de Melo,Benjamin Van Durme,Rama Chellappa
机构: Johns Hopkins Applied Physics Laboratory (约翰霍普金斯应用物理实验室); Johns Hopkins University (约翰霍普金斯大学); Human Language Technology Center of Excellence (人类语言技术卓越中心); DEVCOM Army Research Laboratory (DEVCOM陆军研究实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
备注: Accepted at CVPR 2025. 13 pages, 4 figures. Approved for public release: distribution unlimited
点击查看摘要
Abstract:In this work, we tackle the problem of text-to-video retrieval (T2VR). Inspired by the success of late interaction techniques in text-document, text-image, and text-video retrieval, our approach, Video-ColBERT, introduces a simple and efficient mechanism for fine-grained similarity assessment between queries and videos. Video-ColBERT is built upon 3 main components: a fine-grained spatial and temporal token-wise interaction, query and visual expansions, and a dual sigmoid loss during training. We find that this interaction and training paradigm leads to strong individual, yet compatible, representations for encoding video content. These representations lead to increases in performance on common text-to-video retrieval benchmarks compared to other bi-encoder methods.
zh
[CV-113] Enhancing Synthetic CT from CBCT via Multimodal Fusion: A Study on the Impact of CBCT Quality and Alignment
【速读】:该论文旨在解决锥形束计算机断层扫描(Cone-Beam Computed Tomography, CBCT)在术中实时成像中因伪影较多而导致的图像质量较低的问题。其解决方案的关键在于通过多模态学习,将术中CBCT与术前CT数据进行融合,从而生成更高质量的合成CT(synthetic CT, sCT),以改善图像的视觉质量和诊断可靠性。
链接: https://arxiv.org/abs/2506.08716
作者: Maximilian Tschuchnig,Lukas Lamminger,Philipp Steininger,Michael Gadermayr
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Data is open source. Code will be provided on acceptance. Paper currently under review
点击查看摘要
Abstract:Cone-Beam Computed Tomography (CBCT) is widely used for real-time intraoperative imaging due to its low radiation dose and high acquisition speed. However, despite its high resolution, CBCT suffers from significant artifacts and thereby lower visual quality, compared to conventional Computed Tomography (CT). A recent approach to mitigate these artifacts is synthetic CT (sCT) generation, translating CBCT volumes into the CT domain. In this work, we enhance sCT generation through multimodal learning, integrating intraoperative CBCT with preoperative CT. Beyond validation on two real-world datasets, we use a versatile synthetic dataset, to analyze how CBCT-CT alignment and CBCT quality affect sCT quality. The results demonstrate that multimodal sCT consistently outperform unimodal baselines, with the most significant gains observed in well-aligned, low-quality CBCT-CT cases. Finally, we demonstrate that these findings are highly reproducible in real-world clinical datasets.
zh
[CV-114] MAMBO: High-Resolution Generative Approach for Mammography Images
【速读】:该论文旨在解决乳腺癌筛查中因隐私和伦理限制导致的高质量、多样化训练数据获取困难的问题,从而提升基于人工智能(AI)的乳腺X线摄影辅助诊断系统的性能。其解决方案的关键在于提出一种名为MAMBO的新型基于块的扩散方法,该方法通过集成多个扩散模型以捕捉局部和全局上下文信息,并利用这些上下文信息增强噪声去除过程,从而生成高分辨率(最高达3840x3840像素)且高度逼真的乳腺X线图像,进而支持分类模型训练和异常检测任务。
链接: https://arxiv.org/abs/2506.08677
作者: Milica Škipina,Nikola Jovišić,Nicola Dall’Asen,Vanja Švenda,Anil Osman Tur,Slobodan Ilić,Elisa Ricci,Dubravko Ćulibrk
机构: The Institute for Artificial Intelligence Research and Development of Serbia; University of Trento; Faculty of Technical Sciences, University of Novi Sad; University of Pisa; Fondazione Bruno Kessler; University of Verona
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 21 pages, 14 figures, 7 tables
点击查看摘要
Abstract:Mammography is the gold standard for the detection and diagnosis of breast cancer. This procedure can be significantly enhanced with Artificial Intelligence (AI)-based software, which assists radiologists in identifying abnormalities. However, training AI systems requires large and diverse datasets, which are often difficult to obtain due to privacy and ethical constraints. To address this issue, the paper introduces MAMmography ensemBle mOdel (MAMBO), a novel patch-based diffusion approach designed to generate full-resolution mammograms. Diffusion models have shown breakthrough results in realistic image generation, yet few studies have focused on mammograms, and none have successfully generated high-resolution outputs required to capture fine-grained features of small lesions. To achieve this, MAMBO integrates separate diffusion models to capture both local and global (image-level) contexts. The contextual information is then fed into the final patch-based model, significantly aiding the noise removal process. This thoughtful design enables MAMBO to generate highly realistic mammograms of up to 3840x3840 pixels. Importantly, this approach can be used to enhance the training of classification models and extended to anomaly detection. Experiments, both numerical and radiologist validation, assess MAMBO’s capabilities in image generation, super-resolution, and anomaly detection, highlighting its potential to enhance mammography analysis for more accurate diagnoses and earlier lesion detection.
zh
[CV-115] Biologically Inspired Deep Learning Approaches for Fetal Ultrasound Image Classification
【速读】:该论文旨在解决第二孕期胎儿超声图像准确分类的问题,该问题由于图像质量低、类内变异大以及类别不平衡而具有挑战性。其解决方案的关键在于提出一种简单但强大的生物启发式深度学习集成框架,该框架通过模仿生物视觉系统的分层模块化结构,将两个互补的分支(一个用于粗略、低分辨率线索的“浅层”路径和一个用于精细、高分辨率特征的“详细”路径)进行堆叠,并融合其输出以实现最终预测,从而同时区分16种胎儿结构。
链接: https://arxiv.org/abs/2506.08623
作者: Rinat Prochii,Elizaveta Dakhova,Pavel Birulin,Maxim Sharaev
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 2 figures, 3 tables
点击查看摘要
Abstract:Accurate classification of second-trimester fetal ultrasound images remains challenging due to low image quality, high intra-class variability, and significant class imbalance. In this work, we introduce a simple yet powerful, biologically inspired deep learning ensemble framework that-unlike prior studies focused on only a handful of anatomical targets-simultaneously distinguishes 16 fetal structures. Drawing on the hierarchical, modular organization of biological vision systems, our model stacks two complementary branches (a “shallow” path for coarse, low-resolution cues and a “detailed” path for fine, high-resolution features), concatenating their outputs for final prediction. To our knowledge, no existing method has addressed such a large number of classes with a comparably lightweight architecture. We trained and evaluated on 5,298 routinely acquired clinical images (annotated by three experts and reconciled via Dawid-Skene), reflecting real-world noise and variability rather than a “cleaned” dataset. Despite this complexity, our ensemble (EfficientNet-B0 + EfficientNet-B6 with LDAM-Focal loss) identifies 90% of organs with accuracy 0.75 and 75% of organs with accuracy 0.85-performance competitive with more elaborate models applied to far fewer categories. These results demonstrate that biologically inspired modular stacking can yield robust, scalable fetal anatomy recognition in challenging clinical settings.
zh
[CV-116] DCD: A Semantic Segmentation Model for Fetal Ultrasound Four-Chamber View
【速读】:该论文旨在解决胎儿超声心动图中心尖四腔观(apical four-chamber, A4C)图像中解剖结构精确分割的问题,该问题对于先天性心脏病(congenital heart disease, CHD)的早期诊断和产前评估至关重要。由于超声伪影、斑点噪声、解剖变异性和不同孕周间的边界模糊性,精确分割仍面临挑战。论文提出的解决方案是DCD模型,其关键在于引入了密集空洞空间金字塔池化(Dense Atrous Spatial Pyramid Pooling, Dense ASPP)模块以实现多尺度特征提取,并结合卷积块注意力模块(Convolutional Block Attention Module, CBAM)以增强自适应特征表示,从而有效捕捉局部与全局上下文信息,实现精准且鲁棒的分割。
链接: https://arxiv.org/abs/2506.08534
作者: Donglian Li,Hui Guo,Minglang Chen,Huizhen Chen,Jialing Chen,Bocheng Liang,Pengchen Liang,Ying Tan
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Accurate segmentation of anatomical structures in the apical four-chamber (A4C) view of fetal echocardiography is essential for early diagnosis and prenatal evaluation of congenital heart disease (CHD). However, precise segmentation remains challenging due to ultrasound artifacts, speckle noise, anatomical variability, and boundary ambiguity across different gestational stages. To reduce the workload of sonographers and enhance segmentation accuracy, we propose DCD, an advanced deep learning-based model for automatic segmentation of key anatomical structures in the fetal A4C view. Our model incorporates a Dense Atrous Spatial Pyramid Pooling (Dense ASPP) module, enabling superior multi-scale feature extraction, and a Convolutional Block Attention Module (CBAM) to enhance adaptive feature representation. By effectively capturing both local and global contextual information, DCD achieves precise and robust segmentation, contributing to improved prenatal cardiac assessment.
zh
[CV-117] Plug-and-Play Linear Attention for Pre-trained Image and Video Restoration Models
【速读】:该论文旨在解决多头自注意力(Multi-head self-attention, MHSA)在现代计算机视觉模型中因输入长度呈二次复杂度而带来的计算瓶颈问题,尤其针对实时和资源受限环境下的效率问题。其解决方案的关键在于提出PnP-Nystra,这是一种基于Nyström方法的自注意力线性近似算法,作为即插即用(plug-and-play, PnP)模块,能够无需重新训练直接集成到预训练的图像和视频修复模型中,从而在多种窗口基础的Transformer架构中实现高效加速。
链接: https://arxiv.org/abs/2506.08520
作者: Srinivasan Kidambi,Pravin Nair
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 1 pseudo-code, 3 figure panels, 2 plot panels, 7 tables, 24 references
点击查看摘要
Abstract:Multi-head self-attention (MHSA) has become a core component in modern computer vision models. However, its quadratic complexity with respect to input length poses a significant computational bottleneck in real-time and resource constrained environments. We propose PnP-Nystra, a Nyström based linear approximation of self-attention, developed as a plug-and-play (PnP) module that can be integrated into the pre-trained image and video restoration models without retraining. As a drop-in replacement for MHSA, PnP-Nystra enables efficient acceleration in various window-based transformer architectures, including SwinIR, Uformer, and RVRT. Our experiments across diverse image and video restoration tasks, including denoising, deblurring, and super-resolution, demonstrate that PnP-Nystra achieves a 2-4x speed-up on an NVIDIA RTX 4090 GPU and a 2-5x speed-up on CPU inference. Despite these significant gains, the method incurs a maximum PSNR drop of only 1.5 dB across all evaluated tasks. To the best of our knowledge, we are the first to demonstrate a linear attention functioning as a training-free substitute for MHSA in restoration models.
zh
[CV-118] Snap-and-tune: combining deep learning and test-time optimization for high-fidelity cardiovascular volumetric meshing
【速读】:该论文试图解决从医学图像中生成高质量体积网格(volumetric mesh)的问题,这是个性化医学中基于物理的仿真的一大瓶颈。其解决方案的关键在于提出了一种简单而有效的“snap-and-tune”策略,该策略依次应用深度学习(DL)和测试时优化(test-time optimization),结合了快速初始形状拟合与更精细的样本特定网格修正,从而在保持自动化的同时显著提升了空间精度和网格质量。
链接: https://arxiv.org/abs/2506.08280
作者: Daniel H. Pak,Shubh Thaker,Kyle Baylous,Xiaoran Zhang,Danny Bluestein,James S. Duncan
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:High-quality volumetric meshing from medical images is a key bottleneck for physics-based simulations in personalized medicine. For volumetric meshing of complex medical structures, recent studies have often utilized deep learning (DL)-based template deformation approaches to enable fast test-time generation with high spatial accuracy. However, these approaches still exhibit limitations, such as limited flexibility at high-curvature areas and unrealistic inter-part distances. In this study, we introduce a simple yet effective snap-and-tune strategy that sequentially applies DL and test-time optimization, which combines fast initial shape fitting with more detailed sample-specific mesh corrections. Our method provides significant improvements in both spatial accuracy and mesh quality, while being fully automated and requiring no additional training labels. Finally, we demonstrate the versatility and usefulness of our newly generated meshes via solid mechanics simulations in two different software platforms. Our code is available at this https URL.
zh
[CV-119] A System for Accurate Tracking and Video Recordings of Rodent Eye Movements using Convolutional Neural Networks for Biomedical Image Segmentation
【速读】:该论文试图解决在啮齿类动物眼动追踪中,现有技术主要针对人类眼睛设计,无法有效处理啮齿类动物眼睛图像特有的问题,如眼参数的变异性、周围毛发的丰富性以及眼睛尺寸的小型化。解决方案的关键在于提出一种灵活、鲁棒且高精度的模型,用于啮齿类动物瞳孔和角膜反射的识别,并支持增量训练以适应实际应用中的眼参数变化。该方法结合自动红外视频眼动记录系统,提供了当前啮齿类动物眼动追踪领域的最先进技术。
链接: https://arxiv.org/abs/2506.08183
作者: Isha Puri,David Cox
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Research in neuroscience and vision science relies heavily on careful measurements of animal subject’s gaze direction. Rodents are the most widely studied animal subjects for such research because of their economic advantage and hardiness. Recently, video based eye trackers that use image processing techniques have become a popular option for gaze tracking because they are easy to use and are completely noninvasive. Although significant progress has been made in improving the accuracy and robustness of eye tracking algorithms, unfortunately, almost all of the techniques have focused on human eyes, which does not account for the unique characteristics of the rodent eye images, e.g., variability in eye parameters, abundance of surrounding hair, and their small size. To overcome these unique challenges, this work presents a flexible, robust, and highly accurate model for pupil and corneal reflection identification in rodent gaze determination that can be incrementally trained to account for variability in eye parameters encountered in the field. To the best of our knowledge, this is the first paper that demonstrates a highly accurate and practical biomedical image segmentation based convolutional neural network architecture for pupil and corneal reflection identification in eye images. This new method, in conjunction with our automated infrared videobased eye recording system, offers the state of the art technology in eye tracking for neuroscience and vision science research for rodents.
zh
[CV-120] Aligning Proteins and Language: A Foundation Model for Protein Retrieval CVPR2025
【速读】:该论文旨在从大规模蛋白质数据集中检索具有相似结构和语义的蛋白质,从而促进通过冷冻电子显微镜(cryo-EM)等结构确定方法获得的蛋白质结构的功能解释。解决方案的关键在于受视觉-语言模型(VLMs)近期进展启发,提出了一种类似CLIP的框架,利用对比学习将3D蛋白质结构与功能注释对齐,并构建了一个包含约20万对蛋白质-描述符的大型数据集用于模型训练。
链接: https://arxiv.org/abs/2506.08023
作者: Qifeng Wu,Zhengzhe Liu,Han Zhu,Yizhou Zhao,Daisuke Kihara,Min Xu
机构: Carnegie Mellon University (卡内基梅隆大学); Purdue University (普渡大学)
类目: Biomolecules (q-bio.BM); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 4 pages for body, 3 pages for appendix, 11 figures. Accepted to CVPR 2025 Workshop on Multimodal Foundation Models for Biomedicine: Challenges and Opportunities(MMFM-BIOMED)
点击查看摘要
Abstract:This paper aims to retrieve proteins with similar structures and semantics from large-scale protein dataset, facilitating the functional interpretation of protein structures derived by structural determination methods like cryo-Electron Microscopy (cryo-EM). Motivated by the recent progress of vision-language models (VLMs), we propose a CLIP-style framework for aligning 3D protein structures with functional annotations using contrastive learning. For model training, we propose a large-scale dataset of approximately 200,000 protein-caption pairs with rich functional descriptors. We evaluate our model in both in-domain and more challenging cross-database retrieval on Protein Data Bank (PDB) and Electron Microscopy Data Bank (EMDB) dataset, respectively. In both cases, our approach demonstrates promising zero-shot retrieval performance, highlighting the potential of multimodal foundation models for structure-function understanding in protein biology.
zh
人工智能
[AI-0] ALE-Bench: A Benchmark for Long-Horizon Objective-Driven Algorithm Engineering
【速读】:该论文试图解决AI系统在处理诸如包裹配送路径规划、机组排班、工厂生产计划和电网平衡等硬优化问题时的算法工程性能评估问题。其解决方案的关键是引入ALE-Bench,这是一个基于得分的算法编程竞赛新基准,旨在评估AI系统在计算上困难且无已知精确解的优化问题上的表现。ALE-Bench通过鼓励长期时间跨度内的迭代解决方案优化,并支持交互式代理架构以利用测试运行反馈和可视化,从而推动AI在跨问题一致性和长周期问题求解能力方面的进步。
链接: https://arxiv.org/abs/2506.09050
作者: Yuki Imajuku,Kohki Horie,Yoichi Iwata,Kensho Aoki,Naohiro Takahashi,Takuya Akiba
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 36 pages
点击查看摘要
Abstract:How well do AI systems perform in algorithm engineering for hard optimization problems in domains such as package-delivery routing, crew scheduling, factory production planning, and power-grid balancing? We introduce ALE-Bench, a new benchmark for evaluating AI systems on score-based algorithmic programming contests. Drawing on real tasks from the AtCoder Heuristic Contests, ALE-Bench presents optimization problems that are computationally hard and admit no known exact solution. Unlike short-duration, pass/fail coding benchmarks, ALE-Bench encourages iterative solution refinement over long time horizons. Our software framework supports interactive agent architectures that leverage test-run feedback and visualizations. Our evaluation of frontier LLMs revealed that while they demonstrate high performance on specific problems, a notable gap remains compared to humans in terms of consistency across problems and long-horizon problem-solving capabilities. This highlights the need for this benchmark to foster future AI advancements.
zh
[AI-1] Agent ic Neural Networks: Self-Evolving Multi-Agent Systems via Textual Backpropagation
【速读】:该论文试图解决多大型语言模型(Large Language Models, LLMs)在处理复杂、高维任务时,因依赖静态、手动设计的多智能体配置而受限的问题。其解决方案的关键在于提出一种称为“智能体神经网络”(Agentic Neural Network, ANN)的框架,该框架将多智能体协作建模为分层神经网络结构,通过动态任务分解与迭代反馈优化,实现智能体角色、提示和协调的自适应演化,从而提升系统的准确性与适应性。
链接: https://arxiv.org/abs/2506.09046
作者: Xiaowen Ma,Chenyang Lin,Yao Zhang,Volker Tresp,Yunpu Ma
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
点击查看摘要
Abstract:Leveraging multiple Large Language Models(LLMs) has proven effective for addressing complex, high-dimensional tasks, but current approaches often rely on static, manually engineered multi-agent configurations. To overcome these constraints, we present the Agentic Neural Network(ANN), a framework that conceptualizes multi-agent collaboration as a layered neural network architecture. In this design, each agent operates as a node, and each layer forms a cooperative “team” focused on a specific subtask. Agentic Neural Network follows a two-phase optimization strategy: (1) Forward Phase-Drawing inspiration from neural network forward passes, tasks are dynamically decomposed into subtasks, and cooperative agent teams with suitable aggregation methods are constructed layer by layer. (2) Backward Phase-Mirroring backpropagation, we refine both global and local collaboration through iterative feedback, allowing agents to self-evolve their roles, prompts, and coordination. This neuro-symbolic approach enables ANN to create new or specialized agent teams post-training, delivering notable gains in accuracy and adaptability. Across four benchmark datasets, ANN surpasses leading multi-agent baselines under the same configurations, showing consistent performance improvements. Our findings indicate that ANN provides a scalable, data-driven framework for multi-agent systems, combining the collaborative capabilities of LLMs with the efficiency and flexibility of neural network principles. We plan to open-source the entire framework.
zh
[AI-2] AbstentionBench: Reasoning LLM s Fail on Unanswerable Questions
【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)在面对不确定、不明确或无法回答的问题时,如何有效进行自我约束(abstention)的问题。现有模型在复杂问题求解方面表现优异,但在处理不确定性时仍存在显著不足,且模型规模的扩大对此问题的改善作用有限。解决方案的关键在于构建一个全面评估模型 abstention 能力的基准测试框架——AbstentionBench,该框架覆盖20个多样化数据集,涵盖未知答案、描述不充分、错误前提、主观解释和过时信息等问题类型,旨在推动对 LLM 可靠性的研究与改进。
链接: https://arxiv.org/abs/2506.09038
作者: Polina Kirichenko,Mark Ibrahim,Kamalika Chaudhuri,Samuel J. Bell
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:For Large Language Models (LLMs) to be reliably deployed in both everyday and high-stakes domains, knowing when not to answer is equally critical as answering correctly. Real-world user queries, which can be underspecified, ill-posed, or fundamentally unanswerable, require LLMs to reason about uncertainty and selectively abstain – i.e., refuse to answer definitively. However, abstention remains understudied, without a systematic evaluation framework for modern LLMs. In this work, we introduce AbstentionBench, a large-scale benchmark for holistically evaluating abstention across 20 diverse datasets, including questions with unknown answers, underspecification, false premises, subjective interpretations, and outdated information. Evaluating 20 frontier LLMs reveals abstention is an unsolved problem, and one where scaling models is of little use. While recent reasoning LLMs have shown impressive results in complex problem solving, surprisingly, we find that reasoning fine-tuning degrades abstention (by 24% on average), even for math and science domains on which reasoning models are explicitly trained. We find that while a carefully crafted system prompt can boost abstention in practice, it does not resolve models’ fundamental inability to reason about uncertainty. We release AbstentionBench to foster research into advancing LLM reliability.
zh
[AI-3] FZOO: Fast Zeroth-Order Optimizer for Fine-Tuning Large Language Models towards Adam-Scale Speed
【速读】:该论文旨在解决大规模语言模型(Large Language Models, LLMs)微调过程中面临的GPU内存瓶颈问题,即使用一阶优化器如Adam进行反向传播会显著增加内存消耗。为了解决这一问题,研究者提出了FZOO(Fast Zeroth-Order Optimizer),其关键在于通过采用分批单边估计来减少达到收敛所需的前向传递次数,并根据批次损失的标准差自适应调整步长,同时利用Rademacher随机向量扰动和CUDA并行处理加速每批次的计算。这种设计使得FZOO在保持接近Adam优化器速度的同时,显著降低了内存消耗和前向传递次数。
链接: https://arxiv.org/abs/2506.09034
作者: Sizhe Dang,Yangyang Guo,Yanjun Zhao,Haishan Ye,Xiaodong Zheng,Guang Dai,Ivor Tsang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Fine-tuning large language models (LLMs) often faces GPU memory bottlenecks: the backward pass of first-order optimizers like Adam increases memory usage to more than 10 times the inference level (e.g., 633 GB for OPT-30B). Zeroth-order (ZO) optimizers avoid this cost by estimating gradients only from forward passes, yet existing methods like MeZO usually require many more steps to converge. Can this trade-off between speed and memory in ZO be fundamentally improved? Normalized-SGD demonstrates strong empirical performance with greater memory efficiency than Adam. In light of this, we introduce FZOO, a Fast Zeroth-Order Optimizer toward Adam-Scale Speed. FZOO reduces the total forward passes needed for convergence by employing batched one-sided estimates that adapt step sizes based on the standard deviation of batch losses. It also accelerates per-batch computation through the use of Rademacher random vector perturbations coupled with CUDA’s parallel processing. Extensive experiments on diverse models, including RoBERTa-large, OPT (350M-66B), Phi-2, and Llama3, across 11 tasks validate FZOO’s effectiveness. On average, FZOO outperforms MeZO by 3 percent in accuracy while requiring 3 times fewer forward passes. For RoBERTa-large, FZOO achieves average improvements of 5.6 percent in accuracy and an 18 times reduction in forward passes compared to MeZO, achieving convergence speeds comparable to Adam. We also provide theoretical analysis proving FZOO’s formal equivalence to a normalized-SGD update rule and its convergence guarantees. FZOO integrates smoothly into PEFT techniques, enabling even larger memory savings. Overall, our results make single-GPU, high-speed, full-parameter fine-tuning practical and point toward future work on memory-efficient pre-training.
zh
[AI-4] Edit Flows: Flow Matching with Edit Operations
【速读】:该论文试图解决非自回归生成模型在生成变长序列时的局限性,这类模型通常需要强制固定的、基于词元的结构,而自回归模型则能自然地生成变长序列。解决方案的关键在于提出Edit Flows,这是一种通过插入、删除和替换等编辑操作在序列空间上定义离散流的非自回归模型,其通过在序列空间上的连续时间马尔可夫链建模这些操作,实现了更灵活、与序列数据结构更一致的位置相关生成。
链接: https://arxiv.org/abs/2506.09018
作者: Marton Havasi,Brian Karrer,Itai Gat,Ricky T. Q. Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Autoregressive generative models naturally generate variable-length sequences, while non-autoregressive models struggle, often imposing rigid, token-wise structures. We propose Edit Flows, a non-autoregressive model that overcomes these limitations by defining a discrete flow over sequences through edit operations-insertions, deletions, and substitutions. By modeling these operations within a Continuous-time Markov Chain over the sequence space, Edit Flows enable flexible, position-relative generation that aligns more closely with the structure of sequence data. Our training method leverages an expanded state space with auxiliary variables, making the learning process efficient and tractable. Empirical results show that Edit Flows outperforms both autoregressive and mask models on image captioning and significantly outperforms the mask construction in text and code generation.
zh
[AI-5] Propositional Logic for Probing Generalization in Neural Networks
【速读】:该论文试图解决神经网络在获取和表示符号规则方面的能力问题,特别是其在逻辑推理任务中的泛化行为。研究聚焦于三种关键的神经架构(Transformer、图卷积网络和LSTM)在命题逻辑基础上的结构化任务中的表现。解决方案的关键在于引入了一个平衡的数据集扩展,以消除表面模式并测试未见过的运算符组合,从而更准确地评估模型在分布外数据上的泛化能力。研究发现,尽管所有模型在分布内表现良好,但面对涉及否定等复杂模式时仍存在显著挑战,尤其Transformer在未引入结构偏差的情况下无法进行否定的组合性应用。这表明标准架构在学习系统性逻辑运算符表示方面存在持续局限,需要更强的归纳偏置来支持鲁棒的基于规则的推理。
链接: https://arxiv.org/abs/2506.08978
作者: Anna Langedijk,Jaap Jumelet,Willem Zuidema
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:The extent to which neural networks are able to acquire and represent symbolic rules remains a key topic of research and debate. Much current work focuses on the impressive capabilities of large language models, as well as their often ill-understood failures on a wide range of reasoning tasks. In this paper, in contrast, we investigate the generalization behavior of three key neural architectures (Transformers, Graph Convolution Networks and LSTMs) in a controlled task rooted in propositional logic. The task requires models to generate satisfying assignments for logical formulas, making it a structured and interpretable setting for studying compositionality. We introduce a balanced extension of an existing dataset to eliminate superficial patterns and enable testing on unseen operator combinations. Using this dataset, we evaluate the ability of the three architectures to generalize beyond the training distribution. While all models perform well in-distribution, we find that generalization to unseen patterns, particularly those involving negation, remains a significant challenge. Transformers fail to apply negation compositionally, unless structural biases are introduced. Our findings highlight persistent limitations in the ability of standard architectures to learn systematic representations of logical operators, suggesting the need for stronger inductive biases to support robust rule-based reasoning.
zh
[AI-6] ailored Architectures for Time Series Forecasting: Evaluating Deep Learning Models on Gaussian Process-Generated Data IJCNN25
【速读】:该论文试图解决时间序列预测中模型性能与数据特征之间关联性不明确的问题,即如何将特定的时间序列特征与模型架构的优势相匹配。其解决方案的关键在于引入一个基于高斯过程生成的新型数据集,该数据集具有明确且已知的时间序列特性,以支持对模型适应性的针对性评估,同时提出了一种模块化架构的TimeFlex模型,旨在有效处理多样的时间动态,包括趋势和周期性模式。
链接: https://arxiv.org/abs/2506.08977
作者: Victoria Hankemeier,Malte Schilling
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at IJCNN25, Code: this https URL
点击查看摘要
Abstract:Developments in Deep Learning have significantly improved time series forecasting by enabling more accurate modeling of complex temporal dependencies inherent in sequential data. The effectiveness of such models is often demonstrated on limited sets of specific real-world data. Although this allows for comparative analysis, it still does not demonstrate how specific data characteristics align with the architectural strengths of individual models. Our research aims at uncovering clear connections between time series characteristics and particular models. We introduce a novel dataset generated using Gaussian Processes, specifically designed to display distinct, known characteristics for targeted evaluations of model adaptability to them. Furthermore, we present TimeFlex, a new model that incorporates a modular architecture tailored to handle diverse temporal dynamics, including trends and periodic patterns. This model is compared to current state-of-the-art models, offering a deeper understanding of how models perform under varied time series conditions.
zh
[AI-7] A Survey of Link Prediction in N-ary Knowledge Graphs
【速读】:该论文试图解决N-ary Knowledge Graphs (NKGs) 中的链接预测问题,即预测n元事实中缺失的元素,以完善NKG并提升下游应用的性能。解决方案的关键在于系统地分类和分析现有的链接预测方法,并总结其性能与适用场景,从而为未来的研究提供方向。
链接: https://arxiv.org/abs/2506.08970
作者: Jiyao Wei,Saiping Guan,Da Li,Xiaolong Jin,Jiafeng Guo,Xueqi Cheng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:N-ary Knowledge Graphs (NKGs) are a specialized type of knowledge graph designed to efficiently represent complex real-world facts. Unlike traditional knowledge graphs, where a fact typically involves two entities, NKGs can capture n-ary facts containing more than two entities. Link prediction in NKGs aims to predict missing elements within these n-ary facts, which is essential for completing NKGs and improving the performance of downstream applications. This task has recently gained significant attention. In this paper, we present the first comprehensive survey of link prediction in NKGs, providing an overview of the field, systematically categorizing existing methods, and analyzing their performance and application scenarios. We also outline promising directions for future research.
zh
[AI-8] GFRIEND: Generative Few-shot Reward Inference through EfficieNt DPO
【速读】:该论文旨在解决在少量样本情况下训练高性能奖励模型(reward model)的挑战,以提升基于人类反馈的强化学习(Reinforcement Learning from Human Feedback, RLHF)的效率和可扩展性。传统方法如直接偏好优化(Direct Preference Optimization, DPO)受限于样本配对效率低和数据多样性不足的问题。该研究提出的关键解决方案是引入偏好精炼(preference refinement),结合思维链(Chain-of-Thought, CoT)采样以挖掘多样且高质量的偏好关系,并采用基于困惑度的评分机制来赋予更细致的偏好等级,同时利用多级直接偏好优化(Multi-level Direct Preference Optimization, M-DPO)捕捉样本间的细微偏好差异。
链接: https://arxiv.org/abs/2506.08965
作者: Yiyang Zhao,Huiyu Bai,Xuejiao Zhao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:The ability to train high-performing reward models with few-shot data is critical for enhancing the efficiency and scalability of Reinforcement Learning from Human Feedback (RLHF). We propose a data augmentation and expansion framework that enables generative reward models trained on small datasets to achieve comparable performance to those trained on large-scale datasets. Traditional methods to train a generative reward model, such as Direct Preference Optimization (DPO), are constrained by inefficiencies in sample pairing and limited data diversity. This work introduces preference refinement, which employs Chain-of-Thought (CoT) sampling to uncover diverse and high-quality preference relationships. It also incorporates a perplexity-based scoring mechanism to assign nuanced preference levels and utilizes Multi-level Direct Preference Optimization (M-DPO) to enable the model to capture finer-grained preference differences between samples. Experimental results demonstrate that the proposed method significantly enhances data efficiency and model performance, enabling reward models trained in a few-shot setting to achieve results on par with those trained on large-scale datasets. This study underscores the potential of data-efficient strategies in advancing reward model optimization, offering a robust solution for low-resource RLHF applications.
zh
[AI-9] Evaluating Generative Vehicle Trajectory Models for Traffic Intersection Dynamics
【速读】:该论文试图解决现有深度生成模型在交通动力学建模中无法有效评估交通工程相关指标的问题,特别是在实时微观仿真场景下对模型性能的评估不足。当前模型主要依赖轨迹重建误差等计算指标,未能充分考虑如闯红灯、非法停车等交通规则违规行为。解决方案的关键在于提出一套全面的分析工具,用于训练、运行和评估模型,并引入新的评价指标以从交通工程角度更准确地反映模型性能。
链接: https://arxiv.org/abs/2506.08963
作者: Yash Ranjan,Rahul Sengupta,Anand Rangarajan,Sanjay Ranka
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Traffic Intersections are vital to urban road networks as they regulate the movement of people and goods. However, they are regions of conflicting trajectories and are prone to accidents. Deep Generative models of traffic dynamics at signalized intersections can greatly help traffic authorities better understand the efficiency and safety aspects. At present, models are evaluated on computational metrics that primarily look at trajectory reconstruction errors. They are not evaluated online in a `live’ microsimulation scenario. Further, these metrics do not adequately consider traffic engineering-specific concerns such as red-light violations, unallowed stoppage, etc. In this work, we provide a comprehensive analytics tool to train, run, and evaluate models with metrics that give better insights into model performance from a traffic engineering point of view. We train a state-of-the-art multi-vehicle trajectory forecasting model on a large dataset collected by running a calibrated scenario of a real-world urban intersection. We then evaluate the performance of the prediction models, online in a microsimulator, under unseen traffic conditions. We show that despite using ideally-behaved trajectories as input, and achieving low trajectory reconstruction errors, the generated trajectories show behaviors that break traffic rules. We introduce new metrics to evaluate such undesired behaviors and present our results.
zh
[AI-10] WIP: Large Language Model-Enhanced Smart Tutor for Undergraduate Circuit Analysis
【速读】:该论文旨在解决本科生电路分析课程中作业评估与反馈效率不足的问题,其解决方案的关键在于开发一种基于人工智能(Artificial Intelligence, AI)的智能辅导系统。该系统具备开放性问题回答和作业反馈生成能力,并通过精心设计的提示(prompts)优化不同问题的响应质量。系统部署在Microsoft Azure平台上,能够提供个性化指导并收集学生互动数据,为教师提供实时的学生学习困难洞察,从而支持更精准的课堂教学。
链接: https://arxiv.org/abs/2506.08962
作者: Liangliang Chen,Huiru Xie,Jacqueline Rohde,Ying Zhang
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Accepted to 2025 Frontiers in Education (FIE) Conference
点击查看摘要
Abstract:This research-to-practice work-in-progress (WIP) paper presents an AI-enabled smart tutor designed to provide homework assessment and feedback for students in an undergraduate circuit analysis course. We detail the tutor’s design philosophy and core components, including open-ended question answering and homework feedback generation. The prompts are carefully crafted to optimize responses across different problems. The smart tutor was deployed on the Microsoft Azure platform and is currently in use in an undergraduate circuit analysis course at the School of Electrical and Computer Engineering in a large, public, research-intensive institution in the Southeastern United States. Beyond offering personalized instruction and feedback, the tutor collects student interaction data, which is summarized and shared with the course instructor. To evaluate its effectiveness, we collected student feedback, with 90.9% of responses indicating satisfaction with the tutor. Additionally, we analyze a subset of collected data on preliminary circuit analysis topics to assess tutor usage frequency for each problem and identify frequently asked questions. These insights help instructors gain real-time awareness of student difficulties, enabling more targeted classroom instruction. In future work, we will release a full analysis once the complete dataset is available after the Spring 2025 semester. We also explore the potential applications of this smart tutor across a broader range of engineering disciplines by developing improved prompts, diagram-recognition methods, and database management strategies, which remain ongoing areas of research.
zh
[AI-11] owards Robust Deep Reinforcement Learning against Environmental State Perturbation
【速读】:该论文试图解决深度强化学习(Deep Reinforcement Learning, DRL)在环境状态扰动下的鲁棒性问题,此类扰动在具身场景中是自然存在的,但此前研究较少关注。解决方案的关键在于提出一种名为Boosted Adversarial Training (BAT)的防御框架,该框架首先通过监督学习对智能体进行调优以避免灾难性失败,随后结合强化学习进行对抗训练,从而显著提升智能体在多种情境下对环境状态扰动的鲁棒性。
链接: https://arxiv.org/abs/2506.08961
作者: Chenxu Wang,Huaping Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Adversarial attacks and robustness in Deep Reinforcement Learning (DRL) have been widely studied in various threat models; however, few consider environmental state perturbations, which are natural in embodied scenarios. To improve the robustness of DRL agents, we formulate the problem of environmental state perturbation, introducing a preliminary non-targeted attack method as a calibration adversary, and then propose a defense framework, named Boosted Adversarial Training (BAT), which first tunes the agents via supervised learning to avoid catastrophic failure and subsequently adversarially trains the agent with reinforcement learning. Extensive experimental results substantiate the vulnerability of mainstream agents under environmental state perturbations and the effectiveness of our proposed attack. The defense results demonstrate that while existing robust reinforcement learning algorithms may not be suitable, our BAT framework can significantly enhance the robustness of agents against environmental state perturbations across various situations.
zh
[AI-12] IntTrajSim: Trajectory Prediction for Simulating Multi-Vehicle driving at Signalized Intersections
【速读】:该论文试图解决传统基于规则的交通仿真工具在模拟真实驾驶行为方面的局限性,特别是在交通交叉口处的宏观和微观统计特性方面。其关键解决方案是提出与交通工程相关的评估指标,并构建一个“仿真闭环”流程来评估生成式轨迹预测模型;同时,提出一种基于多头自注意力机制的轨迹预测模型,该模型整合了信号信息,在评估指标上优于之前的方法。
链接: https://arxiv.org/abs/2506.08957
作者: Yash Ranjan,Rahul Sengupta,Anand Rangarajan,Sanjay Ranka
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Traffic simulators are widely used to study the operational efficiency of road infrastructure, but their rule-based approach limits their ability to mimic real-world driving behavior. Traffic intersections are critical components of the road infrastructure, both in terms of safety risk (nearly 28% of fatal crashes and 58% of nonfatal crashes happen at intersections) as well as the operational efficiency of a road corridor. This raises an important question: can we create a data-driven simulator that can mimic the macro- and micro-statistics of the driving behavior at a traffic intersection? Deep Generative Modeling-based trajectory prediction models provide a good starting point to model the complex dynamics of vehicles at an intersection. But they are not tested in a “live” micro-simulation scenario and are not evaluated on traffic engineering-related metrics. In this study, we propose traffic engineering-related metrics to evaluate generative trajectory prediction models and provide a simulation-in-the-loop pipeline to do so. We also provide a multi-headed self-attention-based trajectory prediction model that incorporates the signal information, which outperforms our previous models on the evaluation metrics.
zh
[AI-13] Intention-Conditioned Flow Occupancy Models
【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)中预训练大规模模型所面临的根本性挑战,即如何有效建模具有长期依赖性的动作序列,以提升样本效率和鲁棒性。其解决方案的关键在于构建一个基于流匹配(flow matching)的概率模型,用于预测智能体在遥远未来可能访问的状态分布(即占用度量),同时引入潜在变量来捕捉用户意图,从而增强模型的表达能力并支持广义策略改进。该方法被称为意图条件流占用模型(Intention-conditioned Flow Occupancy Models, InFOM)。
链接: https://arxiv.org/abs/2506.08902
作者: Chongyi Zheng,Seohong Park,Sergey Levine,Benjamin Eysenbach
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Large-scale pre-training has fundamentally changed how machine learning research is done today: large foundation models are trained once, and then can be used by anyone in the community (including those without data or compute resources to train a model from scratch) to adapt and fine-tune to specific tasks. Applying this same framework to reinforcement learning (RL) is appealing because it offers compelling avenues for addressing core challenges in RL, including sample efficiency and robustness. However, there remains a fundamental challenge to pre-train large models in the context of RL: actions have long-term dependencies, so training a foundation model that reasons across time is important. Recent advances in generative AI have provided new tools for modeling highly complex distributions. In this paper, we build a probabilistic model to predict which states an agent will visit in the temporally distant future (i.e., an occupancy measure) using flow matching. As large datasets are often constructed by many distinct users performing distinct tasks, we include in our model a latent variable capturing the user intention. This intention increases the expressivity of our model, and enables adaptation with generalized policy improvement. We call our proposed method intention-conditioned flow occupancy models (InFOM). Comparing with alternative methods for pre-training, our experiments on 36 state-based and 4 image-based benchmark tasks demonstrate that the proposed method achieves 1.8 \times median improvement in returns and increases success rates by 36% . Website: this https URL Code: this https URL
zh
[AI-14] Preference-Driven Multi-Objective Combinatorial Optimization with Conditional Computation
【速读】:该论文旨在解决多目标组合优化问题(Multi-Objective Combinatorial Optimization Problems, MOCOPs)中现有深度强化学习方法因平等处理所有子问题而限制解空间有效探索的问题,从而导致性能不佳。其解决方案的关键在于提出POCCO框架,该框架通过自适应选择子问题的模型结构,并基于偏好信号而非显式奖励值进行优化,实现了更高效的求解。具体而言,POCCO设计了一个条件计算模块,将子问题路由至专用的神经网络架构,并引入一种基于偏好的优化算法,以学习优胜与失败解之间的成对偏好。
链接: https://arxiv.org/abs/2506.08898
作者: Mingfeng Fan,Jianan Zhou,Yifeng Zhang,Yaoxin Wu,Jinbiao Chen,Guillaume Adrien Sartoretti
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 22 pages, 6 figures, under review
点击查看摘要
Abstract:Recent deep reinforcement learning methods have achieved remarkable success in solving multi-objective combinatorial optimization problems (MOCOPs) by decomposing them into multiple subproblems, each associated with a specific weight vector. However, these methods typically treat all subproblems equally and solve them using a single model, hindering the effective exploration of the solution space and thus leading to suboptimal performance. To overcome the limitation, we propose POCCO, a novel plug-and-play framework that enables adaptive selection of model structures for subproblems, which are subsequently optimized based on preference signals rather than explicit reward values. Specifically, we design a conditional computation block that routes subproblems to specialized neural architectures. Moreover, we propose a preference-driven optimization algorithm that learns pairwise preferences between winning and losing solutions. We evaluate the efficacy and versatility of POCCO by applying it to two state-of-the-art neural methods for MOCOPs. Experimental results across four classic MOCOP benchmarks demonstrate its significant superiority and strong generalization.
zh
[AI-15] SeerAttention-R: Sparse Attention Adaptation for Long Reasoning
【速读】:该论文旨在解决长推理模型在解码过程中计算效率与注意力机制稀疏性之间的矛盾问题。其解决方案的关键在于提出SeerAttention-R,这是一个针对长解码任务优化的稀疏注意力框架,通过保留自蒸馏门控机制以学习注意力稀疏性,并移除查询池化以适应自回归解码,同时采用轻量级插件式门控结构,使其能够无缝集成到现有预训练模型中而无需修改原始参数。
链接: https://arxiv.org/abs/2506.08889
作者: Yizhao Gao,Shuming Guo,Shijie Cao,Yuqing Xia,Yu Cheng,Lei Wang,Lingxiao Ma,Yutao Sun,Tianzhu Ye,Li Dong,Hayden Kwok-Hay So,Yu Hua,Ting Cao,Fan Yang,Mao Yang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:We introduce SeerAttention-R, a sparse attention framework specifically tailored for the long decoding of reasoning models. Extended from SeerAttention, SeerAttention-R retains the design of learning attention sparsity through a self-distilled gating mechanism, while removing query pooling to accommodate auto-regressive decoding. With a lightweight plug-in gating, SeerAttention-R is flexible and can be easily integrated into existing pretrained model without modifying the original parameters. We demonstrate that SeerAttention-R, trained on just 0.4B tokens, maintains near-lossless reasoning accuracy with 4K token budget in AIME benchmark under large sparse attention block sizes (64/128). Using TileLang, we develop a highly optimized sparse decoding kernel that achieves near-theoretical speedups of up to 9x over FlashAttention-3 on H100 GPU at 90% sparsity. Code is available at: this https URL.
zh
[AI-16] Your Brain on ChatGPT : Accumulation of Cognitive Debt when Using an AI Assistant for Essay Writing Task
【速读】:该论文试图解决使用大型语言模型(Large Language Model, LLM)辅助写作对神经机制和行为表现的影响问题。其解决方案的关键在于通过对比实验设计,将参与者分为LLM、搜索引擎和纯脑力组,并在不同阶段重新分配组别,结合脑电图(EEG)和自然语言处理(NLP)技术,分析认知负荷、语言模式及行为表现,以评估外部工具使用对认知活动和学习效果的影响。
链接: https://arxiv.org/abs/2506.08872
作者: Nataliya Kosmyna,Eugene Hauptmann,Ye Tong Yuan,Jessica Situ,Xian-Hao Liao,Ashly Vivian Beresnitzky,Iris Braunstein,Pattie Maes
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 206 pages, 92 figures, 4 tables and appendix
点击查看摘要
Abstract:This study explores the neural and behavioral consequences of LLM-assisted essay writing. Participants were divided into three groups: LLM, Search Engine, and Brain-only (no tools). Each completed three sessions under the same condition. In a fourth session, LLM users were reassigned to Brain-only group (LLM-to-Brain), and Brain-only users were reassigned to LLM condition (Brain-to-LLM). A total of 54 participants took part in Sessions 1-3, with 18 completing session 4. We used electroencephalography (EEG) to assess cognitive load during essay writing, and analyzed essays using NLP, as well as scoring essays with the help from human teachers and an AI judge. Across groups, NERs, n-gram patterns, and topic ontology showed within-group homogeneity. EEG revealed significant differences in brain connectivity: Brain-only participants exhibited the strongest, most distributed networks; Search Engine users showed moderate engagement; and LLM users displayed the weakest connectivity. Cognitive activity scaled down in relation to external tool use. In session 4, LLM-to-Brain participants showed reduced alpha and beta connectivity, indicating under-engagement. Brain-to-LLM users exhibited higher memory recall and activation of occipito-parietal and prefrontal areas, similar to Search Engine users. Self-reported ownership of essays was the lowest in the LLM group and the highest in the Brain-only group. LLM users also struggled to accurately quote their own work. While LLMs offer immediate convenience, our findings highlight potential cognitive costs. Over four months, LLM users consistently underperformed at neural, linguistic, and behavioral levels. These results raise concerns about the long-term educational implications of LLM reliance and underscore the need for deeper inquiry into AI’s role in learning.
zh
[AI-17] On The Impact of Merge Request Deviations on Code Review Practices
【速读】:该论文试图解决工业环境中Merge Request (MR)工作流偏离标准化代码审查流程的问题,这些偏离的MR(如草稿、变基或依赖更新)会干扰代码审查分析的准确性。解决方案的关键在于提出一种少量样本学习检测方法,能够有效识别这些偏离情况,从而提升基于机器学习的审查分析模型的性能,减少偏差并提高特征重要性的可靠性。
链接: https://arxiv.org/abs/2506.08860
作者: Samah Kansab,Francis Bordeleau,Ali Tizghadam
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Code review is a key practice in software engineering, ensuring quality and collaboration. However, industrial Merge Request (MR) workflows often deviate from standardized review processes, with many MRs serving non-review purposes (e.g., drafts, rebases, or dependency updates). We term these cases deviations and hypothesize that ignoring them biases analytics and undermines ML models for review analysis. We identify seven deviation categories, occurring in 37.02% of MRs, and propose a few-shot learning detection method (91% accuracy). By excluding deviations, ML models predicting review completion time improve performance in 53.33% of cases (up to 2.25x) and exhibit significant shifts in feature importance (47% overall, 60% top-k). Our contributions include: (1) a taxonomy of MR deviations, (2) an AI-driven detection approach, and (3) empirical evidence of their impact on ML-based review analytics. This work aids practitioners in optimizing review efforts and ensuring reliable insights. Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2506.08860 [cs.SE] (or arXiv:2506.08860v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2506.08860 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-18] FreqPolicy: Efficient Flow-based Visuomotor Policy via Frequency Consistency
【速读】:该论文旨在解决基于生成模型的视觉-运动策略在实时机器人系统中因多步采样带来的高推理成本问题。其关键解决方案是提出FreqPolicy,通过在流模型基础上引入频率一致性约束,使动作模型能够有效捕捉时间结构,并支持高效、高质量的一步动作生成,同时通过自适应一致性损失来建模机器人操作任务中的结构性时间变化。
链接: https://arxiv.org/abs/2506.08822
作者: Yifei Su,Ning Liu,Dong Chen,Zhen Zhao,Kun Wu,Meng Li,Zhiyuan Xu,Zhengping Che,Jian Tang
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Generative modeling-based visuomotor policies have been widely adopted in robotic manipulation attributed to their ability to model multimodal action distributions. However, the high inference cost of multi-step sampling limits their applicability in real-time robotic systems. To address this issue, existing approaches accelerate the sampling process in generative modeling-based visuomotor policies by adapting acceleration techniques originally developed for image generation. Despite this progress, a major distinction remains: image generation typically involves producing independent samples without temporal dependencies, whereas robotic manipulation involves generating time-series action trajectories that require continuity and temporal coherence. To effectively exploit temporal information in robotic manipulation, we propose FreqPolicy, a novel approach that first imposes frequency consistency constraints on flow-based visuomotor policies. Our work enables the action model to capture temporal structure effectively while supporting efficient, high-quality one-step action generation. We introduce a frequency consistency constraint that enforces alignment of frequency-domain action features across different timesteps along the flow, thereby promoting convergence of one-step action generation toward the target distribution. In addition, we design an adaptive consistency loss to capture structural temporal variations inherent in robotic manipulation tasks. We assess FreqPolicy on 53 tasks across 3 simulation benchmarks, proving its superiority over existing one-step action generators. We further integrate FreqPolicy into the vision-language-action (VLA) model and achieve acceleration without performance degradation on the 40 tasks of Libero. Besides, we show efficiency and effectiveness in real-world robotic scenarios with an inference frequency 93.5Hz. The code will be publicly available.
zh
[AI-19] owards Biosignals-Free Autonomous Prosthetic Hand Control via Imitation Learning
【速读】:该论文试图解决传统表面肌电(sEMG)和半自主控制方法在假肢手操作中需要用户主动生成肌电信号所带来的身体和心理负担问题。其解决方案的关键是开发一种完全自主的控制系统,该系统通过腕部安装的摄像头识别物体并自动执行抓取与释放动作,无需用户主动生成控制信号,从而显著降低使用过程中的认知和生理负荷。
链接: https://arxiv.org/abs/2506.08795
作者: Kaijie Shi,Wanglong Lu,Hanli Zhao,Vinicius Prado da Fonseca,Ting Zou,Xianta Jiang
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Limb loss affects millions globally, impairing physical function and reducing quality of life. Most traditional surface electromyographic (sEMG) and semi-autonomous methods require users to generate myoelectric signals for each control, imposing physically and mentally taxing demands. This study aims to develop a fully autonomous control system that enables a prosthetic hand to automatically grasp and release objects of various shapes using only a camera attached to the wrist. By placing the hand near an object, the system will automatically execute grasping actions with a proper grip force in response to the hand’s movements and the environment. To release the object being grasped, just naturally place the object close to the table and the system will automatically open the hand. Such a system would provide individuals with limb loss with a very easy-to-use prosthetic control interface and greatly reduce mental effort while using. To achieve this goal, we developed a teleoperation system to collect human demonstration data for training the prosthetic hand control model using imitation learning, which mimics the prosthetic hand actions from human. Through training the model using only a few objects’ data from one single participant, we have shown that the imitation learning algorithm can achieve high success rates, generalizing to more individuals and unseen objects with a variation of weights. The demonstrations are available at \hrefthis https URLthis https URL
zh
[AI-20] Do Generative AI Tools Ensure Green Code? An Investigative Study ICSE’24
【速读】:该论文试图解决AI生成代码在可持续性方面的不足问题,特别是其在采用可持续编码实践方面的环境友好性。研究指出,当前主流的生成式 AI 工具(如 ChatGPT、BARD 和 Copilot)在默认状态下表现出非绿色行为,缺乏对资源优化和环境影响的充分考虑。解决方案的关键在于深入分析AI生成代码的可持续性缺陷,并探索有效的改进策略,以推动更加环保和可持续的软件开发实践。
链接: https://arxiv.org/abs/2506.08790
作者: Samarth Sikand,Rohit Mehra,Vibhu Saujanya Sharma,Vikrant Kaulgud,Sanjay Podder,Adam P. Burden
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 4 pages. To be published in the proceedings of 2nd International Workshop on Responsible AI Engineering (RAIE '24), co-located with ICSE '24, Lisbon, Portugal
点击查看摘要
Abstract:Software sustainability is emerging as a primary concern, aiming to optimize resource utilization, minimize environmental impact, and promote a greener, more resilient digital ecosystem. The sustainability or “greenness” of software is typically determined by the adoption of sustainable coding practices. With a maturing ecosystem around generative AI, many software developers now rely on these tools to generate code using natural language prompts. Despite their potential advantages, there is a significant lack of studies on the sustainability aspects of AI-generated code. Specifically, how environmentally friendly is the AI-generated code based upon its adoption of sustainable coding practices? In this paper, we present the results of an early investigation into the sustainability aspects of AI-generated code across three popular generative AI tools - ChatGPT, BARD, and Copilot. The results highlight the default non-green behavior of tools for generating code, across multiple rules and scenarios. It underscores the need for further in-depth investigations and effective remediation strategies.
zh
[AI-21] POLARON: Precision-aware On-device Learning and Adaptive Runtime-cONfigurable AI acceleration
【速读】:该论文旨在解决AI模型复杂性增加所带来的硬件灵活性不足问题,特别是在能量受限的边缘平台中对多样化精度格式支持的需求。其解决方案的关键在于提出PARV-CE架构,该架构采用SIMD增强的多精度乘加(MAC)引擎,通过统一的数据路径支持4/8/16位定点、浮点和posit格式,并结合层自适应精度策略,以匹配计算精度与工作负载敏感性,从而优化性能和能耗。此外,PARV-CE通过量化感知执行与可重构SIMD流水线的集成,实现了高吞吐量处理,同时通过软硬件协同设计降低开销。
链接: https://arxiv.org/abs/2506.08785
作者: Mukul Lokhande,Santosh Kumar Vishvakarma
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Computational Complexity (cs.CC); Image and Video Processing (eess.IV)
备注:
点击查看摘要
Abstract:The increasing complexity of AI models requires flexible hardware capable of supporting diverse precision formats, particularly for energy-constrained edge platforms. This work presents PARV-CE, a SIMD-enabled, multi-precision MAC engine that performs efficient multiply-accumulate operations using a unified data-path for 4/8/16-bit fixed-point, floating point, and posit formats. The architecture incorporates a layer adaptive precision strategy to align computational accuracy with workload sensitivity, optimizing both performance and energy usage. PARV-CE integrates quantization-aware execution with a reconfigurable SIMD pipeline, enabling high-throughput processing with minimal overhead through hardware-software co-design. The results demonstrate up to 2x improvement in PDP and 3x reduction in resource usage compared to SoTA designs, while retaining accuracy within 1.8% FP32 baseline. The architecture supports both on-device training and inference across a range of workloads, including DNNs, RNNs, RL, and Transformer models. The empirical analysis establish PARVCE incorporated POLARON as a scalable and energy-efficient solution for precision-adaptive AI acceleration at edge.
zh
[AI-22] Multimodal Representation Alignment for Cross-modal Information Retrieval
【速读】:该论文试图解决跨模态检索中的特征对齐问题(feature alignment),即在不同模态之间找到对应的表示,例如根据文本编码检索语义匹配的图像或反之。其解决方案的关键在于分析视觉与文本嵌入之间的几何关系,并通过多种相似性度量(包括四种标准度量和两种学习得到的度量)进行特征对齐。研究发现,Wasserstein距离可以作为模态间隙的有效度量,而余弦相似度在特征对齐任务中表现优于其他度量方法,同时传统架构如多层感知机不足以捕捉图像与文本表示间的复杂交互。
链接: https://arxiv.org/abs/2506.08774
作者: Fan Xu,Luis A. Leiva
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Different machine learning models can represent the same underlying concept in different ways. This variability is particularly valuable for in-the-wild multimodal retrieval, where the objective is to identify the corresponding representation in one modality given another modality as input. This challenge can be effectively framed as a feature alignment problem. For example, given a sentence encoded by a language model, retrieve the most semantically aligned image based on features produced by an image encoder, or vice versa. In this work, we first investigate the geometric relationships between visual and textual embeddings derived from both vision-language models and combined unimodal models. We then align these representations using four standard similarity metrics as well as two learned ones, implemented via neural networks. Our findings indicate that the Wasserstein distance can serve as an informative measure of the modality gap, while cosine similarity consistently outperforms alternative metrics in feature alignment tasks. Furthermore, we observe that conventional architectures such as multilayer perceptrons are insufficient for capturing the complex interactions between image and text representations. Our study offers novel insights and practical considerations for researchers working in multimodal information retrieval, particularly in real-world, cross-modal applications.
zh
[AI-23] Bayesian Inverse Physics for Neuro-Symbolic Robot Learning
【速读】:该论文旨在解决现实世界机器人应用中面临的适应性、可解释性和数据效率问题,尤其是在未知和动态环境中的高效可靠运行难题。其解决方案的关键在于构建一种融合数据驱动学习与有意识、结构化推理的混合神经符号架构,具体包括利用可微分物理进行高效世界建模、基于贝叶斯推断的不确定性感知决策以及元学习以实现对新任务的快速适应,从而在神经模型中嵌入物理符号推理,使机器人能够超越训练数据进行泛化、处理新情境并持续扩展知识。
链接: https://arxiv.org/abs/2506.08756
作者: Octavio Arriaga,Rebecca Adam,Melvin Laux,Lisa Gutzeit,Marco Ragni,Jan Peters,Frank Kirchner
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Real-world robotic applications, from autonomous exploration to assistive technologies, require adaptive, interpretable, and data-efficient learning paradigms. While deep learning architectures and foundation models have driven significant advances in diverse robotic applications, they remain limited in their ability to operate efficiently and reliably in unknown and dynamic environments. In this position paper, we critically assess these limitations and introduce a conceptual framework for combining data-driven learning with deliberate, structured reasoning. Specifically, we propose leveraging differentiable physics for efficient world modeling, Bayesian inference for uncertainty-aware decision-making, and meta-learning for rapid adaptation to new tasks. By embedding physical symbolic reasoning within neural models, robots could generalize beyond their training data, reason about novel situations, and continuously expand their knowledge. We argue that such hybrid neuro-symbolic architectures are essential for the next generation of autonomous systems, and to this end, we provide a research roadmap to guide and accelerate their development.
zh
[AI-24] A Sample Efficient Conditional Independence Test in the Presence of Discretization
【速读】:该论文试图解决在离散化数据上直接应用条件独立(Conditional Independence, CI)检验会导致错误结论的问题,以及现有通过二值化观测数据推断潜在变量间正确CI关系的方法因信息丢失而性能下降的问题。解决方案的关键在于提出一种无需依赖二值化过程的样本高效CI检验方法,通过解决广义矩方法(Generalized Method of Moments, GMM)中的过度识别约束问题,建立潜在连续变量间的独立关系,并基于节点回归推导出合适的检验统计量,从而正确反映CI的渐近分布。
链接: https://arxiv.org/abs/2506.08747
作者: Boyang Sun,Yu Yao,Xinshuai Dong,Zongfang Liu,Tongliang Liu,Yumou Qiu,Kun Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:
点击查看摘要
Abstract:In many real-world scenarios, interested variables are often represented as discretized values due to measurement limitations. Applying Conditional Independence (CI) tests directly to such discretized data, however, can lead to incorrect conclusions. To address this, recent advancements have sought to infer the correct CI relationship between the latent variables through binarizing observed data. However, this process inevitably results in a loss of information, which degrades the test’s performance. Motivated by this, this paper introduces a sample-efficient CI test that does not rely on the binarization process. We find that the independence relationships of latent continuous variables can be established by addressing an over-identifying restriction problem with Generalized Method of Moments (GMM). Based on this insight, we derive an appropriate test statistic and establish its asymptotic distribution correctly reflecting CI by leveraging nodewise regression. Theoretical findings and Empirical results across various datasets demonstrate that the superiority and effectiveness of our proposed test. Our code implementation is provided in this https URL
zh
[AI-25] Bridging RDF Knowledge Graphs with Graph Neural Networks for Semantically-Rich Recommender Systems DASFAA2025
【速读】:该论文试图解决在基于图神经网络(Graph Neural Networks, GNNs)的推荐系统中,尚未充分利用W3C标准RDF知识图谱(Knowledge Graphs, KGs)中的丰富语义信息的问题。解决方案的关键在于将RDF知识图谱全面集成到GNN中,同时利用RDF对象属性提供的拓扑信息和数据类型属性提供的内容信息,以深入评估不同GNN模型在推荐任务中的性能,并分析语义特征初始化方式及图结构异质性类型的影响。
链接: https://arxiv.org/abs/2506.08743
作者: Michael Färber,David Lamprecht,Yuni Susanti
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Databases (cs.DB); Machine Learning (cs.LG)
备注: Accepted at DASFAA 2025
点击查看摘要
Abstract:Graph Neural Networks (GNNs) have substantially advanced the field of recommender systems. However, despite the creation of more than a thousand knowledge graphs (KGs) under the W3C standard RDF, their rich semantic information has not yet been fully leveraged in GNN-based recommender systems. To address this gap, we propose a comprehensive integration of RDF KGs with GNNs that utilizes both the topological information from RDF object properties and the content information from RDF datatype properties. Our main focus is an in-depth evaluation of various GNNs, analyzing how different semantic feature initializations and types of graph structure heterogeneity influence their performance in recommendation tasks. Through experiments across multiple recommendation scenarios involving multi-million-node RDF graphs, we demonstrate that harnessing the semantic richness of RDF KGs significantly improves recommender systems and lays the groundwork for GNN-based recommender systems for the Linked Open Data cloud. The code and data are available on our GitHub repository: this https URL
zh
[AI-26] Exploration by Random Reward Perturbation
【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)中探索不足的问题,特别是在稀疏和密集奖励场景下,智能体容易陷入局部最优且样本效率低下的问题。解决方案的关键在于提出一种名为随机奖励扰动(Random Reward Perturbation, RRP)的新型探索策略,通过向环境奖励添加零均值噪声来增强训练过程中的策略多样性,从而扩大探索范围。RRP具有通用性、轻量级特点,并能与基于动作扰动的探索策略兼容,实现对探索效果的附加提升。
链接: https://arxiv.org/abs/2506.08737
作者: Haozhe Ma,Guoji Fu,Zhengding Luo,Jiele Wu,Tze-Yun Leong
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:We introduce Random Reward Perturbation (RRP), a novel exploration strategy for reinforcement learning (RL). Our theoretical analyses demonstrate that adding zero-mean noise to environmental rewards effectively enhances policy diversity during training, thereby expanding the range of exploration. RRP is fully compatible with the action-perturbation-based exploration strategies, such as \epsilon -greedy, stochastic policies, and entropy regularization, providing additive improvements to exploration effects. It is general, lightweight, and can be integrated into existing RL algorithms with minimal implementation effort and negligible computational overhead. RRP establishes a theoretical connection between reward shaping and noise-driven exploration, highlighting their complementary potential. Experiments show that RRP significantly boosts the performance of Proximal Policy Optimization and Soft Actor-Critic, achieving higher sample efficiency and escaping local optima across various tasks, under both sparse and dense reward scenarios.
zh
[AI-27] Breaking the ICE: Exploring promises and challenges of benchmarks for Inference Carbon Energy estimation for LLM s ICSE2025
【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)在能源消耗和碳排放方面的估算问题,这一问题可能阻碍组织的可持续发展目标。现有工具在监测和估算能源消耗及碳排放方面存在输入数据点过多、侵入性强、误差率高等缺点。论文提出的解决方案关键在于利用新兴的LLM基准测试和相关数据点,以提高碳排放估算的准确性并减少对系统的干扰,其核心框架R-ICE通过借鉴最先进的(State-of-the-Art, SOTA)基准来实现提示级别推理的碳排放估算,为动态LLM路由、碳核算等新兴应用场景提供了一种更实用且非侵入性的方法。
链接: https://arxiv.org/abs/2506.08727
作者: Samarth Sikand,Rohit Mehra,Priyavanshi Pathania,Nikhil Bamby,Vibhu Saujanya Sharma,Vikrant Kaulgud,Sanjay Podder,Adam P. Burden
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Software Engineering (cs.SE)
备注: 5 pages. To be published in the proceedings of 9th International Workshop on Green and Sustainable Software (GREENS '25), April 29, 2025, Ottawa, Canada (Co-located with ICSE 2025)
点击查看摘要
Abstract:While Generative AI stands to be one of the fastest adopted technologies ever, studies have made evident that the usage of Large Language Models (LLMs) puts significant burden on energy grids and our environment. It may prove a hindrance to the Sustainability goals of any organization. A crucial step in any Sustainability strategy is monitoring or estimating the energy consumption of various components. While there exist multiple tools for monitoring energy consumption, there is a dearth of tools/frameworks for estimating the consumption or carbon emissions. Current drawbacks of both monitoring and estimation tools include high input data points, intrusive nature, high error margin, etc. We posit that leveraging emerging LLM benchmarks and related data points can help overcome aforementioned challenges while balancing accuracy of the emission estimations. To that extent, we discuss the challenges of current approaches and present our evolving framework, R-ICE, which estimates prompt level inference carbon emissions by leveraging existing state-of-the-art(SOTA) benchmark. This direction provides a more practical and non-intrusive way to enable emerging use-cases like dynamic LLM routing, carbon accounting, etc. Our promising validation results suggest that benchmark-based modelling holds great potential for inference emission estimation and warrants further exploration from the scientific community.
zh
[AI-28] Variational Autoencoder-Based Approach to Latent Feature Analysis on Efficient Representation of Power Load Monitoring Data
【速读】:该论文旨在解决智能电网中高维且不完整(High-Dimensional and Incomplete, HDI)电力负荷监测(Power Load Monitoring, PLM)数据对电力负荷预测(Power Load Forecasting, PLF)模型性能的挑战。其解决方案的关键在于提出一种基于变分自编码器(Variational Autoencoder, VAE)的潜在表征模型VAE-LF,通过编码器-解码器结构学习数据的低维潜在表示,并将HDI PLM数据分割为向量序列输入模型以生成补全数据,从而实现高效的数据补全。
链接: https://arxiv.org/abs/2506.08698
作者: Boyu Xie,Tangtang Xie
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 9 pages, 2 figures
点击查看摘要
Abstract:With the development of smart grids, High-Dimensional and Incomplete (HDI) Power Load Monitoring (PLM) data challenges the performance of Power Load Forecasting (PLF) models. In this paper, we propose a potential characterization model VAE-LF based on Variational Autoencoder (VAE) for efficiently representing and complementing PLM missing data. VAE-LF learns a low-dimensional latent representation of the data using an Encoder-Decoder structure by splitting the HDI PLM data into vectors and feeding them sequentially into the VAE-LF model, and generates the complementary data. Experiments on the UK-DALE dataset show that VAE-LF outperforms other benchmark models in both 5% and 10% sparsity test cases, with significantly lower RMSE and MAE, and especially outperforms on low sparsity ratio data. The method provides an efficient data-completion solution for electric load management in smart grids.
zh
[AI-29] Enhancing Reasoning Capabilities of Small Language Models with Blueprints and Prompt Template Search ICML’25
【速读】:该论文试图解决小型语言模型(Small Language Models, SLMs)在推理能力上的局限性以及对提示变化的敏感性问题。其解决方案的关键在于提出一种新颖的框架,通过大型语言模型(Large Language Models, LLMs)生成的蓝图(blueprints)来增强SLMs的推理能力,这些蓝图提供了结构化的高层次推理指导,使SLMs能够系统化地处理相关问题,同时引入提示模板搜索机制以降低对提示变化的敏感性。
链接: https://arxiv.org/abs/2506.08669
作者: Dongge Han,Menglin Xia,Daniel Madrigal Diaz,Samuel Kessler,Ankur Mallick,Xuchao Zhang,Mirian Del Carmen Hipolito Garcia,Jin Xu,Victor Rühle,Saravan Rajmohan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: TTODLer-FM Workshop@ICML’25 (Tiny Titans: The next wave of On-Device Learning for Foundational Models)
点击查看摘要
Abstract:Small language models (SLMs) offer promising and efficient alternatives to large language models (LLMs). However, SLMs’ limited capacity restricts their reasoning capabilities and makes them sensitive to prompt variations. To address these challenges, we propose a novel framework that enhances SLM reasoning capabilities through LLM generated blueprints. The blueprints provide structured, high-level reasoning guides that help SLMs systematically tackle related problems. Furthermore, our framework integrates a prompt template search mechanism to mitigate the SLMs’ sensitivity to prompt variations. Our framework demonstrates improved SLM performance across various tasks, including math (GSM8K), coding (MBPP), and logic reasoning (BBH). Our approach improves the reasoning capabilities of SLMs without increasing model size or requiring additional training, offering a lightweight and deployment-friendly solution for on-device or resource-constrained environments.
zh
[AI-30] Optimizing Learned Image Compression on Scalar and Entropy-Constraint Quantization ICIP2024
【速读】:该论文试图解决在使用变分自编码器进行图像压缩时,量化过程在训练过程中难以建模的问题,因为量化操作在大多数位置产生零梯度,需要通过可微分近似来实现端到端优化,但现有方法无法正确模拟量化噪声,导致网络性能不佳。解决方案的关键在于引入一个额外的微调训练步骤:在传统端到端训练之后,利用推理阶段获得的量化潜在表示对网络的部分结构进行重新训练,从而提升编码增益,尤其在均匀标量和熵约束量化中效果显著,且不增加推理复杂度。
链接: https://arxiv.org/abs/2506.08662
作者: Florian Borzechowski,Michael Schäfer,Heiko Schwarz,Jonathan Pfaff,Detlev Marpe,Thomas Wiegand
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT)
备注: Accepted at ICIP2024, the IEEE International Conference on Image Processing
点击查看摘要
Abstract:The continuous improvements on image compression with variational autoencoders have lead to learned codecs competitive with conventional approaches in terms of rate-distortion efficiency. Nonetheless, taking the quantization into account during the training process remains a problem, since it produces zero derivatives almost everywhere and needs to be replaced with a differentiable approximation which allows end-to-end optimization. Though there are different methods for approximating the quantization, none of them model the quantization noise correctly and thus, result in suboptimal networks. Hence, we propose an additional finetuning training step: After conventional end-to-end training, parts of the network are retrained on quantized latents obtained at the inference stage. For entropy-constraint quantizers like Trellis-Coded Quantization, the impact of the quantizer is particularly difficult to approximate by rounding or adding noise as the quantized latents are interdependently chosen through a trellis search based on both the entropy model and a distortion measure. We show that retraining on correctly quantized data consistently yields additional coding gain for both uniform scalar and especially for entropy-constraint quantization, without increasing inference complexity. For the Kodak test set, we obtain average savings between 1% and 2%, and for the TecNick test set up to 2.2% in terms of Bjøntegaard-Delta bitrate.
zh
[AI-31] owards Robust Real-World Multivariate Time Series Forecasting: A Unified Framework for Dependency Asynchrony and Missingness
【速读】:该论文旨在解决实际时间序列数据中通道依赖性、采样异步性和缺失值等问题,这些问题在现有大多数模型假设(如各通道采样周期相同且测试时输入完全可观测)下无法得到有效处理。其解决方案的关键在于提出ChannelTokenFormer,一种基于Transformer的预测模型,该模型具有灵活架构,能够显式捕捉跨通道交互、适应通道级异步采样并有效处理缺失值。
链接: https://arxiv.org/abs/2506.08660
作者: Jinkwan Jang,Hyungjin Park,Jinmyeong Choi,Taesup Kim
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Real-world time series data are inherently multivariate, often exhibiting complex inter-channel dependencies. Each channel is typically sampled at its own period and is prone to missing values due to various practical and operational constraints. These characteristics pose fundamental challenges related to channel dependency, sampling asynchrony, and missingness, all of which must be addressed to enable robust and reliable forecasting in practical settings. However, most existing architectures are built on oversimplified assumptions, such as identical sampling periods across channels and fully observed inputs at test time, which often do not hold in real-world scenarios. To bridge this gap, we propose ChannelTokenFormer, a Transformer-based forecasting model with a flexible architecture designed to explicitly capture cross-channel interactions, accommodate channel-wise asynchronous sampling, and effectively handle missing values. Extensive experiments on three benchmark datasets modified to reflect practical settings, along with one real-world industrial dataset, demonstrate the superior robustness and accuracy of ChannelTokenFormer under challenging real-world conditions.
zh
[AI-32] JoFormer (Journey-based Transformer): Theory and Empirical Analysis on the Tiny Shakespeare Dataset
【速读】:该论文旨在解决Transformer模型中有效融入位置信息的问题,这是序列建模中的一个挑战性且活跃的研究领域。其解决方案的关键在于提出JoFormer架构,该架构基于一种非交换代数来组合跨位置的变换,并通过可学习的方向变换来表示相对位置,这些变换沿输入序列依次组合,从而扩展并泛化了现有的基于相对位置表示的方法。JoFormer的注意力机制是从基本原理推导得出的,能够涵盖标准方法如旋转变换作为特例。
链接: https://arxiv.org/abs/2506.08652
作者: Mahesh Godavarti
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Transformers have demonstrated remarkable success in sequence modeling, yet effectively incorporating positional information remains a challenging and active area of research. In this paper, we introduce JoFormer, a journey-based Transformer architecture grounded in a recently proposed non-commutative algebra for composing transformations across positions. JoFormer represents relative positions through learnable directional transforms that are sequentially composed along the input, thereby extending and generalizing existing approaches based on relative position representations. We derive the JoFormer attention mechanism from first principles and show that it subsumes standard methods such as rotary transformations as special cases. To evaluate its effectiveness, we compare JoFormer to the RoFormer baseline on the Tiny Shakespeare character-level language modeling task. Our results demonstrate that JoFormer consistently achieves lower perplexity and faster convergence, highlighting the advantages of its more expressive, journey-based treatment of position. Notably, the per-token JoFormer is still a primitive, conceptual variant with layer-independent angles, yet it already demonstrates strong performance-underscoring its promise as a proof of concept for more expressive architectures. We conclude by discussing how JoFormer offers a principled approach to integrating positional structure into Transformer architectures. The code used in this work is available at this https URL. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) MSC classes: 20-XX, 08A02 ACMclasses: F.4.1; I.2 Cite as: arXiv:2506.08652 [cs.LG] (or arXiv:2506.08652v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2506.08652 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-33] Modular Recurrence in Contextual MDPs for Universal Morphology Control
【速读】:该论文试图解决多机器人控制中的泛化问题,即如何使控制器在面对未见过的机器人动力学、运动学和拓扑结构时仍能保持高效性能。解决方案的关键在于利用机器人上下文信息的模块化结构,并通过交互推断部分可观测的上下文信息,从而提升模型在未见场景中的泛化能力。为此,研究者提出了一种模块化循环架构,并在MuJoCo机器人集合上进行了验证,结果表明该方法在四种不同环境中显著提升了未见过机器人的性能。
链接: https://arxiv.org/abs/2506.08630
作者: Laurens Engwegen,Daan Brinks,Wendelin Böhmer
机构: 未知
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:
点击查看摘要
Abstract:A universal controller for any robot morphology would greatly improve computational and data efficiency. By utilizing contextual information about the properties of individual robots and exploiting their modular structure in the architecture of deep reinforcement learning agents, steps have been made towards multi-robot control. Generalization to new, unseen robots, however, remains a challenge. In this paper we hypothesize that the relevant contextual information is partially observable, but that it can be inferred through interactions for better generalization to contexts that are not seen during training. To this extent, we implement a modular recurrent architecture and evaluate its generalization performance on a large set of MuJoCo robots. The results show a substantial improved performance on robots with unseen dynamics, kinematics, and topologies, in four different environments.
zh
[AI-34] FoldA: Computing Partial-Order Alignments Using Directed Net Unfoldings
【速读】:该论文试图解决过程挖掘中的符合性检查问题,特别是传统对齐方法在处理具有高选择性和并发性的过程模型时所面临的状态空间爆炸问题,以及由于对齐固有的顺序结构而无法充分表示实际过程中并发行为的局限性。解决方案的关键在于提出一种基于有向Petri网展开的在线部分序对齐技术,称为FoldA,该技术通过利用过程模型与日志轨迹的同步产品状态空间的展开结构,有效减少了队列状态数量并提高了对并发行为的表示准确性。
链接: https://arxiv.org/abs/2506.08627
作者: Douwe Geurtjens,Xixi Lu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Conditionally accepted at BPM 2025
点击查看摘要
Abstract:Conformance checking is a fundamental task of process mining, which quantifies the extent to which the observed process executions match a normative process model. The state-of-the-art approaches compute alignments by exploring the state space formed by the synchronous product of the process model and the trace. This often leads to state space explosion, particularly when the model exhibits a high degree of choice and concurrency. Moreover, as alignments inherently impose a sequential structure, they fail to fully represent the concurrent behavior present in many real-world processes. To address these limitations, this paper proposes a new technique for computing partial-order alignments on the fly using directed Petri net unfoldings, named FoldA. We evaluate our technique on 485 synthetic model-log pairs and compare it against Astar- and Dijkstra-alignments on 13 real-life model-log pairs and 6 benchmark pairs. The results show that our unfolding alignment, although it requires more computation time, generally reduces the number of queued states and provides a more accurate representation of concurrency.
zh
[AI-35] Flow Matching Meets PDEs: A Unified Framework for Physics-Constrained Generation
【速读】:该论文旨在解决传统生成式机器学习方法在建模复杂系统行为时,难以显式嵌入物理约束的问题。其解决方案的关键在于提出物理引导的流匹配(Physics-Based Flow Matching, PBFM),通过在流匹配目标中显式融入偏微分方程(PDE)残差和代数关系,确保生成结果符合物理规律。此外,PBFM在训练过程中引入时间展开机制以提高无噪声样本预测的准确性,并通过联合最小化流匹配损失和物理残差损失,避免了对权重超参数的依赖。
链接: https://arxiv.org/abs/2506.08604
作者: Giacomo Baldan,Qiang Liu,Alberto Guardone,Nils Thuerey
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Numerical Analysis (math.NA)
备注:
点击查看摘要
Abstract:Generative machine learning methods, such as diffusion models and flow matching, have shown great potential in modeling complex system behaviors and building efficient surrogate models. However, these methods typically learn the underlying physics implicitly from data. We propose Physics-Based Flow Matching (PBFM), a novel generative framework that explicitly embeds physical constraints, both PDE residuals and algebraic relations, into the flow matching objective. We also introduce temporal unrolling at training time that improves the accuracy of the final, noise-free sample prediction. Our method jointly minimizes the flow matching loss and the physics-based residual loss without requiring hyperparameter tuning of their relative weights. Additionally, we analyze the role of the minimum noise level, \sigma_\min , in the context of physical constraints and evaluate a stochastic sampling strategy that helps to reduce physical residuals. Through extensive benchmarks on three representative PDE problems, we show that our approach yields up to an 8\times more accurate physical residuals compared to FM, while clearly outperforming existing algorithms in terms of distributional accuracy. PBFM thus provides a principled and efficient framework for surrogate modeling, uncertainty quantification, and accelerated simulation in physics and engineering applications.
zh
[AI-36] WGLE:Backdoor-free and Multi-bit Black-box Watermarking for Graph Neural Networks
【速读】:该论文旨在解决图神经网络(Graph Neural Networks, GNNs)在知识产权保护中的所有权验证问题,传统方法如指纹识别和黑盒水印存在计算成本高、易受模型后处理影响、无法传递额外信息以及需要多个触发图导致验证成本增加等缺陷。其解决方案的关键在于提出一种名为WGLE的新型黑盒水印范式,该方法通过引入层间边距离差异(Layer-wise Distance Difference on an Edge, LDDE)来嵌入多比特字符串作为所有权信息,无需使用后门机制,从而避免了后门攻击的风险,并实现了高效且准确的所有权验证。
链接: https://arxiv.org/abs/2506.08602
作者: Tingzhi Li,Xuefeng Liu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Graph Neural Networks (GNNs) are increasingly deployed in graph-related applications, making ownership verification critical to protect their intellectual property against model theft. Fingerprinting and black-box watermarking are two main methods. However, the former relies on determining model similarity, which is computationally expensive and prone to ownership collisions after model post-processing such as model pruning or fine-tuning. The latter embeds backdoors, exposing watermarked models to the risk of backdoor attacks. Moreover, both methods enable ownership verification but do not convey additional information. As a result, each distributed model requires a unique trigger graph, and all trigger graphs must be used to query the suspect model during verification. Multiple queries increase the financial cost and the risk of detection. To address these challenges, this paper proposes WGLE, a novel black-box watermarking paradigm for GNNs that enables embedding the multi-bit string as the ownership information without using backdoors. WGLE builds on a key insight we term Layer-wise Distance Difference on an Edge (LDDE), which quantifies the difference between the feature distance and the prediction distance of two connected nodes. By predefining positive or negative LDDE values for multiple selected edges, WGLE embeds the watermark encoding the intended information without introducing incorrect mappings that compromise the primary task. WGLE is evaluated on six public datasets and six mainstream GNN architectures along with state-of-the-art methods. The results show that WGLE achieves 100% ownership verification accuracy, an average fidelity degradation of 0.85%, comparable robustness against potential attacks, and low embedding overhead. The code is available in the repository. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2506.08602 [cs.CR] (or arXiv:2506.08602v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2506.08602 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-37] HGFormer: A Hierarchical Graph Transformer Framework for Two-Stage Colonel Blotto Games via Reinforcement Learning
【速读】:该论文试图解决两阶段Colonel Blotto博弈中的对抗性资源分配问题,该问题涉及两个对立智能体在具有图拓扑结构的网络中依次进行资源部署和动态再分配调整,传统方法难以获得全局最优策略。解决方案的关键在于提出一种称为HGformer的分层图Transformer框架,通过引入增强的带有结构偏置的图Transformer编码器以及双智能体分层决策模型,实现大规模对抗环境下的高效策略生成,并设计逐层反馈强化学习算法,将低层决策的长期回报反馈至高层策略优化,从而弥合两个决策阶段之间的协调差距。
链接: https://arxiv.org/abs/2506.08580
作者: Yang Lv,Jinlong Lei,Peng Yi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Two-stage Colonel Blotto game represents a typical adversarial resource allocation problem, in which two opposing agents sequentially allocate resources in a network topology across two phases: an initial resource deployment followed by multiple rounds of dynamic reallocation adjustments. The sequential dependency between game stages and the complex constraints imposed by the graph topology make it difficult for traditional approaches to attain a globally optimal strategy. To address these challenges, we propose a hierarchical graph Transformer framework called HGformer. By incorporating an enhanced graph Transformer encoder with structural biases and a two-agent hierarchical decision model, our approach enables efficient policy generation in large-scale adversarial environments. Moreover, we design a layer-by-layer feedback reinforcement learning algorithm that feeds the long-term returns from lower-level decisions back into the optimization of the higher-level strategy, thus bridging the coordination gap between the two decision-making stages. Experimental results demonstrate that, compared to existing hierarchical decision-making or graph neural network methods, HGformer significantly improves resource allocation efficiency and adversarial payoff, achieving superior overall performance in complex dynamic game scenarios.
zh
[AI-38] Diffusion-based Time Series Forecasting for Sewerag e Systems
【速读】:该论文试图解决 sewerage 系统中上下文感知预测的准确性问题,特别是在极端天气条件下预测性能下降的问题。解决方案的关键在于引入一种基于扩散模型的深度学习方法,该方法能够处理多变量时间序列数据,从而捕捉不同环境信号之间的复杂相关性,同时结合定制化的共形推断技术对预测结果进行校准,以确保预测区间在给定置信水平下具有统计可靠性。
链接: https://arxiv.org/abs/2506.08577
作者: Nicholas A. Pearson,Francesca Cairoli,Luca Bortolussi,Davide Russo,Francesca Zanello
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted for presentation at the 13th Urban Drainage Modelling Conference, Innsbruck (Austria), September 2025
点击查看摘要
Abstract:We introduce a novel deep learning approach that harnesses the power of generative artificial intelligence to enhance the accuracy of contextual forecasting in sewerage systems. By developing a diffusion-based model that processes multivariate time series data, our system excels at capturing complex correlations across diverse environmental signals, enabling robust predictions even during extreme weather events. To strengthen the model’s reliability, we further calibrate its predictions with a conformal inference technique, tailored for probabilistic time series data, ensuring that the resulting prediction intervals are statistically reliable and cover the true target values with a desired confidence level. Our empirical tests on real sewerage system data confirm the model’s exceptional capability to deliver reliable contextual predictions, maintaining accuracy even under severe weather conditions.
zh
[AI-39] Auto-Regressive vs Flow-Matching: a Comparative Study of Modeling Paradigms for Text-to-Music Generation
【速读】:该论文旨在解决文本到音乐生成模型中不同建模范式(modeling paradigm)对性能影响的评估问题,特别是如何在公平比较的基础上明确不同设计选择对模型表现的影响。其解决方案的关键在于进行系统性的实证分析,通过在相同数据集、训练配置和相似主干架构下从头训练所有模型,以隔离建模范式的独立影响,从而揭示其在生成质量、鲁棒性、可扩展性、条件遵循以及音频修复能力等方面的权衡与特性。
链接: https://arxiv.org/abs/2506.08570
作者: Or Tal,Felix Kreuk,Yossi Adi
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
备注:
点击查看摘要
Abstract:Recent progress in text-to-music generation has enabled models to synthesize high-quality musical segments, full compositions, and even respond to fine-grained control signals, e.g. chord progressions. State-of-the-art (SOTA) systems differ significantly across many dimensions, such as training datasets, modeling paradigms, and architectural choices. This diversity complicates efforts to evaluate models fairly and pinpoint which design choices most influence performance. While factors like data and architecture are important, in this study we focus exclusively on the modeling paradigm. We conduct a systematic empirical analysis to isolate its effects, offering insights into associated trade-offs and emergent behaviors that can guide future text-to-music generation systems. Specifically, we compare the two arguably most common modeling paradigms: Auto-Regressive decoding and Conditional Flow-Matching. We conduct a controlled comparison by training all models from scratch using identical datasets, training configurations, and similar backbone architectures. Performance is evaluated across multiple axes, including generation quality, robustness to inference configurations, scalability, adherence to both textual and temporally aligned conditioning, and editing capabilities in the form of audio inpainting. This comparative study sheds light on distinct strengths and limitations of each paradigm, providing actionable insights that can inform future architectural and training decisions in the evolving landscape of text-to-music generation. Audio sampled examples are available at: this https URL
zh
[AI-40] KP-PINNs: Kernel Packet Accelerated Physics Informed Neural Networks IJCAI2025
【速读】:该论文试图解决物理信息神经网络(Physics Informed Neural Networks, PINNs)在求解复杂微分方程时出现的数值解不准确和不稳定的问题。解决方案的关键在于提出一种名为核包加速物理信息神经网络(Kernel Packet accelerated PINNs, KP-PINNs)的新框架,该框架通过引入再生核希尔伯特空间(reproducing kernel Hilbert space, RKHS)范数来重新定义损失函数,并利用核包(Kernel Packet, KP)方法加速计算过程,从而提升求解的稳定性和效率。
链接: https://arxiv.org/abs/2506.08563
作者: Siyuan Yang,Cheng Song,Zhilu Lai,Wenjia Wang
机构: 未知
类目: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI); Mathematical Physics (math-ph); Computational Physics (physics.comp-ph)
备注: Accepted to IJCAI 2025
点击查看摘要
Abstract:Differential equations are involved in modeling many engineering problems. Many efforts have been devoted to solving differential equations. Due to the flexibility of neural networks, Physics Informed Neural Networks (PINNs) have recently been proposed to solve complex differential equations and have demonstrated superior performance in many applications. While the L2 loss function is usually a default choice in PINNs, it has been shown that the corresponding numerical solution is incorrect and unstable for some complex equations. In this work, we propose a new PINNs framework named Kernel Packet accelerated PINNs (KP-PINNs), which gives a new expression of the loss function using the reproducing kernel Hilbert space (RKHS) norm and uses the Kernel Packet (KP) method to accelerate the computation. Theoretical results show that KP-PINNs can be stable across various differential equations. Numerical experiments illustrate that KP-PINNs can solve differential equations effectively and efficiently. This framework provides a promising direction for improving the stability and accuracy of PINNs-based solvers in scientific computing.
zh
[AI-41] Robust Evolutionary Multi-Objective Network Architecture Search for Reinforcement Learning (EMNAS-RL)
【速读】:该论文试图解决在大规模强化学习(Reinforcement Learning, RL)中优化神经网络架构的问题,以提升自动驾驶(Autonomous Driving, AD)系统的性能。其解决方案的关键在于引入进化多目标网络架构搜索(Evolutionary Multi-Objective Network Architecture Search, EMNAS),通过遗传算法自动化设计网络结构,旨在提高奖励值、减少模型规模同时不牺牲性能。此外,采用并行化技术和师生学习方法以加速搜索过程并实现可扩展优化,从而有效提升学习效率和稳定性。
链接: https://arxiv.org/abs/2506.08533
作者: Nihal Acharya Adde,Alexandra Gianzina,Hanno Gottschalk,Andreas Ebert
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
备注: Published at ESANN 2025 Conference
点击查看摘要
Abstract:This paper introduces Evolutionary Multi-Objective Network Architecture Search (EMNAS) for the first time to optimize neural network architectures in large-scale Reinforcement Learning (RL) for Autonomous Driving (AD). EMNAS uses genetic algorithms to automate network design, tailored to enhance rewards and reduce model size without compromising performance. Additionally, parallelization techniques are employed to accelerate the search, and teacher-student methodologies are implemented to ensure scalable optimization. This research underscores the potential of transfer learning as a robust framework for optimizing performance across iterative learning processes by effectively leveraging knowledge from earlier generations to enhance learning efficiency and stability in subsequent generations. Experimental results demonstrate that tailored EMNAS outperforms manually designed models, achieving higher rewards with fewer parameters. The findings of these strategies contribute positively to EMNAS for RL in autonomous driving, advancing the field toward better-performing networks suitable for real-world scenarios.
zh
[AI-42] Safe and Economical UAV Trajectory Planning in Low-Altitude Airspace: A Hybrid DRL-LLM Approach with Compliance Awareness
【速读】:该论文试图解决在低空经济背景下,无人机(UAV)在复杂城市环境中轨迹规划所面临的挑战,特别是现有研究忽视了城市空域约束和经济效率等关键因素。其解决方案的关键在于提出一种结合深度强化学习(DRL)与大语言模型(LLM)推理的新型无人机轨迹规划框架,以实现安全、合规且经济可行的路径规划。
链接: https://arxiv.org/abs/2506.08532
作者: Yanwei Gong,Xiaolin Chang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:The rapid growth of the low-altitude economy has driven the widespread adoption of unmanned aerial vehicles (UAVs). This growing deployment presents new challenges for UAV trajectory planning in complex urban environments. However, existing studies often overlook key factors, such as urban airspace constraints and economic efficiency, which are essential in low-altitude economy contexts. Deep reinforcement learning (DRL) is regarded as a promising solution to these issues, while its practical adoption remains limited by low learning efficiency. To overcome this limitation, we propose a novel UAV trajectory planning framework that combines DRL with large language model (LLM) reasoning to enable safe, compliant, and economically viable path planning. Experimental results demonstrate that our method significantly outperforms existing baselines across multiple metrics, including data collection rate, collision avoidance, successful landing, regulatory compliance, and energy efficiency. These results validate the effectiveness of our approach in addressing UAV trajectory planning key challenges under constraints of the low-altitude economy networking.
zh
[AI-43] aching Physical Awareness to LLM s through Sounds ICML2025
【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)缺乏物理感知的问题,即对现实世界物理现象的理解不足。其解决方案的关键在于通过声音信号来赋予LLMs物理意识,具体包括利用物理驱动的模拟器生成多样化的训练数据,并构建一个能够处理幅度和相位信息的音频编码器,从而在模拟和真实任务中实现对物理现象的有效理解与建模。
链接: https://arxiv.org/abs/2506.08524
作者: Weiguo Wang,Andy Nie,Wenrui Zhou,Yi Kai,Chengchen Hu
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Multimedia (cs.MM); Robotics (cs.RO); Audio and Speech Processing (eess.AS)
备注: ICML 2025
点击查看摘要
Abstract:Large Language Models (LLMs) have shown remarkable capabilities in text and multimodal processing, yet they fundamentally lack physical awareness–understanding of real-world physical phenomena. In this work, we present ACORN, a framework that teaches LLMs physical awareness through sound, focusing on fundamental physical phenomena like the Doppler effect, multipath effect, and spatial relationships. To overcome data scarcity, ACORN introduce a physics-based simulator combining real-world sound sources with controlled physical channels to generate diverse training data. Using this simulator, we build AQA-PHY, a comprehensive Audio Question-Answer dataset, and propose an audio encoder that processes both magnitude and phase information. By connecting our audio encoder to state-of-the-art LLMs, we demonstrate reasonable results in both simulated and real-world tasks, such as line-of-sight detection, Doppler effect estimation, and Direction-of-Arrival estimation, paving the way for enabling LLMs to understand physical world.
zh
[AI-44] MasHost Builds It All: Autonomous Multi-Agent System Directed by Reinforcement Learning
【速读】:该论文旨在解决传统多智能体系统(Multi-agent systems, Mas)构建方法依赖人工设计交互机制或启发式规则所导致的人为偏见和自主性受限的问题。其解决方案的关键在于提出MasHost,一个基于强化学习(Reinforcement Learning, RL)的框架,通过将Mas构建建模为图搜索问题,并利用统一的概率采样机制联合采样智能体角色及其交互,实现自主且查询适应的Mas设计。此外,引入组件合理性作为新的设计原则,并采用分层相对策略优化(Hierarchical Relative Policy Optimization, HRPO)以实现多目标优化。
链接: https://arxiv.org/abs/2506.08507
作者: Kuo Yang,Xingjie Yang,Linhui Yu,Qing Xu,Yan Fang,Xu Wang,Zhengyang Zhou,Yang Wang
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Large Language Model (LLM)-driven Multi-agent systems (Mas) have recently emerged as a powerful paradigm for tackling complex real-world tasks. However, existing Mas construction methods typically rely on manually crafted interaction mechanisms or heuristic rules, introducing human biases and constraining the autonomous ability. Even with recent advances in adaptive Mas construction, existing systems largely remain within the paradigm of semi-autonomous patterns. In this work, we propose MasHost, a Reinforcement Learning (RL)-based framework for autonomous and query-adaptive Mas design. By formulating Mas construction as a graph search problem, our proposed MasHost jointly samples agent roles and their interactions through a unified probabilistic sampling mechanism. Beyond the accuracy and efficiency objectives pursued in prior works, we introduce component rationality as an additional and novel design principle in Mas. To achieve this multi-objective optimization, we propose Hierarchical Relative Policy Optimization (HRPO), a novel RL strategy that collaboratively integrates group-relative advantages and action-wise rewards. To our knowledge, our proposed MasHost is the first RL-driven framework for autonomous Mas graph construction. Extensive experiments on six benchmarks demonstrate that MasHost consistently outperforms most competitive baselines, validating its effectiveness, efficiency, and structure rationality.
zh
[AI-45] Explaining Fast and Slow: Abstraction and Refinement of Provable Explanations ICML2025
【速读】:该论文试图解决当前神经网络后验可解释性方法大多依赖启发式策略,缺乏形式化保证的问题,以及现有方法在计算具有形式化保证的解释时面临的可扩展性挑战。其解决方案的关键在于提出一种新颖的抽象-精炼技术,通过构建一个显著简化的神经网络来抽象原始网络,使得简化网络的充分解释同样可证明地适用于原始网络,从而大幅加速验证过程;若简化网络的解释不足,则逐步增加网络规模直至收敛,实现高效且形式化保证的神经网络预测解释。
链接: https://arxiv.org/abs/2506.08505
作者: Shahaf Bassan,Yizhak Yisrael Elboher,Tobias Ladner,Matthias Althoff,Guy Katz
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注: To appear in ICML 2025
点击查看摘要
Abstract:Despite significant advancements in post-hoc explainability techniques for neural networks, many current methods rely on heuristics and do not provide formally provable guarantees over the explanations provided. Recent work has shown that it is possible to obtain explanations with formal guarantees by identifying subsets of input features that are sufficient to determine that predictions remain unchanged using neural network verification techniques. Despite the appeal of these explanations, their computation faces significant scalability challenges. In this work, we address this gap by proposing a novel abstraction-refinement technique for efficiently computing provably sufficient explanations of neural network predictions. Our method abstracts the original large neural network by constructing a substantially reduced network, where a sufficient explanation of the reduced network is also provably sufficient for the original network, hence significantly speeding up the verification process. If the explanation is in sufficient on the reduced network, we iteratively refine the network size by gradually increasing it until convergence. Our experiments demonstrate that our approach enhances the efficiency of obtaining provably sufficient explanations for neural network predictions while additionally providing a fine-grained interpretation of the network’s predictions across different abstraction levels.
zh
[AI-46] RHealthTwin: Towards Responsible and Multimodal Digital Twins for Personalized Well-being
【速读】:该论文试图解决在消费健康场景中部署基于大语言模型(Large Language Models, LLMs)的数字孪生系统时所面临的幻觉、偏见、缺乏透明度和伦理滥用等问题。其解决方案的关键在于提出一种名为RHealthTwin的负责任框架,该框架的核心是 Responsible Prompt Engine (RPE),它通过动态提取预定义槽位来结构化输入,从而引导语言模型生成上下文感知、个性化、公平、可靠且可解释的响应,以提升健康辅助的可信度与安全性。
链接: https://arxiv.org/abs/2506.08486
作者: Rahatara Ferdousi,M Anwar Hossain
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 18 pages, 12 figures, IEEE EMBS JBHI
点击查看摘要
Abstract:The rise of large language models (LLMs) has created new possibilities for digital twins in healthcare. However, the deployment of such systems in consumer health contexts raises significant concerns related to hallucination, bias, lack of transparency, and ethical misuse. In response to recommendations from health authorities such as the World Health Organization (WHO), we propose Responsible Health Twin (RHealthTwin), a principled framework for building and governing AI-powered digital twins for well-being assistance. RHealthTwin processes multimodal inputs that guide a health-focused LLM to produce safe, relevant, and explainable responses. At the core of RHealthTwin is the Responsible Prompt Engine (RPE), which addresses the limitations of traditional LLM configuration. Conventionally, users input unstructured prompt and the system instruction to configure the LLM, which increases the risk of hallucination. In contrast, RPE extracts predefined slots dynamically to structure both inputs. This guides the language model to generate responses that are context aware, personalized, fair, reliable, and explainable for well-being assistance. The framework further adapts over time through a feedback loop that updates the prompt structure based on user satisfaction. We evaluate RHealthTwin across four consumer health domains including mental support, symptom triage, nutrition planning, and activity coaching. RPE achieves state-of-the-art results with BLEU = 0.41, ROUGE-L = 0.63, and BERTScore = 0.89 on benchmark datasets. Also, we achieve over 90% in ethical compliance and instruction-following metrics using LLM-as-judge evaluation, outperforming baseline strategies. We envision RHealthTwin as a forward-looking foundation for responsible LLM-based applications in health and well-being.
zh
[AI-47] How to Provably Improve Return Conditioned Supervised Learning?
【速读】:该论文试图解决Return-Conditioned Supervised Learning (RCSL)在离线决策任务中因数据集策略质量受限而缺乏拼接性(stitching property)的问题。为了解决这一问题,作者提出了一种称为Reinforced RCSL的框架,其关键创新在于引入了“分布内最优回报至终点”(in-distribution optimal return-to-go)的概念,通过该机制利用当前状态评估数据集中可实现的最佳未来回报,从而避免了复杂回报增强技术的需求。
链接: https://arxiv.org/abs/2506.08463
作者: Zhishuai Liu,Yu Yang,Ruhan Wang,Pan Xu,Dongruo Zhou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: 25 pages, 4 figures, 12 tables
点击查看摘要
Abstract:In sequential decision-making problems, Return-Conditioned Supervised Learning (RCSL) has gained increasing recognition for its simplicity and stability in modern decision-making tasks. Unlike traditional offline reinforcement learning (RL) algorithms, RCSL frames policy learning as a supervised learning problem by taking both the state and return as input. This approach eliminates the instability often associated with temporal difference (TD) learning in offline RL. However, RCSL has been criticized for lacking the stitching property, meaning its performance is inherently limited by the quality of the policy used to generate the offline dataset. To address this limitation, we propose a principled and simple framework called Reinforced RCSL. The key innovation of our framework is the introduction of a concept we call the in-distribution optimal return-to-go. This mechanism leverages our policy to identify the best achievable in-dataset future return based on the current state, avoiding the need for complex return augmentation techniques. Our theoretical analysis demonstrates that Reinforced RCSL can consistently outperform the standard RCSL approach. Empirical results further validate our claims, showing significant performance improvements across a range of benchmarks.
zh
[AI-48] Hybrid Reasoning for Perception Explanation and Autonomous Action in Manufacturing
【速读】:该论文试图解决工业过程中控制系统的鲁棒性与适应性不足的问题,尤其是在环境和任务不可预测、操作错误成本高且难以检测的情况下,传统基于监督学习的AI控制方法因依赖大量标注数据而难以在数据稀缺的工业场景中泛化。解决方案的关键在于提出一种名为CIPHER的视觉-语言-动作(VLA)模型框架,该框架融合了过程专家知识、回归模型以实现工程任务所需的系统状态定量表征,并结合检索增强生成技术访问外部专家知识,支持物理启发的链式推理,从而实现对分布外任务的强泛化能力。
链接: https://arxiv.org/abs/2506.08462
作者: Christos Margadji,Sebastian W. Pattinson
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Robotics (cs.RO); Systems and Control (eess.SY)
备注:
点击查看摘要
Abstract:Industrial processes must be robust and adaptable, as environments and tasks are often unpredictable, while operational errors remain costly and difficult to detect. AI-based control systems offer a path forward, yet typically depend on supervised learning with extensive labelled datasets, which limits their ability to generalize across variable and data-scarce industrial settings. Foundation models could enable broader reasoning and knowledge integration, but rarely deliver the quantitative precision demanded by engineering applications. Here, we introduceControl and Interpretation of Production via Hybrid Expertise and Reasoning (CIPHER): a vision-language-action (VLA) model framework aiming to replicate human-like reasoning for industrial control, instantiated in a commercial-grade 3D printer. It integrates a process expert, a regression model enabling quantitative characterization of system states required for engineering tasks. CIPHER also incorporates retrieval-augmented generation to access external expert knowledge and support physics-informed, chain-of-thought reasoning. This hybrid architecture exhibits strong generalization to out-of-distribution tasks. It interprets visual or textual inputs from process monitoring, explains its decisions, and autonomously generates precise machine instructions, without requiring explicit annotations. CIPHER thus lays the foundations for autonomous systems that act with precision, reason with context, and communicate decisions transparently, supporting safe and trusted deployment in industrial settings.
zh
[AI-49] MOBODY: Model Based Off-Dynamics Offline Reinforcement Learning
【速读】:该论文旨在解决跨域非动态离线强化学习(off-dynamics offline reinforcement learning)问题,即在源域和目标域转移概率不匹配的情况下,从离线数据集中学习策略。现有方法通常通过过滤与目标域相似的源域转移或对源数据进行奖励增强来应对这一问题,但受限于目标域数据量有限,导致学习到的策略无法超越离线数据集进行有效探索。该论文提出的MOBODY算法关键在于通过学习目标域的动力学模型生成合成转移数据,从而实现对目标域的探索,其核心创新在于利用源域和目标域数据联合学习共享的潜在状态与转移表示,以避免模型偏向源域动力学,并结合Q值加权的行为克隆损失来稳定策略训练。
链接: https://arxiv.org/abs/2506.08460
作者: Yihong Guo,Yu Yang,Pan Xu,Anqi Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:
点击查看摘要
Abstract:We study the off-dynamics offline reinforcement learning problem, where the goal is to learn a policy from offline datasets collected from source and target domains with mismatched transition. Existing off-dynamics offline RL methods typically either filter source transitions that resemble those of the target domain or apply reward augmentation to source data, both constrained by the limited transitions available from the target domain. As a result, the learned policy is unable to explore target domain beyond the offline datasets. We propose MOBODY, a Model-Based Off-Dynamics offline RL algorithm that addresses this limitation by enabling exploration of the target domain via learned dynamics. MOBODY generates new synthetic transitions in the target domain through model rollouts, which are used as data augmentation during offline policy learning. Unlike existing model-based methods that learn dynamics from a single domain, MOBODY tackles the challenge of mismatched dynamics by leveraging both source and target datasets. Directly merging these datasets can bias the learned model toward source dynamics. Instead, MOBODY learns target dynamics by discovering a shared latent representation of states and transitions across domains through representation learning. To stabilize training, MOBODY incorporates a behavior cloning loss that regularizes the policy. Specifically, we introduce a Q-weighted behavior cloning loss that regularizes the policy toward actions with high target-domain Q-values, rather than uniformly imitating all actions in the dataset. These Q-values are learned from an enhanced target dataset composed of offline target data, augmented source data, and rollout data from the learned target dynamics. We evaluate MOBODY on MuJoCo benchmarks and show that it significantly outperforms state-of-the-art baselines, with especially pronounced improvements in challenging scenarios.
zh
[AI-50] Diffusion Models for Safety Validation of Autonomous Driving Systems
【速读】:该论文试图解决自动驾驶系统安全验证的难题,这一问题由于现实测试的高风险和高成本,以及潜在故障的稀有性和多样性而变得尤为复杂。论文提出的解决方案的关键在于训练一个去噪扩散模型,该模型能够在给定任意初始交通状态的情况下生成自动驾驶车辆的潜在故障案例。该方法无需外部训练数据集,能够在有限的计算资源下进行训练和推理,并且不依赖被测系统的先验知识,从而适用于交通交叉口的安全验证。
链接: https://arxiv.org/abs/2506.08459
作者: Juanran Wang,Marc R. Schlichting,Harrison Delecki,Mykel J. Kochenderfer
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Safety validation of autonomous driving systems is extremely challenging due to the high risks and costs of real-world testing as well as the rarity and diversity of potential failures. To address these challenges, we train a denoising diffusion model to generate potential failure cases of an autonomous vehicle given any initial traffic state. Experiments on a four-way intersection problem show that in a variety of scenarios, the diffusion model can generate realistic failure samples while capturing a wide variety of potential failures. Our model does not require any external training dataset, can perform training and inference with modest computing resources, and does not assume any prior knowledge of the system under test, with applicability to safety validation for traffic intersections.
zh
[AI-51] me-Aware World Model for Adaptive Prediction and Control ICML2025
【速读】:该论文试图解决在控制任务中如何有效建模时间动态以提升模型性能和数据效率的问题。传统方法通常采用固定时间步长进行采样,而未能充分考虑系统动态对最优采样率的影响。解决方案的关键在于提出一种时间感知的世界模型(Time-Aware World Model, TAWM),通过条件化时间步长Δt并在多种Δt值上进行训练,从而学习到跨不同控制问题的高、低频任务动态,基于信息论的洞察,提升了模型的性能与数据利用效率。
链接: https://arxiv.org/abs/2506.08441
作者: Anh N. Nhu,Sanghyun Son,Ming Lin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注: Paper accepted to ICML 2025
点击查看摘要
Abstract:In this work, we introduce the Time-Aware World Model (TAWM), a model-based approach that explicitly incorporates temporal dynamics. By conditioning on the time-step size, \Deltat, and training over a diverse range of \Deltat values – rather than sampling at a fixed time-step – TAWM learns both high- and low-frequency task dynamics across diverse control problems. Grounded in the information-theoretic insight that the optimal sampling rate depends on a system’s underlying dynamics, this time-aware formulation improves both performance and data efficiency. Empirical evaluations show that TAWM consistently outperforms conventional models across varying observation rates in a variety of control tasks, using the same number of training samples and iterations. Our code can be found online at: this http URL.
zh
[AI-52] HASFL: Heterogeneity-aware Split Federated Learning over Edge Computing Systems
【速读】:该论文旨在解决边缘设备上联邦学习(Federated Learning, FL)中由于设备资源异构性导致的慢速参与者(straggler effect)问题。其解决方案的关键在于通过自适应控制批次大小(batch size, BS)和模型分割(model splitting, MS)来平衡通信与计算延迟以及训练收敛性,从而提升整体学习性能。论文首先推导了SFL的紧致收敛界,以此为基础提出了Heterogeneity-Aware SFL(HASFL)框架,实现了对BS和MS的动态调整。
链接: https://arxiv.org/abs/2506.08426
作者: Zheng Lin,Zhe Chen,Xianhao Chen,Wei Ni,Yue Gao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: 16 pages, 11 figures. arXiv admin note: text overlap with arXiv:2403.13101
点击查看摘要
Abstract:Split federated learning (SFL) has emerged as a promising paradigm to democratize machine learning (ML) on edge devices by enabling layer-wise model partitioning. However, existing SFL approaches suffer significantly from the straggler effect due to the heterogeneous capabilities of edge devices. To address the fundamental challenge, we propose adaptively controlling batch sizes (BSs) and model splitting (MS) for edge devices to overcome resource heterogeneity. We first derive a tight convergence bound of SFL that quantifies the impact of varied BSs and MS on learning performance. Based on the convergence bound, we propose HASFL, a heterogeneity-aware SFL framework capable of adaptively controlling BS and MS to balance communication-computing latency and training convergence in heterogeneous edge networks. Extensive experiments with various datasets validate the effectiveness of HASFL and demonstrate its superiority over state-of-the-art benchmarks.
zh
[AI-53] SHIELD: Multi-task Multi-distribution Vehicle Routing Solver with Sparsity and Hierarchy ICML
【速读】:该论文试图解决传统基础模型在处理车辆路径问题(Vehicle Routing Problem, VRP)时忽视现实世界中复杂客户分布的问题,提出更贴近实际的多任务多分布VRP(Multi-Task Multi-Distribution VRP, MTMDVRP)设置。其解决方案的关键在于引入SHIELD模型,该模型结合了稀疏性(sparsity)和层次性(hierarchy)原则,通过深度解码器架构集成混合深度(Mixture-of-Depths, MoD)技术以实现动态节点选择,提升模型效率与泛化能力;同时设计基于上下文的聚类层,利用问题中的层次结构生成更优局部表示,从而增强模型对任务和分布特有及共享表征的学习能力。
链接: https://arxiv.org/abs/2506.08424
作者: Yong Liang Goh,Zhiguang Cao,Yining Ma,Jianan Zhou,Mohammad Haroon Dupty,Wee Sun Lee
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted in the 42nd International Conference of Machine Learning (ICML)
点击查看摘要
Abstract:Recent advances toward foundation models for routing problems have shown great potential of a unified deep model for various VRP variants. However, they overlook the complex real-world customer distributions. In this work, we advance the Multi-Task VRP (MTVRP) setting to the more realistic yet challenging Multi-Task Multi-Distribution VRP (MTMDVRP) setting, and introduce SHIELD, a novel model that leverages both sparsity and hierarchy principles. Building on a deeper decoder architecture, we first incorporate the Mixture-of-Depths (MoD) technique to enforce sparsity. This improves both efficiency and generalization by allowing the model to dynamically select nodes to use or skip each decoder layer, providing the needed capacity to adaptively allocate computation for learning the task/distribution specific and shared representations. We also develop a context-based clustering layer that exploits the presence of hierarchical structures in the problems to produce better local representations. These two designs inductively bias the network to identify key features that are common across tasks and distributions, leading to significantly improved generalization on unseen ones. Our empirical results demonstrate the superiority of our approach over existing methods on 9 real-world maps with 16 VRP variants each.
zh
[AI-54] ransforming Expert Knowledge into Scalable Ontology via Large Language Models
【速读】:该论文旨在解决领域特定应用中知识表示的统一性和连贯性问题,特别是如何高效、准确地对齐不同术语体系与底层概念之间的关系。传统的人工方法依赖专家对概念对进行审查,但存在成本高、耗时长以及主观差异导致的不一致问题;现有自动化方法在处理语义关系的细微差别和跨领域的一致性方面存在局限。论文提出的解决方案关键在于结合大语言模型(Large Language Models, LLMs)与专家校准及迭代提示优化,通过专家标注示例、多阶段提示工程和人工验证,引导LLMs生成分类法关联及其支持性推理,从而实现高质量的自动对齐。
链接: https://arxiv.org/abs/2506.08422
作者: Ikkei Itoku,David Theil,Evelyn Eichelsdoerfer Uehara,Sreyoshi Bhaduri,Junnosuke Kuroda,Toshi Yumoto,Alex Gil,Natalie Perez,Rajesh Cherukuri,Naumaan Nayyar
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Having a unified, coherent taxonomy is essential for effective knowledge representation in domain-specific applications as diverse terminologies need to be mapped to underlying concepts. Traditional manual approaches to taxonomy alignment rely on expert review of concept pairs, but this becomes prohibitively expensive and time-consuming at scale, while subjective interpretations often lead to expert disagreements. Existing automated methods for taxonomy alignment have shown promise but face limitations in handling nuanced semantic relationships and maintaining consistency across different domains. These approaches often struggle with context-dependent concept mappings and lack transparent reasoning processes. We propose a novel framework that combines large language models (LLMs) with expert calibration and iterative prompt optimization to automate taxonomy alignment. Our method integrates expert-labeled examples, multi-stage prompt engineering, and human validation to guide LLMs in generating both taxonomy linkages and supporting rationales. In evaluating our framework on a domain-specific mapping task of concept essentiality, we achieved an F1-score of 0.97, substantially exceeding the human benchmark of 0.68. These results demonstrate the effectiveness of our approach in scaling taxonomy alignment while maintaining high-quality mappings and preserving expert oversight for ambiguous cases.
zh
[AI-55] Offline RL with Smooth OOD Generalization in Convex Hull and its Neighborhood ICLR2025
【速读】:该论文旨在解决离线强化学习(Offline Reinforcement Learning, RL)中由于分布偏移导致的Q值高估问题,特别是在分布外(Out-of-Distribution, OOD)动作上的Q值估计不准确问题。现有方法通过引入约束来缓解该问题,但往往在评估OOD区域时过于保守,限制了Q函数的泛化能力。该论文的关键解决方案是提出一种基于凸包及其邻域(Convex Hull and its Neighborhood, CHN)的平滑贝尔曼算子(Smooth Bellman Operator, SBO),通过将OOD的Q值与邻近的样本内Q值进行平滑处理,提升Q函数在OOD区域的泛化能力。理论分析表明,SBO能够在CHN范围内近似真实Q值,从而实现更准确的Q值估计。
链接: https://arxiv.org/abs/2506.08417
作者: Qingmao Yao,Zhichao Lei,Tianyuan Chen,Ziyue Yuan,Xuefan Chen,Jianxiang Liu,Faguo Wu,Xiao Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: ICLR 2025
点击查看摘要
Abstract:Offline Reinforcement Learning (RL) struggles with distributional shifts, leading to the Q -value overestimation for out-of-distribution (OOD) actions. Existing methods address this issue by imposing constraints; however, they often become overly conservative when evaluating OOD regions, which constrains the Q -function generalization. This over-constraint issue results in poor Q -value estimation and hinders policy improvement. In this paper, we introduce a novel approach to achieve better Q -value estimation by enhancing Q -function generalization in OOD regions within Convex Hull and its Neighborhood (CHN). Under the safety generalization guarantees of the CHN, we propose the Smooth Bellman Operator (SBO), which updates OOD Q -values by smoothing them with neighboring in-sample Q -values. We theoretically show that SBO approximates true Q -values for both in-sample and OOD actions within the CHN. Our practical algorithm, Smooth Q-function OOD Generalization (SQOG), empirically alleviates the over-constraint issue, achieving near-accurate Q -value estimation. On the D4RL benchmarks, SQOG outperforms existing state-of-the-art methods in both performance and computational efficiency.
zh
[AI-56] Single-Node Trigger Backdoor Attacks in Graph-Based Recommendation Systems
【速读】:该论文旨在解决图推荐系统在面对攻击时存在的漏洞问题,特别是现有shilling攻击方法在隐蔽性和破坏性方面的不足。其解决方案的关键在于提出一种新型的图后门攻击方法,通过设计一个单节点触发生成器,在仅插入一个虚假用户节点的情况下,能够有效提升目标物品对目标用户的曝光度,同时通过引入目标节点与无关节点之间的约束条件,降低虚假节点对推荐系统性能的影响。
链接: https://arxiv.org/abs/2506.08401
作者: Runze Li,Di Jin,Xiaobao Wang,Dongxiao He,Bingdao Feng,Zhen Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Graph recommendation systems have been widely studied due to their ability to effectively capture the complex interactions between users and items. However, these systems also exhibit certain vulnerabilities when faced with attacks. The prevailing shilling attack methods typically manipulate recommendation results by injecting a large number of fake nodes and edges. However, such attack strategies face two primary challenges: low stealth and high destructiveness. To address these challenges, this paper proposes a novel graph backdoor attack method that aims to enhance the exposure of target items to the target user in a covert manner, without affecting other unrelated nodes. Specifically, we design a single-node trigger generator, which can effectively expose multiple target items to the target user by inserting only one fake user node. Additionally, we introduce constraint conditions between the target nodes and irrelevant nodes to mitigate the impact of fake nodes on the recommendation system’s performance. Experimental results show that the exposure of the target items reaches no less than 50% in 99% of the target users, while the impact on the recommendation system’s performance is controlled within approximately 5%.
zh
[AI-57] SafeCoT: Improving VLM Safety with Minimal Reasoning
【速读】:该论文试图解决视觉-语言模型(Vision-Language Models, VLMs)在高风险或模糊场景中无法生成安全且适当响应的问题。解决方案的关键在于提出SafeCoT框架,该框架通过基于规则的思维链(Chain-of-Thought, CoT)监督机制,在最小化监督的情况下提升模型对安全风险的推理能力,并实现上下文感知的拒绝行为。
链接: https://arxiv.org/abs/2506.08399
作者: Jiachen Ma,Zhanhui Zhou,Chao Yang,Chaochao Lu
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Ensuring safe and appropriate responses from vision-language models (VLMs) remains a critical challenge, particularly in high-risk or ambiguous scenarios. We introduce SafeCoT, a lightweight, interpretable framework that leverages rule-based chain-of-thought (CoT) supervision to improve refusal behavior in VLMs. Unlike prior methods that rely on large-scale safety annotations or complex modeling, SafeCoT uses minimal supervision to help models reason about safety risks and make context-aware refusals. Experiments across multiple benchmarks show that SafeCoT significantly reduces overrefusal and enhances generalization, even with limited training data. Our approach offers a scalable solution for aligning VLMs with safety-critical objectives.
zh
[AI-58] Spatiotemporal deep learning models for detection of rapid intensification in cyclones
【速读】:该论文试图解决热带气旋快速增强(Cyclone Rapid Intensification)检测中的类别不平衡问题,以及传统机器学习模型在处理此类极端事件时的局限性。其解决方案的关键在于采用深度学习与数据增强框架,利用深度学习模型生成具有时空特征的合成数据,以补充稀有样本,并通过分类模块区分快速增强与非快速增强事件,从而提升检测性能。
链接: https://arxiv.org/abs/2506.08397
作者: Vamshika Sutar,Amandeep Singh,Rohitash Chandra
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:
点击查看摘要
Abstract:Cyclone rapid intensification is the rapid increase in cyclone wind intensity, exceeding a threshold of 30 knots, within 24 hours. Rapid intensification is considered an extreme event during a cyclone, and its occurrence is relatively rare, contributing to a class imbalance in the dataset. A diverse array of factors influences the likelihood of a cyclone undergoing rapid intensification, further complicating the task for conventional machine learning models. In this paper, we evaluate deep learning, ensemble learning and data augmentation frameworks to detect cyclone rapid intensification based on wind intensity and spatial coordinates. We note that conventional data augmentation methods cannot be utilised for generating spatiotemporal patterns replicating cyclones that undergo rapid intensification. Therefore, our framework employs deep learning models to generate spatial coordinates and wind intensity that replicate cyclones to address the class imbalance problem of rapid intensification. We also use a deep learning model for the classification module within the data augmentation framework to differentiate between rapid and non-rapid intensification events during a cyclone. Our results show that data augmentation improves the results for rapid intensification detection in cyclones, and spatial coordinates play a critical role as input features to the given models. This paves the way for research in synthetic data generation for spatiotemporal data with extreme events.
zh
[AI-59] On Reasoning Strength Planning in Large Reasoning Models
【速读】:该论文试图解决大型推理模型(Large Reasoning Models, LRMs)在处理复杂问题时能够自动分配更多推理强度(即推理标记数量)的现象背后的机制问题,即如何理解LRMs的难度感知能力及其对任务性能的提升。解决方案的关键在于从模型激活的角度揭示这一现象,发现LRMs在生成前的激活中预先规划了推理强度,并通过一个预分配的方向向量的幅度来因果控制推理强度,从而实现了对推理标记数量的预测与调控。
链接: https://arxiv.org/abs/2506.08390
作者: Leheng Sheng,An Zhang,Zijian Wu,Weixiang Zhao,Changshuo Shen,Yi Zhang,Xiang Wang,Tat-Seng Chua
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Recent studies empirically reveal that large reasoning models (LRMs) can automatically allocate more reasoning strengths (i.e., the number of reasoning tokens) for harder problems, exhibiting difficulty-awareness for better task performance. While this automatic reasoning strength allocation phenomenon has been widely observed, its underlying mechanism remains largely unexplored. To this end, we provide explanations for this phenomenon from the perspective of model activations. We find evidence that LRMs pre-plan the reasoning strengths in their activations even before generation, with this reasoning strength causally controlled by the magnitude of a pre-allocated directional vector. Specifically, we show that the number of reasoning tokens is predictable solely based on the question activations using linear probes, indicating that LRMs estimate the required reasoning strength in advance. We then uncover that LRMs encode this reasoning strength through a pre-allocated directional vector embedded in the activations of the model, where the vector’s magnitude modulates the reasoning strength. Subtracting this vector can lead to reduced reasoning token number and performance, while adding this vector can lead to increased reasoning token number and even improved performance. We further reveal that this direction vector consistently yields positive reasoning length prediction, and it modifies the logits of end-of-reasoning token /think to affect the reasoning length. Finally, we demonstrate two potential applications of our findings: overthinking behavior detection and enabling efficient reasoning on simple problems. Our work provides new insights into the internal mechanisms of reasoning in LRMs and offers practical tools for controlling their reasoning behaviors. Our code is available at this https URL.
zh
[AI-60] FloorplanMAE:A self-supervised framework for complete floorplan generation from partial inputs
【速读】:该论文旨在解决从部分缺失的平面图中预测完整平面图的问题,这一能力在建筑设计过程中具有重要价值,能够提升设计效率并减少重复修改的工作量。解决方案的关键在于提出了一种基于自监督学习的框架FloorplanMAE,其核心是利用Masked Autoencoders (MAE)技术,通过遮蔽平面图的部分区域并训练轻量级Vision Transformer (ViT)来恢复缺失部分,从而实现对不完整平面图的重建。
链接: https://arxiv.org/abs/2506.08363
作者: Jun Yin,Jing Zhong,Pengyu Zeng,Peilin Li,Miao Zhang,Ran Luo,Shuai Lu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:In the architectural design process, floorplan design is often a dynamic and iterative process. Architects progressively draw various parts of the floorplan according to their ideas and requirements, continuously adjusting and refining throughout the design process. Therefore, the ability to predict a complete floorplan from a partial one holds significant value in the design process. Such prediction can help architects quickly generate preliminary designs, improve design efficiency, and reduce the workload associated with repeated modifications. To address this need, we propose FloorplanMAE, a self-supervised learning framework for restoring incomplete floor plans into complete ones. First, we developed a floor plan reconstruction dataset, FloorplanNet, specifically trained on architectural floor plans. Secondly, we propose a floor plan reconstruction method based on Masked Autoencoders (MAE), which reconstructs missing parts by masking sections of the floor plan and training a lightweight Vision Transformer (ViT). We evaluated the reconstruction accuracy of FloorplanMAE and compared it with state-of-the-art benchmarks. Additionally, we validated the model using real sketches from the early stages of architectural design. Experimental results show that the FloorplanMAE model can generate high-quality complete floor plans from incomplete partial plans. This framework provides a scalable solution for floor plan generation, with broad application prospects.
zh
[AI-61] MD-ViSCo: A Unified Model for Multi-Directional Vital Sign Waveform Conversion
【速读】:该论文试图解决现有深度学习模型仅适用于特定源-目标波形对的问题,导致需要不同的模型架构、优化过程和预处理流程,从而在临床环境中限制了可用性。解决方案的关键在于提出多方向生命体征转换器(MD-ViSCo),这是一个统一框架,能够使用单一模型从任何单输入波形生成如心电图(ECG)、光电容积描记图(PPG)或动脉血压(ABP)等任何目标波形。MD-ViSCo采用了一个浅层一维U-Net结合Swin Transformer,并利用自适应实例归一化(AdaIN)来捕捉不同的波形风格。
链接: https://arxiv.org/abs/2506.08357
作者: Franck Meyer,Kyunghoon Hur,Edward Choi
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: Main paper (16 pages, 5 figures). Paper submitted for review. Code available at this https URL
点击查看摘要
Abstract:Despite the remarkable progress of deep-learning methods generating a target vital sign waveform from a source vital sign waveform, most existing models are designed exclusively for a specific source-to-target pair. This requires distinct model architectures, optimization procedures, and pre-processing pipelines, resulting in multiple models that hinder usability in clinical settings. To address this limitation, we propose the Multi-Directional Vital-Sign Converter (MD-ViSCo), a unified framework capable of generating any target waveform such as electrocardiogram (ECG), photoplethysmogram (PPG), or arterial blood pressure (ABP) from any single input waveform with a single model. MD-ViSCo employs a shallow 1-Dimensional U-Net integrated with a Swin Transformer that leverages Adaptive Instance Normalization (AdaIN) to capture distinct waveform styles. To evaluate the efficacy of MD-ViSCo, we conduct multi-directional waveform generation on two publicly available datasets. Our framework surpasses state-of-the-art baselines (NabNet PPG2ABP) on average across all waveform types, lowering Mean absolute error (MAE) by 8.8% and improving Pearson correlation (PC) by 4.9% over two datasets. In addition, the generated ABP waveforms satisfy the Association for the Advancement of Medical Instrumentation (AAMI) criterion and achieve Grade B on the British Hypertension Society (BHS) standard, outperforming all baselines. By eliminating the need for developing a distinct model for each task, we believe that this work offers a unified framework that can deal with any kind of vital sign waveforms with a single model in healthcare monitoring.
zh
[AI-62] Re4MPC: Reactive Nonlinear MPC for Multi-model Motion Planning via Deep Reinforcement Learning
【速读】:该论文旨在解决高自由度机器人(如移动机械臂)在现实场景中传统运动规划方法计算开销过大的问题。其解决方案的关键在于提出了一种名为Re4MPC的多模型运动规划流程,该流程通过非线性模型预测控制(NMPC)计算轨迹,并利用深度强化学习(DRL)框架实现对NMPC问题中模型、代价函数和约束的动态选择,从而在不同任务复杂度和机器人状态下实现计算效率的优化。
链接: https://arxiv.org/abs/2506.08344
作者: Neşet Ünver Akmandor,Sarvesh Prajapati,Mark Zolotas,Taşkın Padır
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY)
备注: Accepted to the 2025 IEEE International Conference on Automation Science and Engineering (CASE)
点击查看摘要
Abstract:Traditional motion planning methods for robots with many degrees-of-freedom, such as mobile manipulators, are often computationally prohibitive for real-world settings. In this paper, we propose a novel multi-model motion planning pipeline, termed Re4MPC, which computes trajectories using Nonlinear Model Predictive Control (NMPC). Re4MPC generates trajectories in a computationally efficient manner by reactively selecting the model, cost, and constraints of the NMPC problem depending on the complexity of the task and robot state. The policy for this reactive decision-making is learned via a Deep Reinforcement Learning (DRL) framework. We introduce a mathematical formulation to integrate NMPC into this DRL framework. To validate our methodology and design choices, we evaluate DRL training and test outcomes in a physics-based simulation involving a mobile manipulator. Experimental results demonstrate that Re4MPC is more computationally efficient and achieves higher success rates in reaching end-effector goals than the NMPC baseline, which computes whole-body trajectories without our learning mechanism.
zh
[AI-63] Your Agent Can Defend Itself against Backdoor Attacks
【速读】:该论文试图解决大型语言模型(Large Language Model, LLM)驱动的智能体在训练和微调过程中面临的关键安全风险——后门攻击问题。此类攻击会使智能体在接收到特定触发器时执行恶意操作。论文提出的解决方案ReAgent的核心在于利用智能体自身的行为不一致性来检测潜在后门,其关键在于通过两个层级的检查:在执行层面上验证智能体的思考与行为之间的一致性,在规划层面上则通过智能体对指令的重建能力,检查重建指令与原始用户指令之间的一致性。
链接: https://arxiv.org/abs/2506.08336
作者: Li Changjiang,Liang Jiacheng,Cao Bochuan,Chen Jinghui,Wang Ting
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Despite their growing adoption across domains, large language model (LLM)-powered agents face significant security risks from backdoor attacks during training and fine-tuning. These compromised agents can subsequently be manipulated to execute malicious operations when presented with specific triggers in their inputs or environments. To address this pressing risk, we present ReAgent, a novel defense against a range of backdoor attacks on LLM-based agents. Intuitively, backdoor attacks often result in inconsistencies among the user’s instruction, the agent’s planning, and its execution. Drawing on this insight, ReAgent employs a two-level approach to detect potential backdoors. At the execution level, ReAgent verifies consistency between the agent’s thoughts and actions; at the planning level, ReAgent leverages the agent’s capability to reconstruct the instruction based on its thought trajectory, checking for consistency between the reconstructed instruction and the user’s instruction. Extensive evaluation demonstrates ReAgent’s effectiveness against various backdoor attacks across tasks. For instance, ReAgent reduces the attack success rate by up to 90% in database operation tasks, outperforming existing defenses by large margins. This work reveals the potential of utilizing compromised agents themselves to mitigate backdoor risks.
zh
[AI-64] ORFS-agent : Tool-Using Agents for Chip Design Optimization
【速读】:该论文旨在解决集成电路设计流程中参数调优的复杂性问题,特别是在高维优化任务中,传统方法如贝叶斯优化在资源效率和设计指标上存在局限。其解决方案的关键在于引入ORFS-agent,一个基于大型语言模型(Large Language Model, LLM)的迭代优化代理,通过自适应探索参数配置,实现更高效的参数调优。该方法在多个技术节点和电路基准测试中表现出色,显著提升了布线长度和有效时钟周期,并减少了优化迭代次数。
链接: https://arxiv.org/abs/2506.08332
作者: Amur Ghose,Andrew B. Kahng,Sayak Kundu,Zhiang Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Machine learning has been widely used to optimize complex engineering workflows across numerous domains. In the context of integrated circuit design, modern flows (e.g., going from a register-transfer level netlist to physical layouts) involve extensive configuration via thousands of parameters, and small changes to these parameters can have large downstream impacts on desired outcomes - namely design performance, power, and area. Recent advances in Large Language Models (LLMs) offer new opportunities for learning and reasoning within such high-dimensional optimization tasks. In this work, we introduce ORFS-agent, an LLM-based iterative optimization agent that automates parameter tuning in an open-source hardware design flow. ORFS-agent adaptively explores parameter configurations, demonstrating clear improvements over standard Bayesian optimization approaches in terms of resource efficiency and final design metrics. Our empirical evaluations on two different technology nodes and a range of circuit benchmarks indicate that ORFS-agent can improve both routed wirelength and effective clock period by over 13%, all while using 40% fewer optimization iterations. Moreover, by following natural language objectives to trade off certain metrics for others, ORFS-agent demonstrates a flexible and interpretable framework for multi-objective optimization. Crucially, RFS-agent is modular and model-agnostic, and can be plugged in to any frontier LLM without any further fine-tuning.
zh
[AI-65] Graph Prompting for Graph Learning Models: Recent Advances and Future Directions KDD2025
【速读】:该论文旨在系统回顾图提示(graph prompting)在图学习模型中的最新进展,以解决如何在保持预训练图学习模型不变的前提下,通过学习可训练提示来提升模型在下游任务中的性能问题。其解决方案的关键在于设计有效的可学习提示机制,从而在适应阶段增强模型对特定任务的表达能力,同时保留预训练阶段所获得的通用特征表示。
链接: https://arxiv.org/abs/2506.08326
作者: Xingbo Fu,Zehong Wang,Zihan Chen,Jiazheng Li,Yaochen Zhu,Zhenyu Lei,Cong Shen,Yanfang Ye,Chuxu Zhang,Jundong Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted by KDD 2025 Tutorial/Survey Track
点击查看摘要
Abstract:Graph learning models have demonstrated great prowess in learning expressive representations from large-scale graph data in a wide variety of real-world scenarios. As a prevalent strategy for training powerful graph learning models, the “pre-training, adaptation” scheme first pre-trains graph learning models on unlabeled graph data in a self-supervised manner and then adapts them to specific downstream tasks. During the adaptation phase, graph prompting emerges as a promising approach that learns trainable prompts while keeping the pre-trained graph learning models unchanged. In this paper, we present a systematic review of recent advancements in graph prompting. First, we introduce representative graph pre-training methods that serve as the foundation step of graph prompting. Next, we review mainstream techniques in graph prompting and elaborate on how they design learnable prompts for graph prompting. Furthermore, we summarize the real-world applications of graph prompting from different domains. Finally, we discuss several open challenges in existing studies with promising future directions in this field.
zh
[AI-66] LeanTutor: A Formally-Verified AI Tutor for Mathematical Proofs
【速读】:该论文试图解决数学证明教学中自动化辅导与反馈生成的问题,具体而言是通过大型语言模型(LLM)构建一个能够与学生进行自然语言交互、验证数学证明、生成正确下一步策略并提供适当教学指导的系统。解决方案的关键在于LeanTutor的三个核心模块:自动形式化/证明检查器、下一步生成器和自然语言反馈生成器,其中自动形式化模块通过将学生的非形式化证明转换为Lean语言并验证其准确性,为后续步骤生成和反馈提供基础。
链接: https://arxiv.org/abs/2506.08321
作者: Manooshree Patel,Rayna Bhattacharyya,Thomas Lu,Arnav Mehta,Niels Voss,Narges Norouzi,Gireeja Ranade
机构: 未知
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注:
点击查看摘要
Abstract:We present LeanTutor, a Large Language Model (LLM)-based tutoring system for math proofs. LeanTutor interacts with the student in natural language, formally verifies student-written math proofs in Lean, generates correct next steps, and provides the appropriate instructional guidance. LeanTutor is composed of three modules: (i) an autoformalizer/proof-checker, (ii) a next-step generator, and (iii) a natural language feedback generator. The first module faithfully autoformalizes student proofs into Lean and verifies proof accuracy via successful code compilation. If the proof has an error, the incorrect step is identified. The next-step generator module outputs a valid next Lean tactic for incorrect proofs via LLM-based candidate generation and proof search. The feedback generator module leverages Lean data to produce a pedagogically-motivated natural language hint for the student user. To evaluate our system, we introduce PeanoBench, a human-written dataset derived from the Natural Numbers Game, consisting of 371 Peano Arithmetic proofs, where each natural language proof step is paired with the corresponding logically equivalent tactic in Lean. The Autoformalizer correctly formalizes 57% of tactics in correct proofs and accurately identifies the incorrect step in 30% of incorrect proofs. In generating natural language hints for erroneous proofs, LeanTutor outperforms a simple baseline on accuracy and relevance metrics.
zh
[AI-67] How Good LLM -Generated Password Policies Are?
【速读】:该论文旨在解决生成式 AI(Generative AI)在网络安全访问控制系统的应用中,由于大型语言模型(Large Language Models, LLMs)输出的不一致性和不可预测性所带来的安全风险问题,特别是LLMs生成的密码策略在一致性与准确性方面的挑战。其解决方案的关键在于通过两种不同的方法评估LLMs生成配置文件的质量:一种是仅依赖自然语言提示生成配置文件,另一种是在提供官方文档作为信息基准的情况下生成配置文件,并系统地评估生成配置的合理性、准确性和一致性。
链接: https://arxiv.org/abs/2506.08320
作者: Vivek Vaidya,Aditya Patwardhan,Ashish Kundu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 11 pages, 2 Tables, 9 figures, 3 Algorithms
点击查看摘要
Abstract:Generative AI technologies, particularly Large Language Models (LLMs), are rapidly being adopted across industry, academia, and government sectors, owing to their remarkable capabilities in natural language processing. However, despite their strengths, the inconsistency and unpredictability of LLM outputs present substantial challenges, especially in security-critical domains such as access control. One critical issue that emerges prominently is the consistency of LLM-generated responses, which is paramount for ensuring secure and reliable operations. In this paper, we study the application of LLMs within the context of Cybersecurity Access Control Systems. Specifically, we investigate the consistency and accuracy of LLM-generated password policies, translating natural language prompts into executable this http URL configuration files. Our experimental methodology adopts two distinct approaches: firstly, we utilize pre-trained LLMs to generate configuration files purely from natural language prompts without additional guidance. Secondly, we provide these models with official this http URL documentation to serve as an informative baseline. We systematically assess the soundness, accuracy, and consistency of these AI-generated configurations. Our findings underscore significant challenges in the current generation of LLMs and contribute valuable insights into refining the deployment of LLMs in Access Control Systems. Comments: 11 pages, 2 Tables, 9 figures, 3 Algorithms Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2506.08320 [cs.CR] (or arXiv:2506.08320v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2506.08320 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-68] Understanding Software Engineering Agents Through the Lens of Traceability: An Empirical Study
【速读】:该论文试图解决软件工程代理(SWE agents)在自动化软件任务中的行为机制不清晰的问题,特别是其内部决策流程的缺乏深入理解,这限制了代理的可靠性和效率。解决方案的关键在于通过执行轨迹进行系统性研究,提出首个跨五个代表性代理的决策路径分类法,并识别出影响代理成功的核心组件,包括缺陷定位、补丁生成和重现测试生成,同时分析测试生成对成功补丁生产的影响力及有效策略。
链接: https://arxiv.org/abs/2506.08311
作者: Ira Ceka,Saurabh Pujar,Shyam Ramji,Luca Buratti,Gail Kaiser,Baishakhi Ray
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:With the advent of large language models (LLMs), software engineering agents (SWE agents) have emerged as a powerful paradigm for automating a range of software tasks – from code generation and repair to test case synthesis. These agents operate autonomously by interpreting user input and responding to environmental feedback. While various agent architectures have demonstrated strong empirical performance, the internal decision-making worfklows that drive their behavior remain poorly understood. Deeper insight into these workflows hold promise for improving both agent reliability and efficiency. In this work, we present the first systematic study of SWE agent behavior through the lens of execution traces. Our contributions are as follows: (1) we propose the first taxonomy of decision-making pathways across five representative agents; (2) using this taxonomy, we identify three core components essential to agent success – bug localization, patch generation, and reproduction test generation – and study each in depth; (3) we study the impact of test generation on successful patch production; and analyze strategies that can lead to successful test generation; (4) we further conduct the first large-scale code clone analysis comparing agent-generated and developer-written patches and provide a qualitative study revealing structural and stylistic differences in patch content. Together, these findings offer novel insights into agent design and open avenues for building agents that are both more effective and more aligned with human development practices.
zh
[AI-69] Learnable Spatial-Temporal Positional Encoding for Link Prediction ICML2025
【速读】:该论文试图解决图深度学习框架中位置编码(positional encoding)的局限性问题,具体包括:现有方法多采用预定义且固定的函数,难以适应复杂的属性图;部分可学习的位置编码仅考虑结构信息,未考虑现实世界中的时序演化拓扑和特征信息;多数方法依赖于Transformer的注意力机制,导致在大规模结构化数据上计算成本过高。解决方案的关键在于提出一种有效且高效的学习时空位置编码方法,即L-STEP模型,其核心在于通过空间-时序谱视角证明所提位置学习方案能够保留图特性,并验证MLPs在该编码下可充分表达并达到Transformer性能,同时通过多种初始位置编码输入展示鲁棒性及在多个数据集和采样策略下的优越性。
链接: https://arxiv.org/abs/2506.08309
作者: Katherine Tieu,Dongqi Fu,Zihao Li,Ross Maciejewski,Jingrui He
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted by ICML 2025. 28 pages, 1 figures, 22 tables
点击查看摘要
Abstract:Accurate predictions rely on the expressiveness power of graph deep learning frameworks like graph neural networks and graph transformers, where a positional encoding mechanism has become much more indispensable in recent state-of-the-art works to record the canonical position information. However, the current positional encoding is limited in three aspects: (1) most positional encoding methods use pre-defined, and fixed functions, which are inadequate to adapt to the complex attributed graphs; (2) a few pioneering works proposed the learnable positional encoding but are still limited to the structural information, not considering the real-world time-evolving topological and feature information; (3) most positional encoding methods are equipped with transformers’ attention mechanism to fully leverage their capabilities, where the dense or relational attention is often unaffordable on large-scale structured data. Hence, we aim to develop Learnable Spatial-Temporal Positional Encoding in an effective and efficient manner and propose a simple temporal link prediction model named L-STEP. Briefly, for L-STEP, we (1) prove the proposed positional learning scheme can preserve the graph property from the spatial-temporal spectral viewpoint, (2) verify that MLPs can fully exploit the expressiveness and reach transformers’ performance on that encoding, (3) change different initial positional encoding inputs to show robustness, (4) analyze the theoretical complexity and obtain less empirical running time than SOTA, and (5) demonstrate its temporal link prediction out-performance on 13 classic datasets and with 10 algorithms in both transductive and inductive settings using 3 different sampling strategies. Also, \name\ obtains the leading performance in the newest large-scale TGB benchmark. Our code is available at this https URL.
zh
[AI-70] AstroCompress: A benchmark dataset for multi-purpose compression of astronomical data ICLR2025
【速读】:该论文试图解决天文观测数据在传输过程中因带宽限制导致的数据获取量受限问题,进而影响科学研究效率的问题。其解决方案的关键在于利用生成式 AI (Generative AI) 技术,通过端到端学习压缩算法,以充分利用天文学图像特有的空间、时间及波长结构,从而实现比传统无损压缩方法更优的压缩性能。
链接: https://arxiv.org/abs/2506.08306
作者: Tuan Truong,Rithwik Sudharsan,Yibo Yang,Peter Xiangyuan Ma,Ruihan Yang,Stephan Mandt,Joshua S. Bloom
机构: 未知
类目: Artificial Intelligence (cs.AI); Instrumentation and Methods for Astrophysics (astro-ph.IM)
备注: ICLR 2025 conference paper. See reviews at this https URL
点击查看摘要
Abstract:The site conditions that make astronomical observatories in space and on the ground so desirable – cold and dark – demand a physical remoteness that leads to limited data transmission capabilities. Such transmission limitations directly bottleneck the amount of data acquired and in an era of costly modern observatories, any improvements in lossless data compression has the potential scale to billions of dollars worth of additional science that can be accomplished on the same instrument. Traditional lossless methods for compressing astrophysical data are manually designed. Neural data compression, on the other hand, holds the promise of learning compression algorithms end-to-end from data and outperforming classical techniques by leveraging the unique spatial, temporal, and wavelength structures of astronomical images. This paper introduces AstroCompress: a neural compression challenge for astrophysics data, featuring four new datasets (and one legacy dataset) with 16-bit unsigned integer imaging data in various modes: space-based, ground-based, multi-wavelength, and time-series imaging. We provide code to easily access the data and benchmark seven lossless compression methods (three neural and four non-neural, including all practical state-of-the-art algorithms). Our results on lossless compression indicate that lossless neural compression techniques can enhance data collection at observatories, and provide guidance on the adoption of neural compression in scientific applications. Though the scope of this paper is restricted to lossless compression, we also comment on the potential exploration of lossy compression methods in future studies.
zh
[AI-71] Sparse Interpretable Deep Learning with LIES Networks for Symbolic Regression
【速读】:该论文旨在解决符号回归(Symbolic Regression, SR)中现有方法在可扩展性和符号一致性方面存在的问题。传统SR方法通常依赖于基于种群的搜索或自回归建模,难以处理大规模数据并保持符号表达式的合理性。论文提出的解决方案关键在于引入LIES(Logarithm, Identity, Exponential, Sine)架构,这是一种具有可解释原始激活函数的固定神经网络结构,通过优化以建模符号表达式。此外,该框架采用适当的过采样策略和定制损失函数以促进稀疏性并防止梯度不稳定,并在训练后应用剪枝策略进一步简化表达式,从而生成简洁且准确的符号公式。
链接: https://arxiv.org/abs/2506.08267
作者: Mansooreh Montazerin,Majd Al Aawar,Antonio Ortega,Ajitesh Srivastava
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Symbolic regression (SR) aims to discover closed-form mathematical expressions that accurately describe data, offering interpretability and analytical insight beyond standard black-box models. Existing SR methods often rely on population-based search or autoregressive modeling, which struggle with scalability and symbolic consistency. We introduce LIES (Logarithm, Identity, Exponential, Sine), a fixed neural network architecture with interpretable primitive activations that are optimized to model symbolic expressions. We develop a framework to extract compact formulae from LIES networks by training with an appropriate oversampling strategy and a tailored loss function to promote sparsity and to prevent gradient instability. After training, it applies additional pruning strategies to further simplify the learned expressions into compact formulae. Our experiments on SR benchmarks show that the LIES framework consistently produces sparse and accurate symbolic formulae outperforming all baselines. We also demonstrate the importance of each design component through ablation studies.
zh
[AI-72] SHIELD: Secure Hypernetworks for Incremental Expansion Learning Defense
【速读】:该论文试图解决传统深度神经网络在持续学习中面临的灾难性遗忘(catastrophic forgetting)和对对抗攻击的脆弱性(vulnerability to adversarial attacks)问题。其解决方案的关键在于提出SHIELD(Secure Hypernetworks for Incremental Expansion and Learning Defense),该方法将基于超网络(hypernetwork)的持续学习框架与区间算术相结合,通过超网络将可训练任务嵌入向量转换为目标模型的权重,从而为每个子任务动态生成独立网络,并在所有任务间聚合分析信息,同时利用区间范围内的超立方体预测提供针对潜在攻击的严格安全保障。
链接: https://arxiv.org/abs/2506.08255
作者: Patryk Krukowski,Łukasz Gorczyca,Piotr Helm,Kamil Książek,Przemysław Spurek
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Traditional deep neural networks suffer from several limitations, including catastrophic forgetting. When models are adapted to new datasets, they tend to quickly forget previously learned knowledge. Another significant issue is the lack of robustness to even small perturbations in the input data. In practice, we can often easily perform adversarial attacks and change the network’s predictions, adding minimal noise to the input. Dedicated architectures and training procedures can solve each of the above problems separately. Unfortunately, currently, no model can simultaneously address both catastrophic forgetting and vulnerability to adversarial attacks. We introduce SHIELD (Secure Hypernetworks for Incremental Expansion and Learning Defense), a novel approach that integrates a hypernetwork-based continual learning approach with interval arithmetic. SHIELD use the hypernetwork to transfer trainable task embedding vectors into the weights of a target model dedicated to specific data. This paradigm allows for the dynamic generation of separate networks for each subtask, while the hypernetwork aggregates and analyzes information across all tasks. The target model takes in the input a data sample with a defined interval range, and by creating a hypercube, produces a prediction for the given range. Therefore, such target models provide strict guarantees against all possible attacks for data samples within the interval range. Our approach enhances security without sacrificing network adaptability, addressing the overlooked challenge of safety in continual learning.
zh
[AI-73] Parameter-free approximate equivariance for tasks with finite group symmetry
【速读】:该论文试图解决现有等变神经网络(equivariant neural networks)在计算上较为复杂、参数量大且通常与特定架构绑定的问题。其解决方案的关键在于提出一种零参数的方法,在潜在表示中通过损失函数中的额外项近似地引入有限群的等变性,从而在不增加参数数量的情况下实现对称性的建模。
链接: https://arxiv.org/abs/2506.08244
作者: Riccardo Ali,Pietro Liò,Jamie Vicary
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:
点击查看摘要
Abstract:Equivariant neural networks incorporate symmetries through group actions, embedding them as an inductive bias to improve performance on a wide variety of tasks. However, existing equivariant methods can be computationally intensive, with high parameter counts, and are often tied to a specific architecture. We propose a simple zero-parameter approach that imposes approximate equivariance for a finite group in the latent representation, as an additional term in the loss function. We conduct experiments which allow the network to learn a group representation on the latent space, and show in every case it prefers to learn the regular representation. Fixing this action on the latent space, this yields a simple method to impose approximate equivariance as an additional loss penalty. We benchmark our approach on three datasets and compare it against several existing equivariant methods, showing that in many cases it achieves similar or better performance for a fraction of the parameters.
zh
[AI-74] Ensuring Reliability of Curated EHR-Derived Data: The Validation of Accuracy for LLM /ML-Extracted Information and Data (VALID) Framework
【速读】:该论文试图解决生成式 AI (Generative AI) 在从电子健康记录 (EHRs) 中提取临床数据时所面临的可靠性、准确性和公平性问题,这些问题对研究、监管和临床应用至关重要。现有质量保障框架未能充分应对与 LLM 提取数据相关的独特错误模式和复杂性。论文提出的解决方案关键在于构建一个综合评估框架,该框架整合了变量级性能基准测试、自动化验证检查以及复制分析,以识别需改进的变量、系统检测潜在错误并确认数据集的适用性,同时通过分层度量支持偏差评估,从而提供一种严格且透明的方法来评估 LLM 提取的真实世界数据 (RWD)。
链接: https://arxiv.org/abs/2506.08231
作者: Melissa Estevez,Nisha Singh,Lauren Dyson,Blythe Adamson,Qianyu Yuan,Megan W. Hildner,Erin Fidyk,Olive Mbah,Farhad Khan,Kathi Seidl-Rathkopf,Aaron B. Cohen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Performance (cs.PF)
备注: 18 pages, 3 tables, 1 figure
点击查看摘要
Abstract:Large language models (LLMs) are increasingly used to extract clinical data from electronic health records (EHRs), offering significant improvements in scalability and efficiency for real-world data (RWD) curation in oncology. However, the adoption of LLMs introduces new challenges in ensuring the reliability, accuracy, and fairness of extracted data, which are essential for research, regulatory, and clinical applications. Existing quality assurance frameworks for RWD and artificial intelligence do not fully address the unique error modes and complexities associated with LLM-extracted data. In this paper, we propose a comprehensive framework for evaluating the quality of clinical data extracted by LLMs. The framework integrates variable-level performance benchmarking against expert human abstraction, automated verification checks for internal consistency and plausibility, and replication analyses comparing LLM-extracted data to human-abstracted datasets or external standards. This multidimensional approach enables the identification of variables most in need of improvement, systematic detection of latent errors, and confirmation of dataset fitness-for-purpose in real-world research. Additionally, the framework supports bias assessment by stratifying metrics across demographic subgroups. By providing a rigorous and transparent method for assessing LLM-extracted RWD, this framework advances industry standards and supports the trustworthy use of AI-powered evidence generation in oncology research and practice.
zh
[AI-75] Scaling Laws of Motion Forecasting and Planning – A Technical Report
【速读】:该论文旨在解决自动驾驶领域中联合运动预测与规划模型的性能提升问题,其核心在于通过研究模型规模、训练数据量和计算资源之间的经验性缩放规律,优化模型的训练与推理阶段的计算效率。解决方案的关键在于发现模型性能随计算预算呈幂律增长,并且在模型参数数量和训练数据规模之间存在最优缩放比例,即随着训练计算预算的增长,模型规模应以1.5倍于数据集规模的速度增长;此外,还发现较小模型通过采样与聚类可以在一定范围内达到与大模型相当的性能,这为高效部署提供了新的思路。
链接: https://arxiv.org/abs/2506.08228
作者: Mustafa Baniodeh,Kratarth Goel,Scott Ettinger,Carlos Fuertes,Ari Seff,Tim Shen,Cole Gulino,Chenjie Yang,Ghassen Jerfel,Dokook Choe,Rui Wang,Vinutha Kallem,Sergio Casas,Rami Al-Rfou,Benjamin Sapp,Dragomir Anguelov
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:
点击查看摘要
Abstract:We study the empirical scaling laws of a family of encoder-decoder autoregressive transformer models on the task of joint motion forecasting and planning in the autonomous driving domain. Using a 500 thousand hours driving dataset, we demonstrate that, similar to language modeling, model performance improves as a power-law function of the total compute budget, and we observe a strong correlation between model training loss and model evaluation metrics. Most interestingly, closed-loop metrics also improve with scaling, which has important implications for the suitability of open-loop metrics for model development and hill climbing. We also study the optimal scaling of the number of transformer parameters and the training data size for a training compute-optimal model. We find that as the training compute budget grows, optimal scaling requires increasing the model size 1.5x as fast as the dataset size. We also study inference-time compute scaling, where we observe that sampling and clustering the output of smaller models makes them competitive with larger models, up to a crossover point beyond which a larger models becomes more inference-compute efficient. Overall, our experimental results demonstrate that optimizing the training and inference-time scaling properties of motion forecasting and planning models is a key lever for improving their performance to address a wide variety of driving scenarios. Finally, we briefly study the utility of training on general logged driving data of other agents to improve the performance of the ego-agent, an important research area to address the scarcity of robotics data for large capacity models training.
zh
[AI-76] Repeton: Structured Bug Repair with ReAct-Guided Patch-and-Test Cycles
【速读】:该论文试图解决在复杂软件工程任务中,大型语言模型(Large Language Models, LLMs)应用时存在的精度低和可解释性差的问题。其解决方案的关键在于提出了一种名为Repeton的完全开源框架,该框架通过结构化的补丁与测试流水线实现精确且自动化的代码操作,包括迭代诊断问题、提出代码修改建议,并通过自动化测试验证每个补丁,从而提升代码修复的准确性和透明度。
链接: https://arxiv.org/abs/2506.08173
作者: Nguyen Phu Vinh,Anh Chung Hoang,Chris Ngo,Truong-Son Hy
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) have shown strong capabilities in code generation and comprehension, yet their application to complex software engineering tasks often suffers from low precision and limited interpretability. We present Repeton, a fully open-source framework that leverages LLMs for precise and automated code manipulation in real-world Git repositories. Rather than generating holistic fixes, Repeton operates through a structured patch-and-test pipeline: it iteratively diagnoses issues, proposes code changes, and validates each patch through automated testing. This stepwise process is guided by lightweight heuristics and development tools, avoiding reliance on embedding-based retrieval systems. Evaluated on the SWE-bench Lite benchmark, our method shows good performance compared to RAG-based methods in both patch validity and interpretability. By decomposing software engineering tasks into modular, verifiable stages, Repeton provides a practical path toward scalable and transparent autonomous debugging.
zh
[AI-77] Worst-Case Symbolic Constraints Analysis and Generalisation with Large Language Models
【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)在处理复杂符号推理任务,特别是程序中最坏情况执行的符号约束分析方面能力不足的问题。其解决方案的关键在于通过符号推理引导的微调方法,结合SMT(Satisfiability Modulo Theories)约束求解技术,并利用专门设计的符号约束数据集来提升LLMs的性能。实验结果表明,该方法有效增强了LLMs在符号推理方面的表现,使其能够准确捕捉算法最坏情况行为的约束条件。
链接: https://arxiv.org/abs/2506.08171
作者: Daniel Koh,Yannic Noller,Corina S. Pasareanu,Adrians Skapars,Youcheng Sun
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Large language models (LLMs) have been successfully applied to a variety of coding tasks, including code generation, completion, and repair. However, more complex symbolic reasoning tasks remain largely unexplored by LLMs. This paper investigates the capacity of LLMs to reason about worst-case executions in programs through symbolic constraints analysis, aiming to connect LLMs and symbolic reasoning approaches. Specifically, we define and address the problem of worst-case symbolic constraints analysis as a measure to assess the comprehension of LLMs. We evaluate the performance of existing LLMs on this novel task and further improve their capabilities through symbolic reasoning-guided fine-tuning, grounded in SMT (Satisfiability Modulo Theories) constraint solving and supported by a specially designed dataset of symbolic constraints. Experimental results show that our solver-aligned model, WARP-1.0-3B, consistently surpasses size-matched and even much larger baselines, demonstrating that a 3B LLM can recover the very constraints that pin down an algorithm’s worst-case behaviour through reinforcement learning methods. These findings suggest that LLMs are capable of engaging in deeper symbolic reasoning, supporting a closer integration between neural network-based learning and formal methods for rigorous program analysis.
zh
[AI-78] A Metrics-Oriented Architectural Model to Characterize Complexity on Machine Learning-Enabled Systems
【速读】:该论文试图解决如何有效管理机器学习使能系统(Machine Learning-Enabled Systems, MLES)复杂性的问题。其解决方案的关键在于引入一种基于度量的架构模型,以表征MLES的复杂性,并为系统的架构决策提供支持,从而指导这些系统的初始构建和持续发展。论文展示了构建该度量基础架构模型的第一步,即对能够描述MLES的参考架构进行扩展,以收集相关度量数据。
链接: https://arxiv.org/abs/2506.08153
作者: Renato Cordeiro Ferreira(1,2,3,4) ((1) University of São Paulo, (2) Jheronimus Academy of Data Science, (3) Technical University of Eindhoven, (4) Tilburg University)
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 4 pages, 3 figures (2 diagrams, 1 table), to be published in CAIN 2025
点击查看摘要
Abstract:How can the complexity of ML-enabled systems be managed effectively? The goal of this research is to investigate how complexity affects ML-Enabled Systems (MLES). To address this question, this research aims to introduce a metrics-based architectural model to characterize the complexity of MLES. The goal is to support architectural decisions, providing a guideline for the inception and growth of these systems. This paper showcases the first step for creating the metrics-based architectural model: an extension of a reference architecture that can describe MLES to collect their metrics.
zh
[AI-79] Compiling Metric Temporal Answer Set Programming
【速读】:该论文试图解决在度量答案集编程(Metric Answer Set Programming, MASPP)中处理定量时间约束(如持续时间和截止时间)时的可扩展性问题,尤其是在面对细粒度时间约束时,会显著加剧ASP的预处理瓶颈。解决方案的关键在于利用差分约束(difference constraints)对ASP进行扩展,这是一种简化的线性约束形式,通过将时间相关方面外部化来有效解耦度量ASP与时间粒度,从而使得求解过程不受时间精度的影响。
链接: https://arxiv.org/abs/2506.08150
作者: Arvid Becker,Pedro Cabalar,Martin Diéguez,Javier Romero,Susana Hahn,Torsten Schaub
机构: 未知
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注:
点击查看摘要
Abstract:We develop a computational approach to Metric Answer Set Programming (ASP) to allow for expressing quantitative temporal constrains, like durations and deadlines. A central challenge is to maintain scalability when dealing with fine-grained timing constraints, which can significantly exacerbate ASP’s grounding bottleneck. To address this issue, we leverage extensions of ASP with difference constraints, a simplified form of linear constraints, to handle time-related aspects externally. Our approach effectively decouples metric ASP from the granularity of time, resulting in a solution that is unaffected by time precision.
zh
[AI-80] Ego-centric Learning of Communicative World Models for Autonomous Driving
【速读】:该论文旨在解决多智能体强化学习(MARL)在复杂高维环境中的部分可观测性和非平稳性问题。其解决方案的关键在于提出一种名为CALL(Communicative World Model)的方法,通过生成式AI(Generative AI)构建世界模型,并利用其潜在表示进行轻量级的信息共享,从而降低通信开销并提升可扩展性。该方法使每个智能体能够通过共享低维潜在表示来丰富自身世界模型,并借助其泛化能力提高预测精度以优化规划。
链接: https://arxiv.org/abs/2506.08149
作者: Hang Wang,Dechen Gao,Junshan Zhang
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:We study multi-agent reinforcement learning (MARL) for tasks in complex high-dimensional environments, such as autonomous driving. MARL is known to suffer from the \textitpartial observability and \textitnon-stationarity issues. To tackle these challenges, information sharing is often employed, which however faces major hurdles in practice, including overwhelming communication overhead and scalability concerns. By making use of generative AI embodied in world model together with its latent representation, we develop \it CALL, \underlineCommunic\underlineative Wor\underlineld Mode\underlinel, for MARL, where 1) each agent first learns its world model that encodes its state and intention into low-dimensional latent representation with smaller memory footprint, which can be shared with other agents of interest via lightweight communication; and 2) each agent carries out ego-centric learning while exploiting lightweight information sharing to enrich her world model, and then exploits its generalization capacity to improve prediction for better planning. We characterize the gain on the prediction accuracy from the information sharing and its impact on performance gap. Extensive experiments are carried out on the challenging local trajectory planning tasks in the CARLA platform to demonstrate the performance gains of using \textitCALL.
zh
[AI-81] Nearness of Neighbors Attention for Regression in Supervised Finetuning
【速读】:该论文试图解决在监督机器学习中,传统算法(如k-近邻(k-NN)或支持向量机)在使用经过监督微调(SFT)的特征提取器生成的嵌入进行预测时,其性能通常优于SFT模型本身的问题。其关键解决方案是引入了Nearness of Neighbors Attention (NONA)回归层,该层通过神经网络注意力机制和一种新颖的学习注意力掩码方案,为k-NN回归算法提供了一个可微分的代理,从而使得传统算法能够作为神经网络层直接集成到模型中以进一步提升性能。
链接: https://arxiv.org/abs/2506.08139
作者: Aviad Susman,Mayte Suárez-Fariñas,Joseph T Colonel
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:It is common in supervised machine learning to combine the feature extraction capabilities of neural networks with the predictive power of traditional algorithms, such as k-nearest neighbors (k-NN) or support vector machines. This procedure involves performing supervised fine-tuning (SFT) on a domain-appropriate feature extractor, followed by training a traditional predictor on the resulting SFT embeddings. When used in this manner, traditional predictors often deliver increased performance over the SFT model itself, despite the fine-tuned feature extractor yielding embeddings specifically optimized for prediction by the neural network’s final dense layer. This suggests that directly incorporating traditional algorithms into SFT as prediction layers may further improve performance. However, many traditional algorithms have not been implemented as neural network layers due to their non-differentiable nature and their unique optimization requirements. As a step towards solving this problem, we introduce the Nearness of Neighbors Attention (NONA) regression layer. NONA uses the mechanics of neural network attention and a novel learned attention-masking scheme to yield a differentiable proxy of the k-NN regression algorithm. Results on multiple unstructured datasets show improved performance over both dense layer prediction and k-NN on SFT embeddings for regression.
zh
[AI-82] he AI Imperative: Scaling High-Quality Peer Review in Machine Learning
【速读】:该论文试图解决机器学习领域同行评审(peer review)因论文投稿量激增而面临的规模危机,这一问题导致评审质量、一致性及评审人疲劳等问题。解决方案的关键在于将AI辅助的同行评审作为紧急的研究和基础设施优先事项,构建一个全面的AI增强生态系统,利用大型语言模型(Large Language Models, LLMs)作为人类判断的协作工具,而非替代者,以提升事实验证、评审表现引导、作者质量改进以及区域主席决策支持等环节的效率与准确性。
链接: https://arxiv.org/abs/2506.08134
作者: Qiyao Wei,Samuel Holt,Jing Yang,Markus Wulfmeier,Mihaela van der Schaar
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 18 pages, 3 figures. Position paper
点击查看摘要
Abstract:Peer review, the bedrock of scientific advancement in machine learning (ML), is strained by a crisis of scale. Exponential growth in manuscript submissions to premier ML venues such as NeurIPS, ICML, and ICLR is outpacing the finite capacity of qualified reviewers, leading to concerns about review quality, consistency, and reviewer fatigue. This position paper argues that AI-assisted peer review must become an urgent research and infrastructure priority. We advocate for a comprehensive AI-augmented ecosystem, leveraging Large Language Models (LLMs) not as replacements for human judgment, but as sophisticated collaborators for authors, reviewers, and Area Chairs (ACs). We propose specific roles for AI in enhancing factual verification, guiding reviewer performance, assisting authors in quality improvement, and supporting ACs in decision-making. Crucially, we contend that the development of such systems hinges on access to more granular, structured, and ethically-sourced peer review process data. We outline a research agenda, including illustrative experiments, to develop and validate these AI assistants, and discuss significant technical and ethical challenges. We call upon the ML community to proactively build this AI-assisted future, ensuring the continued integrity and scalability of scientific validation, while maintaining high standards of peer review.
zh
[AI-83] SOP-Bench: Complex Industrial SOPs for Evaluating LLM Agents
【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)在执行需要严格遵循标准操作流程(Standard Operating Procedures, SOPs)的复杂、长周期工作流时存在的能力不足问题。现有研究缺乏能够反映SOP复杂性、结构及领域特性的公开基准。解决方案的关键在于提出一种合成数据生成框架,用于创建符合工业标准的SOP,并基于此构建了SOP-Bench,一个涵盖10个工业领域、包含超过1800个任务的基准测试集,包含API、工具接口和人工验证的测试用例。此外,通过评估两种主流代理架构在SOP-Bench上的表现,揭示了当前LLM代理在工具调用和任务执行方面的显著不足。
链接: https://arxiv.org/abs/2506.08119
作者: Subhrangshu Nandi,Arghya Datta,Nikhil Vichare,Indranil Bhattacharya,Huzefa Raja,Jing Xu,Shayan Ray,Giuseppe Carenini,Abhi Srivastava,Aaron Chan,Man Ho Woo,Amar Kandola,Brandon Theresa,Francesco Carbone
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Under review
点击查看摘要
Abstract:Large Language Models (LLMs) demonstrate impressive general-purpose reasoning and problem-solving abilities. However, they struggle with executing complex, long-horizon workflows that demand strict adherence to Standard Operating Procedures (SOPs), a critical requirement for real-world industrial automation. Despite this need, there is a lack of public benchmarks that reflect the complexity, structure, and domain-specific nuances of SOPs. To address this, we present three main contributions. First, we introduce a synthetic data generation framework to create realistic, industry-grade SOPs that rigorously test the planning, reasoning, and tool-use capabilities of LLM-based agents. Second, using this framework, we develop SOP-Bench, a benchmark of over 1,800 tasks across 10 industrial domains, each with APIs, tool interfaces, and human-validated test cases. Third, we evaluate two prominent agent architectures: Function-Calling and ReAct Agents, on SOP-Bench, observing average success rates of only 27% and 48%, respectively. Remarkably, when the tool registry is much larger than necessary, agents invoke incorrect tools nearly 100% of the time. These findings underscore a substantial gap between current agentic capabilities of LLMs and the demands of automating real-world SOPs. Performance varies significantly by task and domain, highlighting the need for domain-specific benchmarking and architectural choices before deployment. SOP-Bench is publicly available at this http URL. We also release the prompts underpinning the data generation framework to support new domain-specific SOP benchmarks. We invite the community to extend SOP-Bench with SOPs from their industrial domains.
zh
[AI-84] Benchmarking Pre-Trained Time Series Models for Electricity Price Forecasting
【速读】:该论文旨在解决电力价格预测(Electricity Price Forecasting, EPF)中传统方法与新兴时间序列基础模型(Time Series Foundation Models, TSFMs)效果对比不明确的问题。其解决方案的关键在于对多种先进的预训练模型(如Chronos-Bolt、Chronos-T5、TimesFM、Moirai、Time-MoE和TimeGPT)与经典统计和机器学习方法进行系统性基准测试,基于德国、法国、荷兰、奥地利和比利时的2024年日前拍卖(Day-Ahead Auction, DAA)电价数据生成一日期 forecasts,以评估不同模型在EPF任务中的表现。研究发现,尽管部分TSFMs表现出色,但具有双季节性分解能力的MSTL模型在多个国家和评估指标上仍保持稳定优势。
链接: https://arxiv.org/abs/2506.08113
作者: Timothée Hornek Amir Sartipi,Igor Tchappi,Gilbert Fridgen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Accurate electricity price forecasting (EPF) is crucial for effective decision-making in power trading on the spot market. While recent advances in generative artificial intelligence (GenAI) and pre-trained large language models (LLMs) have inspired the development of numerous time series foundation models (TSFMs) for time series forecasting, their effectiveness in EPF remains uncertain. To address this gap, we benchmark several state-of-the-art pretrained models–Chronos-Bolt, Chronos-T5, TimesFM, Moirai, Time-MoE, and TimeGPT–against established statistical and machine learning (ML) methods for EPF. Using 2024 day-ahead auction (DAA) electricity prices from Germany, France, the Netherlands, Austria, and Belgium, we generate daily forecasts with a one-day horizon. Chronos-Bolt and Time-MoE emerge as the strongest among the TSFMs, performing on par with traditional models. However, the biseasonal MSTL model, which captures daily and weekly seasonality, stands out for its consistent performance across countries and evaluation metrics, with no TSFM statistically outperforming it.
zh
[AI-85] Cognitive Weave: Synthesizing Abstracted Knowledge with a Spatio-Temporal Resonance Graph
【速读】:该论文旨在解决大型语言模型(Large Language Model, LLM)代理在持续学习、精细推理和动态适应方面所面临的记忆架构局限性问题。现有记忆系统在结构灵活性、时间意识以及从原始交互数据中提炼高层见解的能力上存在不足。论文提出的解决方案是Cognitive Weave,其核心是一个多层时空共振图(multi-layered spatio-temporal resonance graph, STRG),通过语义丰富的洞察粒子(insight particles, IPs)及其动态增强的共振键、指示符和情境印记进行信息管理,并通过类型化关系链构建演化的知识织物。关键创新在于认知精炼过程,该过程自主合成由相关IPs聚类衍生的高阶知识结构(insight aggregates, IAs),从而显著提升了长周期规划任务、动态问答场景及多会话对话连贯性等性能指标。
链接: https://arxiv.org/abs/2506.08098
作者: Akash Vishwakarma,Hojin Lee,Mohith Suresh,Priyam Shankar Sharma,Rahul Vishwakarma,Sparsh Gupta,Yuvraj Anupam Chauhan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:The emergence of capable large language model (LLM) based agents necessitates memory architectures that transcend mere data storage, enabling continuous learning, nuanced reasoning, and dynamic adaptation. Current memory systems often grapple with fundamental limitations in structural flexibility, temporal awareness, and the ability to synthesize higher-level insights from raw interaction data. This paper introduces Cognitive Weave, a novel memory framework centered around a multi-layered spatio-temporal resonance graph (STRG). This graph manages information as semantically rich insight particles (IPs), which are dynamically enriched with resonance keys, signifiers, and situational imprints via a dedicated semantic oracle interface (SOI). These IPs are interconnected through typed relational strands, forming an evolving knowledge tapestry. A key component of Cognitive Weave is the cognitive refinement process, an autonomous mechanism that includes the synthesis of insight aggregates (IAs) condensed, higher-level knowledge structures derived from identified clusters of related IPs. We present comprehensive experimental results demonstrating Cognitive Weave’s marked enhancement over existing approaches in long-horizon planning tasks, evolving question-answering scenarios, and multi-session dialogue coherence. The system achieves a notable 34% average improvement in task completion rates and a 42% reduction in mean query latency when compared to state-of-the-art baselines. Furthermore, this paper explores the ethical considerations inherent in such advanced memory systems, discusses the implications for long-term memory in LLMs, and outlines promising future research trajectories.
zh
[AI-86] Info-Coevolution: An Efficient Framework for Data Model Coevolution
【速读】:该论文试图解决在机器学习中,随着真实世界数据的持续增长,如何高效构建数据集和进行训练的问题,特别是针对新数据是否需要标注或学习的决策问题。传统方法保留所有可用数据,导致数据和训练效率不优,而主动学习虽然能减少数据冗余,但增加了管道复杂性和偏差。该论文提出的解决方案是Info-Coevolution框架,其关键在于通过在线选择性标注实现模型与数据的协同进化,无需引入偏差,并利用任务特定模型(以及开源模型)对在线和网络数据进行选择性标注和整合,从而高效提升数据集质量。
链接: https://arxiv.org/abs/2506.08070
作者: Ziheng Qin,Hailun Xu,Wei Chee Yew,Qi Jia,Yang Luo,Kanchan Sarkar,Danhui Guan,Kai Wang,Yang You
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: V1
点击查看摘要
Abstract:Machine learning relies heavily on data, yet the continuous growth of real-world data poses challenges for efficient dataset construction and training. A fundamental yet unsolved question is: given our current model and data, does a new data (sample/batch) need annotation/learning? Conventional approaches retain all available data, leading to non-optimal data and training efficiency. Active learning aims to reduce data redundancy by selecting a subset of samples to annotate, while it increases pipeline complexity and introduces bias. In this work, we propose Info-Coevolution, a novel framework that efficiently enables models and data to coevolve through online selective annotation with no bias. Leveraging task-specific models (and open-source models), it selectively annotates and integrates online and web data to improve datasets efficiently. For real-world datasets like ImageNet-1K, Info-Coevolution reduces annotation and training costs by 32% without performance loss. It is able to automatically give the saving ratio without tuning the ratio. It can further reduce the annotation ratio to 50% with semi-supervised learning. We also explore retrieval-based dataset enhancement using unlabeled open-source data. Code is available at this https URL.
zh
[AI-87] FairDICE: Fairness-Driven Offline Multi-Objective Reinforcement Learning
【速读】:该论文试图解决在离线多目标强化学习(Offline Multi-Objective Reinforcement Learning, Offline MORL)中,如何直接优化非线性福利目标的问题。传统方法依赖于线性标量化来处理冲突目标,但无法有效捕捉公平导向的目标,如纳什社会福利或最大最小公平性。论文提出的解决方案是FairDICE,其关键在于利用分布校正估计,同时考虑福利最大化和分布正则化,从而实现稳定且样本高效的策略优化,无需显式偏好权重或全面的权重搜索。
链接: https://arxiv.org/abs/2506.08062
作者: Woosung Kim,Jinho Lee,Jongmin Lee,Byung-Jun Lee
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Multi-objective Reinforcement Learning
点击查看摘要
Abstract:Multi-objective reinforcement learning (MORL) aims to optimize policies in the presence of conflicting objectives, where linear scalarization is commonly used to reduce vector-valued returns into scalar signals. While effective for certain preferences, this approach cannot capture fairness-oriented goals such as Nash social welfare or max-min fairness, which require nonlinear and non-additive trade-offs. Although several online algorithms have been proposed for specific fairness objectives, a unified approach for optimizing nonlinear welfare criteria in the offline setting-where learning must proceed from a fixed dataset-remains unexplored. In this work, we present FairDICE, the first offline MORL framework that directly optimizes nonlinear welfare objective. FairDICE leverages distribution correction estimation to jointly account for welfare maximization and distributional regularization, enabling stable and sample-efficient learning without requiring explicit preference weights or exhaustive weight search. Across multiple offline benchmarks, FairDICE demonstrates strong fairness-aware performance compared to existing baselines.
zh
[AI-88] Eliciting Fine-Tuned Transformer Capabilities via Inference-Time Techniques
【速读】:该论文试图解决监督微调(Supervised Fine-Tuning, SFT)在大型语言模型中的计算成本过高的问题,其核心在于证明通过推理时技术(如上下文学习,In-Context Learning, ICL)可以在不修改模型参数的情况下近似实现SFT所获得的能力。解决方案的关键在于理论分析表明,在理想假设下(如无限计算资源和访问微调数据集),基础的Transformer模型可以通过ICL实现与SFT相似的性能;进一步扩展至实际场景,考虑有限的上下文长度和部分数据访问,研究推导出不同任务(如文本生成和线性分类)所需的数据集规模,从而为资源高效的大型语言模型部署提供理论支持。
链接: https://arxiv.org/abs/2506.08060
作者: Asankhaya Sharma
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Large language models have transformed natural language processing, yet supervised fine-tuning (SFT) remains computationally intensive. This paper formally proves that capabilities acquired through SFT can be approximated by a base transformer model using inference-time techniques, specifically in-context learning (ICL), without altering model parameters, under idealized assumptions including unbounded computational resources and access to the fine-tuning dataset. We extend these results to practical scenarios with finite context lengths and partial dataset access. For text generation tasks with fixed output length l , datasets of size \mathrmO\left( \fracm V\varepsilon^2 \log \fracm\delta \right) or, with bounded context, \mathrmO\left( \fracl \log V\varepsilon^2 \log \frac1\delta \right) suffice to approximate fine-tuned behavior across m contexts within error \varepsilon , where V is the vocabulary size and \delta is the failure probability. For linear classification, datasets of size \mathrmO\left( \fracd\varepsilon \right) or, with fixed context, \mathrmO\left( \frac1\varepsilon^2 \log \frac1\delta \right) are sufficient, where d is the input dimension. Grounded in the Turing completeness of transformers, these results provide a theoretical foundation for resource-efficient deployment of large language models, with practical techniques like retrieval-augmented generation bridging theory to real-world applications.
zh
[AI-89] STAMImputer: Spatio-Temporal Attention MoE for Traffic Data Imputation IJCAI2025
【速读】:该论文旨在解决交通数据填补(traffic data imputation)中的两个关键问题:现有时间到空间序列方法在块状缺失数据场景下难以有效提取特征,以及静态图结构在处理非平稳交通数据分布偏移问题时限制了模型的灵活性。其解决方案的关键在于提出一种名为STAMImputer的时空注意力专家混合网络,通过引入Mixture of Experts (MoE)框架捕捉潜在的时空特征及其影响权重,实现对块状缺失数据的有效填补;同时设计了一种新颖的低秩引导采样图注意力机制(LrSGAT),动态平衡路网中的局部与全局相关性,生成能够捕捉实时空间相关性的动态图。
链接: https://arxiv.org/abs/2506.08054
作者: Yiming Wang,Hao Peng,Senzhang Wang,Haohua Du,Chunyang Liu,Jia Wu,Guanlin Wu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 10 pages, 5 figures, 3 tables. Extended version of paper accepted at IJCAI 2025
点击查看摘要
Abstract:Traffic data imputation is fundamentally important to support various applications in intelligent transportation systems such as traffic flow prediction. However, existing time-to-space sequential methods often fail to effectively extract features in block-wise missing data scenarios. Meanwhile, the static graph structure for spatial feature propagation significantly constrains the models flexibility in handling the distribution shift issue for the nonstationary traffic data. To address these issues, this paper proposes a SpatioTemporal Attention Mixture of experts network named STAMImputer for traffic data imputation. Specifically, we introduce a Mixture of Experts (MoE) framework to capture latent spatio-temporal features and their influence weights, effectively imputing block missing. A novel Low-rank guided Sampling Graph ATtention (LrSGAT) mechanism is designed to dynamically balance the local and global correlations across road networks. The sampled attention vectors are utilized to generate dynamic graphs that capture real-time spatial correlations. Extensive experiments are conducted on four traffic datasets for evaluation. The result shows STAMImputer achieves significantly performance improvement compared with existing SOTA approaches. Our codes are available at this https URL.
zh
[AI-90] Evaluation of Machine Learning Models in Student Academic Performance Prediction
【速读】:该论文试图解决如何利用机器学习方法预测学生在学校的学术表现问题。其解决方案的关键在于采用多层感知机分类器(MLPC)作为核心模型,并通过特征选择方法提升模型性能。实验结果显示,MLPC在测试集上达到了86.46%的最大准确率,在10折交叉验证下平均准确率为79.58%,表明神经网络在数据高效建模方面具有潜在优势。此外,可解释性机器学习方法被用于解析黑箱模型并验证特征选择的有效性。
链接: https://arxiv.org/abs/2506.08047
作者: A.G.R. Sandeepa,Sanka Mohottala
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Paper Accepted for IEEE ICARC Conference (2025). 6 pages, 5 figures
点击查看摘要
Abstract:This research investigates the use of machine learning methods to forecast students’ academic performance in a school setting. Students’ data with behavioral, academic, and demographic details were used in implementations with standard classical machine learning models including multi-layer perceptron classifier (MLPC). MLPC obtained 86.46% maximum accuracy for test set across all implementations. Under 10-fold cross validation, MLPC obtained 79.58% average accuracy for test set while for train set, it was 99.65%. MLP’s better performance over other machine learning models strongly suggest the potential use of neural networks as data-efficient models. Feature selection approach played a crucial role in improving the performance and multiple evaluation approaches were used in order to compare with existing literature. Explainable machine learning methods were utilized to demystify the black box models and to validate the feature selection approach.
zh
[AI-91] UAVs Meet Agent ic AI: A Multidomain Survey of Autonomous Aerial Intelligence and Agent ic UAVs
【速读】:该论文旨在解决传统无人飞行器(UAV)在复杂现实环境中自主性不足的问题,提出Agentic UAVs作为新一代自主空中智能系统,通过整合感知、决策、记忆和协作规划等能力实现自适应操作。其解决方案的关键在于引入生成式AI(Generative AI)技术,使系统具备目标驱动行为、情境推理和交互式自主性,从而提升任务灵活性与智能化水平。此外,研究还聚焦于硬件创新、学习架构优化及人机协同等关键技术,以应对技术约束、法规限制和数据-模型可靠性等挑战。
链接: https://arxiv.org/abs/2506.08045
作者: Ranjan Sapkota,Konstantinos I. Roumeliotis,Manoj Karkee
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 40 pages, 6 Figures
点击查看摘要
Abstract:Agentic UAVs represent a new frontier in autonomous aerial intelligence, integrating perception, decision-making, memory, and collaborative planning to operate adaptively in complex, real-world environments. Driven by recent advances in Agentic AI, these systems surpass traditional UAVs by exhibiting goal-driven behavior, contextual reasoning, and interactive autonomy. We provide a comprehensive foundation for understanding the architectural components and enabling technologies that distinguish Agentic UAVs from traditional autonomous UAVs. Furthermore, a detailed comparative analysis highlights advancements in autonomy with AI agents, learning, and mission flexibility. This study explores seven high-impact application domains precision agriculture, construction mining, disaster response, environmental monitoring, infrastructure inspection, logistics, security, and wildlife conservation, illustrating the broad societal value of agentic aerial intelligence. Furthermore, we identify key challenges in technical constraints, regulatory limitations, and data-model reliability, and we present emerging solutions across hardware innovation, learning architectures, and human-AI interaction. Finally, a future roadmap is proposed, outlining pathways toward self-evolving aerial ecosystems, system-level collaboration, and sustainable, equitable deployments. This survey establishes a foundational framework for the future development, deployment, and governance of agentic aerial systems (Agentic UAVs) across diverse societal and industrial domains.
zh
[AI-92] he World of AI: A Novel Approach to AI Literacy for First-year Engineering Students
【速读】:该论文试图解决工程专业新生在学术初期缺乏人工智能(Artificial Intelligence, AI)基础认知及其社会影响的问题。解决方案的关键在于设计并实施一门跨学科课程,该课程能够面向所有工程领域的本科生,使他们无需数学基础即可理解AI系统的基本原理,并认识到AI对生活产生的深远影响。课程由工程与人文学科教师共同授课,分为三个模块:行星模块探讨AI在可持续性和环境挑战中的双重角色,社会影响模块关注AI偏见及隐私与公平问题,工作场所模块强调AI驱动的就业替代及其适应性的重要性。该课程的创新之处在于其跨学科的课程设计和教学方法,结合了技术教学与社会讨论。
链接: https://arxiv.org/abs/2506.08041
作者: Siddharth Siddharth,Brainerd Prince,Amol Harsh,Shreyas Ramachandran
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: Accepted for publication at AIED 2025 in the late-breaking work track
点击查看摘要
Abstract:This work presents a novel course titled The World of AI designed for first-year undergraduate engineering students with little to no prior exposure to AI. The central problem addressed by this course is that engineering students often lack foundational knowledge of AI and its broader societal implications at the outset of their academic journeys. We believe the way to address this gap is to design and deliver an interdisciplinary course that can a) be accessed by first-year undergraduate engineering students across any domain, b) enable them to understand the basic workings of AI systems sans mathematics, and c) make them appreciate AI’s far-reaching implications on our lives. The course was divided into three modules co-delivered by faculty from both engineering and humanities. The planetary module explored AI’s dual role as both a catalyst for sustainability and a contributor to environmental challenges. The societal impact module focused on AI biases and concerns around privacy and fairness. Lastly, the workplace module highlighted AI-driven job displacement, emphasizing the importance of adaptation. The novelty of this course lies in its interdisciplinary curriculum design and pedagogical approach, which combines technical instruction with societal discourse. Results revealed that students’ comprehension of AI challenges improved across diverse metrics like (a) increased awareness of AI’s environmental impact, and (b) efficient corrective solutions for AI fairness. Furthermore, it also indicated the evolution in students’ perception of AI’s transformative impact on our lives.
zh
[AI-93] Inverse Design in Distributed Circuits Using Single-Step Reinforcement Learning
【速读】:该论文试图解决分布式电路逆向设计中生成接近最优设计以满足期望传递函数规范的问题。现有设计探索方法通常依赖于人工网格、可微分评估过程和特定模板拓扑,但在实际设计中往往需要处理不可微分的评估过程、变化的拓扑结构以及近连续的布局空间。论文提出的DCIDA框架通过学习目标传递函数的接近最优设计采样策略来解决这一问题,其关键在于通过联合训练的条件分布采样所有设计因素,并利用一个可注入的相互依赖的“映射”将原始采样的设计“动作”转换为唯一的物理表示,从而学习联合设计决策之间的条件依赖关系。
链接: https://arxiv.org/abs/2506.08029
作者: Jiayu Li,Masood Mortazavi,Ning Yan,Yihong Ma,Reza Zafarani
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: A briefer version of this paper was accepted as a Work-in-Progress (WIP) at the Design Automation Conference (DAC) 2024
点击查看摘要
Abstract:The goal of inverse design in distributed circuits is to generate near-optimal designs that meet a desirable transfer function specification. Existing design exploration methods use some combination of strategies involving artificial grids, differentiable evaluation procedures, and specific template topologies. However, real-world design practices often require non-differentiable evaluation procedures, varying topologies, and near-continuous placement spaces. In this paper, we propose DCIDA, a design exploration framework that learns a near-optimal design sampling policy for a target transfer function. DCIDA decides all design factors in a compound single-step action by sampling from a set of jointly-trained conditional distributions generated by the policy. Utilizing an injective interdependent map", DCIDA transforms raw sampled design
actions" into uniquely equivalent physical representations, enabling the framework to learn the conditional dependencies among joint ``raw’’ design decisions. Our experiments demonstrate DCIDA’s Transformer-based policy network achieves significant reductions in design error compared to state-of-the-art approaches, with significantly better fit in cases involving more complex transfer functions.
zh
[AI-94] Recipes for Pre-training LLM s with MXFP8
【速读】:该论文旨在解决在使用NVIDIA Blackwell GPU中的Microscaling (MX)格式进行大语言模型(LLM)预训练时,由于精度缩放导致的数值不稳定问题。论文指出,OCP规范中建议的舍入模式可能导致模型在多万亿token数据集上训练发散,而采用改进的舍入模式——利用向无穷大的舍入方式计算缩放因子,能够有效实现MXFP8格式下8B参数模型的成功预训练。解决方案的关键在于优化舍入策略以提升数值稳定性。
链接: https://arxiv.org/abs/2506.08027
作者: Asit Mishra,Dusan Stosic,Simon Layton
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:
点击查看摘要
Abstract:Precision scaling - using fewer bits to represent model parameters and related tensors during pre-training - has emerged as a compelling technique for improving GPU efficiency without sacrificing accuracy. Microscaling (MX) formats in NVIDIA’s latest Blackwell GPUs represent a major leap in enabling this precision scaling aspect. These formats combine narrow floating-point data types with per-block scaling factors, offering a fine-grained approach to quantizing tensors. Although MX-formats offer the promise of improved numeric stability compared to other reduced-precision representations, in practice they must be used carefully in order to successfully converge an LLM on a multi-trillion token dataset. In this paper, we show that the rounding mode suggested in OCP specification can lead to divergence when pre-training an LLM. We show an improved rounding mode, which uses round-to-infinity to compute scaling factors, enables successful pre-training in MXFP8 for an 8B model on 15T tokens. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC) Cite as: arXiv:2506.08027 [cs.LG] (or arXiv:2506.08027v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2506.08027 Focus to learn more arXiv-issued DOI via DataCite
zh
[AI-95] IP-Search: Time-Predictable Inference Scheduling for Market Prediction under Uncertain Load
【速读】:该论文试图解决在不确定工作负载下,实时市场预测中严格延迟需求与预测准确性之间的平衡问题。解决方案的关键在于提出TIP-Search框架,该框架通过动态从异构模型池中选择深度学习模型,在满足每个任务截止时间约束的前提下,最大化预测精度。其核心机制包括离线分析模型的延迟和泛化性能,并在在线阶段无需依赖显式输入领域标签进行任务感知的选择。
链接: https://arxiv.org/abs/2506.08026
作者: Xibai Wang
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY)
备注:
点击查看摘要
Abstract:This paper proposes TIP-Search, a time-predictable inference scheduling framework for real-time market prediction under uncertain workloads. Motivated by the strict latency demands in high-frequency financial systems, TIP-Search dynamically selects a deep learning model from a heterogeneous pool, aiming to maximize predictive accuracy while satisfying per-task deadline constraints. Our approach profiles latency and generalization performance offline, then performs online task-aware selection without relying on explicit input domain labels. We evaluate TIP-Search on three real-world limit order book datasets (FI-2010, Binance BTC/USDT, LOBSTER AAPL) and demonstrate that it outperforms static baselines with up to 8.5% improvement in accuracy and 100% deadline satisfaction. Our results highlight the effectiveness of TIP-Search in robust low-latency financial inference under uncertainty.
zh
[AI-96] KVmix: Gradient-Based Layer Importance-Aware Mixed-Precision Quantization for KV Cache
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在推理过程中键值(Key-Value, KV)缓存带来的高内存需求问题,这一问题严重限制了LLMs在资源受限平台上的部署。其解决方案的关键在于提出一种名为KVmix的混合精度量化方法,通过基于梯度的重要性分析评估每个键和值投影矩阵对模型损失的影响,从而实现层级别的位宽分配,动态优先为关键层保留较高精度,同时对影响较小的层进行激进量化,以在准确性和效率之间取得可调的平衡。此外,KVmix还引入了动态长上下文优化策略,自适应地保留近期关键标记的全精度KV对,压缩较旧的KV对,从而在保持高质量序列生成的同时降低内存使用。
链接: https://arxiv.org/abs/2506.08018
作者: Fei Li,Song Liu,Weiguo Wu,Shiqiang Nie,Jinyu Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 14 pages, 8 figures, 4 tables
点击查看摘要
Abstract:The high memory demands of the Key-Value (KV) Cache during the inference of Large Language Models (LLMs) severely restrict their deployment in resource-constrained platforms. Quantization can effectively alleviate the memory pressure caused by KV Cache. However, existing methods either rely on static one-size-fits-all precision allocation or fail to dynamically prioritize critical KV in long-context tasks, forcing memory-accuracy-throughput tradeoffs. In this work, we propose a novel mixed-precision quantization method for KV Cache named KVmix. KVmix leverages gradient-based importance analysis to evaluate how individual Key and Value projection matrices affect the model loss, enabling layer-specific bit-width allocation for mix-precision quantization. It dynamically prioritizes higher precision for important layers while aggressively quantizing less influential ones, achieving a tunable balance between accuracy and efficiency. KVmix also introduces a dynamic long-context optimization strategy that adaptively keeps full-precision KV pairs for recent pivotal tokens and compresses older ones, achieving high-quality sequence generation with low memory usage. Additionally, KVmix provides efficient low-bit quantization and CUDA kernels to optimize computational overhead. On LLMs such as Llama and Mistral, KVmix achieves near-lossless inference performance with extremely low quantization configuration (Key 2.19bit Value 2.38bit), while delivering a remarkable 4.9x memory compression and a 5.3x speedup in inference throughput.
zh
[AI-97] QUITE: A Query Rewrite System Beyond Rules with LLM Agents
【速读】:该论文旨在解决传统基于规则的SQL查询重写方法在处理范围有限、泛化能力差以及无法表达某些重写技术的问题。其解决方案的关键在于利用大型语言模型(Large Language Models, LLMs)的能力,通过一个无需训练且具备反馈感知的系统——QUITE(Query Rewrite),实现超越传统规则的SQL查询重写。该系统通过多智能体框架、重写中间件和提示注入技术,显著提升了查询性能并扩展了可处理的查询模式和重写策略。
链接: https://arxiv.org/abs/2506.07675
作者: Yuyang Song,Hanxu Yan,Jiale Lao,Yibo Wang,Yufei Li,Yuanchun Zhou,Jianguo Wang,Mingjie Tang
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Query rewrite transforms SQL queries into semantically equivalent forms that run more efficiently. Existing approaches mainly rely on predefined rewrite rules, but they handle a limited subset of queries and can cause performance regressions. This limitation stems from three challenges of rule-based query rewrite: (1) it is hard to discover and verify new rules, (2) fixed rewrite rules do not generalize to new query patterns, and (3) some rewrite techniques cannot be expressed as fixed rules. Motivated by the fact that human experts exhibit significantly better rewrite ability but suffer from scalability, and Large Language Models (LLMs) have demonstrated nearly human-level semantic and reasoning abilities, we propose a new approach of using LLMs to rewrite SQL queries beyond rules. Due to the hallucination problems in LLMs, directly applying LLMs often leads to nonequivalent and suboptimal queries. To address this issue, we propose QUITE (query rewrite), a training-free and feedback-aware system based on LLM agents that rewrites SQL queries into semantically equivalent forms with significantly better performance, covering a broader range of query patterns and rewrite strategies compared to rule-based methods. Firstly, we design a multi-agent framework controlled by a finite state machine (FSM) to equip LLMs with the ability to use external tools and enhance the rewrite process with real-time database feedback. Secondly, we develop a rewrite middleware to enhance the ability of LLMs to generate optimized query equivalents. Finally, we employ a novel hint injection technique to improve execution plans for rewritten queries. Extensive experiments show that QUITE reduces query execution time by up to 35.8% over state-of-the-art approaches and produces 24.1% more rewrites than prior methods, covering query cases that earlier systems did not handle.
zh
[AI-98] Quantum Adiabatic Generation of Human-Like Passwords
【速读】:该论文试图解决如何利用量子计算技术生成具有人类行为特征的密码问题,以用于测试认证系统的安全性。其解决方案的关键在于利用绝热量子计算机,通过不同的标记字符串编码方法,并基于二次无约束二进制优化(QUBO)和单位圆最大独立集(UD-MIS)问题提出新的方法,从而从数据中估计标记分布并绝热地准备量子态,最终通过测量生成密码。
链接: https://arxiv.org/abs/2506.08917
作者: Sascha Mücke,Raoul Heese,Thore Gerlach,David Biesner,Loong Kuan Lee,Nico Piatkowski
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI)
备注: 9 pages, 4 figures
点击查看摘要
Abstract:Generative Artificial Intelligence (GenAI) for Natural Language Processing (NLP) is the predominant AI technology to date. An important perspective for Quantum Computing (QC) is the question whether QC has the potential to reduce the vast resource requirements for training and operating GenAI models. While large-scale generative NLP tasks are currently out of reach for practical quantum computers, the generation of short semantic structures such as passwords is not. Generating passwords that mimic real user behavior has many applications, for example to test an authentication system against realistic threat models. Classical password generation via deep learning have recently been investigated with significant progress in their ability to generate novel, realistic password candidates. In the present work we investigate the utility of adiabatic quantum computers for this task. More precisely, we study different encodings of token strings and propose novel approaches based on the Quadratic Unconstrained Binary Optimization (QUBO) and the Unit-Disk Maximum Independent Set (UD-MIS) problems. Our approach allows us to estimate the token distribution from data and adiabatically prepare a quantum state from which we eventually sample the generated passwords via measurements. Our results show that relatively small samples of 128 passwords, generated on the QuEra Aquila 256-qubit neutral atom quantum computer, contain human-like passwords such as “Tunas200992” or “teedem28iglove”.
zh
[AI-99] Solving excited states for long-range interacting trapped ions with neural networks
【速读】:该论文旨在解决强相互作用量子多体系统中激发态计算的难题,该问题因希尔伯特空间维度随系统尺寸呈指数增长而极具挑战性。论文提出的解决方案是基于神经网络的神经量子激发态(NQES)算法,其关键在于能够高效且准确地同时输出多个低能激发态,且无需显式正交化处理,适用于高维系统。通过在多种模型中的应用验证,该算法展现了在计算多体系统的激发态及其相关可观测量方面的有效性与可扩展性。
链接: https://arxiv.org/abs/2506.08594
作者: Yixuan Ma,Chang Liu,Weikang Li,Shun-Yao Zhang,L.-M. Duan,Yukai Wu,Dong-Ling Deng
机构: 未知
类目: Quantum Physics (quant-ph); Disordered Systems and Neural Networks (cond-mat.dis-nn); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:The computation of excited states in strongly interacting quantum many-body systems is of fundamental importance. Yet, it is notoriously challenging due to the exponential scaling of the Hilbert space dimension with the system size. Here, we introduce a neural network-based algorithm that can simultaneously output multiple low-lying excited states of a quantum many-body spin system in an accurate and efficient fashion. This algorithm, dubbed the neural quantum excited-state (NQES) algorithm, requires no explicit orthogonalization of the states and is generally applicable to higher dimensions. We demonstrate, through concrete examples including the Haldane-Shastry model with all-to-all interactions, that the NQES algorithm is capable of efficiently computing multiple excited states and their related observable expectations. In addition, we apply the NQES algorithm to two classes of long-range interacting trapped-ion systems in a two-dimensional Wigner crystal. For non-decaying all-to-all interactions with alternating signs, our computed low-lying excited states bear spatial correlation patterns similar to those of the ground states, which closely match recent experimental observations that the quasi-adiabatically prepared state accurately reproduces analytical ground-state correlations. For a system of up to 300 ions with power-law decaying antiferromagnetic interactions, we successfully uncover its gap scaling and correlation features. Our results establish a scalable and efficient algorithm for computing excited states of interacting quantum many-body systems, which holds potential applications ranging from benchmarking quantum devices to photoisomerization.
zh
[AI-100] Flow-Lenia: Emergent evolutionary dynamics in mass conservative continuous cellular automata
【速读】:该论文试图解决如何在人工生命领域中创建能够自发生成类似生物系统特性(如自繁殖、自组织、进化和开放性)的人工系统的问题。其解决方案的关键在于提出Flow-Lenia,这是一种质量守恒的Lenia扩展模型,通过优化更新规则参数以生成具有复杂行为的空间局部模式(SLPs),并实现多物种模拟,从而在系统内部动态嵌入模型参数,促进复杂生物行为的涌现。
链接: https://arxiv.org/abs/2506.08569
作者: Erwan Plantec,Gautier Hamon,Mayalen Etcheverry,Bert Wang-Chak Chan,Pierre-Yves Oudeyer,Clément Moulin-Frier
机构: 未知
类目: Cellular Automata and Lattice Gases (nlin.CG); Artificial Intelligence (cs.AI)
备注: This manuscript has been accepted for publication in the Artificial Life journal ( this https URL )
点击查看摘要
Abstract:Central to the artificial life endeavour is the creation of artificial systems spontaneously generating properties found in the living world such as autopoiesis, self-replication, evolution and open-endedness. While numerous models and paradigms have been proposed, cellular automata (CA) have taken a very important place in the field notably as they enable the study of phenomenons like self-reproduction and autopoiesis. Continuous CA like Lenia have been showed to produce life-like patterns reminiscent, on an aesthetic and ontological point of view, of biological organisms we call creatures. We propose in this paper Flow-Lenia, a mass conservative extension of Lenia. We present experiments demonstrating its effectiveness in generating spatially-localized patters (SLPs) with complex behaviors and show that the update rule parameters can be optimized to generate complex creatures showing behaviors of interest. Furthermore, we show that Flow-Lenia allows us to embed the parameters of the model, defining the properties of the emerging patterns, within its own dynamics thus allowing for multispecies simulations. By using the evolutionary activity framework as well as other metrics, we shed light on the emergent evolutionary dynamics taking place in this system.
zh
[AI-101] Domain Switching on the Pareto Front: Multi-Objective Deep Kernel Learning in Automated Piezoresponse Force Microscopy
【速读】:该论文试图解决铁电材料中极化翻转行为与复杂局部微观结构特征之间关系的系统性探索问题,传统的人工或网格化光谱测量方法难以应对这种复杂依赖关系。解决方案的关键在于引入一种多目标核学习工作流,该方法直接从高分辨率成像数据中推断出控制翻转行为的微观结构规则,结合自动压电响应力显微镜(PFM)实验,高效识别畴壁构型与局部翻转动力学之间的关键关系,并通过后实验分析将抽象奖励函数映射到可物理解释的描述符,从而实现高通量主动学习和对翻转现象微观结构控制机制的深入理解。
链接: https://arxiv.org/abs/2506.08073
作者: Yu Liu,Utkarsh Pratiush,Kamyar Barakati,Hiroshi Funakubo,Ching-Che Lin,Jaegyu Kim,Lane W. Martin,Sergei V. Kalinin
机构: 未知
类目: Materials Science (cond-mat.mtrl-sci); Mesoscale and Nanoscale Physics (cond-mat.mes-hall); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Ferroelectric polarization switching underpins the functional performance of a wide range of materials and devices, yet its dependence on complex local microstructural features renders systematic exploration by manual or grid-based spectroscopic measurements impractical. Here, we introduce a multi-objective kernel-learning workflow that infers the microstructural rules governing switching behavior directly from high-resolution imaging data. Applied to automated piezoresponse force microscopy (PFM) experiments, our framework efficiently identifies the key relationships between domain-wall configurations and local switching kinetics, revealing how specific wall geometries and defect distributions modulate polarization reversal. Post-experiment analysis projects abstract reward functions, such as switching ease and domain symmetry, onto physically interpretable descriptors including domain configuration and proximity to boundaries. This enables not only high-throughput active learning, but also mechanistic insight into the microstructural control of switching phenomena. While demonstrated for ferroelectric domain switching, our approach provides a powerful, generalizable tool for navigating complex, non-differentiable design spaces, from structure-property correlations in molecular discovery to combinatorial optimization across diverse imaging modalities.
zh
[AI-102] WWAggr: A Window Wasserstein-based Aggregation for Ensemble Change Point Detection
【速读】:该论文旨在解决高维数据流中变化点检测(Change Point Detection, CPD)的挑战,特别是在数据模式复杂且违反常见假设的情况下。当前基于独立深度神经网络的检测方法尚未达到理想效果,而集成学习虽能提供更稳健的解决方案,但标准的预测聚合技术(如平均)表现不佳,未能充分考虑问题特性。论文的关键解决方案是引入WWAggr——一种基于Wasserstein距离的任务特定集成聚合方法,该方法在多种深度CPD模型集成中表现出色,并有效解决了变化点检测中长期存在的决策阈值选择问题。
链接: https://arxiv.org/abs/2506.08066
作者: Alexander Stepikin,Evgenia Romanenkova,Alexey Zaytsev
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Change Point Detection (CPD) aims to identify moments of abrupt distribution shifts in data streams. Real-world high-dimensional CPD remains challenging due to data pattern complexity and violation of common assumptions. Resorting to standalone deep neural networks, the current state-of-the-art detectors have yet to achieve perfect quality. Concurrently, ensembling provides more robust solutions, boosting the performance. In this paper, we investigate ensembles of deep change point detectors and realize that standard prediction aggregation techniques, e.g., averaging, are suboptimal and fail to account for problem peculiarities. Alternatively, we introduce WWAggr – a novel task-specific method of ensemble aggregation based on the Wasserstein distance. Our procedure is versatile, working effectively with various ensembles of deep CPD models. Moreover, unlike existing solutions, we practically lift a long-standing problem of the decision threshold selection for CPD.
zh
[AI-103] CaliciBoost: Performance-Driven Evaluation of Molecular Representations for Caco-2 Permeability Prediction
【速读】:该论文旨在解决药物早期开发中口服吸收预测的准确性与计算预测效率问题,具体针对Caco-2渗透性(Caco-2 permeability)的预测。其解决方案的关键在于系统评估了八种分子特征表示类型(包括2D/3D描述符、结构指纹和基于深度学习的嵌入)与自动化机器学习(AutoML)技术的结合效果,并通过两个不同规模和多样性的数据集(TDC基准和整理后的OCHEM数据)验证了模型性能。研究发现,PaDEL、Mordred和RDKit描述符在Caco-2预测中表现尤为有效,而基于AutoML的CaliciBoost模型在均方误差(MAE)方面取得了最佳性能,同时3D描述符的引入显著提升了预测精度。
链接: https://arxiv.org/abs/2506.08059
作者: Huong Van Le,Weibin Ren,Junhong Kim,Yukyung Yun,Young Bin Park,Young Jun Kim,Bok Kyung Han,Inho Choi,Jong IL Park,Hwi-Yeol Yun,Jae-Mun Choi
机构: 未知
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 49 pages, 11 figures
点击查看摘要
Abstract:Caco-2 permeability serves as a critical in vitro indicator for predicting the oral absorption of drug candidates during early-stage drug discovery. To enhance the accuracy and efficiency of computational predictions, we systematically investigated the impact of eight molecular feature representation types including 2D/3D descriptors, structural fingerprints, and deep learning-based embeddings combined with automated machine learning techniques to predict Caco-2 permeability. Using two datasets of differing scale and diversity (TDC benchmark and curated OCHEM data), we assessed model performance across representations and identified PaDEL, Mordred, and RDKit descriptors as particularly effective for Caco-2 prediction. Notably, the AutoML-based model CaliciBoost achieved the best MAE performance. Furthermore, for both PaDEL and Mordred representations, the incorporation of 3D descriptors resulted in a 15.73% reduction in MAE compared to using 2D features alone, as confirmed by feature importance analysis. These findings highlight the effectiveness of AutoML approaches in ADMET modeling and offer practical guidance for feature selection in data-limited prediction tasks.
zh
[AI-104] Physics-Informed Teleconnection-Aware Transformer for Global Subseasonal-to-Seasonal Forecasting
【速读】:该论文旨在解决次季节到季节(Subseasonal-to-seasonal, S2S)预测中的挑战,即在数周至数月的时间尺度上准确预测气候条件,这受限于大气系统的混沌动力学和多尺度复杂相互作用。现有方法通常无法显式建模对S2S时间尺度至关重要的物理过程和遥相关(teleconnection)特征。其解决方案的关键在于提出TelePiT,一种结合多尺度物理信息与遥相关感知的深度学习架构,包含三个核心组件:球谐嵌入、多尺度物理引导神经微分方程以及遥相关感知的Transformer,从而显著提升了全球S2S预测的准确性。
链接: https://arxiv.org/abs/2506.08049
作者: Tengfei Lyu,Weijia Zhang,Hao Liu
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Subseasonal-to-seasonal (S2S) forecasting, which predicts climate conditions from several weeks to months in advance, presents significant challenges due to the chaotic dynamics of atmospheric systems and complex interactions across multiple scales. Current approaches often fail to explicitly model underlying physical processes and teleconnections that are crucial at S2S timescales. We introduce TelePiT, a novel deep learning architecture that enhances global S2S forecasting through integrated multi-scale physics and teleconnection awareness. Our approach consists of three key components: (1) Spherical Harmonic Embedding, which accurately encodes global atmospheric variables onto spherical geometry; (2) Multi-Scale Physics-Informed Neural ODE, which explicitly captures atmospheric physical processes across multiple learnable frequency bands; (3) Teleconnection-Aware Transformer, which models critical global climate interactions through tactfully injecting teleconnection patterns into the self-attention. Extensive experiments demonstrate that TelePiT significantly outperforms state-of-the-art data-driven baselines and operational numerical weather prediction systems, with remarkable improvements for atmospheric variables including a 57.7% reduction in RMSE for 2-meter temperature compared to previous best models.
zh
[AI-105] ChemGraph: An Agent ic Framework for Computational Chemistry Workflows
【速读】:该论文旨在解决计算化学与材料科学中原子尺度模拟流程复杂、依赖专家知识及手动操作的问题,从而提高模拟效率与可访问性。其解决方案的关键在于提出ChemGraph框架,该框架结合生成式AI与先进的模拟工具,利用图神经网络基础模型实现高效准确的计算,并通过大语言模型(LLMs)实现自然语言理解、任务规划与科学推理,从而构建一个直观交互的计算工作流自动化系统。
链接: https://arxiv.org/abs/2506.06363
作者: Thang D. Pham,Aditya Tanikanti,Murat Keçeli
机构: 未知
类目: Chemical Physics (physics.chem-ph); Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
备注:
点击查看摘要
Abstract:Atomistic simulations are essential tools in chemistry and materials science, accelerating the discovery of novel catalysts, energy storage materials, and pharmaceuticals. However, running these simulations remains challenging due to the wide range of computational methods, diverse software ecosystems, and the need for expert knowledge and manual effort for the setup, execution, and validation stages. In this work, we present ChemGraph, an agentic framework powered by artificial intelligence and state-of-the-art simulation tools to streamline and automate computational chemistry and materials science workflows. ChemGraph leverages graph neural network-based foundation models for accurate yet computationally efficient calculations and large language models (LLMs) for natural language understanding, task planning, and scientific reasoning to provide an intuitive and interactive interface. Users can perform tasks such as molecular structure generation, single-point energy, geometry optimization, vibrational analysis, and thermochemistry calculations with methods ranging from tight-binding and machine learning interatomic potentials to density functional theory or wave function theory-based methods. We evaluate ChemGraph across 13 benchmark tasks and demonstrate that smaller LLMs (GPT-4o-mini, Claude-3.5-haiku, Qwen2.5-14B) perform well on simple workflows, while more complex tasks benefit from using larger models like GPT-4o. Importantly, we show that decomposing complex tasks into smaller subtasks through a multi-agent framework enables smaller LLM models to match or exceed GPT-4o’s performance in specific scenarios.
zh
[AI-106] Large Language Models for EEG: A Comprehensive Survey and Taxonomy
【速读】:该论文试图解决如何将大型语言模型(Large Language Models, LLMs)与脑电图(EEG)研究相结合,以推动神经解码、脑机接口(BCI)和情感计算等领域的创新。其解决方案的关键在于利用基于Transformer的架构,通过微调、少样本学习和零样本学习等方法,使EEG-based模型能够执行复杂的任务,如自然语言生成、语义解释和诊断辅助。论文通过系统回顾和结构化分类,为未来融合自然语言处理与神经信号分析的研究提供了基础资源。
链接: https://arxiv.org/abs/2506.06353
作者: Naseem Babu,Jimson Mathew,A. P. Vinod
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:The growing convergence between Large Language Models (LLMs) and electroencephalography (EEG) research is enabling new directions in neural decoding, brain-computer interfaces (BCIs), and affective computing. This survey offers a systematic review and structured taxonomy of recent advancements that utilize LLMs for EEG-based analysis and applications. We organize the literature into four domains: (1) LLM-inspired foundation models for EEG representation learning, (2) EEG-to-language decoding, (3) cross-modal generation including image and 3D object synthesis, and (4) clinical applications and dataset management tools. The survey highlights how transformer-based architectures adapted through fine-tuning, few-shot, and zero-shot learning have enabled EEG-based models to perform complex tasks such as natural language generation, semantic interpretation, and diagnostic assistance. By offering a structured overview of modeling strategies, system designs, and application areas, this work serves as a foundational resource for future work to bridge natural language processing and neural signal analysis through language models.
zh
机器学习
[LG-0] Understanding Task Vectors in In-Context Learning: Emergence Functionality and Limitations
链接: https://arxiv.org/abs/2506.09048
作者: Yuxin Dong,Jiachen Jiang,Zhihui Zhu,Xia Ning
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Task vectors offer a compelling mechanism for accelerating inference in in-context learning (ICL) by distilling task-specific information into a single, reusable representation. Despite their empirical success, the underlying principles governing their emergence and functionality remain unclear. This work proposes the Linear Combination Conjecture, positing that task vectors act as single in-context demonstrations formed through linear combinations of the original ones. We provide both theoretical and empirical support for this conjecture. First, we show that task vectors naturally emerge in linear transformers trained on triplet-formatted prompts through loss landscape analysis. Next, we predict the failure of task vectors on representing high-rank mappings and confirm this on practical LLMs. Our findings are further validated through saliency analyses and parameter visualization, suggesting an enhancement of task vectors by injecting multiple ones into few-shot prompts. Together, our results advance the understanding of task vectors and shed light on the mechanisms underlying ICL in transformer-based models.
[LG-1] he Decoupled Risk Landscape in Performative Prediction
链接: https://arxiv.org/abs/2506.09044
作者: Javier Sanguino,Thomas Kehrenberg,Jose A. Lozano,Novi Quadrianto
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Performative Prediction addresses scenarios where deploying a model induces a distribution shift in the input data, such as individuals modifying their features and reapplying for a bank loan after rejection. Literature has had a theoretical perspective giving mathematical guarantees for convergence (either to the stable or optimal point). We believe that visualization of the loss landscape can complement this theoretical advances with practical insights. Therefore, (1) we introduce a simple decoupled risk visualization method inspired in the two-step process that performative prediction is. Our approach visualizes the risk landscape with respect to two parameter vectors: model parameters and data parameters. We use this method to propose new properties of the interest points, to examine how existing algorithms traverse the risk landscape and perform under more realistic conditions, including strategic classification with non-linear models. (2) Building on this decoupled risk visualization, we introduce a novel setting - extended Performative Prediction - which captures scenarios where the distribution reacts to a model different from the decision-making one, reflecting the reality that agents often lack full access to the deployed model.
[LG-2] SPEED-RL: Faster Training of Reasoning Models via Online Curriculum Learning
链接: https://arxiv.org/abs/2506.09016
作者: Ruiqi Zhang,Daman Arora,Song Mei,Andrea Zanette
类目: Machine Learning (cs.LG)
*备注: pre-print
点击查看摘要
Abstract:Training large language models with reinforcement learning (RL) against verifiable rewards significantly enhances their reasoning abilities, yet remains computationally expensive due to inefficient uniform prompt sampling. We introduce Selective Prompting with Efficient Estimation of Difficulty (SPEED), an adaptive online RL curriculum that selectively chooses training examples of intermediate difficulty to maximize learning efficiency. Theoretically, we establish that intermediate-difficulty prompts improve the gradient estimator’s signal-to-noise ratio, accelerating convergence. Empirically, our efficient implementation leads to 2x to 6x faster training without degrading accuracy, requires no manual tuning, and integrates seamlessly into standard RL algorithms.
[LG-3] Effective Data Pruning through Score Extrapolation
链接: https://arxiv.org/abs/2506.09010
作者: Sebastian Schmidt,Prasanga Dhungel,Christoffer Löffler,Björn Nieth,Stephan Günnemann,Leo Schwinn
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Training advanced machine learning models demands massive datasets, resulting in prohibitive computational costs. To address this challenge, data pruning techniques identify and remove redundant training samples while preserving model performance. Yet, existing pruning techniques predominantly require a full initial training pass to identify removable samples, negating any efficiency benefits for single training runs. To overcome this limitation, we introduce a novel importance score extrapolation framework that requires training on only a small subset of data. We present two initial approaches in this framework - k-nearest neighbors and graph neural networks - to accurately predict sample importance for the entire dataset using patterns learned from this minimal subset. We demonstrate the effectiveness of our approach for 2 state-of-the-art pruning methods (Dynamic Uncertainty and TDDS), 4 different datasets (CIFAR-10, CIFAR-100, Places-365, and ImageNet), and 3 training paradigms (supervised, unsupervised, and adversarial). Our results indicate that score extrapolation is a promising direction to scale expensive score calculation methods, such as pruning, data attribution, or other tasks.
[LG-4] Branched Schrödinger Bridge Matching
链接: https://arxiv.org/abs/2506.09007
作者: Sophia Tang,Yinuo Zhang,Alexander Tong,Pranam Chatterjee
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Predicting the intermediate trajectories between an initial and target distribution is a central problem in generative modeling. Existing approaches, such as flow matching and Schrödinger Bridge Matching, effectively learn mappings between two distributions by modeling a single stochastic path. However, these methods are inherently limited to unimodal transitions and cannot capture branched or divergent evolution from a common origin to multiple distinct outcomes. To address this, we introduce Branched Schrödinger Bridge Matching (BranchSBM), a novel framework that learns branched Schrödinger bridges. BranchSBM parameterizes multiple time-dependent velocity fields and growth processes, enabling the representation of population-level divergence into multiple terminal distributions. We show that BranchSBM is not only more expressive but also essential for tasks involving multi-path surface navigation, modeling cell fate bifurcations from homogeneous progenitor states, and simulating diverging cellular responses to perturbations.
[LG-5] On Finetuning Tabular Foundation Models
链接: https://arxiv.org/abs/2506.08982
作者: Ivan Rubachev,Akim Kotelnikov,Nikolay Kartashev
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Foundation models are an emerging research direction in tabular deep learning. Notably, TabPFNv2 recently claimed superior performance over traditional GBDT-based methods on small-scale datasets using an in-context learning paradigm, which does not adapt model parameters to target datasets. However, the optimal finetuning approach for adapting tabular foundational models, and how this adaptation reshapes their internal mechanisms, remains underexplored. While prior works studied finetuning for earlier foundational models, inconsistent findings and TabPFNv2’s unique architecture necessitate fresh investigation. To address these questions, we first systematically evaluate various finetuning strategies on diverse datasets. Our findings establish full finetuning as the most practical solution for TabPFNv2 in terms of time-efficiency and effectiveness. We then investigate how finetuning alters TabPFNv2’s inner mechanisms, drawing an analogy to retrieval-augmented models. We reveal that the success of finetuning stems from the fact that after gradient-based adaptation, the dot products of the query-representations of test objects and the key-representations of in-context training objects more accurately reflect their target similarity. This improved similarity allows finetuned TabPFNv2 to better approximate target dependency by appropriately weighting relevant in-context samples, improving the retrieval-based prediction logic. From the practical perspective, we managed to finetune TabPFNv2 on datasets with up to 50K objects, observing performance improvements on almost all tasks. More precisely, on academic datasets with I.I.D. splits, finetuning allows TabPFNv2 to achieve state-of-the-art results, while on datasets with gradual temporal shifts and rich feature sets, TabPFNv2 is less stable and prior methods remain better.
[LG-6] KARMA: A Multilevel Decomposition Hybrid Mamba Framework for Multivariate Long-Term Time Series Forecasting
链接: https://arxiv.org/abs/2506.08939
作者: Hang Ye,Gaoxiang Duan,Haoran Zeng,Yangxin Zhu,Lingxue Meng,Xiaoying Zheng,Yongxin Zhu
类目: Machine Learning (cs.LG)
*备注: 10 pages,3 figures, published to WASA2025
点击查看摘要
Abstract:Multivariate long-term and efficient time series forecasting is a key requirement for a variety of practical applications, and there are complex interleaving time dynamics in time series data that require decomposition modeling. Traditional time series decomposition methods are single and rely on fixed rules, which are insufficient for mining the potential information of the series and adapting to the dynamic characteristics of complex series. On the other hand, the Transformer-based models for time series forecasting struggle to effectively model long sequences and intricate dynamic relationships due to their high computational complexity. To overcome these limitations, we introduce KARMA, with an Adaptive Time Channel Decomposition module (ATCD) to dynamically extract trend and seasonal components. It further integrates a Hybrid Frequency-Time Decomposition module (HFTD) to further decompose Series into frequency-domain and time-domain. These components are coupled with multi-scale Mamba-based KarmaBlock to efficiently process global and local information in a coordinated manner. Experiments on eight real-world datasets from diverse domains well demonstrated that KARMA significantly outperforms mainstream baseline methods in both predictive accuracy and computational efficiency. Code and full results are available at this repository: this https URL
[LG-7] BioLangFusion: Multimodal Fusion of DNA mRNA and Protein Language Models ICML2025
链接: https://arxiv.org/abs/2506.08936
作者: Amina Mollaysa,Artem Moskale,Pushpak Pati,Tommaso Mansi,Mangal Prakash,Rui Liao
类目: Machine Learning (cs.LG)
*备注: Proceedings of ICML 2025 Workshop on Multi-modal Foundation Proceedings of ICML 2025 Workshop on Multi-modal Foundation Proceedings of ICML 2025 Workshop on Multi-modal Foundation Models and Large Language Models for Life Sciences
点击查看摘要
Abstract:We present BioLangFusion, a simple approach for integrating pre-trained DNA, mRNA, and protein language models into unified molecular representations. Motivated by the central dogma of molecular biology (information flow from gene to transcript to protein), we align per-modality embeddings at the biologically meaningful codon level (three nucleotides encoding one amino acid) to ensure direct cross-modal correspondence. BioLangFusion studies three standard fusion techniques: (i) codon-level embedding concatenation, (ii) entropy-regularized attention pooling inspired by multiple-instance learning, and (iii) cross-modal multi-head attention – each technique providing a different inductive bias for combining modality-specific signals. These methods require no additional pre-training or modification of the base models, allowing straightforward integration with existing sequence-based foundation models. Across five molecular property prediction tasks, BioLangFusion outperforms strong unimodal baselines, showing that even simple fusion of pre-trained models can capture complementary multi-omic information with minimal overhead.
[LG-8] Local MDI: Local Feature Importances for Tree-Based Models
链接: https://arxiv.org/abs/2506.08928
作者: Zhongyuan Liang,Zachary T. Rewolinski,Abhineet Agarwal,Tiffany M. Tang,Bin Yu
类目: Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML)
*备注:
点击查看摘要
Abstract:Tree-based ensembles such as random forests remain the go-to for tabular data over deep learning models due to their prediction performance and computational efficiency. These advantages have led to their widespread deployment in high-stakes domains, where interpretability is essential for ensuring trustworthy predictions. This has motivated the development of popular local (i.e. sample-specific) feature importance (LFI) methods such as LIME and TreeSHAP. However, these approaches rely on approximations that ignore the model’s internal structure and instead depend on potentially unstable perturbations. These issues are addressed in the global setting by MDI+, a feature importance method which exploits an equivalence between decision trees and linear models on a transformed node basis. However, the global MDI+ scores are not able to explain predictions when faced with heterogeneous individual characteristics. To address this gap, we propose Local MDI+ (LMDI+), a novel extension of the MDI+ framework to the sample specific setting. LMDI+ outperforms existing baselines LIME and TreeSHAP in identifying instance-specific signal features, averaging a 10% improvement in downstream task performance across twelve real-world benchmark datasets. It further demonstrates greater stability by consistently producing similar instance-level feature importance rankings across multiple random forest fits. Finally, LMDI+ enables local interpretability use cases, including the identification of closer counterfactuals and the discovery of homogeneous subgroups.
[LG-9] Enhancing generalizability of model discovery across parameter space with multi-experiment equation learning (ME-EQL)
链接: https://arxiv.org/abs/2506.08916
作者: Maria-Veronica Ciocanel,John T. Nardini,Kevin B. Flores,Erica M. Rutter,Suzanne S. Sindi,Alexandria Volkening
类目: Machine Learning (cs.LG); Dynamical Systems (math.DS); Quantitative Methods (q-bio.QM)
*备注: 31 pages, 10 figures
点击查看摘要
Abstract:Agent-based modeling (ABM) is a powerful tool for understanding self-organizing biological systems, but it is computationally intensive and often not analytically tractable. Equation learning (EQL) methods can derive continuum models from ABM data, but they typically require extensive simulations for each parameter set, raising concerns about generalizability. In this work, we extend EQL to Multi-experiment equation learning (ME-EQL) by introducing two methods: one-at-a-time ME-EQL (OAT ME-EQL), which learns individual models for each parameter set and connects them via interpolation, and embedded structure ME-EQL (ES ME-EQL), which builds a unified model library across parameters. We demonstrate these methods using a birth–death mean-field model and an on-lattice agent-based model of birth, death, and migration with spatial structure. Our results show that both methods significantly reduce the relative error in recovering parameters from agent-based simulations, with OAT ME-EQL offering better generalizability across parameter space. Our findings highlight the potential of equation learning from multiple experiments to enhance the generalizability and interpretability of learned models for complex biological systems.
[LG-10] Implementing Keyword Spotting on the MCUX947 Microcontroller with Integrated NPU
链接: https://arxiv.org/abs/2506.08911
作者: Petar Jakuš,Hrvoje Džapo
类目: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Sound (cs.SD)
*备注: 4 pages
点击查看摘要
Abstract:This paper presents a keyword spotting (KWS) system implemented on the NXP MCXN947 microcontroller with an integrated Neural Processing Unit (NPU), enabling real-time voice interaction on resource-constrained devices. The system combines MFCC feature extraction with a CNN classifier, optimized using Quantization Aware Training to reduce model size with minimal accuracy drop. Experimental results demonstrate a 59x speedup in inference time when leveraging the NPU compared to CPU-only execution, achieving 97.06% accuracy with a model size of 30.58 KB, demonstrating the feasibility of efficient, low-power voice interfaces on embedded platforms.
[LG-11] InfoDPCCA: Information-Theoretic Dynamic Probabilistic Canonical Correlation Analysis UAI-25
链接: https://arxiv.org/abs/2506.08884
作者: Shiqin Tang,Shujian Yu
类目: Machine Learning (cs.LG); Information Theory (cs.IT)
*备注: accepted by UAI-25, code is available at \url{ this https URL }
点击查看摘要
Abstract:Extracting meaningful latent representations from high-dimensional sequential data is a crucial challenge in machine learning, with applications spanning natural science and engineering. We introduce InfoDPCCA, a dynamic probabilistic Canonical Correlation Analysis (CCA) framework designed to model two interdependent sequences of observations. InfoDPCCA leverages a novel information-theoretic objective to extract a shared latent representation that captures the mutual structure between the data streams and balances representation compression and predictive sufficiency while also learning separate latent components that encode information specific to each sequence. Unlike prior dynamic CCA models, such as DPCCA, our approach explicitly enforces the shared latent space to encode only the mutual information between the sequences, improving interpretability and robustness. We further introduce a two-step training scheme to bridge the gap between information-theoretic representation learning and generative modeling, along with a residual connection mechanism to enhance training stability. Through experiments on synthetic and medical fMRI data, we demonstrate that InfoDPCCA excels as a tool for representation learning. Code of InfoDPCCA is available at this https URL.
[LG-12] Filling in the Blanks: Applying Data Imputation in incomplete Water Metering Data
链接: https://arxiv.org/abs/2506.08882
作者: Dimitrios Amaxilatis,Themistoklis Sarantakos,Ioannis Chatzigiannakis,Georgios Mylonas
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:In this work, we explore the application of recent data imputation techniques to enhance monitoring and management of water distribution networks using smart water meters, based on data derived from a real-world IoT water grid monitoring deployment. Despite the detailed data produced by such meters, data gaps due to technical issues can significantly impact operational decisions and efficiency. Our results, by comparing various imputation methods, such as k-Nearest Neighbors, MissForest, Transformers, and Recurrent Neural Networks, indicate that effective data imputation can substantially enhance the quality of the insights derived from water consumption data as we study their effect on accuracy and reliability of water metering data to provide solutions in applications like leak detection and predictive maintenance scheduling.
[LG-13] Adapting to Heterophilic Graph Data with Structure-Guided Neighbor Discovery
链接: https://arxiv.org/abs/2506.08871
作者: Victor M. Tenorio,Madeline Navarro,Samuel Rey,Santiago Segarra,Antonio G. Marques
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:
点击查看摘要
Abstract:Graph Neural Networks (GNNs) often struggle with heterophilic data, where connected nodes may have dissimilar labels, as they typically assume homophily and rely on local message passing. To address this, we propose creating alternative graph structures by linking nodes with similar structural attributes (e.g., role-based or global), thereby fostering higher label homophily on these new graphs. We theoretically prove that GNN performance can be improved by utilizing graphs with fewer false positive edges (connections between nodes of different classes) and that considering multiple graph views increases the likelihood of finding such beneficial structures. Building on these insights, we introduce Structure-Guided GNN (SG-GNN), an architecture that processes the original graph alongside the newly created structural graphs, adaptively learning to weigh their contributions. Extensive experiments on various benchmark datasets, particularly those with heterophilic characteristics, demonstrate that our SG-GNN achieves state-of-the-art or highly competitive performance, highlighting the efficacy of exploiting structural information to guide GNNs.
[LG-14] Agile Reinforcement Learning for Real-Time Task Scheduling in Edge Computing
链接: https://arxiv.org/abs/2506.08850
作者: Amin Avan,Akramul Azim,Qusay Mahmoud
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Soft real-time applications are becoming increasingly complex, posing significant challenges for scheduling offloaded tasks in edge computing environments while meeting task timing constraints. Moreover, the exponential growth of the search space, presence of multiple objectives and parameters, and highly dynamic nature of edge computing environments further exacerbate the complexity of task scheduling. As a result, schedulers based on heuristic and metaheuristic algorithms frequently encounter difficulties in generating optimal or near-optimal task schedules due to their constrained ability to adapt to the dynamic conditions and complex environmental characteristics of edge computing. Accordingly, reinforcement learning algorithms have been incorporated into schedulers to address the complexity and dynamic conditions inherent in task scheduling in edge computing. However, a significant limitation of reinforcement learning algorithms is the prolonged learning time required to adapt to new environments and to address medium- and large-scale problems. This challenge arises from the extensive global action space and frequent random exploration of irrelevant actions. Therefore, this study proposes Agile Reinforcement learning (aRL), in which the RL-agent performs informed exploration and executes only relevant actions. Consequently, the predictability of the RL-agent is enhanced, leading to rapid adaptation and convergence, which positions aRL as a suitable candidate for scheduling the tasks of soft real-time applications in edge computing. The experiments demonstrate that the combination of informed exploration and action-masking methods enables aRL to achieve a higher hit-ratio and converge faster than the baseline approaches.
[LG-15] IMAGIC-500: IMputation benchmark on A Generative Imaginary Country (500k samples)
链接: https://arxiv.org/abs/2506.08844
作者: Siyi Sun,David Antony Selby,Yunchuan Huang,Sebastian Vollmer,Seth Flaxman,Anisoara Calinescu
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE)
*备注:
点击查看摘要
Abstract:Missing data imputation in tabular datasets remains a pivotal challenge in data science and machine learning, particularly within socioeconomic research. However, real-world socioeconomic datasets are typically subject to strict data protection protocols, which often prohibit public sharing, even for synthetic derivatives. This severely limits the reproducibility and accessibility of benchmark studies in such settings. Further, there are very few publicly available synthetic datasets. Thus, there is limited availability of benchmarks for systematic evaluation of imputation methods on socioeconomic datasets, whether real or synthetic. In this study, we utilize the World Bank’s publicly available synthetic dataset, Synthetic Data for an Imaginary Country, which closely mimics a real World Bank household survey while being fully public, enabling broad access for methodological research. With this as a starting point, we derived the IMAGIC-500 dataset: we select a subset of 500k individuals across approximately 100k households with 19 socioeconomic features, designed to reflect the hierarchical structure of real-world household surveys. This paper introduces a comprehensive missing data imputation benchmark on IMAGIC-500 under various missing mechanisms (MCAR, MAR, MNAR) and missingness ratios (10%, 20%, 30%, 40%, 50%). Our evaluation considers the imputation accuracy for continuous and categorical variables, computational efficiency, and impact on downstream predictive tasks, such as estimating educational attainment at the individual level. The results highlight the strengths and weaknesses of statistical, traditional machine learning, and deep learning imputation techniques, including recent diffusion-based methods. The IMAGIC-500 dataset and benchmark aim to facilitate the development of robust imputation algorithms and foster reproducible social science research.
[LG-16] Design Patterns for Securing LLM Agents against Prompt Injections
链接: https://arxiv.org/abs/2506.08837
作者: Luca Beurer-Kellner,Beat Buesser Ana-Maria Creţu,Edoardo Debenedetti,Daniel Dobos,Daniel Fabian,Marc Fischer,David Froelicher,Kathrin Grosse,Daniel Naeff,Ezinwanne Ozoani,Andrew Paverd,Florian Tramèr,Václav Volhejn
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:As AI agents powered by Large Language Models (LLMs) become increasingly versatile and capable of addressing a broad spectrum of tasks, ensuring their security has become a critical challenge. Among the most pressing threats are prompt injection attacks, which exploit the agent’s resilience on natural language inputs – an especially dangerous threat when agents are granted tool access or handle sensitive information. In this work, we propose a set of principled design patterns for building AI agents with provable resistance to prompt injection. We systematically analyze these patterns, discuss their trade-offs in terms of utility and security, and illustrate their real-world applicability through a series of case studies.
[LG-17] On the Stability of the Jacobian Matrix in Deep Neural Networks
链接: https://arxiv.org/abs/2506.08764
作者: Benjamin Dadoun,Soufiane Hayou,Hanan Salam,Mohamed El Amine Seddik,Pierre Youssef
类目: Machine Learning (cs.LG)
*备注: 16 pages, 26 figures
点击查看摘要
Abstract:Deep neural networks are known to suffer from exploding or vanishing gradients as depth increases, a phenomenon closely tied to the spectral behavior of the input-output Jacobian. Prior work has identified critical initialization schemes that ensure Jacobian stability, but these analyses are typically restricted to fully connected networks with i.i.d. weights. In this work, we go significantly beyond these limitations: we establish a general stability theorem for deep neural networks that accommodates sparsity (such as that introduced by pruning) and non-i.i.d., weakly correlated weights (e.g. induced by training). Our results rely on recent advances in random matrix theory, and provide rigorous guarantees for spectral stability in a much broader class of network models. This extends the theoretical foundation for initialization schemes in modern neural networks with structured and dependent randomness.
[LG-18] Urban Incident Prediction with Graph Neural Networks: Integrating Government Ratings and Crowdsourced Reports
链接: https://arxiv.org/abs/2506.08740
作者: Sidhika Balachandar,Shuvom Sadhuka,Bonnie Berger,Emma Pierson,Nikhil Garg
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Graph neural networks (GNNs) are widely used in urban spatiotemporal forecasting, such as predicting infrastructure problems. In this setting, government officials wish to know in which neighborhoods incidents like potholes or rodent issues occur. The true state of incidents (e.g., street conditions) for each neighborhood is observed via government inspection ratings. However, these ratings are only conducted for a sparse set of neighborhoods and incident types. We also observe the state of incidents via crowdsourced reports, which are more densely observed but may be biased due to heterogeneous reporting behavior. First, for such settings, we propose a multiview, multioutput GNN-based model that uses both unbiased rating data and biased reporting data to predict the true latent state of incidents. Second, we investigate a case study of New York City urban incidents and collect, standardize, and make publicly available a dataset of 9,615,863 crowdsourced reports and 1,041,415 government inspection ratings over 3 years and across 139 types of incidents. Finally, we show on both real and semi-synthetic data that our model can better predict the latent state compared to models that use only reporting data or models that use only rating data, especially when rating data is sparse and reports are predictive of ratings. We also quantify demographic biases in crowdsourced reporting, e.g., higher-income neighborhoods report problems at higher rates. Our analysis showcases a widely applicable approach for latent state prediction using heterogeneous, sparse, and biased data.
[LG-19] Stop Misusing t-SNE and UMAP for Visual Analytics
链接: https://arxiv.org/abs/2506.08725
作者: Hyeon Jeon,Jeongin Park,Sungbok Shin,Jinwook Seo
类目: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: 9 pages
点击查看摘要
Abstract:Misuses of t-SNE and UMAP in visual analytics have become increasingly common. For example, although t-SNE and UMAP projections often do not faithfully reflect true distances between clusters, practitioners frequently use them to investigate inter-cluster relationships. In this paper, we bring this issue to the surface and comprehensively investigate why such misuse occurs and how to prevent it. We conduct a literature review of 114 papers to verify the prevalence of the misuse and analyze the reasonings behind it. We then execute an interview study to uncover practitioners’ implicit motivations for using these techniques – rationales often undisclosed in the literature. Our findings indicate that misuse of t-SNE and UMAP primarily stems from limited discourse on their appropriate use in visual analytics. We conclude by proposing future directions and concrete action items to promote more reasonable use of DR.
[LG-20] Mitigating Reward Over-optimization in Direct Alignment Algorithms with Importance Sampling
链接: https://arxiv.org/abs/2506.08681
作者: Phuc Minh Nguyen,Ngoc-Hieu Nguyen,Duy H. M. Nguyen,Anji Liu,An Mai,Binh T. Nguyen,Daniel Sonntag,Khoa D. Doan
类目: Machine Learning (cs.LG)
*备注: First version
点击查看摘要
Abstract:Direct Alignment Algorithms (DAAs) such as Direct Preference Optimization (DPO) have emerged as alternatives to the standard Reinforcement Learning from Human Feedback (RLHF) for aligning large language models (LLMs) with human values. However, these methods are more susceptible to over-optimization, in which the model drifts away from the reference policy, leading to degraded performance as training progresses. This paper proposes a novel importance-sampling approach to mitigate the over-optimization problem of offline DAAs. This approach, called (IS-DAAs), multiplies the DAA objective with an importance ratio that accounts for the reference policy distribution. IS-DAAs additionally avoid the high variance issue associated with importance sampling by clipping the importance ratio to a maximum value. Our extensive experiments demonstrate that IS-DAAs can effectively mitigate over-optimization, especially under low regularization strength, and achieve better performance than other methods designed to address this problem. Our implementations are provided publicly at this link.
[LG-21] owards Fair Representation: Clustering and Consensus COLT
链接: https://arxiv.org/abs/2506.08673
作者: Diptarka Chakraborty,Kushagra Chatterjee,Debarati Das,Tien Long Nguyen,Romina Nobahari
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS)
*备注: The paper has been accepted at the Conference on Learning Theory (COLT) 2025
点击查看摘要
Abstract:Consensus clustering, a fundamental task in machine learning and data analysis, aims to aggregate multiple input clusterings of a dataset, potentially based on different non-sensitive attributes, into a single clustering that best represents the collective structure of the data. In this work, we study this fundamental problem through the lens of fair clustering, as introduced by Chierichetti et al. [NeurIPS’17], which incorporates the disparate impact doctrine to ensure proportional representation of each protected group in the dataset within every cluster. Our objective is to find a consensus clustering that is not only representative but also fair with respect to specific protected attributes. To the best of our knowledge, we are the first to address this problem and provide a constant-factor approximation. As part of our investigation, we examine how to minimally modify an existing clustering to enforce fairness – an essential postprocessing step in many clustering applications that require fair representation. We develop an optimal algorithm for datasets with equal group representation and near-linear time constant factor approximation algorithms for more general scenarios with different proportions of two group sizes. We complement our approximation result by showing that the problem is NP-hard for two unequal-sized groups. Given the fundamental nature of this problem, we believe our results on Closest Fair Clustering could have broader implications for other clustering problems, particularly those for which no prior approximation guarantees exist for their fair variants. Comments: The paper has been accepted at the Conference on Learning Theory (COLT) 2025 Subjects: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS) Cite as: arXiv:2506.08673 [cs.LG] (or arXiv:2506.08673v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2506.08673 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-22] sparseGeoHOPCA: A Geometric Solution to Sparse Higher-Order PCA Without Covariance Estimation
链接: https://arxiv.org/abs/2506.08670
作者: Renjie Xu,Chong Wu,Maolin Che,Zhuoheng Ran,Yimin Wei,Hong Yan
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
点击查看摘要
Abstract:We propose sparseGeoHOPCA, a novel framework for sparse higher-order principal component analysis (SHOPCA) that introduces a geometric perspective to high-dimensional tensor decomposition. By unfolding the input tensor along each mode and reformulating the resulting subproblems as structured binary linear optimization problems, our method transforms the original nonconvex sparse objective into a tractable geometric form. This eliminates the need for explicit covariance estimation and iterative deflation, enabling significant gains in both computational efficiency and interpretability, particularly in high-dimensional and unbalanced data scenarios. We theoretically establish the equivalence between the geometric subproblems and the original SHOPCA formulation, and derive worst-case approximation error bounds based on classical PCA residuals, providing data-dependent performance guarantees. The proposed algorithm achieves a total computational complexity of O\left(\sum_n=1^N (k_n^3 + J_n k_n^2)\right) , which scales linearly with tensor size. Extensive experiments demonstrate that sparseGeoHOPCA accurately recovers sparse supports in synthetic settings, preserves classification performance under 10 \times compression, and achieves high-quality image reconstruction on ImageNet, highlighting its robustness and versatility.
[LG-23] When Simple Model Just Works: Is Network Traffic Classification in Crisis?
链接: https://arxiv.org/abs/2506.08655
作者: Kamil Jerabek,Jan Luxemburk,Richard Plny,Josef Koumar,Jaroslav Pesek,Karel Hynek
类目: Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注:
点击查看摘要
Abstract:Machine learning has been applied to network traffic classification (TC) for over two decades. While early efforts used shallow models, the latter 2010s saw a shift toward complex neural networks, often reporting near-perfect accuracy. However, it was recently revealed that a simple k-NN baseline using packet sequences metadata (sizes, times, and directions) can be on par or even outperform more complex methods. In this paper, we investigate this phenomenon further and evaluate this baseline across 12 datasets and 15 TC tasks, and investigate why it performs so well. Our analysis shows that most datasets contain over 50% redundant samples (identical packet sequences), which frequently appear in both training and test sets due to common splitting practices. This redundancy can lead to overestimated model performance and reduce the theoretical maximum accuracy when identical flows have conflicting labels. Given its distinct characteristics, we further argue that standard machine learning practices adapted from domains like NLP or computer vision may be ill-suited for TC. Finally, we propose new directions for task formulation and evaluation to address these challenges and help realign the field.
[LG-24] Fusing Cross-modal and Uni-modal Representations: A Kronecker Product Approach
链接: https://arxiv.org/abs/2506.08645
作者: Youqi Wu,Jingwei Zhang,Farzan Farnia
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Cross-modal embeddings, such as CLIP, BLIP and their variants, have achieved promising results in aligning representations across modalities. However, these embeddings could underperform compared to state-of-the-art single-modality embeddings on modality-specific tasks. On the other hand, single-modality embeddings excel in their domains but lack cross-modal alignment capabilities. In this work, we focus on the problem of unifying cross-modality and single-modality embeddings to achieve the performance of modality-expert embedding within individual modalities while preserving cross-modal alignment. To this end, we propose RP-KrossFuse, a method that leverages a random projection-based Kronecker product to integrate cross-modal embeddings with single-modality embeddings. RP-KrossFuse aims to fuse the sample-pairwise similarity scores of the fused embeddings and operates efficiently in a specified kernel space and supports scalable implementations via random Fourier features for shift-invariant kernels such as the Gaussian kernel. We demonstrate the effectiveness of RP-KrossFuse through several numerical experiments, combining CLIP embeddings with uni-modal image and text embeddings. Our numerical results indicate that RP-KrossFuse achieves competitive modality-specific performance while retaining cross-modal alignment, bridging the gap between cross-modal and single-modality embeddings.
[LG-25] Semi-gradient DICE for Offline Constrained Reinforcement Learning
链接: https://arxiv.org/abs/2506.08644
作者: Woosung Kim,JunHo Seo,Jongmin Lee,Byung-Jun Lee
类目: Machine Learning (cs.LG)
*备注: Constrained Offline Reinforcement Learning
点击查看摘要
Abstract:Stationary Distribution Correction Estimation (DICE) addresses the mismatch between the stationary distribution induced by a policy and the target distribution required for reliable off-policy evaluation (OPE) and policy optimization. DICE-based offline constrained RL particularly benefits from the flexibility of DICE, as it simultaneously maximizes return while estimating costs in offline settings. However, we have observed that recent approaches designed to enhance the offline RL performance of the DICE framework inadvertently undermine its ability to perform OPE, making them unsuitable for constrained RL scenarios. In this paper, we identify the root cause of this limitation: their reliance on a semi-gradient optimization, which solves a fundamentally different optimization problem and results in failures in cost estimation. Building on these insights, we propose a novel method to enable OPE and constrained RL through semi-gradient DICE. Our method ensures accurate cost estimation and achieves state-of-the-art performance on the offline constrained RL benchmark, DSRL.
[LG-26] Sample Efficient Demonstration Selection for In-Context Learning ICML2025
链接: https://arxiv.org/abs/2506.08607
作者: Kiran Purohit,V Venktesh,Sourangshu Bhattacharya,Avishek Anand
类目: Machine Learning (cs.LG)
*备注: Accepted at ICML 2025 , 24 pages
点击查看摘要
Abstract:The in-context learning paradigm with LLMs has been instrumental in advancing a wide range of natural language processing tasks. The selection of few-shot examples (exemplars / demonstration samples) is essential for constructing effective prompts under context-length budget constraints. In this paper, we formulate the exemplar selection task as a top-m best arms identification problem. A key challenge in this setup is the exponentially large number of arms that need to be evaluated to identify the m-best arms. We propose CASE (Challenger Arm Sampling for Exemplar selection), a novel sample-efficient selective exploration strategy that maintains a shortlist of “challenger” arms, which are current candidates for the top-m arms. In each iteration, only one of the arms from this shortlist or the current topm set is pulled, thereby reducing sample complexity and, consequently, the number of LLM evaluations. Furthermore, we model the scores of exemplar subsets (arms) using a parameterized linear scoring function, leading to stochastic linear bandits setting. CASE achieves remarkable efficiency gains of up to 7x speedup in runtime while requiring 7x fewer LLM calls (87% reduction) without sacrificing performance compared to state-of-the-art exemplar selection methods. We release our code and data at this https URL
[LG-27] CALT: A Library for Computer Algebra with Transformer
链接: https://arxiv.org/abs/2506.08600
作者: Hiroshi Kera,Shun Arakawa,Yuta Sato
类目: Machine Learning (cs.LG); Symbolic Computation (cs.SC); Commutative Algebra (math.AC)
*备注: ISSAC 2025 Short Communications
点击查看摘要
Abstract:Recent advances in artificial intelligence have demonstrated the learnability of symbolic computation through end-to-end deep learning. Given a sufficient number of examples of symbolic expressions before and after the target computation, Transformer models - highly effective learners of sequence-to-sequence functions - can be trained to emulate the computation. This development opens up several intriguing challenges and new research directions, which require active contributions from the symbolic computation community. In this work, we introduce Computer Algebra with Transformer (CALT), a user-friendly Python library designed to help non-experts in deep learning train models for symbolic computation tasks.
[LG-28] SLEEPYLAND: trust begins with fair evaluation of automatic sleep staging models
链接: https://arxiv.org/abs/2506.08574
作者: Alvise Dei Rossi,Matteo Metaldi,Michal Bechny,Irina Filchenko,Julia van der Meer,Markus H. Schmidt,Claudio L.A. Bassetti,Athina Tzovara,Francesca D. Faraci,Luigi Fiorillo
类目: Machine Learning (cs.LG)
*备注: 41 pages, 4 Figures, 7 Tables
点击查看摘要
Abstract:Despite advances in deep learning for automatic sleep staging, clinical adoption remains limited due to challenges in fair model evaluation, generalization across diverse datasets, model bias, and variability in human annotations. We present SLEEPYLAND, an open-source sleep staging evaluation framework designed to address these barriers. It includes more than 22’0000 hours in-domain (ID) sleep recordings, and more than 84’000 hours out-of-domain (OOD) sleep recordings, spanning a broad range of ages, sleep-wake disorders, and hardware setups. We release pre-trained models based on high-performing SoA architectures and evaluate them under standardized conditions across single- and multi-channel EEG/EOG configurations. We introduce SOMNUS, an ensemble combining models across architectures and channel setups via soft voting. SOMNUS achieves robust performance across twenty-four different datasets, with macro-F1 scores between 68.7% and 87.2%, outperforming individual models in 94.9% of cases. Notably, SOMNUS surpasses previous SoA methods, even including cases where compared models were trained ID while SOMNUS treated the same data as OOD. Using a subset of the BSWR (N=6’633), we quantify model biases linked to age, gender, AHI, and PLMI, showing that while ensemble improves robustness, no model architecture consistently minimizes bias in performance and clinical markers estimation. In evaluations on OOD multi-annotated datasets (DOD-H, DOD-O), SOMNUS exceeds the best human scorer, i.e., MF1 85.2% vs 80.8% on DOD-H, and 80.2% vs 75.9% on DOD-O, better reproducing the scorer consensus than any individual expert (k = 0.89/0.85 and ACS = 0.95/0.94 for healthy/OSA cohorts). Finally, we introduce ensemble disagreement metrics - entropy and inter-model divergence based - predicting regions of scorer disagreement with ROC AUCs up to 0.828, offering a data-driven proxy for human uncertainty.
[LG-29] DeepForm: Reasoning Large Language Model for Communication System Formulation
链接: https://arxiv.org/abs/2506.08551
作者: Panlong Wu,Ting Wang,Yifei Zhong,Haoqi Zhang,Zitong Wang,Fangxin Wang
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Communication system formulation is critical for advancing 6G and future wireless technologies, yet it remains a complex, expertise-intensive task. While Large Language Models (LLMs) offer potential, existing general-purpose models often lack the specialized domain knowledge, nuanced reasoning capabilities, and access to high-quality, domain-specific training data required for adapting a general LLM into an LLM specially for communication system formulation. To bridge this gap, we introduce DeepForm, the first reasoning LLM specially for automated communication system formulation. We propose the world-first large-scale, open-source dataset meticulously curated for this domain called Communication System Formulation Reasoning Corpus (CSFRC). Our framework employs a two-stage training strategy: first, Supervised Fine-Tuning (SFT) with Chain-of-Thought (CoT) data to distill domain knowledge; second, a novel rule-based Reinforcement Learning (RL) algorithm, C-ReMax based on ReMax, to cultivate advanced modeling capabilities and elicit sophisticated reasoning patterns like self-correction and verification. Extensive experiments demonstrate that our model achieves state-of-the-art performance, significantly outperforming larger proprietary LLMs on diverse senerios. We will release related resources to foster further research in this area after the paper is accepted.
[LG-30] Structured Variational D-Decomposition for Accurate and Stable Low-Rank Approximation
链接: https://arxiv.org/abs/2506.08535
作者: Ronald Katende
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:We introduce the D -decomposition, a non-orthogonal matrix factorization of the form A \approx P D Q , where P \in \mathbbR^n \times k , D \in \mathbbR^k \times k , and Q \in \mathbbR^k \times n . The decomposition is defined variationally by minimizing a regularized Frobenius loss, allowing control over rank, sparsity, and conditioning. Unlike algebraic factorizations such as LU or SVD, it is computed by alternating minimization. We establish existence and perturbation stability of the solution and show that each update has complexity \mathcalO(n^2k) . Benchmarks against truncated SVD, CUR, and nonnegative matrix factorization show improved reconstruction accuracy on MovieLens, MNIST, Olivetti Faces, and gene expression matrices, particularly under sparsity and noise.
[LG-31] PerfTracker: Online Performance Troubleshooting for Large-scale Model Training in Production
链接: https://arxiv.org/abs/2506.08528
作者: Yu Guan,Zhiyu Yin,Haoyu Chen,Sheng Cheng,Chaojie Yang,Tianyin Xu,Yang Zhang,Hanyu Zhao,Yong Li,Dennis Cai,Ennan Zhai
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Operating Systems (cs.OS)
*备注:
点击查看摘要
Abstract:Troubleshooting performance problems of large model training (LMT) is immensely challenging, due to unprecedented scales of modern GPU clusters, the complexity of software-hardware interactions, and the data intensity of the training process. Existing troubleshooting approaches designed for traditional distributed systems or datacenter networks fall short and can hardly apply to real-world training systems. In this paper, we present PerfTracker, the first online troubleshooting system utilizing fine-grained profiling, to diagnose performance issues of large-scale model training in production. PerfTracker can diagnose performance issues rooted in both hardware (e.g., GPUs and their interconnects) and software (e.g., Python functions and GPU operations). It scales to LMT on modern GPU clusters. PerfTracker effectively summarizes runtime behavior patterns of fine-grained LMT functions via online profiling, and leverages differential observability to localize the root cause with minimal production impact. PerfTracker has been deployed as a production service for large-scale GPU clusters of O(10, 000) GPUs (product homepage this https URL). It has been used to diagnose a variety of difficult performance issues.
[LG-32] Leverag ing chaos in the training of artificial neural networks
链接: https://arxiv.org/abs/2506.08523
作者: Pedro Jiménez-González,Miguel C. Soriano,Lucas Lacasa
类目: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn); Chaotic Dynamics (nlin.CD); Data Analysis, Statistics and Probability (physics.data-an)
*备注:
点击查看摘要
Abstract:Traditional algorithms to optimize artificial neural networks when confronted with a supervised learning task are usually exploitation-type relaxational dynamics such as gradient descent (GD). Here, we explore the dynamics of the neural network trajectory along training for unconventionally large learning rates. We show that for a region of values of the learning rate, the GD optimization shifts away from purely exploitation-like algorithm into a regime of exploration-exploitation balance, as the neural network is still capable of learning but the trajectory shows sensitive dependence on initial conditions – as characterized by positive network maximum Lyapunov exponent --. Interestingly, the characteristic training time required to reach an acceptable accuracy in the test set reaches a minimum precisely in such learning rate region, further suggesting that one can accelerate the training of artificial neural networks by locating at the onset of chaos. Our results – initially illustrated for the MNIST classification task – qualitatively hold for a range of supervised learning tasks, learning architectures and other hyperparameters, and showcase the emergent, constructive role of transient chaotic dynamics in the training of artificial neural networks.
[LG-33] NeurIPS 2024 ML4CFD Competition: Results and Retrospective Analysis
链接: https://arxiv.org/abs/2506.08516
作者: Mouadh Yagoubi,David Danan,Milad Leyli-Abadi,Ahmed Mazari,Jean-Patrick Brunet,Abbas Kabalan,Fabien Casenave,Yuxin Ma,Giovanni Catalani,Jean Fesquet,Jacob Helwig,Xuan Zhang,Haiyang Yu,Xavier Bertrand,Frederic Tost,Michael Baurheim,Joseph Morlier,Shuiwang Ji
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:The integration of machine learning (ML) into the physical sciences is reshaping computational paradigms, offering the potential to accelerate demanding simulations such as computational fluid dynamics (CFD). Yet, persistent challenges in accuracy, generalization, and physical consistency hinder the practical deployment of ML models in scientific domains. To address these limitations and systematically benchmark progress, we organized the ML4CFD competition, centered on surrogate modeling for aerodynamic simulations over two-dimensional airfoils. The competition attracted over 240 teams, who were provided with a curated dataset generated via OpenFOAM and evaluated through a multi-criteria framework encompassing predictive accuracy, physical fidelity, computational efficiency, and out-of-distribution generalization. This retrospective analysis reviews the competition outcomes, highlighting several approaches that outperformed baselines under our global evaluation score. Notably, the top entry exceeded the performance of the original OpenFOAM solver on aggregate metrics, illustrating the promise of ML-based surrogates to outperform traditional solvers under tailored criteria. Drawing from these results, we analyze the key design principles of top submissions, assess the robustness of our evaluation framework, and offer guidance for future scientific ML challenges.
[LG-34] DiffGradCAM: A Universal Class Activation Map Resistant to Adversarial Training
链接: https://arxiv.org/abs/2506.08514
作者: Jacob Piland,Chris Sweet,Adam Czakja
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Class Activation Mapping (CAM) and its gradient-based variants (e.g., GradCAM) have become standard tools for explaining Convolutional Neural Network (CNN) predictions. However, these approaches typically focus on individual logits, while for neural networks using softmax, the class membership probability estimates depend \textitonly on the \textitdifferences between logits, not on their absolute values. This disconnect leaves standard CAMs vulnerable to adversarial manipulation, such as passive fooling, where a model is trained to produce misleading CAMs without affecting decision performance. We introduce \textbfSalience-Hoax Activation Maps (SHAMs), an \emphentropy-aware form of passive fooling that serves as a benchmark for CAM robustness under adversarial conditions. To address the passive fooling vulnerability, we then propose \textbfDiffGradCAM, a novel, lightweight, and contrastive approach to class activation mapping that is both non-suceptible to passive fooling, but also matches the output of standard CAM methods such as GradCAM in the non-adversarial case. Together, SHAM and DiffGradCAM establish a new framework for probing and improving the robustness of saliency-based explanations. We validate both contributions across multi-class tasks with few and many classes.
[LG-35] hermodynamically Consistent Latent Dynamics Identification for Parametric Systems
链接: https://arxiv.org/abs/2506.08475
作者: Xiaolong He,Yeonjong Shin,Anthony Gruber,Sohyeon Jung,Kookjin Lee,Youngsoo Choi
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Numerical Analysis (math.NA)
*备注:
点击查看摘要
Abstract:We propose an efficient thermodynamics-informed latent space dynamics identification (tLaSDI) framework for the reduced-order modeling of parametric nonlinear dynamical systems. This framework integrates autoencoders for dimensionality reduction with newly developed parametric GENERIC formalism-informed neural networks (pGFINNs), which enable efficient learning of parametric latent dynamics while preserving key thermodynamic principles such as free energy conservation and entropy generation across the parameter space. To further enhance model performance, a physics-informed active learning strategy is incorporated, leveraging a greedy, residual-based error indicator to adaptively sample informative training data, outperforming uniform sampling at equivalent computational cost. Numerical experiments on the Burgers’ equation and the 1D/1V Vlasov-Poisson equation demonstrate that the proposed method achieves up to 3,528x speed-up with 1-3% relative errors, and significant reduction in training (50-90%) and inference (57-61%) cost. Moreover, the learned latent space dynamics reveal the underlying thermodynamic behavior of the system, offering valuable insights into the physical-space dynamics.
[LG-36] AsFT: Anchoring Safety During LLM Fine-Tuning Within Narrow Safety Basin
链接: https://arxiv.org/abs/2506.08473
作者: Shuo Yang,Qihui Zhang,Yuyang Liu,Yue Huang,Xiaojun Jia,Kunpeng Ning,Jiayu Yao,Jigang Wang,Hailiang Dai,Yibing Song,Li Yuan
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Large language models (LLMs) are vulnerable to safety risks during fine-tuning, where small amounts of malicious or harmless data can compromise safeguards. In this paper, building on the concept of alignment direction – defined by the weight difference between aligned and unaligned models – we observe that perturbations along this direction preserve model safety. In contrast, perturbations along directions orthogonal to this alignment are strongly linked to harmful direction perturbations, rapidly degrading safety and framing the parameter space as a narrow safety basin. Based on this insight, we propose a methodology for safety fine-tuning called AsFT (Anchoring Safety in Fine-Tuning), which integrates a regularization term into the training objective. This term uses the alignment direction as an anchor to suppress updates in harmful directions, ensuring that fine-tuning is constrained within the narrow safety basin. Extensive experiments on multiple datasets show that AsFT outperforms Safe LoRA, reducing harmful behavior by 7.60 percent, improving model performance by 3.44 percent, and maintaining robust performance across various experimental settings. Code is available at this https URL
[LG-37] MAC: An Efficient Gradient Preconditioning using Mean Activation Approximated Curvature
链接: https://arxiv.org/abs/2506.08464
作者: Hyunseok Seung,Jaewoo Lee,Hyunsuk Ko
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Second-order optimization methods for training neural networks, such as KFAC, exhibit superior convergence by utilizing curvature information of loss landscape. However, it comes at the expense of high computational burden. In this work, we analyze the two components that constitute the layer-wise Fisher information matrix (FIM) used in KFAC: the Kronecker factors related to activations and pre-activation gradients. Based on empirical observations on their eigenspectra, we propose efficient approximations for them, resulting in a computationally efficient optimization method called MAC. To the best of our knowledge, MAC is the first algorithm to apply the Kronecker factorization to the FIM of attention layers used in transformers and explicitly integrate attention scores into the preconditioning. We also study the convergence property of MAC on nonlinear neural networks and provide two conditions under which it converges to global minima. Our extensive evaluations on various network architectures and datasets show that the proposed method outperforms KFAC and other state-of-the-art methods in terms of accuracy, end-to-end training time, and memory usage. Code is available at this https URL.
[LG-38] Learning to Lead: Incentivizing Strategic Agents in the Dark
链接: https://arxiv.org/abs/2506.08438
作者: Yuchen Wu,Xinyi Zhong,Zhuoran Yang
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT); Machine Learning (stat.ML)
*备注: 81 pages, 7 figures
点击查看摘要
Abstract:We study an online learning version of the generalized principal-agent model, where a principal interacts repeatedly with a strategic agent possessing private types, private rewards, and taking unobservable actions. The agent is non-myopic, optimizing a discounted sum of future rewards and may strategically misreport types to manipulate the principal’s learning. The principal, observing only her own realized rewards and the agent’s reported types, aims to learn an optimal coordination mechanism that minimizes strategic regret. We develop the first provably sample-efficient algorithm for this challenging setting. Our approach features a novel pipeline that combines (i) a delaying mechanism to incentivize approximately myopic agent behavior, (ii) an innovative reward angle estimation framework that uses sector tests and a matching procedure to recover type-dependent reward functions, and (iii) a pessimistic-optimistic LinUCB algorithm that enables the principal to explore efficiently while respecting the agent’s incentive constraints. We establish a near optimal \tildeO(\sqrtT) regret bound for learning the principal’s optimal policy, where \tildeO(\cdot) omits logarithmic factors. Our results open up new avenues for designing robust online learning algorithms for a wide range of game-theoretic settings involving private types and strategic agents.
[LG-39] Online Learning-guided Learning Rate Adaptation via Gradient Alignment
链接: https://arxiv.org/abs/2506.08419
作者: Ruichen Jiang,Ali Kavis,Aryan Mokhtari
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: 24 pages, 5 figures
点击查看摘要
Abstract:The performance of an optimizer on large-scale deep learning models depends critically on fine-tuning the learning rate, often requiring an extensive grid search over base learning rates, schedules, and other hyperparameters. In this paper, we propose a principled framework called GALA (Gradient Alignment-based Learning rate Adaptation), which dynamically adjusts the learning rate by tracking the alignment between consecutive gradients and using a local curvature estimate. Guided by the convergence analysis, we formulate the problem of selecting the learning rate as a one-dimensional online learning problem. When paired with an online learning algorithm such as Follow-the-Regularized-Leader, our method produces a flexible, adaptive learning rate schedule that tends to increase when consecutive gradients are aligned and decrease otherwise. We establish a data-adaptive convergence rate for normalized SGD equipped with GALA in the smooth, nonconvex setting. Empirically, common optimizers such as SGD and Adam, when augmented with GALA, demonstrate robust performance across a wide range of initial learning rates and perform competitively without the need for tuning.
[LG-40] Improved Scaling Laws in Linear Regression via Data Reuse
链接: https://arxiv.org/abs/2506.08415
作者: Licong Lin,Jingfeng Wu,Peter L. Bartlett
类目: Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注:
点击查看摘要
Abstract:Neural scaling laws suggest that the test error of large language models trained online decreases polynomially as the model size and data size increase. However, such scaling can be unsustainable when running out of new data. In this work, we show that data reuse can improve existing scaling laws in linear regression. Specifically, we derive sharp test error bounds on M -dimensional linear models trained by multi-pass stochastic gradient descent (multi-pass SGD) on N data with sketched features. Assuming that the data covariance has a power-law spectrum of degree a , and that the true parameter follows a prior with an aligned power-law spectrum of degree b-a (with a b 1 ), we show that multi-pass SGD achieves a test error of \Theta(M^1-b + L^(1-b)/a) , where L \lesssim N^a/b is the number of iterations. In the same setting, one-pass SGD only attains a test error of \Theta(M^1-b + N^(1-b)/a) (see e.g., Lin et al., 2024). This suggests an improved scaling law via data reuse (i.e., choosing LN ) in data-constrained regimes. Numerical simulations are also provided to verify our theoretical findings.
[LG-41] Learning to Hear Broken Motors: Signature-Guided Data Augmentation for Induction-Motor Diagnostics
链接: https://arxiv.org/abs/2506.08412
作者: Saraa Ali,Aleksandr Khizhik,Stepan Svirin,Artem Ryzhikov,Denis Derkach
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:The application of machine learning (ML) algorithms in the intelligent diagnosis of three-phase engines has the potential to significantly enhance diagnostic performance and accuracy. Traditional methods largely rely on signature analysis, which, despite being a standard practice, can benefit from the integration of advanced ML techniques. In our study, we innovate by combining ML algorithms with a novel unsupervised anomaly generation methodology that takes into account the engine physics model. We propose Signature-Guided Data Augmentation (SGDA), an unsupervised framework that synthesizes physically plausible faults directly in the frequency domain of healthy current signals. Guided by Motor Current Signature Analysis, SGDA creates diverse and realistic anomalies without resorting to computationally intensive simulations. This hybrid approach leverages the strengths of both supervised ML and unsupervised signature analysis, achieving superior diagnostic accuracy and reliability along with wide industrial application. The findings highlight the potential of our approach to contribute significantly to the field of engine diagnostics, offering a robust and efficient solution for real-world applications.
[LG-42] FUSE: Measure-Theoretic Compact Fuzzy Set Representation for Taxonomy Expansion
链接: https://arxiv.org/abs/2506.08409
作者: Fred Xu,Song Jiang,Zijie Huang,Xiao Luo,Shichang Zhang,Adrian Chen,Yizhou Sun
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Taxonomy Expansion, which models complex concepts and their relations, can be formulated as a set representation learning task. The generalization of set, fuzzy set, incorporates uncertainty and measures the information within a semantic concept, making it suitable for concept modeling. Existing works usually model sets as vectors or geometric objects such as boxes, which are not closed under set operations. In this work, we propose a sound and efficient formulation of set representation learning based on its volume approximation as a fuzzy set. The resulting embedding framework, Fuzzy Set Embedding (FUSE), satisfies all set operations and compactly approximates the underlying fuzzy set, hence preserving information while being efficient to learn, relying on minimum neural architecture. We empirically demonstrate the power of FUSE on the task of taxonomy expansion, where FUSE achieves remarkable improvements up to 23% compared with existing baselines. Our work marks the first attempt to understand and efficiently compute the embeddings of fuzzy sets.
[LG-43] Network Threat Detection: Addressing Class Imbalanced Data with Deep Forest
链接: https://arxiv.org/abs/2506.08383
作者: Jiaqi Chen,Rongbin Ye
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:
点击查看摘要
Abstract:With the rapid expansion of Internet of Things (IoT) networks, detecting malicious traffic in real-time has become a critical cybersecurity challenge. This research addresses the detection challenges by presenting a comprehensive empirical analysis of machine learning techniques for malware detection using the IoT-23 dataset provided by the Stratosphere Laboratory. We address the significant class imbalance within the dataset through three resampling strategies. We implement and compare a few machine learning techniques. Our findings demonstrate that the combination of appropriate imbalance treatment techniques with ensemble methods, particularly gcForest, achieves better detection performance compared to traditional approaches. This work contributes significantly to the development of more intelligent and efficient automated threat detection systems for IoT environments, helping to secure critical infrastructure against sophisticated cyber attacks while optimizing computational resource usage.
[LG-44] AlphaFold Database Debiasing for Robust Inverse Folding
链接: https://arxiv.org/abs/2506.08365
作者: Cheng Tan,Zhenxiao Cao,Zhangyang Gao,Siyuan Li,Yufei Huang,Stan Z. Li
类目: Machine Learning (cs.LG); Biomolecules (q-bio.BM)
*备注: Under review
点击查看摘要
Abstract:The AlphaFold Protein Structure Database (AFDB) offers unparalleled structural coverage at near-experimental accuracy, positioning it as a valuable resource for data-driven protein design. However, its direct use in training deep models that are sensitive to fine-grained atomic geometry, such as inverse folding, exposes a critical limitation. Comparative analysis of structural feature distributions reveals that AFDB structures exhibit distinct statistical regularities, reflecting a systematic geometric bias that deviates from the conformational diversity found in experimentally determined structures from the Protein Data Bank (PDB). While AFDB structures are cleaner and more idealized, PDB structures capture the intrinsic variability and physical realism essential for generalization in downstream tasks. To address this discrepancy, we introduce a Debiasing Structure AutoEncoder (DeSAE) that learns to reconstruct native-like conformations from intentionally corrupted backbone geometries. By training the model to recover plausible structural states, DeSAE implicitly captures a more robust and natural structural manifold. At inference, applying DeSAE to AFDB structures produces debiased structures that significantly improve inverse folding performance across multiple benchmarks. This work highlights the critical impact of subtle systematic biases in predicted structures and presents a principled framework for debiasing, significantly boosting the performance of structure-based learning tasks like inverse folding.
[LG-45] NysAct: A Scalable Preconditioned Gradient Descent using Nystrom Approximation
链接: https://arxiv.org/abs/2506.08360
作者: Hyunseok Seung,Jaewoo Lee,Hyunsuk Ko
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Adaptive gradient methods are computationally efficient and converge quickly, but they often suffer from poor generalization. In contrast, second-order methods enhance convergence and generalization but typically incur high computational and memory costs. In this work, we introduce NysAct, a scalable first-order gradient preconditioning method that strikes a balance between state-of-the-art first-order and second-order optimization methods. NysAct leverages an eigenvalue-shifted Nystrom method to approximate the activation covariance matrix, which is used as a preconditioning matrix, significantly reducing time and memory complexities with minimal impact on test accuracy. Our experiments show that NysAct not only achieves improved test accuracy compared to both first-order and second-order methods but also demands considerably less computational resources than existing second-order methods. Code is available at this https URL.
[LG-46] Differentially Private Relational Learning with Entity-level Privacy Guarantees
链接: https://arxiv.org/abs/2506.08347
作者: Yinan Huang,Haoteng Ying,Eli Chien,Rongzhe Wei,Pan Li
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:
点击查看摘要
Abstract:Learning with relational and network-structured data is increasingly vital in sensitive domains where protecting the privacy of individual entities is paramount. Differential Privacy (DP) offers a principled approach for quantifying privacy risks, with DP-SGD emerging as a standard mechanism for private model training. However, directly applying DP-SGD to relational learning is challenging due to two key factors: (i) entities often participate in multiple relations, resulting in high and difficult-to-control sensitivity; and (ii) relational learning typically involves multi-stage, potentially coupled (interdependent) sampling procedures that make standard privacy amplification analyses inapplicable. This work presents a principled framework for relational learning with formal entity-level DP guarantees. We provide a rigorous sensitivity analysis and introduce an adaptive gradient clipping scheme that modulates clipping thresholds based on entity occurrence frequency. We also extend the privacy amplification results to a tractable subclass of coupled sampling, where the dependence arises only through sample sizes. These contributions lead to a tailored DP-SGD variant for relational data with provable privacy guarantees. Experiments on fine-tuning text encoders over text-attributed network-structured relational data demonstrate the strong utility-privacy trade-offs of our approach. Our code is available at this https URL.
[LG-47] Dynamical System Optimization
链接: https://arxiv.org/abs/2506.08340
作者: Emo Todorov
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:We develop an optimization framework centered around a core idea: once a (parametric) policy is specified, control authority is transferred to the policy, resulting in an autonomous dynamical system. Thus we should be able to optimize policy parameters without further reference to controls or actions, and without directly using the machinery of approximate Dynamic Programming and Reinforcement Learning. Here we derive simpler algorithms at the autonomous system level, and show that they compute the same quantities as policy gradients and Hessians, natural gradients, proximal methods. Analogs to approximate policy iteration and off-policy learning are also available. Since policy parameters and other system parameters are treated uniformly, the same algorithms apply to behavioral cloning, mechanism design, system identification, learning of state estimators. Tuning of generative AI models is not only possible, but is conceptually closer to the present framework than to Reinforcement Learning.
[LG-48] A Simple Analysis of Discretization Error in Diffusion Models
链接: https://arxiv.org/abs/2506.08337
作者: Juhyeok Choi,Chenglin Fan
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
点击查看摘要
Abstract:Diffusion models, formulated as discretizations of stochastic differential equations (SDEs), achieve state-of-the-art generative performance. However, existing analyses of their discretization error often rely on complex probabilistic tools. In this work, we present a simplified theoretical framework for analyzing the Euler–Maruyama discretization of variance-preserving SDEs (VP-SDEs) in Denoising Diffusion Probabilistic Models (DDPMs), where T denotes the number of denoising steps in the diffusion process. Our approach leverages Grönwall’s inequality to derive a convergence rate of \mathcalO(1/T^1/2) under Lipschitz assumptions, significantly streamlining prior proofs. Furthermore, we demonstrate that the Gaussian noise in the discretization can be replaced by a discrete random variable (e.g., Rademacher or uniform noise) without sacrificing convergence guarantees-an insight with practical implications for efficient sampling. Experiments validate our theory, showing that (1) the error scales as predicted, (2) discrete noise achieves comparable sample quality to Gaussian noise, and (3) incorrect noise scaling degrades performance. By unifying simplified analysis and discrete noise substitution, our work bridges theoretical rigor with practical efficiency in diffusion-based generative modeling.
[LG-49] Why Masking Diffusion Works: Condition on the Jump Schedule for Improved Discrete Diffusion
链接: https://arxiv.org/abs/2506.08316
作者: Alan N. Amin,Nate Gruver,Andrew Gordon Wilson
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Code available at: this https URL
点击查看摘要
Abstract:Discrete diffusion models, like continuous diffusion models, generate high-quality samples by gradually undoing noise applied to datapoints with a Markov process. Gradual generation in theory comes with many conceptual benefits; for example, inductive biases can be incorporated into the noising Markov process, and access to improved sampling algorithms. In practice, however, the consistently best performing discrete diffusion model is, surprisingly, masking diffusion, which does not denoise gradually. Here we explain the superior performance of masking diffusion by noting that it makes use of a fundamental difference between continuous and discrete Markov processes: discrete Markov processes evolve by discontinuous jumps at a fixed rate and, unlike other discrete diffusion models, masking diffusion builds in the known distribution of jump times and only learns where to jump to. We show that we can similarly bake in the known distribution of jump times into any discrete diffusion model. The resulting models - schedule-conditioned discrete diffusion (SCUD) - generalize classical discrete diffusion and masking diffusion. By applying SCUD to models with noising processes that incorporate inductive biases on images, text, and protein data, we build models that outperform masking.
[LG-50] Private Evolution Converges
链接: https://arxiv.org/abs/2506.08312
作者: Tomás González,Giulia Fanti,Aaditya Ramdas
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Data Structures and Algorithms (cs.DS); Probability (math.PR); Statistics Theory (math.ST)
*备注:
点击查看摘要
Abstract:Private Evolution (PE) is a promising training-free method for differentially private (DP) synthetic data generation. While it achieves strong performance in some domains (e.g., images and text), its behavior in others (e.g., tabular data) is less consistent. To date, the only theoretical analysis of the convergence of PE depends on unrealistic assumptions about both the algorithm’s behavior and the structure of the sensitive dataset. In this work, we develop a new theoretical framework to explain PE’s practical behavior and identify sufficient conditions for its convergence. For d -dimensional sensitive datasets with n data points from a bounded domain, we prove that PE produces an (\epsilon, \delta) -DP synthetic dataset with expected 1-Wasserstein distance of order \tildeO(d(n\epsilon)^-1/d) from the original, establishing worst-case convergence of the algorithm as n \to \infty . Our analysis extends to general Banach spaces as well. We also connect PE to the Private Signed Measure Mechanism, a method for DP synthetic data generation that has thus far not seen much practical adoption. We demonstrate the practical relevance of our theoretical findings in simulations.
[LG-51] H2GFM: Towards unifying Homogeneity and Heterogeneity on Text-Attributed Graphs
链接: https://arxiv.org/abs/2506.08298
作者: Trung-Kien Nguyen,Heng Ping,Shixuan Li,Peiyu Zhang,Nikos Kanakaris,Nicholas Kotov,Paul Bogdan
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:The growing interests and applications of graph learning in diverse domains have propelled the development of a unified model generalizing well across different graphs and tasks, known as the Graph Foundation Model (GFM). Existing research has leveraged text-attributed graphs (TAGs) to tackle the heterogeneity in node features among graphs. However, they primarily focus on homogeneous TAGs (HoTAGs), leaving heterogeneous TAGs (HeTAGs), where multiple types of nodes/edges reside, underexplored. To enhance the capabilities and applications of GFM, we introduce H ^2 GFM, a novel framework designed to generalize across both HoTAGs and HeTAGs. Our model projects diverse meta-relations among graphs under a unified textual space, and employs a context encoding to capture spatial and higher-order semantic relationships. To achieve robust node representations, we propose a novel context-adaptive graph transformer (CGT), effectively capturing information from both context neighbors and their relationships. Furthermore, we employ a mixture of CGT experts to capture the heterogeneity in structural patterns among graph types. Comprehensive experiments on a wide range of HoTAGs and HeTAGs as well as learning scenarios demonstrate the effectiveness of our model.
[LG-52] LEANN: A Low-Storag e Vector Index
链接: https://arxiv.org/abs/2506.08276
作者: Yichuan Wang,Shu Liu,Zhifei Li,Yongji Wu,Ziming Mao,Yilong Zhao,Xiao Yan,Zhiying Xu,Yang Zhou,Ion Stoica,Sewon Min,Matei Zaharia,Joseph E. Gonzalez
类目: Databases (cs.DB); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Embedding-based search is widely used in applications such as recommendation and retrieval-augmented generation (RAG). Recently, there is a growing demand to support these capabilities over personal data stored locally on devices. However, maintaining the necessary data structure associated with the embedding-based search is often infeasible due to its high storage overhead. For example, indexing 100 GB of raw data requires 150 to 700 GB of storage, making local deployment impractical. Reducing this overhead while maintaining search quality and latency becomes a critical challenge. In this paper, we present LEANN, a storage-efficient approximate nearest neighbor (ANN) search index optimized for resource-constrained personal devices. LEANN combines a compact graph-based structure with an efficient on-the-fly recomputation strategy to enable fast and accurate retrieval with minimal storage overhead. Our evaluation shows that LEANN reduces index size to under 5% of the original raw data, achieving up to 50 times smaller storage than standard indexes, while maintaining 90% top-3 recall in under 2 seconds on real-world question answering benchmarks.
[LG-53] he Impact of Feature Scaling In Machine Learning: Effects on Regression and Classification Tasks
链接: https://arxiv.org/abs/2506.08274
作者: João Manoel Herrera Pinheiro,Suzana Vilas Boas de Oliveira,Thiago Henrique Segreto Silva,Pedro Antonio Rabelo Saraiva,Enzo Ferreira de Souza,Leonardo André Ambrosio,Marcelo Becker
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 27 pages
点击查看摘要
Abstract:This research addresses the critical lack of comprehensive studies on feature scaling by systematically evaluating 12 scaling techniques - including several less common transformations - across 14 different Machine Learning algorithms and 16 datasets for classification and regression tasks. We meticulously analyzed impacts on predictive performance (using metrics such as accuracy, MAE, MSE, and R^2 ) and computational costs (training time, inference time, and memory usage). Key findings reveal that while ensemble methods (such as Random Forest and gradient boosting models like XGBoost, CatBoost and LightGBM) demonstrate robust performance largely independent of scaling, other widely used models such as Logistic Regression, SVMs, TabNet, and MLPs show significant performance variations highly dependent on the chosen scaler. This extensive empirical analysis, with all source code, experimental results, and model parameters made publicly available to ensure complete transparency and reproducibility, offers model-specific crucial guidance to practitioners on the need for an optimal selection of feature scaling techniques.
[LG-54] Universal Differential Equations for Scientific Machine Learning of Node-Wise Battery Dynamics in Smart Grids
链接: https://arxiv.org/abs/2506.08272
作者: Tarushri N. S.
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:
点击查看摘要
Abstract:Universal Differential Equations (UDEs), which blend neural networks with physical differential equations, have emerged as a powerful framework for scientific machine learning (SciML), enabling data-efficient, interpretable, and physically consistent modeling. In the context of smart grid systems, modeling node-wise battery dynamics remains a challenge due to the stochasticity of solar input and variability in household load profiles. Traditional approaches often struggle with generalization and fail to capture unmodeled residual dynamics. This work proposes a UDE-based approach to learn node-specific battery evolution by embedding a neural residual into a physically inspired battery ODE. Synthetic yet realistic solar generation and load demand data are used to simulate battery dynamics over time. The neural component learns to model unobserved or stochastic corrections arising from heterogeneity in node demand and environmental conditions. Comprehensive experiments reveal that the trained UDE aligns closely with ground truth battery trajectories, exhibits smooth convergence behavior, and maintains stability in long-term forecasts. These findings affirm the viability of UDE-based SciML approaches for battery modeling in decentralized energy networks and suggest broader implications for real-time control and optimization in renewable-integrated smart grids.
[LG-55] SWAT-NN: Simultaneous Weights and Architecture Training for Neural Networks in a Latent Space
链接: https://arxiv.org/abs/2506.08270
作者: Zitong Huang,Mansooreh Montazerin,Ajitesh Srivastava
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Designing neural networks typically relies on manual trial and error or a neural architecture search (NAS) followed by weight training. The former is time-consuming and labor-intensive, while the latter often discretizes architecture search and weight optimization. In this paper, we propose a fundamentally different approach that simultaneously optimizes both the architecture and the weights of a neural network. Our framework first trains a universal multi-scale autoencoder that embeds both architectural and parametric information into a continuous latent space, where functionally similar neural networks are mapped closer together. Given a dataset, we then randomly initialize a point in the embedding space and update it via gradient descent to obtain the optimal neural network, jointly optimizing its structure and weights. The optimization process incorporates sparsity and compactness penalties to promote efficient models. Experiments on synthetic regression tasks demonstrate that our method effectively discovers sparse and compact neural networks with strong performance.
[LG-56] Learning-Based Multiuser Scheduling in MIMO-OFDM Systems with Hybrid Beamforming
链接: https://arxiv.org/abs/2506.08263
作者: Pouya Agheli,Tugce Kobal,François Durand,Matthew Andrews
类目: Information Theory (cs.IT); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: To appear in the proceedings of the European Conference on Networks and Communications (EuCNC) 6G Summit, 2025
点击查看摘要
Abstract:We investigate the multiuser scheduling problem in multiple-input multiple-output (MIMO) systems using orthogonal frequency division multiplexing (OFDM) and hybrid beamforming in which a base station (BS) communicates with multiple users over millimeter wave (mmWave) channels in the downlink. Improved scheduling is critical for enhancing spectral efficiency and the long-term performance of the system from the perspective of proportional fairness (PF) metric in hybrid beamforming systems due to its limited multiplexing gain. Our objective is to maximize PF by properly designing the analog and digital precoders within the hybrid beamforming and selecting the users subject to the number of radio frequency (RF) chains. Leveraging the characteristics of mmWave channels, we apply a two-timescale protocol. On a long timescale, we assign an analog beam to each user. Scheduling the users and designing the digital precoder are done accordingly on a short timescale. To conduct scheduling, we propose combinatorial solutions, such as greedy and sorting algorithms, followed by a machine learning (ML) approach. Our numerical results highlight the trade-off between the performance and complexity of the proposed approaches. Consequently, we show that the choice of approach depends on the specific criteria within a given scenario.
[LG-57] mporalizing Confidence: Evaluation of Chain-of-Thought Reasoning with Signal Temporal Logic
链接: https://arxiv.org/abs/2506.08243
作者: Zhenjiang Mao,Artem Bisliouk,Rohith Reddy Nama,Ivan Ruchkin
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Large Language Models (LLMs) have shown impressive performance in mathematical reasoning tasks when guided by Chain-of-Thought (CoT) prompting. However, they tend to produce highly confident yet incorrect outputs, which poses significant risks in domains like education, where users may lack the expertise to assess reasoning steps. To address this, we propose a structured framework that models stepwise confidence as a temporal signal and evaluates it using Signal Temporal Logic (STL). In particular, we define formal STL-based constraints to capture desirable temporal properties and compute robustness scores that serve as structured, interpretable confidence estimates. Our approach also introduces a set of uncertainty reshaping strategies to enforce smoothness, monotonicity, and causal consistency across the reasoning trajectory. Experiments show that our approach consistently improves calibration metrics and provides more reliable uncertainty estimates than conventional confidence aggregation and post-hoc calibration.
[LG-58] Dealing with the Evil Twins: Improving Random Augmentation by Addressing Catastrophic Forgetting of Diverse Augmentations
链接: https://arxiv.org/abs/2506.08240
作者: Dongkyu Cho,Rumi Chunara
类目: Machine Learning (cs.LG)
*备注: 12 pages, 6 figures
点击查看摘要
Abstract:Data augmentation is a promising tool for enhancing out-of-distribution generalization, where the key is to produce diverse, challenging variations of the source domain via costly targeted augmentations that maximize its generalization effect. Conversely, random augmentation is inexpensive but is deemed suboptimal due to its limited effect. In this paper, we revisit random augmentation and explore methods to address its shortcomings. We show that the stochastic nature of random augmentation can produce a set of colliding augmentations that distorts the learned features, similar to catastrophic forgetting. We propose a simple solution that improves the generalization effect of random augmentation by addressing forgetting, which displays strong generalization performance across various single source domain generalization (sDG) benchmarks.
[LG-59] Mondrian: Transformer Operators via Domain Decomposition
链接: https://arxiv.org/abs/2506.08226
作者: Arthur Feeney,Kuei-Hsiang Huang,Aparna Chandramowlishwaran
类目: Machine Learning (cs.LG)
*备注: 26 pages, 7 figures
点击查看摘要
Abstract:Operator learning enables data-driven modeling of partial differential equations (PDEs) by learning mappings between function spaces. However, scaling transformer-based operator models to high-resolution, multiscale domains remains a challenge due to the quadratic cost of attention and its coupling to discretization. We introduce \textbfMondrian, transformer operators that decompose a domain into non-overlapping subdomains and apply attention over sequences of subdomain-restricted functions. Leveraging principles from domain decomposition, Mondrian decouples attention from discretization. Within each subdomain, it replaces standard layers with expressive neural operators, and attention across subdomains is computed via softmax-based inner products over functions. The formulation naturally extends to hierarchical windowed and neighborhood attention, supporting both local and global interactions. Mondrian achieves strong performance on Allen-Cahn and Navier-Stokes PDEs, demonstrating resolution scaling without retraining. These results highlight the promise of domain-decomposed attention for scalable and general-purpose neural operators.
[LG-60] What makes an Ensemble (Un) Interpretable? ICML2025
链接: https://arxiv.org/abs/2506.08216
作者: Shahaf Bassan,Guy Amir,Meirav Zehavi,Guy Katz
类目: Machine Learning (cs.LG); Computational Complexity (cs.CC); Logic in Computer Science (cs.LO)
*备注: To appear in ICML 2025
点击查看摘要
Abstract:Ensemble models are widely recognized in the ML community for their limited interpretability. For instance, while a single decision tree is considered interpretable, ensembles of trees (e.g., boosted trees) are often treated as black-boxes. Despite this folklore recognition, there remains a lack of rigorous mathematical understanding of what particularly makes an ensemble (un)-interpretable, including how fundamental factors like the (1) number, (2) size, and (3) type of base models influence its interpretability. In this work, we seek to bridge this gap by applying concepts from computational complexity theory to study the challenges of generating explanations for various ensemble configurations. Our analysis uncovers nuanced complexity patterns influenced by various factors. For example, we demonstrate that under standard complexity assumptions like P \neq NP, interpreting ensembles remains intractable even when base models are of constant size. Surprisingly, the complexity changes drastically with the number of base models: small ensembles of decision trees are efficiently interpretable, whereas interpreting ensembles with even a constant number of linear models remains intractable. We believe that our findings provide a more robust foundation for understanding the interpretability of ensembles, emphasizing the benefits of examining it through a computational complexity lens.
[LG-61] A Machine Learning Approach to Generate Residual Stress Distributions using Sparse Characterization Data in Friction-Stir Processed Parts
链接: https://arxiv.org/abs/2506.08205
作者: Shadab Anwar Shaikh,Kranthi Balusu,Ayoub Soulami
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE)
*备注:
点击查看摘要
Abstract:Residual stresses, which remain within a component after processing, can deteriorate performance. Accurately determining their full-field distributions is essential for optimizing the structural integrity and longevity. However, the experimental effort required for full-field characterization is impractical. Given these challenges, this work proposes a machine learning (ML) based Residual Stress Generator (RSG) to infer full-field stresses from limited measurements. An extensive dataset was initially constructed by performing numerous process simulations with a diverse parameter set. A ML model based on U-Net architecture was then trained to learn the underlying structure through systematic hyperparameter tuning. Then, the model’s ability to generate simulated stresses was evaluated, and it was ultimately tested on actual characterization data to validate its effectiveness. The model’s prediction of simulated stresses shows that it achieved excellent predictive accuracy and exhibited a significant degree of generalization, indicating that it successfully learnt the latent structure of residual stress distribution. The RSG’s performance in predicting experimentally characterized data highlights the feasibility of the proposed approach in providing a comprehensive understanding of residual stress distributions from limited measurements, thereby significantly reducing experimental efforts.
[LG-62] Correlated Noise Mechanisms for Differentially Private Learning
链接: https://arxiv.org/abs/2506.08201
作者: Krishna Pillutla,Jalaj Upadhyay,Christopher A. Choquette-Choo,Krishnamurthy Dvijotham,Arun Ganesh,Monika Henzinger,Jonathan Katz,Ryan McKenna,H. Brendan McMahan,Keith Rush,Thomas Steinke,Abhradeep Thakurta
类目: Machine Learning (cs.LG)
*备注: 212 pages
点击查看摘要
Abstract:This monograph explores the design and analysis of correlated noise mechanisms for differential privacy (DP), focusing on their application to private training of AI and machine learning models via the core primitive of estimation of weighted prefix sums. While typical DP mechanisms inject independent noise into each step of a stochastic gradient (SGD) learning algorithm in order to protect the privacy of the training data, a growing body of recent research demonstrates that introducing (anti-)correlations in the noise can significantly improve privacy-utility trade-offs by carefully canceling out some of the noise added on earlier steps in subsequent steps. Such correlated noise mechanisms, known variously as matrix mechanisms, factorization mechanisms, and DP-Follow-the-Regularized-Leader (DP-FTRL) when applied to learning algorithms, have also been influential in practice, with industrial deployment at a global scale.
[LG-63] Interpreting Agent Behaviors in Reinforcement-Learning-Based Cyber-Battle Simulation Platforms
链接: https://arxiv.org/abs/2506.08192
作者: Jared Claypoole,Steven Cheung,Ashish Gehani,Vinod Yegneswaran,Ahmad Ridley
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:We analyze two open source deep reinforcement learning agents submitted to the CAGE Challenge 2 cyber defense challenge, where each competitor submitted an agent to defend a simulated network against each of several provided rules-based attack agents. We demonstrate that one can gain interpretability of agent successes and failures by simplifying the complex state and action spaces and by tracking important events, shedding light on the fine-grained behavior of both the defense and attack agents in each experimental scenario. By analyzing important events within an evaluation episode, we identify patterns in infiltration and clearing events that tell us how well the attacker and defender played their respective roles; for example, defenders were generally able to clear infiltrations within one or two timesteps of a host being exploited. By examining transitions in the environment’s state caused by the various possible actions, we determine which actions tended to be effective and which did not, showing that certain important actions are between 40% and 99% ineffective. We examine how decoy services affect exploit success, concluding for instance that decoys block up to 94% of exploits that would directly grant privileged access to a host. Finally, we discuss the realism of the challenge and ways that the CAGE Challenge 4 has addressed some of our concerns.
[LG-64] FedGA-Tree: Federated Decision Tree using Genetic Algorithm
链接: https://arxiv.org/abs/2506.08176
作者: Anh V Nguyen,Diego Klabjan
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:In recent years, with rising concerns for data privacy, Federated Learning has gained prominence, as it enables collaborative training without the aggregation of raw data from participating clients. However, much of the current focus has been on parametric gradient-based models, while nonparametric counterparts such as decision tree are relatively understudied. Existing methods for adapting decision trees to Federated Learning generally combine a greedy tree-building algorithm with differential privacy to produce a global model for all clients. These methods are limited to classification trees and categorical data due to the constraints of differential privacy. In this paper, we explore an alternative approach that utilizes Genetic Algorithm to facilitate the construction of personalized decision trees and accommodate categorical and numerical data, thus allowing for both classification and regression trees. Comprehensive experiments demonstrate that our method surpasses decision trees trained solely on local data and a benchmark algorithm.
[LG-65] Federated Learning on Stochastic Neural Networks
链接: https://arxiv.org/abs/2506.08169
作者: Jingqiao Tang(1),Ryan Bausback(1),Feng Bao(1),Richard Archibald(2) ((1) Department of Mathematics at Florida State University, Tallahassee, Florida, USA, (2) Division of Computer Science and Mathematics, Oak Ridge National Laboratory, Oak Ridge, Tennessee, USA)
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: 25 pages, 19 figures, Submitted to Journal of Machine Learning for Modeling and Computing
点击查看摘要
Abstract:Federated learning is a machine learning paradigm that leverages edge computing on client devices to optimize models while maintaining user privacy by ensuring that local data remains on the device. However, since all data is collected by clients, federated learning is susceptible to latent noise in local datasets. Factors such as limited measurement capabilities or human errors may introduce inaccuracies in client data. To address this challenge, we propose the use of a stochastic neural network as the local model within the federated learning framework. Stochastic neural networks not only facilitate the estimation of the true underlying states of the data but also enable the quantification of latent noise. We refer to our federated learning approach, which incorporates stochastic neural networks as local models, as Federated stochastic neural networks. We will present numerical experiments demonstrating the performance and effectiveness of our method, particularly in handling non-independent and identically distributed data.
[LG-66] BLUR: A Bi-Level Optimization Approach for LLM Unlearning
链接: https://arxiv.org/abs/2506.08164
作者: Hadi Reisizadeh,Jinghan Jia,Zhiqi Bu,Bhanukiran Vinzamuri,Anil Ramakrishna,Kai-Wei Chang,Volkan Cevher,Sijia Liu,Mingyi Hong
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Enabling large language models (LLMs) to unlearn knowledge and capabilities acquired during training has proven vital for ensuring compliance with data regulations and promoting ethical practices in generative AI. Although there are growing interests in developing various unlearning algorithms, it remains unclear how to best formulate the unlearning problem. The most popular formulation uses a weighted sum of forget and retain loss, but it often leads to performance degradation due to the inherent trade-off between forget and retain losses. In this work, we argue that it is important to model the hierarchical structure of the unlearning problem, where the forget problem (which \textitunlearns certain knowledge and/or capabilities) takes priority over the retain problem (which preserves model utility). This hierarchical structure naturally leads to a bi-level optimization formulation where the lower-level objective focuses on minimizing the forget loss, while the upper-level objective aims to maintain the model’s utility. Based on this new formulation, we propose a novel algorithm, termed Bi-Level UnleaRning (\textttBLUR), which not only possesses strong theoretical guarantees but more importantly, delivers superior performance. In particular, our extensive experiments demonstrate that \textttBLUR consistently outperforms all the state-of-the-art algorithms across various unlearning tasks, models, and metrics. Codes are available at this https URL.
[LG-67] Fully data-driven inverse hyperelasticity with hyper-network neural ODE fields
链接: https://arxiv.org/abs/2506.08146
作者: Vahidullah Taç,Amirhossein Amiri-Hezaveh,Manuel K. Rausch,Grace N. Bechtel,Francisco Sahli Costabal,Adrian Buganza Tepole
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:We propose a new framework for identifying mechanical properties of heterogeneous materials without a closed-form constitutive equation. Given a full-field measurement of the displacement field, for instance as obtained from digital image correlation (DIC), a continuous approximation of the strain field is obtained by training a neural network that incorporates Fourier features to effectively capture sharp gradients in the data. A physics-based data-driven method built upon ordinary neural differential equations (NODEs) is employed to discover constitutive equations. The NODE framework can represent arbitrary materials while satisfying constraints in the theory of constitutive equations by default. To account for heterogeneity, a hyper-network is defined, where the input is the material coordinate system, and the output is the NODE-based constitutive equation. The parameters of the hyper-network are optimized by minimizing a multi-objective loss function that includes penalty terms for violations of the strong form of the equilibrium equations of elasticity and the associated Neumann boundary conditions. We showcase the framework with several numerical examples, including heterogeneity arising from variations in material parameters, spatial transitions from isotropy to anisotropy, material identification in the presence of noise, and, ultimately, application to experimental data. As the numerical results suggest, the proposed approach is robust and general in identifying the mechanical properties of heterogeneous materials with very few assumptions, making it a suitable alternative to classical inverse methods.
[LG-68] Accelerating Spectral Clustering under Fairness Constraints ICML2025
链接: https://arxiv.org/abs/2506.08143
作者: Francesco Tonin,Alex Lambert,Johan A. K. Suykens,Volkan Cevher
类目: Machine Learning (cs.LG)
*备注: ICML 2025
点击查看摘要
Abstract:Fairness of decision-making algorithms is an increasingly important issue. In this paper, we focus on spectral clustering with group fairness constraints, where every demographic group is represented in each cluster proportionally as in the general population. We present a new efficient method for fair spectral clustering (Fair SC) by casting the Fair SC problem within the difference of convex functions (DC) framework. To this end, we introduce a novel variable augmentation strategy and employ an alternating direction method of multipliers type of algorithm adapted to DC problems. We show that each associated subproblem can be solved efficiently, resulting in higher computational efficiency compared to prior work, which required a computationally expensive eigendecomposition. Numerical experiments demonstrate the effectiveness of our approach on both synthetic and real-world benchmarks, showing significant speedups in computation time over prior art, especially as the problem size grows. This work thus represents a considerable step forward towards the adoption of fair clustering in real-world applications.
[LG-69] Lite-RVFL: A Lightweight Random Vector Functional-Link Neural Network for Learning Under Concept Drift
链接: https://arxiv.org/abs/2506.08063
作者: Songqiao Hu,Zeyi Liu,Xiao He
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 6 pages, 4 figures, accepted by the 2025 CAA Symposium on Fault Detection, Supervision and Safety for Technical Processes (SAFEPROCESS 2025)
点击查看摘要
Abstract:The change in data distribution over time, also known as concept drift, poses a significant challenge to the reliability of online learning methods. Existing methods typically require model retraining or drift detection, both of which demand high computational costs and are often unsuitable for real-time applications. To address these limitations, a lightweight, fast and efficient random vector functional-link network termed Lite-RVFL is proposed, capable of adapting to concept drift without drift detection and retraining. Lite-RVFL introduces a novel objective function that assigns weights exponentially increasing to new samples, thereby emphasizing recent data and enabling timely adaptation. Theoretical analysis confirms the feasibility of this objective function for drift adaptation, and an efficient incremental update rule is derived. Experimental results on a real-world safety assessment task validate the efficiency, effectiveness in adapting to drift, and potential to capture temporal patterns of Lite-RVFL. The source code is available at this https URL.
[LG-70] ST-GraphNet: A Spatio-Temporal Graph Neural Network for Understanding and Predicting Automated Vehicle Crash Severity
链接: https://arxiv.org/abs/2506.08051
作者: Mahmuda Sultana Mimi,Md Monzurul Islam,Anannya Ghosh Tusti,Shriyank Somvanshi,Subasish Das
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Understanding the spatial and temporal dynamics of automated vehicle (AV) crash severity is critical for advancing urban mobility safety and infrastructure planning. In this work, we introduce ST-GraphNet, a spatio-temporal graph neural network framework designed to model and predict AV crash severity by using both fine-grained and region-aggregated spatial graphs. Using a balanced dataset of 2,352 real-world AV-related crash reports from Texas (2024), including geospatial coordinates, crash timestamps, SAE automation levels, and narrative descriptions, we construct two complementary graph representations: (1) a fine-grained graph with individual crash events as nodes, where edges are defined via spatio-temporal proximity; and (2) a coarse-grained graph where crashes are aggregated into Hexagonal Hierarchical Spatial Indexing (H3)-based spatial cells, connected through hexagonal adjacency. Each node in the graph is enriched with multimodal data, including semantic, spatial, and temporal attributes, including textual embeddings from crash narratives using a pretrained Sentence-BERT model. We evaluate various graph neural network (GNN) architectures, such as Graph Convolutional Networks (GCN), Graph Attention Networks (GAT), and Dynamic Spatio-Temporal GCN (DSTGCN), to classify crash severity and predict high-risk regions. Our proposed ST-GraphNet, which utilizes a DSTGCN backbone on the coarse-grained H3 graph, achieves a test accuracy of 97.74%, substantially outperforming the best fine-grained model (64.7% test accuracy). These findings highlight the effectiveness of spatial aggregation, dynamic message passing, and multi-modal feature integration in capturing the complex spatio-temporal patterns underlying AV crash severity.
[LG-71] Feasibility Study of CNNs and MLPs for Radiation Heat Transfer in 2-D Furnaces with Spectrally Participative Gases
链接: https://arxiv.org/abs/2506.08033
作者: Axel TahmasebiMoradi,Vincent Ren,Benjamin Le-Creurer,Chetra Mang
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Aiming to reduce the computational cost of numerical simulations, a convolutional neural network (CNN) and a multi-layer perceptron (MLP) are introduced to build a surrogate model to approximate radiative heat transfer solutions in a 2-D walled domain with participative gases. The originality of this work lays in the adaptation of the inputs of the problem (gas and wall properties) in order to fit with the CNN architecture, more commonly used for image processing. Two precision datasets have been created with the classical solver, ICARUS2D, that uses the discrete transfer radiation method with the statistical narrow bands model. The performance of the CNN architecture is compared to a more classical MLP architecture in terms of speed and accuracy. Thanks to Optuna, all results are obtained using the optimized hyper parameters networks. The results show a significant speedup with industrially acceptable relative errors compared to the classical solver for both architectures. Additionally, the CNN outperforms the MLP in terms of precision and is more robust and stable to changes in hyper-parameters. A performance analysis on the dataset size of the samples have also been carried out to gain a deeper understanding of the model behavior.
[LG-72] FlowBERT: Prompt-tuned BERT for variable flow field prediction
链接: https://arxiv.org/abs/2506.08021
作者: Weihao Zou,Weibing Feng,Pin Wu
类目: Machine Learning (cs.LG); Fluid Dynamics (physics.flu-dyn)
*备注:
点击查看摘要
Abstract:This study proposes a universal flow field prediction framework based on knowledge transfer from large language model (LLM), addressing the high computational costs of traditional computational fluid dynamics (CFD) methods and the limited cross-condition transfer capability of existing deep learning models. The framework innovatively integrates Proper Orthogonal Decomposition (POD) dimensionality reduction with fine-tuning strategies for pretrained LLM, where POD facilitates compressed representation of flow field features while the fine-tuned model learns to encode system dynamics in state space. To enhance the model’s adaptability to flow field data, we specifically designed fluid dynamics-oriented text templates that improve predictive performance through enriched contextual semantic information. Experimental results demonstrate that our framework outperforms conventional Transformer models in few-shot learning scenarios while exhibiting exceptional generalization across various inflow conditions and airfoil geometries. Ablation studies reveal the contributions of key components in the FlowBERT architecture. Compared to traditional Navier-Stokes equation solvers requiring hours of computation, our approach reduces prediction time to seconds while maintaining over 90% accuracy. The developed knowledge transfer paradigm establishes a new direction for rapid fluid dynamics prediction, with potential applications extending to aerodynamic optimization, flow control, and other engineering domains. Subjects: Machine Learning (cs.LG); Fluid Dynamics (physics.flu-dyn) Cite as: arXiv:2506.08021 [cs.LG] (or arXiv:2506.08021v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2506.08021 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Weihao Zou [view email] [v1] Tue, 20 May 2025 02:25:38 UTC (8,374 KB) Full-text links: Access Paper: View a PDF of the paper titled FlowBERT: Prompt-tuned BERT for variable flow field prediction, by Weihao Zou and Weibing Feng and Pin WuView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.LG prev | next new | recent | 2025-06 Change to browse by: cs physics physics.flu-dyn References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) IArxiv recommender toggle IArxiv Recommender (What is IArxiv?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack
[LG-73] GradSkip: Communication-Accelerated Local Gradient Methods with Better Computational Complexity
链接: https://arxiv.org/abs/2210.16402
作者: Artavazd Maranjyan,Mher Safaryan,Peter Richtárik
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:
点击查看摘要
Abstract:We study a class of distributed optimization algorithms that aim to alleviate high communication costs by allowing clients to perform multiple local gradient-type training steps before communication. In a recent breakthrough, Mishchenko et al. (2022) proved that local training, when properly executed, leads to provable communication acceleration, and this holds in the strongly convex regime without relying on any data similarity assumptions. However, their ProxSkip method requires all clients to take the same number of local training steps in each communication round. We propose a redesign of the ProxSkip method, allowing clients with ``less important’’ data to get away with fewer local training steps without impacting the overall communication complexity of the method. In particular, we prove that our modified method, GradSkip, converges linearly under the same assumptions and has the same accelerated communication complexity, while the number of local gradient steps can be reduced relative to a local condition number. We further generalize our method by extending the randomness of probabilistic alternations to arbitrary unbiased compression operators and by considering a generic proximable regularizer. This generalization, which we call GradSkip+, recovers several related methods in the literature as special cases. Finally, we present an empirical study on carefully designed toy problems that confirm our theoretical claims.
[LG-74] Protriever: End-to-End Differentiable Protein Homology Search for Fitness Prediction ICML2025
链接: https://arxiv.org/abs/2506.08954
作者: Ruben Weitzman,Peter Mørch Groth,Lood Van Niekerk,Aoi Otani,Yarin Gal,Debora Marks,Pascal Notin
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG)
*备注: Accepted at ICML 2025
点击查看摘要
Abstract:Retrieving homologous protein sequences is essential for a broad range of protein modeling tasks such as fitness prediction, protein design, structure modeling, and protein-protein interactions. Traditional workflows have relied on a two-step process: first retrieving homologs via Multiple Sequence Alignments (MSA), then training models on one or more of these alignments. However, MSA-based retrieval is computationally expensive, struggles with highly divergent sequences or complex insertions deletions patterns, and operates independently of the downstream modeling objective. We introduce Protriever, an end-to-end differentiable framework that learns to retrieve relevant homologs while simultaneously training for the target task. When applied to protein fitness prediction, Protriever achieves state-of-the-art performance compared to sequence-based models that rely on MSA-based homolog retrieval, while being two orders of magnitude faster through efficient vector search. Protriever is both architecture- and task-agnostic, and can flexibly adapt to different retrieval strategies and protein databases at inference time – offering a scalable alternative to alignment-centric approaches.
[LG-75] Real-Time Cascade Mitigation in Power Systems Using Influence Graph Improved by Reinforcement Learning
链接: https://arxiv.org/abs/2506.08893
作者: Kai Zhou,Youbiao He,Chong Zhong,Yifu Wu
类目: Physics and Society (physics.soc-ph); Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an)
*备注:
点击查看摘要
Abstract:Despite high reliability, modern power systems with growing renewable penetration face an increasing risk of cascading outages. Real-time cascade mitigation requires fast, complex operational decisions under uncertainty. In this work, we extend the influence graph into a Markov decision process model (MDP) for real-time mitigation of cascading outages in power transmission systems, accounting for uncertainties in generation, load, and initial contingencies. The MDP includes a do-nothing action to allow for conservative decision-making and is solved using reinforcement learning. We present a policy gradient learning algorithm initialized with a policy corresponding to the unmitigated case and designed to handle invalid actions. The proposed learning method converges faster than the conventional algorithm. Through careful reward design, we learn a policy that takes conservative actions without deteriorating system conditions. The model is validated on the IEEE 14-bus and IEEE 118-bus systems. The results show that proactive line disconnections can effectively reduce cascading risk, and certain lines consistently emerge as critical in mitigating cascade propagation.
[LG-76] syren-baryon: Analytic emulators for the impact of baryons on the matter power spectrum
链接: https://arxiv.org/abs/2506.08783
作者: Lukas Kammerer,Deaglan J. Bartlett,Gabriel Kronberger,Harry Desmond,Pedro G. Ferreira
类目: Cosmology and Nongalactic Astrophysics (astro-ph.CO); Astrophysics of Galaxies (astro-ph.GA); Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: 14 pages, 6 figures. Submitted to AA
点击查看摘要
Abstract:Baryonic physics has a considerable impact on the distribution of matter in our Universe on scales probed by current and future cosmological surveys, acting as a key systematic in such analyses. We seek simple symbolic parametrisations for the impact of baryonic physics on the matter power spectrum for a range of physically motivated models, as a function of wavenumber, redshift, cosmology, and parameters controlling the baryonic feedback. We use symbolic regression to construct analytic approximations for the ratio of the matter power spectrum in the presence of baryons to that without such effects. We obtain separate functions of each of four distinct sub-grid prescriptions of baryonic physics from the CAMELS suite of hydrodynamical simulations (Astrid, IllustrisTNG, SIMBA and Swift-EAGLE) as well as for a baryonification algorithm. We also provide functions which describe the uncertainty on these predictions, due to both the stochastic nature of baryonic physics and the errors on our fits. The error on our approximations to the hydrodynamical simulations is comparable to the sample variance estimated through varying initial conditions, and our baryonification expression has a root mean squared error of better than one percent, although this increases on small scales. These errors are comparable to those of previous numerical emulators for these models. Our expressions are enforced to have the physically correct behaviour on large scales and at high redshift. Due to their analytic form, we are able to directly interpret the impact of varying cosmology and feedback parameters, and we can identify parameters which have little to no effect. Each function is based on a different implementation of baryonic physics, and can therefore be used to discriminate between these models when applied to real data. We provide publicly available code for all symbolic approximations found.
[LG-77] Superposed Parameterised Quantum Circuits
链接: https://arxiv.org/abs/2506.08749
作者: Viktoria Patapovich,Mo Kordzanganeh,Alexey Melnikov
类目: Quantum Physics (quant-ph); Emerging Technologies (cs.ET); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: 20 pages, 6 figures, 3 tables
点击查看摘要
Abstract:Quantum machine learning has shown promise for high-dimensional data analysis, yet many existing approaches rely on linear unitary operations and shared trainable parameters across outputs. These constraints limit expressivity and scalability relative to the multi-layered, non-linear architectures of classical deep networks. We introduce superposed parameterised quantum circuits to overcome these limitations. By combining flip-flop quantum random-access memory with repeat-until-success protocols, a superposed parameterised quantum circuit embeds an exponential number of parameterised sub-models in a single circuit and induces polynomial activation functions through amplitude transformations and post-selection. We provide an analytic description of the architecture, showing how multiple parameter sets are trained in parallel while non-linear amplitude transformations broaden representational power beyond conventional quantum kernels. Numerical experiments underscore these advantages: on a 1D step-function regression a two-qubit superposed parameterised quantum circuit cuts the mean-squared error by three orders of magnitude versus a parameter-matched variational baseline; on a 2D star-shaped two-dimensional classification task, introducing a quadratic activation lifts accuracy to 81.4% and reduces run-to-run variance three-fold. These results position superposed parameterised quantum circuits as a hardware-efficient route toward deeper, more versatile parameterised quantum circuits capable of learning complex decision boundaries.
[LG-78] Flexible and Efficient Drift Detection without Labels
链接: https://arxiv.org/abs/2506.08734
作者: Nelvin Tan,Yu-Ching Shih,Dong Yang,Amol Salunkhe
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Applications (stat.AP)
*备注: 9 pages, 4 figures
点击查看摘要
Abstract:Machine learning models are being increasingly used to automate decisions in almost every domain, and ensuring the performance of these models is crucial for ensuring high quality machine learning enabled services. Ensuring concept drift is detected early is thus of the highest importance. A lot of research on concept drift has focused on the supervised case that assumes the true labels of supervised tasks are available immediately after making predictions. Controlling for false positives while monitoring the performance of predictive models used to make inference from extremely large datasets periodically, where the true labels are not instantly available, becomes extremely challenging. We propose a flexible and efficient concept drift detection algorithm that uses classical statistical process control in a label-less setting to accurately detect concept drifts. We shown empirically that under computational constraints, our approach has better statistical power than previous known methods. Furthermore, we introduce a new drift detection framework to model the scenario of detecting drift (without labels) given prior detections, and show our how our drift detection algorithm can be incorporated effectively into this framework. We demonstrate promising performance via numerical simulations.
[LG-79] A Privacy-Preserving Federated Learning Framework for Generalizable CBCT to Synthetic CT Translation in Head and Neck
链接: https://arxiv.org/abs/2506.08654
作者: Ciro Benito Raggio,Paolo Zaffino,Maria Francesca Spadea
类目: Medical Physics (physics.med-ph); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Shortened Abstract Cone-beam computed tomography (CBCT) has become a widely adopted modality for image-guided radiotherapy (IGRT). However, CBCT suffers from increased noise, limited soft-tissue contrast, and artifacts, resulting in unreliable Hounsfield unit values and hindering direct dose calculation. Synthetic CT (sCT) generation from CBCT addresses these issues, especially using deep learning (DL) methods. Existing approaches are limited by institutional heterogeneity, scanner-dependent variations, and data privacy regulations that prevent multi-center data sharing. To overcome these challenges, we propose a cross-silo horizontal federated learning (FL) approach for CBCT-to-sCT synthesis in the head and neck region, extending our FedSynthCT framework. A conditional generative adversarial network was collaboratively trained on data from three European medical centers in the public SynthRAD2025 challenge dataset. The federated model demonstrated effective generalization across centers, with mean absolute error (MAE) ranging from 64.38\pm13.63 to 85.90\pm7.10 HU, structural similarity index (SSIM) from 0.882\pm0.022 to 0.922\pm0.039 , and peak signal-to-noise ratio (PSNR) from 32.86\pm0.94 to 34.91\pm1.04 dB. Notably, on an external validation dataset of 60 patients, comparable performance was achieved (MAE: 75.22\pm11.81 HU, SSIM: 0.904\pm0.034 , PSNR: 33.52\pm2.06 dB) without additional training, confirming robust generalization despite protocol, scanner differences and registration errors. These findings demonstrate the technical feasibility of FL for CBCT-to-sCT synthesis while preserving data privacy and offer a collaborative solution for developing generalizable models across institutions without centralized data sharing or site-specific fine-tuning. Subjects: Medical Physics (physics.med-ph); Machine Learning (cs.LG) Cite as: arXiv:2506.08654 [physics.med-ph] (or arXiv:2506.08654v1 [physics.med-ph] for this version) https://doi.org/10.48550/arXiv.2506.08654 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Ciro Benito Raggio [view email] [v1] Tue, 10 Jun 2025 10:10:56 UTC (1,867 KB) Full-text links: Access Paper: View a PDF of the paper titled A Privacy-Preserving Federated Learning Framework for Generalizable CBCT to Synthetic CT Translation in Head and Neck, by Ciro Benito Raggio and 2 other authorsView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: physics.med-ph prev | next new | recent | 2025-06 Change to browse by: cs cs.LG physics References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack
[LG-80] Generalizing while preserving monotonicity in comparison-based preference learning models
链接: https://arxiv.org/abs/2506.08616
作者: Julien Fageot,Peva Blanchard,Gilles Bareilles,Lê-Nguyên Hoang
类目: atistics Theory (math.ST); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:If you tell a learning model that you prefer an alternative a over another alternative b , then you probably expect the model to be monotone, that is, the valuation of a increases, and that of b decreases. Yet, perhaps surprisingly, many widely deployed comparison-based preference learning models, including large language models, fail to have this guarantee. Until now, the only comparison-based preference learning algorithms that were proved to be monotone are the Generalized Bradley-Terry models. Yet, these models are unable to generalize to uncompared data. In this paper, we advance the understanding of the set of models with generalization ability that are monotone. Namely, we propose a new class of Linear Generalized Bradley-Terry models with Diffusion Priors, and identify sufficient conditions on alternatives’ embeddings that guarantee monotonicity. Our experiments show that this monotonicity is far from being a general guarantee, and that our new class of generalizing models improves accuracy, especially when the dataset is limited.
[LG-81] Optimization over Sparse Support-Preserving Sets: Two-Step Projection with Global Optimality Guarantees ICML2025
链接: https://arxiv.org/abs/2506.08558
作者: William de Vazelhes,Xiao-Tong Yuan,Bin Gu
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注: Accepted for publication at ICML 2025
点击查看摘要
Abstract:In sparse optimization, enforcing hard constraints using the \ell_0 pseudo-norm offers advantages like controlled sparsity compared to convex relaxations. However, many real-world applications demand not only sparsity constraints but also some extra constraints. While prior algorithms have been developed to address this complex scenario with mixed combinatorial and convex constraints, they typically require the closed form projection onto the mixed constraints which might not exist, and/or only provide local guarantees of convergence which is different from the global guarantees commonly sought in sparse optimization. To fill this gap, in this paper, we study the problem of sparse optimization with extra \qw\textitsupport-preserving constraints commonly encountered in the literature. We present a new variant of iterative hard-thresholding algorithm equipped with a two-step consecutive projection operator customized for these mixed constraints, serving as a simple alternative to the Euclidean projection onto the mixed constraint. By introducing a novel trade-off between sparsity relaxation and sub-optimality, we provide global guarantees in objective value for the output of our algorithm, in the deterministic, stochastic, and zeroth-order settings, under the conventional restricted strong-convexity/smoothness assumptions. As a fundamental contribution in proof techniques, we develop a novel extension of the classic three-point lemma to the considered two-step non-convex projection operator, which allows us to analyze the convergence in objective value in an elegant way that has not been possible with existing techniques. In the zeroth-order case, such technique also improves upon the state-of-the-art result from de Vazelhes et. al. (2022), even in the case without additional constraints, by allowing us to remove a non-vanishing system error present in their work.
[LG-82] Asymptotic Normality of Infinite Centered Random Forests -Application to Imbalanced Classification
链接: https://arxiv.org/abs/2506.08548
作者: Moria Mayala(LPSM (UMR_8001)),Erwan Scornet(LPSM (UMR_8001)),Charles Tillier(LMV),Olivier Wintenberger(LPSM (UMR_8001))
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Many classification tasks involve imbalanced data, in which a class is largely underrepresented. Several techniques consists in creating a rebalanced dataset on which a classifier is trained. In this paper, we study theoretically such a procedure, when the classifier is a Centered Random Forests (CRF). We establish a Central Limit Theorem (CLT) on the infinite CRF with explicit rates and exact constant. We then prove that the CRF trained on the rebalanced dataset exhibits a bias, which can be removed with appropriate techniques. Based on an importance sampling (IS) approach, the resulting debiased estimator, called IS-ICRF, satisfies a CLT centered at the prediction function value. For high imbalance settings, we prove that the IS-ICRF estimator enjoys a variance reduction compared to the ICRF trained on the original data. Therefore, our theoretical analysis highlights the benefits of training random forests on a rebalanced dataset (followed by a debiasing procedure) compared to using the original data. Our theoretical results, especially the variance rates and the variance reduction, appear to be valid for Breiman’s random forests in our experiments.
[LG-83] he interplay of robustness and generalization in quantum machine learning
链接: https://arxiv.org/abs/2506.08455
作者: Julian Berberich,Tobias Fellner,Christian Holm
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:
点击查看摘要
Abstract:While adversarial robustness and generalization have individually received substantial attention in the recent literature on quantum machine learning, their interplay is much less explored. In this chapter, we address this interplay for variational quantum models, which were recently proposed as function approximators in supervised learning. We discuss recent results quantifying both robustness and generalization via Lipschitz bounds, which explicitly depend on model parameters. Thus, they give rise to a regularization-based training approach for robust and generalizable quantum models, highlighting the importance of trainable data encoding strategies. The practical implications of the theoretical results are demonstrated with an application to time series analysis.
[LG-84] Systematic and Efficient Construction of Quadratic Unconstrained Binary Optimization Forms for High-order and Dense Interactions
链接: https://arxiv.org/abs/2506.08448
作者: Hyakka Nakada,Shu Tanaka
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Quantum Annealing (QA) can efficiently solve combinatorial optimization problems whose objective functions are represented by Quadratic Unconstrained Binary Optimization (QUBO) formulations. For broader applicability of QA, quadratization methods are used to transform higher-order problems into QUBOs. However, quadratization methods for complex problems involving Machine Learning (ML) remain largely unknown. In these problems, strong nonlinearity and dense interactions prevent conventional methods from being applied. Therefore, we model target functions by the sum of rectified linear unit bases, which not only have the ability of universal approximation, but also have an equivalent quadratic-polynomial representation. In this study, the proof of concept is verified both numerically and analytically. In addition, by combining QA with the proposed quadratization, we design a new black-box optimization scheme, in which ML surrogate regressors are inputted to QA after the quadratization process.
[LG-85] Sharper Convergence Rates for Nonconvex Optimisation via Reduction Mappings
链接: https://arxiv.org/abs/2506.08428
作者: Evan Markou,Thalaiyasingam Ajanthan,Stephen Gould
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注: 37 pages, 5 figures
点击查看摘要
Abstract:Many high-dimensional optimisation problems exhibit rich geometric structures in their set of minimisers, often forming smooth manifolds due to over-parametrisation or symmetries. When this structure is known, at least locally, it can be exploited through reduction mappings that reparametrise part of the parameter space to lie on the solution manifold. These reductions naturally arise from inner optimisation problems and effectively remove redundant directions, yielding a lower-dimensional objective. In this work, we introduce a general framework to understand how such reductions influence the optimisation landscape. We show that well-designed reduction mappings improve curvature properties of the objective, leading to better-conditioned problems and theoretically faster convergence for gradient-based methods. Our analysis unifies a range of scenarios where structural information at optimality is leveraged to accelerate convergence, offering a principled explanation for the empirical gains observed in such optimisation algorithms.
[LG-86] Mic-hackathon 2024: Hackathon on Machine Learning for Electron and Scanning Probe Microscopy
链接: https://arxiv.org/abs/2506.08423
作者: Utkarsh Pratiush,Austin Houston,Kamyar Barakati,Aditya Raghavan,Dasol Yoon,Harikrishnan KP,Zhaslan Baraissov,Desheng Ma,Samuel S. Welborn,Mikolaj Jakowski,Shawn-Patrick Barhorst,Alexander J. Pattison,Panayotis Manganaris,Sita Sirisha Madugula,Sai Venkata Gayathri Ayyagari,Vishal Kennedy,Ralph Bulanadi,Michelle Wang,Kieran J. Pang,Ian Addison-Smith,Willy Menacho,Horacio V. Guzman,Alexander Kiefer,Nicholas Furth,Nikola L. Kolev,Mikhail Petrov,Viktoriia Liu,Sergey Ilyev,Srikar Rairao,Tommaso Rodani,Ivan Pinto-Huguet,Xuli Chen,Josep Cruañes,Marta Torrens,Jovan Pomar,Fanzhi Su,Pawan Vedanti,Zhiheng Lyu,Xingzhi Wang,Lehan Yao,Amir Taqieddin,Forrest Laskowski,Xiangyu Yin,Yu-Tsun Shao,Benjamin Fein-Ashley,Yi Jiang,Vineet Kumar,Himanshu Mishra,Yogesh Paul,Adib Bazgir,Rama chandra Praneeth Madugula,Yuwen Zhang,Pravan Omprakash,Jian Huang,Eric Montufar-Morales,Vivek Chawla,Harshit Sethi,Jie Huang,Lauri Kurki,Grace Guinan,Addison Salvador,Arman Ter-Petrosyan,Madeline Van Winkle,Steven R. Spurgeon,Ganesh Narasimha,Zijie Wu,Richard Liu,Yongtao Liu,Boris Slautin,Andrew R Lupini,Rama Vasudevan,Gerd Duscher,Sergei V. Kalinin
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG); Instrumentation and Detectors (physics.ins-det)
*备注:
点击查看摘要
Abstract:Microscopy is a primary source of information on materials structure and functionality at nanometer and atomic scales. The data generated is often well-structured, enriched with metadata and sample histories, though not always consistent in detail or format. The adoption of Data Management Plans (DMPs) by major funding agencies promotes preservation and access. However, deriving insights remains difficult due to the lack of standardized code ecosystems, benchmarks, and integration strategies. As a result, data usage is inefficient and analysis time is extensive. In addition to post-acquisition analysis, new APIs from major microscope manufacturers enable real-time, ML-based analytics for automated decision-making and ML-agent-controlled microscope operation. Yet, a gap remains between the ML and microscopy communities, limiting the impact of these methods on physics, materials discovery, and optimization. Hackathons help bridge this divide by fostering collaboration between ML researchers and microscopy experts. They encourage the development of novel solutions that apply ML to microscopy, while preparing a future workforce for instrumentation, materials science, and applied ML. This hackathon produced benchmark datasets and digital twins of microscopes to support community growth and standardized workflows. All related code is available at GitHub: this https URL
[LG-87] S-PIELM: Time-Stepping Physics-Informed Extreme Learning Machine Facilitates Soil Consolidation Analyses
链接: https://arxiv.org/abs/2506.08381
作者: He Yang,Fei Ren,Hai-Sui Yu,Xueyu Geng,Pei-Zhi Zhuang
类目: Geophysics (physics.geo-ph); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Accuracy and efficiency of the conventional physics-informed neural network (PINN) need to be improved before it can be a competitive alternative for soil consolidation analyses. This paper aims to overcome these limitations by proposing a highly accurate and efficient physics-informed machine learning (PIML) approach, termed time-stepping physics-informed extreme learning machine (TS-PIELM). In the TS-PIELM framework the consolidation process is divided into numerous time intervals, which helps overcome the limitation of PIELM in solving differential equations with sharp gradients. To accelerate network training, the solution is approximated by a single-layer feedforward extreme learning machine (ELM), rather than using a fully connected neural network in PINN. The input layer weights of the ELM network are generated randomly and fixed during the training process. Subsequently, the output layer weights are directly computed by solving a system of linear equations, which significantly enhances the training efficiency compared to the time-consuming gradient descent method in PINN. Finally, the superior performance of TS-PIELM is demonstrated by solving three typical Terzaghi consolidation problems. Compared to PINN, results show that the computational efficiency and accuracy of the novel TS-PIELM framework are improved by more than 1000 times and 100 times for one-dimensional cases, respectively. This paper provides compelling evidence that PIML can be a powerful tool for computational geotechnics.
[LG-88] Solving Convex-Concave Problems with tildemathcalO(ε-4/7) Second-Order Oracle Complexity COLT2025
链接: https://arxiv.org/abs/2506.08362
作者: Lesi Chen,Chengchang Liu,Luo Luo,Jingzhao Zhang
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注: COLT 2025
点击查看摘要
Abstract:Previous algorithms can solve convex-concave minimax problems \min_x \in \mathcalX \max_y \in \mathcalY f(x,y) with \mathcalO(\epsilon^-2/3) second-order oracle calls using Newton-type methods. This result has been speculated to be optimal because the upper bound is achieved by a natural generalization of the optimal first-order method. In this work, we show an improved upper bound of \tilde\mathcalO(\epsilon^-4/7) by generalizing the optimal second-order method for convex optimization to solve the convex-concave minimax problem. We further apply a similar technique to lazy Hessian algorithms and show that our proposed algorithm can also be seen as a second-order ``Catalyst’’ framework (Lin et al., JMLR 2018) that could accelerate any globally convergent algorithms for solving minimax problems.
[LG-89] midr: Learning from Black-Box Models by Maximum Interpretation Decomposition
链接: https://arxiv.org/abs/2506.08338
作者: Ryoichi Asashiba,Reiji Kozuma,Hirokazu Iwasawa
类目: Methodology (stat.ME); Machine Learning (cs.LG)
*备注: 20 pages, 10 figures
点击查看摘要
Abstract:The use of appropriate methods of Interpretable Machine Learning (IML) and eXplainable Artificial Intelligence (XAI) is essential for adopting black-box predictive models in fields where model and prediction explainability is required. As a novel tool for interpreting black-box models, we introduce the R package midr, which implements Maximum Interpretation Decomposition (MID). MID is a functional decomposition approach that derives a low-order additive representation of a black-box model by minimizing the squared error between the model’s prediction function and this additive representation. midr enables learning from black-box models by constructing a global surrogate model with advanced analytical capabilities. After reviewing related work and the theoretical foundation of MID, we demonstrate the package’s usage and discuss some of its key features.
[LG-90] Model-Free Kernel Conformal Depth Measures Algorithm for Uncertainty Quantification in Regression Models in Separable Hilbert Spaces
链接: https://arxiv.org/abs/2506.08325
作者: Marcos Matabuena,Rahul Ghosal,Pavlo Mozharovskyi,Oscar Hernan Madrid Padilla,Jukka-Pekka Onnela
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注: arXiv admin note: substantial text overlap with arXiv:2405.13970
点击查看摘要
Abstract:Depth measures are powerful tools for defining level sets in emerging, non–standard, and complex random objects such as high-dimensional multivariate data, functional data, and random graphs. Despite their favorable theoretical properties, the integration of depth measures into regression modeling to provide prediction regions remains a largely underexplored area of research. To address this gap, we propose a novel, model-free uncertainty quantification algorithm based on conditional depth measures–specifically, conditional kernel mean embeddings and an integrated depth measure. These new algorithms can be used to define prediction and tolerance regions when predictors and responses are defined in separable Hilbert spaces. The use of kernel mean embeddings ensures faster convergence rates in prediction region estimation. To enhance the practical utility of the algorithms with finite samples, we also introduce a conformal prediction variant that provides marginal, non-asymptotic guarantees for the derived prediction regions. Additionally, we establish both conditional and unconditional consistency results, as well as fast convergence rates in certain homoscedastic settings. We evaluate the finite–sample performance of our model in extensive simulation studies involving various types of functional data and traditional Euclidean scenarios. Finally, we demonstrate the practical relevance of our approach through a digital health application related to physical activity, aiming to provide personalized recommendations
[LG-91] Constrained Pareto Set Identification with Bandit Feedback ICML2025
链接: https://arxiv.org/abs/2506.08127
作者: Cyrille Kone,Emilie Kaufmann,Laura Richert
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: To appear in Proceedings of ICML2025
点击查看摘要
Abstract:In this paper, we address the problem of identifying the Pareto Set under feasibility constraints in a multivariate bandit setting. Specifically, given a K -armed bandit with unknown means \mu_1, \dots, \mu_K \in \mathbbR^d , the goal is to identify the set of arms whose mean is not uniformly worse than that of another arm (i.e., not smaller for all objectives), while satisfying some known set of linear constraints, expressing, for example, some minimal performance on each objective. Our focus lies in fixed-confidence identification, for which we introduce an algorithm that significantly outperforms racing-like algorithms and the intuitive two-stage approach that first identifies feasible arms and then their Pareto Set. We further prove an information-theoretic lower bound on the sample complexity of any algorithm for constrained Pareto Set identification, showing that the sample complexity of our approach is near-optimal. Our theoretical results are supported by an extensive empirical evaluation on a series of benchmarks.
[LG-92] Continuous Policy and Value Iteration for Stochastic Control Problems and Its Convergence
链接: https://arxiv.org/abs/2506.08121
作者: Qi Feng,Gu Wang
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注: 37 pages
点击查看摘要
Abstract:We introduce a continuous policy-value iteration algorithm where the approximations of the value function of a stochastic control problem and the optimal control are simultaneously updated through Langevin-type dynamics. This framework applies to both the entropy-regularized relaxed control problems and the classical control problems, with infinite horizon. We establish policy improvement and demonstrate convergence to the optimal control under the monotonicity condition of the Hamiltonian. By utilizing Langevin-type stochastic differential equations for continuous updates along the policy iteration direction, our approach enables the use of distribution sampling and non-convex learning techniques in machine learning to optimize the value function and identify the optimal control simultaneously.
[LG-93] Dynamic Diffusion Schrödinger Bridge in Astrophysical Observational Inversions
链接: https://arxiv.org/abs/2506.08065
作者: Ye Zhu,Duo Xu,Zhiwei Deng,Jonathon C. Tan,Olga Russakovsky
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG)
*备注: Preprint. Code will be available at this https URL
点击查看摘要
Abstract:We study Diffusion Schrödinger Bridge (DSB) models in the context of dynamical astrophysical systems, specifically tackling observational inverse prediction tasks within Giant Molecular Clouds (GMCs) for star formation. We introduce the Astro-DSB model, a variant of DSB with the pairwise domain assumption tailored for astrophysical dynamics. By investigating its learning process and prediction performance in both physically simulated data and in real observations (the Taurus B213 data), we present two main takeaways. First, from the astrophysical perspective, our proposed paired DSB method improves interpretability, learning efficiency, and prediction performance over conventional astrostatistical and other machine learning methods. Second, from the generative modeling perspective, probabilistic generative modeling reveals improvements over discriminative pixel-to-pixel modeling in Out-Of-Distribution (OOD) testing cases of physical simulations with unseen initial conditions and different dominant physical processes. Our study expands research into diffusion models beyond the traditional visual synthesis application and provides evidence of the models’ learning abilities beyond pure data statistics, paving a path for future physics-aware generative models which can align dynamics between machine learning and real (astro)physical systems.
[LG-94] MOSS: Multi-Objective Optimization for Stable Rule Sets
链接: https://arxiv.org/abs/2506.08030
作者: Brian Liu,Rahul Mazumder
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
点击查看摘要
Abstract:We present MOSS, a multi-objective optimization framework for constructing stable sets of decision rules. MOSS incorporates three important criteria for interpretability: sparsity, accuracy, and stability, into a single multi-objective optimization framework. Importantly, MOSS allows a practitioner to rapidly evaluate the trade-off between accuracy and stability in sparse rule sets in order to select an appropriate model. We develop a specialized cutting plane algorithm in our framework to rapidly compute the Pareto frontier between these two objectives, and our algorithm scales to problem instances beyond the capabilities of commercial optimization solvers. Our experiments show that MOSS outperforms state-of-the-art rule ensembles in terms of both predictive performance and stability.
信息检索
[IR-0] Leverag ing LLM s to Evaluate Usefulness of Document
链接: https://arxiv.org/abs/2506.08626
作者: Xingzhu Wang,Erhan Zhang,Yiqun Chen,Jinghan Xuan,Yucheng Hou,Yitong Xu,Ying Nie,Shuaiqiang Wang,Dawei Yin,Jiaxin Mao
类目: Information Retrieval (cs.IR)
*备注:
点击查看摘要
Abstract:The conventional Cranfield paradigm struggles to effectively capture user satisfaction due to its weak correlation between relevance and satisfaction, alongside the high costs of relevance annotation in building test collections. To tackle these issues, our research explores the potential of leveraging large language models (LLMs) to generate multilevel usefulness labels for evaluation. We introduce a new user-centric evaluation framework that integrates users’ search context and behavioral data into LLMs. This framework uses a cascading judgment structure designed for multilevel usefulness assessments, drawing inspiration from ordinal regression techniques. Our study demonstrates that when well-guided with context and behavioral information, LLMs can accurately evaluate usefulness, allowing our approach to surpass third-party labeling methods. Furthermore, we conduct ablation studies to investigate the influence of key components within the framework. We also apply the labels produced by our method to predict user satisfaction, with real-world experiments indicating that these labels substantially improve the performance of satisfaction prediction models.
[IR-1] SRec: Enhancing Repeat-Aware Recommendation from a Temporal-Sequential Perspective
链接: https://arxiv.org/abs/2506.08531
作者: Shigang Quan,Shui Liu,Zhenzhe Zheng,Fan Wu
类目: Information Retrieval (cs.IR)
*备注:
点击查看摘要
Abstract:Repeat consumption, such as repurchasing items and relistening songs, is a common scenario in daily life. To model repeat consumption, the repeat-aware recommendation has been proposed to predict which item will be re-interacted based on the user-item interactions. In this paper, we investigate various inherent characteristics to enhance the repeat-aware recommendation. Specifically, we explore these characteristics from two aspects: one is from the temporal aspect where we consider the time interval relationship in the user behavior sequence; the other is from the sequential aspect where we consider the sequential-level relationship in the user behavior sequence. And our intuition is that both the temporal pattern and sequential pattern will reflect users’ intentions of repeat consumption. By utilizing these two patterns, a novel model called Temporal and Sequential repeat-aware Recommendation(TSRec for short) is proposed to enhance repeat-aware recommendation. TSRec has three main components: 1) User-specific Temporal Representation Module (UTRM), which encodes and extracts user historical repeat temporal information. 2)Item-specific Temporal Representation Module (ITRM), which incorporates item time interval information as side information to alleviate the data sparsity problem of user repeat behavior sequence. 3) Sequential Repeat-Aware Module (SRAM), which represents the similarity between the user’s current and the last repeat sequences. Extensive experimental results on three public benchmarks demonstrate the superiority of TSRec over state-of-the-art methods. The implementation code is available this https URL.
[IR-2] MERIT: A Merchant Incentive Ranking Model for Hotel Search Ranking
链接: https://arxiv.org/abs/2506.08442
作者: Shigang Quan,Hailong Tan,Shui Liu,Zhenzhe zheng,Ruihao Zhu,Liangyue Li,Quan Lu,Fan Wu
类目: Information Retrieval (cs.IR)
*备注:
点击查看摘要
Abstract:Online Travel Platforms (OTPs) have been working on improving their hotel Search Ranking (SR) systems that facilitate efficient matching between consumers and hotels. Existing OTPs focus almost exclusively on improving platform revenue. In this work, we take a first step in incorporating hotel merchants’ objectives into the design of hotel SR systems to achieve an incentive loop: the OTP tilts impressions and better-ranked positions to merchants with high quality, and in return, the merchants provide better service to consumers. Three critical design challenges need to be resolved to achieve this incentive loop: Matthew Effect in the consumer feedback-loop, unclear relation between hotel quality and performance, and conflicts between short-term and long-term revenue. To address these challenges, we propose MERIT, a MERchant IncenTive ranking model, which can simultaneously take the interests of merchants and consumers into account. We define a new Merchant Competitiveness Index (MCI) to represent hotel merchant quality and propose a new Merchant Tower to model the relation between MCI and ranking scores. Also, we design a monotonic structure for Merchant Tower to provide a clear relation between hotel quality and performance. Finally, we propose a Multi-objective Stratified Pairwise Loss, which can mitigate the conflicts between OTP’s short-term and long-term revenue. The offline experiment results indicate that MERIT outperforms these methods in optimizing the demands of consumers and merchants. Furthermore, we conduct an online A/B test and obtain an improvement of 3.02% for the MCI score.
[IR-3] NAM: A Normalization Attention Model for Personalized Product Search In Fliggy
链接: https://arxiv.org/abs/2506.08382
作者: Shui Liu,Mingyuan Tao,Maofei Que,Pan Li,Dong Li,Shenghua Ni,Zhuoran Zhuang
类目: Information Retrieval (cs.IR)
*备注:
点击查看摘要
Abstract:Personalized product search provides significant benefits to e-commerce platforms by extracting more accurate user preferences from historical behaviors. Previous studies largely focused on the user factors when personalizing the search query, while ignoring the item perspective, which leads to the following two challenges that we summarize in this paper: First, previous approaches relying only on co-occurrence frequency tend to overestimate the conversion rates for popular items and underestimate those for long-tail items, resulting in inaccurate item similarities; Second, user purchasing propensity is highly heterogeneous according to the popularity of the target item: it is less correlated with the user’s historical behavior for a popular item and more correlated for a long-tail item. To address these challenges, in this paper we propose NAM, a Normalization Attention Model, which optimizes ‘‘when to personalize’’ by utilizing Inverse Item Frequency (IIF) and employing a gating mechanism, as well as optimizes ‘‘how to personalize’’ by normalizing the attention mechanism from a global perspective. Through comprehensive experiments, we demonstrate that our proposed NAM model significantly outperforms state-of-the-art baseline models. Furthermore, we conducted an online A/B test at Fliggy, and obtained a significant improvement of 0.8% over the latest production system in conversion rate.
[IR-4] Reinforcement Fine-Tuning for Reasoning towards Multi-Step Multi-Source Search in Large Language Models
链接: https://arxiv.org/abs/2506.08352
作者: Wentao Shi,Yiqing Shen
类目: Information Retrieval (cs.IR)
*备注:
点击查看摘要
Abstract:Large language models (LLMs) can face factual limitations when responding to time-sensitive queries about recent events that arise after their knowledge thresholds in the training corpus. Existing search-augmented approaches fall into two categories, each with distinct limitations: multi-agent search frameworks incur substantial computational overhead by separating search planning and response synthesis across multiple LLMs, while single-LLM tool-calling methods restrict themselves to sequential planned, single-query searches from sole search sources. We present Reasoning-Search (R-Search), a single-LLM search framework that unifies multi-step planning, multi-source search execution, and answer synthesis within one coherent inference process. Innovatively, it structure the output into four explicitly defined components, including reasoning steps that guide the search process (think), a natural-language directed acyclic graph that represents the search plans with respect to diverse sources (search), retrieved results from executing the search plans (result), and synthesized final answers (answer). To enable effective generation of these structured outputs, we propose a specialized Reinforcement Fine-Tuning (ReFT) method based on GRPO, together with a multi-component reward function that optimizes LLM’s answer correctness, structural validity of the generated DAG, and adherence to the defined output format. Experimental evaluation on FinSearchBench-24, SearchExpertBench-25, and seven Q and A benchmarks demonstrates that R-Search outperforms state-of-the-art methods, while achieving substantial efficiency gains through 70% reduction in context token usage and approximately 50% decrease in execution latency. Code is available at this https URL.
[IR-5] Rule-Assisted Attribute Embedding ECML-PKDD2025
链接: https://arxiv.org/abs/2506.08314
作者: Sibo Zhao,Michael Bewong,Selasi Kwashie,Junwei Hu,Zaiwen Feng
类目: Information Retrieval (cs.IR)
*备注: ECML-PKDD2025
点击查看摘要
Abstract:Recommendation systems often overlook the rich attribute information embedded in property graphs, limiting their effectiveness. Existing graph convolutional network (GCN) models either ignore attributes or rely on simplistic user, item, attribute triples, failing to capture deeper semantic structures. We propose RAE (Rule- Assisted Approach for Attribute Embedding), a novel method that improves recommendations by mining semantic rules from property graphs to guide attribute embedding. RAE performs rule-based random walks to generate enriched attribute representations, which are integrated into GCNs. Experiments on real-world datasets (BlogCatalog, Flickr) show that RAE outperforms state-of-the-art baselines by 10.6% on average in Recall@20 and NDCG@20. RAE also demonstrates greater robustness to sparse data and missing attributes, highlighting the value of leveraging structured attribute information in recommendation tasks.
[IR-6] Serendipitous Recommendation with Multimodal LLM
链接: https://arxiv.org/abs/2506.08283
作者: Haoting Wang,Jianling Wang,Hao Li,Fangjun Yi,Mengyu Fu,Youwei Zhang,Yifan Liu,Liang Liu,Minmin Chen,Ed H. Chi,Lichan Hong,Haokai Lu
类目: Information Retrieval (cs.IR)
*备注:
点击查看摘要
Abstract:Conventional recommendation systems succeed in identifying relevant content but often fail to provide users with surprising or novel items. Multimodal Large Language Models (MLLMs) possess the world knowledge and multimodal understanding needed for serendipity, but their integration into billion-item-scale platforms presents significant challenges. In this paper, we propose a novel hierarchical framework where fine-tuned MLLMs provide high-level guidance to conventional recommendation models, steering them towards more serendipitous suggestions. This approach leverages MLLM strengths in understanding multimodal content and user interests while retaining the efficiency of traditional models for item-level recommendation. This mitigates the complexity of applying MLLMs directly to vast action spaces. We also demonstrate a chain-of-thought strategy enabling MLLMs to discover novel user interests by first understanding video content and then identifying relevant yet unexplored interest clusters. Through live experiments within a commercial short-form video platform serving billions of users, we show that our MLLM-powered approach significantly improves both recommendation serendipity and user satisfaction.
[IR-7] No Stupid Questions: An Analysis of Question Query Generation for Citation Recommendation
链接: https://arxiv.org/abs/2506.08196
作者: Brian D. Zimmerman,Julien Aubert-Béduchaud,Florian Boudin,Akiko Aizawa,Olga Vechtomova
类目: Information Retrieval (cs.IR); Digital Libraries (cs.DL)
*备注: 6 pages, 5 figures, 2 tables
点击查看摘要
Abstract:Existing techniques for citation recommendation are constrained by their adherence to article contents and metadata. We leverage GPT-4o-mini’s latent expertise as an inquisitive assistant by instructing it to ask questions which, when answered, could expose new insights about an excerpt from a scientific article. We evaluate the utility of these questions as retrieval queries, measuring their effectiveness in retrieving and ranking masked target documents. In some cases, generated questions ended up being better queries than extractive keyword queries generated by the same model. We additionally propose MMR-RBO, a variation of Maximal Marginal Relevance (MMR) using Rank-Biased Overlap (RBO) to identify which questions will perform competitively with the keyword baseline. As all question queries yield unique result sets, we contend that there are no stupid questions.
附件下载
点击下载今日全部论文列表